Wikifunctions:Type proposals/Syntactic type
The sketch towards Abstract Wikipedia discusses our path to natural language generation. But it works directly on Wikidata Lexemes. Wikidata Lexemes can be very versatile: they can be French verbs, Hausa nouns, or Malayalam adjectives. At the same time, Lexemes in Wikidata are not particularly constrained and can differ vastly in how complete their forms are.
This has a number of disadvantages. Three important disadvantages for our use case are:
- a function can say it takes or returns a Lexeme, and while its description or name can make it clear that this is supposed to take an English noun, the system cannot guide the user in actually ensuring that. Instead it would be open to taking any Lexeme, and a function may need to do some checks to ensure the language, part of speech, and available forms are correct.
- it is difficult to write tests that don’t implicitly depend on the completeness of Wikidata. Lexemes are always Wikidata Lexemes, and therefore we cannot create test Lexemes in Wikifunctions.
- we lose the advantages of a more strictly typed system, e.g. as guidance in composition, in being more tightly restricted in selecting a function, etc.
It also has advantages, foremost a significant reduction in the number of types.
From Lexemes to Syntactic Types
We suggest introducing types for language and part of speech pairs (not for all of them, but for open classes). To give a simple example, we could have a type for "English noun", which then can be used with only the appropriate functions: we can ask for the singular or plural, maybe the possessive, but we cannot ask for the past tense or for the superlative or the accusative form of an English noun.
This would also allow combining, for example, explicitly stated forms with morphological functions.
An English noun now can have several constructors: most notably, it would take a Lexeme that is an English noun, and use that to create the English noun object. But it could also use a string and treat it as the lemma of the noun (for example, in order to represent neologisms), or take two strings (one for the singular and one for the plural), or take the label of an item on Wikidata, or it may take a Lexeme with a noun from another language (in order to form a loan word), or it could take an English verb and noun it (for example to go from "verb" to "verbing"). An English noun or adjective could be created from a place name, or from a person's name. This allows us to be much more flexible and cover the tail-end of the lexicon with alternative approaches.
Such a type would allow us to create functions that build sentences from more specific parts of speech, reducing ambiguity and making it easier to compose sentences. This makes it easier to write the sentence generating functions, and it makes it easier and more predictable to use such functions.
Note that besides actual parts of speech, in many languages we would also have types for relevant phrases (e.g. noun phrases, verb phrases, etc.).
This has the disadvantage that we would need to create a lot of types. That would benefit from allowing the community to create types, in order to scale better.
(Note: the name "syntactic type" isn't very good. These types are language-specific types for representing grammatically typed building blocks of natural language. If you can come up with a better name, please do!)
A common building block for creating linguistic types
We introduce a type that is helpful in representing many syntactic types in many languages, together with one particular function operating on this type. We introduce the "table" type and the "merge" function.
Note that the following proposal is heavily inspired by the book and software Grammatical Framework.
Table type
A table type consists of two parts: a list of inherent features, and a list of options. Features are grammatical features, represented by either Wikidata item references or Wikifunction enumerations representing the same. Options are a map of a list of features to a list of word forms.
The inherent features, the keys in the options map, and the structure of the dictionary of forms are all constrained by the syntactical type. For example, a typical German noun has one inherent feature (its grammatical gender: masculine, feminine or neuter), and usually eight options: two for the grammatical number (singular or plural), and four for the case (nominative, accusative, dative, genitive), leading to two times four = eight options. Since German nouns are usually continuous (i.e. they are not separated by other words, in contrast to some German verbs), the dictionary of forms for each option has a single entry.
Here is an example of a table for the German noun "Stadt" (meaning city):
| inherent | feminine | |||
| options | nominative | genitive | dative | accusative |
| singular | Stadt | Stadt | Stadt | Stadt |
| plural | Städte | Städte | Städten | Städte |
German adjectives on the other hand have no inherent grammatical features, but many options: for the grammatical gender, the grammatical number, the definiteness, the level of comparison (positive, comparative, superlative), etc.
Here is an example of a table for the German noun "süddeutsch" (meaning southern German). Note that the Adjective "süddeutsch" does not necessarily have its own Lexeme, but may have been constructed by a cardinal direction ("Süden", meaning south) and a Demonym ("deutsch", meaning German). See the table on English Wiktionary for süddeutsch.
This is why they are called tables, because these words are usually represented as tables in dictionaries and grammar books, with the different grammatical dimensions representing the grammatical values. The grammatical features do not need to always be combinatorial. (For example, an English verb may have different forms for the different grammatical person and number, but only in the present tense, not in the past tense).
Merge function
The merge function takes two tables and returns a table.
The inherent features of each of the tables is used to filter the options in each other tables. In addition, the dictionaries of forms are combined together to create a new map of forms.
For example, if we wanted to create the noun phrase "eine süddeutsche Stadt" (meaning "a city in the south of Germany"), we build a function that takes the noun and the adjective from the previous section and merges them, creating a German noun phrase. Since German noun phrases are continuous, it can immediately concatenate the different forms and other words (in this example the indefinite article "ein" or "eine", again based on the inherent feature of the noun).
The result of the merge is still open with regards to the case, but would have an inherent gender (feminine) and number (singular). A function that takes the noun phrase as an input for a specific sentence can then select the right case depending on the phrase's function in the sentence (e.g. if we wanted to make a sentence such as “Anton kommt aus einer süddeutschen Stadt.” (meaning “Anton comes from a city in the south of Germany”), we would select the dative case. For a different sentence such as “Anton zieht in eine süddeutsche Stadt.” (meaning “Anton moves to a city in the south of Germany”), we would need the accusative case. The predicate and the proposition together would determine which case is needed.
Most concrete natural language generation functions can be built by using the merge function, thus reducing the implementation work in many languages.
Summary
The table type and the merge function would become the backbone for natural language generation, enabling the support of agreement which is crucial in many languages.
Just to give an example how this could look like, still using German as the example:
- We would create a type for German nouns, which consists of a table type that has
- one inherent feature, the grammatical gender with the three values feminine, masculine and neuter (proposed here)
- And options based on a two-valued grammatical number and a specific four-valued grammatical case
- We would create a type for German adjectives, which consists of a table type that has
- No inherent features
- And options based on the three-valued grammatical gender mentioned above, the two-valued grammatical number, the definiteness (two-valued), and the four German cases
- We would create types for German noun phrases, verb phrases, and for sentences
We furthermore would create a number of functions. Relevant for our example are the following functions:
- Create a sentence from a noun phrase and a verb phrase
- Create a verb phrase from a German verb, proposition, and a noun phrase for the object in the given case
- Create a noun phrase from an adjective and a noun
Now we can create a noun phrase from the adjective and noun given in the example above. Internally, that would use the merge function we have described above.
That one can be used as the argument for a verb phrase which takes the verb “kommen”, the German proposition “aus”, the noun phrase we just constructed, and it states that the latter has to be dative. We would select the dative for the verb phrase and concatenate the parts of the sentence.
Finally, we can build the whole sentence from a noun phrase (which we may just create on Wikifunctions from the string “Anton”, since there is currently no such German proper noun) and we would get the sentence “Anton kommt aus einer süddeutschen Stadt”.
Without introducing syntactic type, we miss a crucial abstraction level here, which makes it much more fragile to implement the generation of such sentences.
This proposal lives within Stage 2 of the path towards Abstract Wikipedia. Note that this does not yet presuppose nor determine how abstract content will be structured. But no matter what the answer to that question will be, it makes it easier to scale up writing functions that can generate sentences in many languages.
Discussion and comments
Please use the reply link to add your comments --GrounderUK (talk) 19:02, 18 February 2025 (UTC)
- Why have language-specific types? GrounderUK (talk) 19:08, 18 February 2025 (UTC)
- It makes more sense to me to extend the Z11 pattern, so that we have, for example, monolingual noun (phrase) variant and monolingual verb (phrase) variant. It may be considered a disadvantage that such an approach does not prevent a German noun pattern from being used in a French noun phrase. However, when considering phrases rather than lexeme forms, it is not clear that focusing on syntax rather than semantics is the better approach. A noun phrase representing an indirect object, for example, may have a different form or position from a direct object whether or not any word is marked by a specific inflection. Whether we describe this as “indirect object”, “dative” or something else hardly matters, but since we intend to generate natural language from language-neutral semantic representations, I think it makes more sense to refine by semantic function first, language second and syntactic considerations last. GrounderUK (talk) 19:59, 18 February 2025 (UTC)
- Oh, I am not saying that syntactic types should be done instead of semantic or abstract types. I think that both are necessary. The suggestion here is firmly on the syntactic side. I totally agree that we also need something to express the semantic side as well.
- If I understand your suggestion, it is that we should figure out the semantic step first, before we get to the syntactic layer. Hmm, I see your point. I was thinking we could build up from the concrete towards the abstract. But maybe the other approach has benefits that I missed, and we should leave this rest for now and see if we need it once the abstract content questions are a bit more settled. --Denny (talk) 13:20, 19 February 2025 (UTC)
- It’s hard to say exactly what I mean and harder still to mean exactly what I say. I’m not saying that we should figure out the semantic step first, I’m saying that the syntactic step is downstream of the semantic in the NLG pipeline. We need to think ourselves a little upstream of the Great Lexemification Rapids to provide sufficient context to evaluate options, I think. I believe that there is sufficient overlap between the semantic and syntactic challenges that it is at least plausible that the solutions can evolve along similar lines. But I see the end-to-end pipeline as a sequence of successive refinements, where the availability of Wikidata content and Wikifunctions capabilities together determine the available refinements in any particular case. It is conceptually convenient to consider that at each stage a function converts content objects from one type to the next, the advantage of which is that the functions available for the next refinement are readily apparent, but it will be convenient, in practice (I’m sure), to avoid multiplying up these types by the available languages. GrounderUK (talk) 15:04, 19 February 2025 (UTC)
- It makes more sense to me to extend the Z11 pattern, so that we have, for example, monolingual noun (phrase) variant and monolingual verb (phrase) variant. It may be considered a disadvantage that such an approach does not prevent a German noun pattern from being used in a French noun phrase. However, when considering phrases rather than lexeme forms, it is not clear that focusing on syntax rather than semantics is the better approach. A noun phrase representing an indirect object, for example, may have a different form or position from a direct object whether or not any word is marked by a specific inflection. Whether we describe this as “indirect object”, “dative” or something else hardly matters, but since we intend to generate natural language from language-neutral semantic representations, I think it makes more sense to refine by semantic function first, language second and syntactic considerations last. GrounderUK (talk) 19:59, 18 February 2025 (UTC)
- Strongly
Oppose, this adds debt of creating new enums and pollutes the wiki. I would be fine with this if there were software changes to make the footprint of an enum smaller. Feeglgeef (talk) 19:35, 18 February 2025 (UTC)
- That's not a bad idea, to make enums more light weight. You mentioned it before. @Jdforrester (WMF) was already suggesting something -- let's see what'll come out of that. --Denny (talk) 13:15, 19 February 2025 (UTC)
Oppose, even if there wasn't so much 'debt'/'pollution', since the extra complexity appears unnecessarily limiting and is unlikely to handle morphological exceptions (whether due to syntactic or semantic constraints) in a given language. Languages that are more analytic (and thus with closer to one form per lexeme), and those that are more agglutinative/polysynthetic (and thus with separate inflectional morpheme lexemes), are also unlikely to benefit from such types. Mahir256 (talk) 23:25, 18 February 2025 (UTC)
- That's actually a great point -- there is no need to do it for all languages, and for all parts of speech, but only for the ones where it makes sense! --Denny (talk) 13:16, 19 February 2025 (UTC)
- Wikifunctions lexemes? I’ve always assumed that we would need a type for representing lexemes that are absent from Wikidata or incomplete. We sort of agreed in our discussion on Wikifunctions:Type proposals/Wikidata based types. How does the current proposal align with or shape our approach to this? GrounderUK (talk) 09:39, 20 February 2025 (UTC)
- Yes, I think so too. And one of the constructors for that type should be a Wikidata Lexeme. So that we can then use the resulting type in constructing phrases.
- So, basically, we could have that new type Wf lexeme, and it can be created either from a Wikidata Lexeme, or from a Wikidata Item, or from a string (or set of strings), and these then would be usable for building a phrase. It seems we agree on that part.
- What I am saying additionally is that we shouldn't have a single type for Wikifunctions lexeme, but several, based on their part of speech. But I am happy to find already agreement on the first step. I think that eventually we'll get to the whole way anyway :) --Denny (talk) 10:19, 26 February 2025 (UTC)
- Yes, I agree with a Lexeme type for different parts of speech; I’m just not convinced that we need separate sets of Lexeme types for each language. So we have Noun but not English noun. Clearly, a function that operates only on Nouns for the English language could usefully distinguish by type from one that operates on Nouns for the German language. One way of doing this is to pair the Noun type with the relevant language, as a “generic” type. To be clear, though, I don’t object to there being multiple Noun (and other part-of-speech) types to cater for the different patterns that languages follow with regard to number, gender, case etc., so long as the types correspond to the patterns followed rather than the languages that follow those patterns. GrounderUK (talk) 11:14, 26 February 2025 (UTC)
- comment I'm not sure what to think. On one hand, sure an English noun and a German noun is not the same thing and should be handled differently ; but on the other noun, even English nouns are not all the same ("dog"@en has singular and plural while "clothes"@en has only a plural as it's a plurale tantum) and English and German adverbs are the same. So a type "lexical category + language" seems both too wide and too narrow. Also, why not go more "atomic" and having two types : one for category and one for language? Cheers, VIGNERON (talk) 08:16, 24 February 2025 (UTC)
- Comment: I don't think that I understood how exactly is the table with Stadt, Städte, Städten, etc. is generated. Is the "Städten" string, for example, supposed to be generated by code stored on Wikifunctions? Or is this proposed type supposed to be a representation for Wikifunctions of the data that is already within Wikidata? --Amir E. Aharoni (talk) 14:17, 3 March 2025 (UTC)
Oppose, since (as already pointed in previous comments) the bottom-up strategy implied by this proposal to language generation creates a lot of useless computation and memory (with a complexity that could grow exponentially), other than failing to account the exceptions that all natural languages are filled with. I think that generally to generate human language text we should instead use a top-down approach (starting from an abstract representation of an entire sentence and from there generate the single components), since I think it's also closer on how our brain actually generates sentences. Dv103 (talk) 13:37, 16 June 2025 (UTC)