Wikifunctions:Project chat
Welcome to the Project chat, a place to discuss any and all aspects of Wikifunctions: the project itself, policy and proposals, individual data items, technical issues, etc.
Other places to find help:
- Wikifunctions:Administrators' noticeboard
- Wikifunctions:Report a technical problem
- Wikifunctions:FAQ
![]() | SpBot archives all sections tagged with {{Section resolved|1=~~~~}} after 1 day and sections whose most recent comment is older than 30 days. |
edit |
![]() |
---|
Archives |
Natural language functions
Hello everyone,
We’re working on functions that return natural-language outputs, and we should think about how to handle both these outputs and their inputs consistently. For example:
- How should we represent inputs like Wikidata’s “grammatical features” (gender, number, etc.) across different languages?
- How do we decide whether a function’s output should be a simple string or monolingual text?
- How can we create functions that work across languages and sentence structures, or that can be combined with other functions to do so?
What are your thoughts? Any suggestions or examples? GrounderUK (talk) 13:55, 1 February 2025 (UTC)
- Thanks. This is important questions. I don't have a good answer for now but I did see some strange and inconsistent things, we need some clarity. For example, "[gender] is a [country] [professional]", English Lexemes (Z21765) using the datatype Sign (Z16659) in input for gender (it works in this case but it still feels wrong, and it won't work in most languages and grammatical features), there is also Conjugate regular -er verb (Z21617) using arbitrary Natural number (Z13518) (see the two proposals WF:FRENCHSUBJ and WF:FRENCHTENSE by MolecularPilot).
- My two cents to help move forward:
- grammatical features can be strange (natural languages are full of exception and unexpected things, like in Breton prepositions are conjugated like verbs), whatever we choose need to be flexible enough.
- At the same time, I guess we don't want to recreate a list/datatype for each languages as most behave similarly and it would be redundant (6000+ languages with most of the non-genderless one having masculine/feminine, see https://wals.info/feature/30A for instance).
- Some of these functions will rely on Wikidata so Wikifunctions should understand and accept grammatical features used on Lexemes (and there is a lot, 989 right now : https://qlever.cs.uni-freiburg.de/wikidata/7n3eYj ).
- Right now, it seems most natural language functions return a simple string (String (Z6)), it's not wrong but a monolingual text (Monolingual text (Z11)) would be cleaner and clearer (removing the language tag afterwards is super easy - we already have string of monolingual text (Z14396) -, adding it could be trickier: when the label of a function says "English" is is American English, British English, both, none? a monolingual text would be explicit).
- grammatical features can be strange (natural languages are full of exception and unexpected things, like in Breton prepositions are conjugated like verbs), whatever we choose need to be flexible enough.
- Cheers, VIGNERON (talk) 14:27, 1 February 2025 (UTC)
- The 989 grammatical features are probably all useful, but the first few I looked at (e.g. "singular") would just be enumeration values for a type (in this case "grammatical number", of which there are currently 24 possible values). I think we should have a single type for grammatical number, and some languages will hardly use any values, but every language would find what they need. Similar for other categories of grammatical features. At present, this would result in long dropdowns (e.g. 24 items of which I might only know what two of them are), but if sorted reasonably well, I think that would be fine. 99of9 (talk) 10:55, 2 February 2025 (UTC)
- If we are ramping up the use of monolingual texts (I'm not opposed, when functions are returning sentences or phrases directly for a language), then we should start building quite a few more monolingual text helper functions (e.g. join texts - even the simple ones haven't been written). I'm a little bit concerned that we'll need separate functions to generate each of the monolingual texts for each of the English variants. It would be good to call a single one and share results whenever possible. 99of9 (talk) 11:03, 2 February 2025 (UTC)
- As I suggested on Telegram, I think the “gender” input for "[gender] is a [country] [professional]", English Lexemes (Z21765) is best understood as a placeholder for a noun phrase. The function supplies the copula and sentence complement for an indeterminate person who is the grammatical subject. This is an English language (family) function that assumes a third-person (semantically) singular subject (a living human being who is currently active in some profession or role). For English, such a context is not sufficient to determine the required placeholder (a pronoun), because third-person (semantically) singular pronouns are marked for “gender”. This additional context is therefore required as input. The use of Sign (Z16659) here is unfortunate, but we do not have a general-purpose Type to represent “one of three options” (and I’m not suggesting we should). A similar solution was not available for Conjugate regular -er verb (Z21617), hence the use of arbitrary natural numbers (and I’m not suggesting we should do that, either).
- Although I don’t object to "[gender] is a [country] [professional]", English Lexemes (Z21765), I don’t believe it provides a useful pattern for future functions. To be useful in a Wikipedia context, the end result (like “they are an American actor”) would need to change when the person gives up acting or dies, or changes their pronoun preference or nationality. Of course, we could have a separate function to handle the past tense or whatever and rely on prior functions to call a different function when the context changes, but I don’t think that would be sensible. In a more multilingual context, we would presumably characterise the context in a language-neutral way but expect (more) language-specific functions to determine the form of the copula (if any), and the required forms of any article, adjective or noun (or, indeed, a different sentence structure altogether). None of this is straightforward but it is characterising the context that poses a particular challenge, as more languages are considered.
- The “grammatical features” for a lexeme form on Wikidata suggest a way forward, since they can account for the variety of forms that are available. In effect, a particular function will produce sensible results for some subset of all supported contexts. However, we need to be able to handle the normal cases where the available context provides information that is unnecessary for the function, as well as the cases where the function supports distinctions that the context does not. This suggests the need for some intermediate interface function(s) that can reduce or extend the available context according to the expectations of the function being called. For example, if the mood is not available in context when calling a French conjugation function, it would default to the indicative mood. This implies that the user interface for such a function would support the provision of the context as an input object, presumably (at its most basic) as a list of grammatical features. How we could restrict such a list to values for relevant grammatical features is an open question (see, for example, phab:T379338 and Wikifunctions talk:Representing identity#Functionally constrained lists). GrounderUK (talk) 12:06, 2 February 2025 (UTC)
- Using grammatical features in this way has now been prototyped at Breton verb form (Z22097). This calls one of these existing functions based on a supplied list of grammatical features. It has two implementations but these are set up to call only a few functions while we evaluate this approach. GrounderUK (talk) 21:20, 2 February 2025 (UTC)
- I’ve also created grammatical features list from Wikidata items (Z22107) to demonstrate the expansion of composite items like first-person singular (Q51929218) into its basic components. It currently recognises only three such items and passes any other grammatical features through unaltered.
- After discussions with @99of9 and @Feeglgeef on Telegram, I created an implementation of Breton verb form (Z22097) that uses N-ifs (Z19601). This allows a flatter conditional structure similar to a case construct, which is easier to work with but doesn’t scale well to support a large number of function calls. This implementation currently supports calls to seven of the Breton conjugation functions. GrounderUK (talk) 13:25, 3 February 2025 (UTC)
- No opinion
- I'd actually quite like to use monolingual texts for ones that we don't intend to use on Wikipedia
- Generally I think we should try to have the same input types, even if that means a lot of redundant inputs.
- Feeglgeef (talk) 16:50, 1 February 2025 (UTC)
- I like it if outputs are simple strings. For me as I usually try to not care about types while programming and give the decision to language interpreter this seems to be easiest thing. As different people implement functions it will be not possible to be completely consistent here. I prefer referring to objects. For using functions across languages I need to think about how far it is possible. I will write something about it maybe in the next days. Hogü-456 (talk) 22:58, 1 February 2025 (UTC)
- Thanks.
- Do you have any concerns about the use of grammatical features?
- Same here, although my comment seems to have been overlooked (except by you).
- The problem I have with redundant inputs is that they are liable to be inconsistent with the grammatical features that are actually present on Wikidata. The approach I’ve adopted so far with Breton verb form (Z22097) and grammatical features list from Wikidata items (Z22107) is tolerant of redundancy and intolerant of deficiencies, but that is more “line of least resistance” than a firm conviction.
- GrounderUK (talk) 14:41, 3 February 2025 (UTC)
- Since Return monolingual text from grammatical features (Z19530) takes a list of Wikidata item reference (Z6091), I'd expect other functions to use that too (though I'm not sure how you'd include number). For languages with only a few cases, there could be persistent (named) lists for each as a shorthand. YoshiRulz (talk) 06:41, 2 February 2025 (UTC)
- Please see, for example, present indicative of “labour” 1st singular (Z22098) specifying grammatical number (Q104083) as singular (Q110786). We might also consider using expansions of items like first-person singular (Q51929218), as suggested by User:VIGNERON on Talk:Z22097 GrounderUK (talk) 22:58, 2 February 2025 (UTC)
- My preference would be to introduce precise enumerations for grammatical features. For example, we would have one enumeration for grammatical genders for languages that have feminine and masculine genders (e.g. for Spanish and French), and one for languages that have three grammatical genders such as German, and so on. Then there are individual enumerations for grammatical numbers: there's one for languages with singular and plural, one for singular, dual, and plural, etc. And each language-part of speech would only use the relevant enumerations.
- This means creating quite a few enumerations, but I think that's OK.
- Furthermore I think we should have individual types for each pair of language and part of speech, i.e. a type for English noun, a type for Breton verb, a type for Hausa verb, a type for Ukrainian adjective, etc. And each of these would be using the right enumerations as created above.
- I know that it is a bit of work, but in the end it allows the user experience to provide much more guidance.
- I think this is a really important discussion, and it would be good to get this right! --Denny (talk) 15:25, 3 February 2025 (UTC)
- I would be happy to create a few grammatical gender enumerations for now, e.g. one for feminine / masculine and one for feminine / masculine / neuter, and maybe a few more, depending on the languages people would like to work on. Creating enumerations is not that much work, and I agree that it would be good to get rid of using sign to represent gender rather sooner than later. --Denny (talk) 15:28, 3 February 2025 (UTC)
- Can instead of this we have a "flexible enum" type that has a key-value pairing for dropdowns instead?
- {"Masculine": 1, "Feminine": 2, "Neuter": 3}
- Would create a dropdown and the value would be passed into the function. This would still be reusable and scale much better. Feeglgeef (talk) 15:47, 3 February 2025 (UTC)
- The idea of having a separate type for each pairing of language and part of speech sounds pretty scary. Or were you thinking of generic types?
- I’m not opposed to specific enumerations for grammatical features like “gender” and “number” but naming might be a problem. For pronouns and verb conjugations, a pairing of gender and number (like first-person singular (Q51929218), as suggested on Talk:Z22097) should be considered (and deferred).
- It seems important that we have seamless conversion to and from Wikidata item references so that we can select appropriate forms. This is why I still prefer to use Wikidata item reference (Z6091) directly. But I recognise the usability advantages of a dropdown that is limited to the tiny subset of all Wikidata items that are the relevant kind of grammatical feature. This is why I favour “functionally constrained lists” or something similar, in the medium term. This month, I support a few gender and number types, as you suggest. I would include common/neuter for Swedish etc. (For English, we should think about gender-neutral/masculine/feminine nouns like actor/actress and (s)he/they/it singular pronouns.) GrounderUK (talk) 15:21, 5 February 2025 (UTC)
- Yes, I fully agree that we should have a smooth transition from Wikidata Items to e.g. enumeration values. So if we had, e.g. french tenses, with a value for "Présent", we should have a functions that map to the appropriate Item and back. I see that Gregorian calendar month to Wikidata reference (Z22240) is heading in exactly that direction, thank you!
- This would allow us to have exactly the values that we need, without needing to force Wikidata to follow the same structure. For example, if we decide that person and number should be combined, we could have mappings to the number, mappings to the person, and mappings to combinations.
- Regarding separate types, I wasn't thinking of generic types, but of normal types. Do you find it scary due to the potential number of types, or due to other reasons? I want to write up more on my thinking around that, so questions are good. --Denny (talk) 12:29, 7 February 2025 (UTC)
- 1000 languages * 10 enums per language * 10 items per enums gives us a rough estimate of 100,000 objects. The number is definitely the problem. Feeglgeef (talk) 13:48, 7 February 2025 (UTC)
- Oh, I don't think that's how it would play out. The enums are often reusable across different languages. For example, I expect a few grammatical gender enums, but certainly not one per language, more like in the area of low tens. Same think for grammatical numbers, where they are probably less than ten different ones. --Denny (talk) 13:53, 7 February 2025 (UTC)
- You mentioned one per part of speech per language. That's not reusable (and would have much much more than 10 items). Perhaps the team can provide a way to filter a lexeme/item search box with a generic type? I've created a task for this, phab:T385895. Feeglgeef (talk) 16:54, 7 February 2025 (UTC)
- Ah, yes, but those wouldn't be enums. But yes, I currently don't see how those could be made generic, nor how this could be avoided without the system become very fragile. But I am afraid this requires an essay to explain. I will do so, but give me a few days. Thanks for prompting. --Denny (talk) 07:27, 8 February 2025 (UTC)
- You mentioned one per part of speech per language. That's not reusable (and would have much much more than 10 items). Perhaps the team can provide a way to filter a lexeme/item search box with a generic type? I've created a task for this, phab:T385895. Feeglgeef (talk) 16:54, 7 February 2025 (UTC)
- I made proposals for grammatical gender enumerations: masculine / feminine, masculine / feminine / neuter, common / neuter, animate / inanimate, in order to demonstrate what I mean. I am not sure if it makes sense to discuss the individually, and instead I suggest to discuss them here all together. --Denny (talk) 14:52, 7 February 2025 (UTC)
- Oh, I don't think that's how it would play out. The enums are often reusable across different languages. For example, I expect a few grammatical gender enums, but certainly not one per language, more like in the area of low tens. Same think for grammatical numbers, where they are probably less than ten different ones. --Denny (talk) 13:53, 7 February 2025 (UTC)
- Hi @Denny just a slight question about "Gregorian calendar month to Wikidata reference", I found that the returned object in the return statement works fine in js, but in python it always gives errors, I tried rewriting it from scratch but got the same object, shouldn't it be the same form in python and js? thx --Mohanad (talk) 20:21, 7 February 2025 (UTC)
- The way I did it for JS was to copy from and adapt one of the converters from code. So for Python, I would start with Python converter to natural number (Z13532). Is that roughly what you already tried? Would you like to try this way? I can try too if you like. @GrounderUK also mentioned problems when he tried with Python a few weeks ago. 99of9 (talk) 01:32, 8 February 2025 (UTC)
- I tried at month to Wikidata reference, python (Z22256) (direct copy of the natural number converter). I have the same problem as you. So yes @DVrandecic (WMF) this is worth a better investigation. 99of9 (talk) 02:14, 8 February 2025 (UTC)
- Thanks @Mohanad for raising it, and @99of9 for looking into it! I looked into it and am at a loss too. I will raise it with the team next week. Thanks! --Denny (talk) 08:17, 8 February 2025 (UTC)
- @Denny Thx in advance, @99of9 I followed function model to write what I think would work, and it didn't. Thank you for contributing to discussion --Mohanad (talk) 09:05, 8 February 2025 (UTC)
- @Mohanad: Cory on our team fixed it. This is not yet documented, so yeah, there wasn't much chance to get this right, apologies. But now it works! --DVrandecic (WMF) (talk) 13:53, 10 February 2025 (UTC)
- @Denny oh that's different, thx again --Mohanad (talk) 14:20, 10 February 2025 (UTC)
- Yes, sorry for the missing documentation! --Denny (talk) 18:57, 10 February 2025 (UTC)
- I was close! Feeglgeef (talk) 14:36, 10 February 2025 (UTC)
- Closer than me! --Denny (talk) 18:57, 10 February 2025 (UTC)
- @Denny oh that's different, thx again --Mohanad (talk) 14:20, 10 February 2025 (UTC)
- @Mohanad: Cory on our team fixed it. This is not yet documented, so yeah, there wasn't much chance to get this right, apologies. But now it works! --DVrandecic (WMF) (talk) 13:53, 10 February 2025 (UTC)
- I tried at month to Wikidata reference, python (Z22256) (direct copy of the natural number converter). I have the same problem as you. So yes @DVrandecic (WMF) this is worth a better investigation. 99of9 (talk) 02:14, 8 February 2025 (UTC)
- The way I did it for JS was to copy from and adapt one of the converters from code. So for Python, I would start with Python converter to natural number (Z13532). Is that roughly what you already tried? Would you like to try this way? I can try too if you like. @GrounderUK also mentioned problems when he tried with Python a few weeks ago. 99of9 (talk) 01:32, 8 February 2025 (UTC)
- Maybe it’s just scary because I’ve always assumed we definitely wouldn’t be having separate types for each language. I won’t try to second guess your thinking now, if you’re planning to write it up in the next week or two. At the phrase/lexeme level, we just need to refer to the linguistic context. The language, part of speech, person, number, tense etc are all included in that context and if you multiply them up, you end up with a big number. The two biggest contributors to the final answer are language and “etc”. We can quantify language at around a thousand, but I don’t have a good sense of how large “etc” might be. In any event, since you mention only language x part of speech, my concern is how we represent functions that determine the part of speech and whose return type represents “noun or verb (infinitive or participle)”, for example (“I need to think” or “I need a think” or “thought is needed” or “thinking is necessary” etc). GrounderUK (talk) 10:55, 8 February 2025 (UTC)
- 1000 languages * 10 enums per language * 10 items per enums gives us a rough estimate of 100,000 objects. The number is definitely the problem. Feeglgeef (talk) 13:48, 7 February 2025 (UTC)
More detailed request
Moved from Administrators' noticeboard
Hi, I just noticed that the template here doesn't have any prominent information indicating exactly what permission is requested, which makes the sub-pages of the archive page here (all requests together) a bit less clear.
Maybe a line like this one used on mediawiki wiki will be helpful. --Mohanad (talk) 10:37, 3 February 2025 (UTC)
Wikifunctions & Abstract Wikipedia Newsletter #188 is out: Invitation to the Natural Language Generation Special Interest Group
There is a new update for Abstract Wikipedia and Wikifunctions. Please, come and read it!
In this issue, we present a proposal to restructure our Natural Language Generation Special Interest Group (NLG SIG) meeting, we announce the creation of a new type, and we take a look at the latest software developments.
Want to catch up with the previous updates? Check our archive!
Enjoy the reading! -- User:Sannita (WMF) (talk) 17:17, 6 February 2025 (UTC)
- @99of9 I saw that you mentioned the algebraic formula implementation of Kleenean and in the Function of the week section; just a random question, is there a list of algebraic equation equivalents of three-valued logical functions, and are there equivalents in Boolean algebra as well? Xeroctic (talk) 12:03, 8 February 2025 (UTC)
- @Xeroctic I got that one from the three-valued logic Wikipedia page. Sorry, I don't know if they're neatly listed somewhere. It seems like something that could be done computationally without too much difficulty. 99of9 (talk) 12:10, 8 February 2025 (UTC)
![]() |
New Codex table
Hi, I don't know if it's just me or everyone else's, there's a display problem with the word "Passed" that makes it wrap in a new line "letter by letter" in some table cells. You can see that in the test table of this function. I think there's some CSS rule responsible for that "It appears in browser dev.tools", maybe overflow-wrap: anywhere;
. The issue depends on column width and for limited columns, the word Passed appears like that --Mohanad (talk) 20:04, 11 February 2025 (UTC)
- Yes, this sometimes happens for me too. 99of9 (talk) 09:42, 12 February 2025 (UTC)
- I can not reproduce the issue on the provided link in firefox, chrome or safari. Would you mind telling me which browser and dimensions (screensize) your are using when you see this? DSmit-WMF (talk) 10:26, 12 February 2025 (UTC)
- Hi @DSmit-WMF see this function --Mohanad (talk) 10:41, 12 February 2025 (UTC)
- Or this --Mohanad (talk) 10:45, 12 February 2025 (UTC)
- I am not seeing it locally so my guess is this reverted mixin in Codex.
- I asked internally. I will get back to you! DSmit-WMF (talk) 11:16, 12 February 2025 (UTC)
- I'm seeing it right now at [1] on Chrome in a full screen browser on a 1920x1080 display. 99of9 (talk) 11:30, 12 February 2025 (UTC)
- @99of9 I tested it on different desktop & mobile browsers and got the same issue, it also affects the words "composition" & "javascript" on the Implementations table. --11:49, 12 February 2025 (UTC) Mohanad (talk) 11:49, 12 February 2025 (UTC)
- Interesting. I don't get that behaviour in the implementation table. 99of9 (talk) 11:59, 12 February 2025 (UTC)
- Just confirmed: This weeks Codex release will fix the table behaviour. DSmit-WMF (talk) 14:39, 12 February 2025 (UTC)
- Interesting. I don't get that behaviour in the implementation table. 99of9 (talk) 11:59, 12 February 2025 (UTC)
- @99of9 I tested it on different desktop & mobile browsers and got the same issue, it also affects the words "composition" & "javascript" on the Implementations table. --11:49, 12 February 2025 (UTC) Mohanad (talk) 11:49, 12 February 2025 (UTC)
- I'm seeing it right now at [1] on Chrome in a full screen browser on a 1920x1080 display. 99of9 (talk) 11:30, 12 February 2025 (UTC)
equality function for natural languages
Please can we add same language (Z14326) as the equality function for Natural language (Z60)? As far as I understand, the only consequence of adding it to the Type is that when we create tests on functions which return languages, it shows up as the default result validator. This would almost always be an improvement over not having any default, so having to choose a function every time (especially for newcomers who don't know which equality functions are on offer). 99of9 (talk) 04:03, 13 February 2025 (UTC)
- @99of9: Done. If there is disagreement on this, we can still change it. Thanks for the suggestion! --DVrandecic (WMF) (talk) 14:12, 13 February 2025 (UTC)
- I very much disagree with this change. I think to newcomers, prefilling often seems mandatory (which may mean we need UI improvements), but I also think that to an average person, American English and British English are the same language, so I
Oppose this change. Feeglgeef (talk) 21:44, 13 February 2025 (UTC)
- I don't think we should treat American English and British English as identical or the same. They have different QIDs and different IETF codes.
- On the other hand, it would make a lot of sense to introduce a weaker notion of similar enough, which could be based solely on the language code being the same (in which case "en-GB" and "en-US" could be the same, just as "pt-PT" and "pt-BR" could be the same), and even a mutual intelligible test, in which case e.g Urdu and Hindi or Serbian and Croatian may pass (I am not a linguist and might have gotten my examples wrong).
- For testing though, I think that strict identity is the right thing to test. --Denny (talk) 10:35, 14 February 2025 (UTC)
- An equality with the semantics of "is mutually intelligible" wouldn't be symmetric nor transitive, limiting how it could be used. YoshiRulz (talk) 17:37, 14 February 2025 (UTC)
Byte type
We implemented the proposal to fix up the Byte type. We removed the markers from the type to invite usage, and now more functions can be created for the Z80/Byte type. We are inviting you to suggest display and read functions for this type, too. If you find any issues with the type, please let us know. Enjoy! -- DVrandecic (WMF) (talk) 15:06, 13 February 2025 (UTC)
Wikifunctions & Abstract Wikipedia Newsletter #189 is out: Restricting the World, redux
There is a new update for Abstract Wikipedia and Wikifunctions. Please, come and read it!
In this issue, we have an essay from Denny, we discuss the fix to the Byte Type, and we take a look at the latest software developments.
Want to catch up with the previous updates? Check our archive!
Enjoy the reading! -- User:Sannita (WMF) (talk) 11:19, 14 February 2025 (UTC)