Jump to content

Wikifunctions talk:Abstract Wikipedia/2025 fragment experiments

From Wikifunctions
Latest comment: 1 day ago by Réjean McCormick in topic Abstract Wiki Architect

Danish?

@Fnielsen Danish seems to be missing, would you like to help add support for it? So9q (talk) 05:16, 15 September 2025 (UTC)Reply

Proposed recommendation: Fragments should return Z11/monolingual strings

[Starting this as prompted by @Hogü-456 in Wikifunctions:Project chat#c-GrounderUK-20251004090900-Jdforrester (WMF)-20251003171500.]

I propose that in our recommendations, we say that fragment providers should return Z11s, so that users know which language we used in practice.

This is because the difference may be relatively trivial (e.g. you asked for American English and we only have international English), may have odd words (e.g. you asked for Portuguese and we only have Brazilian Portuguese), or may be non-idiomatic more generally (e.g. if you ask for something in Hong Kong Chinese and we only have a Traditional Han Chinese fragment function).

What do people think? Jdforrester (WMF) (talk) 18:05, 6 October 2025 (UTC)Reply

I'm okay with this being the best current advice. --99of9 (talk) 23:29, 6 October 2025 (UTC)Reply
It’s okay so long as all the resolved text has the same language tag. Labels and lexeme form representations for a particular language variant will typically be absent when they don’t differ and we don’t know when it’s reasonable to infer that fallback text is valid in a more particular variant. We can make the heroic assumption, of course, but logically we should somehow be distinguishing any part of the text that differs from the target language (which may be the whole text). GrounderUK (talk) 10:09, 7 October 2025 (UTC)Reply
@GrounderUK: That's true. The text "In l'´Étranger, Zola made use of…" the outer fragment is English but the quoted object label is in French. Maybe we will need to revisit it. Jdforrester (WMF) (talk) 19:48, 7 October 2025 (UTC)Reply
…noting that whether to emphasise foreign text will depend on the language pairing and the type of object. We wouldn’t emphasise the name of a person or place (in the target script) if there is no representation specific to the target language but we might want to emphasise a form representation from the target language’s own lexeme, like (en) déjà vu (L565846). (In the case of L’Étranger, the italics arise from the fact that it is a title, but they served to remind me of this nicety.) GrounderUK (talk) 15:25, 8 October 2025 (UTC)Reply
@GrounderUK: Indeed! Maybe the wording should be couched in wording that we expect this might later change? Jdforrester (WMF) (talk) 23:01, 8 October 2025 (UTC)Reply
Yes, it becomes more relevant when we address language fallbacks. get label of item according to language fallbacks (Z24766) has always been an option but it returns text. select monolingual text labels from Wikidata item (Z24139) provides it with a list of Z11s in the preferred order, so a Z11 equivalent would be pretty trivial. I feel a separate topic coming on… GrounderUK (talk) 11:43, 11 October 2025 (UTC)Reply
I have been working on an external prototype toolkit called Abstract Wiki Architect, and my experience strongly supports this recommendation to have a top-level all-language function that fans out to language-specific ones.
In the toolkit I model “instantiation” patterns very close to the fragments here (“X is a Y”, “X is a Y in Z”, “X is the Y of Z”, etc.). The architecture is:
  • one language-agnostic fragment definition, expressed as a semantic frame (roles like ENTITY, TYPE, LOCATION, ROLE, etc.);
  • a small number of family-level realisation engines (e.g. Romance, Slavic, Bantu, Japonic);
  • per-language configuration cards (JSON) that specify morphology, agreement, word order options, determiners, etc., plus a library of cross-linguistic constructions for these patterns.
A few observations that might be useful for the fragment work:
  • Having a single top-level fragment function with a stable semantic interface makes it much easier to add new languages later, because family- or language-specific implementations can evolve underneath without changing callers.
  • Grouping languages by family (with shared code and tests) has been effective: adding a new Romance or Slavic language is mostly configuration, not new code.
  • Keeping the output type consistently as “monolingual string/text” from the top level down (rather than sometimes Z6, sometimes Z11) simplifies both composition and testing, especially when generating larger sentences from fragments.
I am not suggesting to adopt this prototype directly, but if it would be helpful I can try to map one or two of the current fragment functions to this style and share concrete examples of the semantic frame + family config for them. Réjean McCormick (talk) 19:55, 4 December 2025 (UTC)Reply

Proposed recommendation: Fragments should map from a top-level all-language function to language-specific ones.

[Starting this as prompted by @Hogü-456 in Wikifunctions:Project chat#c-GrounderUK-20251004090900-Jdforrester (WMF)-20251003171500.]

I propose that in our recommendations, we say that fragment providers should be multi-lingual, and fan out to smaller, language-specific functions, rather than try to solve for all languages in one place.

This is because editing a big function is scary and it's hard once connected for people to get things fixed/propose new languages, as they will be protected.

What do people think? Jdforrester (WMF) (talk) 18:07, 6 October 2025 (UTC)Reply

Yes this is a good idea. I wish there is a possibility to copy the required inputs from the top-level all-language function and to be able to connect a language-specific implementation to the top-level all-language function in the language-specific ones. Having a field what says this is a implemention in language xxx for function yyy. As far as I understand it, it is necessary to provide as much as information as needed for the language with a fragment implementation what needs the most information to generate the text. This is from my point of view difficult to find out when implementing it for one language. Hogü-456 (talk) 19:57, 6 October 2025 (UTC)Reply
@Hogü-456: Yes, I think we can do better. I'll have a think about what work we can pitch to help here. Jdforrester (WMF) (talk) 19:56, 7 October 2025 (UTC)Reply
Yes, this is correct. --99of9 (talk) 23:24, 6 October 2025 (UTC)Reply
I broadly agree but I think we should do more to isolate the data access logic, which can be language neutral. One approach I tried is location in, composition (interleaving English) (Z26933). The idea is that making the function language-specific might be as simple as specifying appropriate linking texts. Of course, for some languages, the order in which the fetched terms should appear may be different, so I made promote indexed objects (Z27014) (which I haven’t tried using yet).
One advantage of the generalised function, labels for some Wikidata items (in one language) (Z26929), is that it avoids difficulties with the case where there are more than two fetched terms. I think it also supports content re-arrangement, like “Paris is the capital of France”, “France’s capital is Paris”, “the capital of France is Paris”, “Paris, the capital of France,…”. GrounderUK (talk) 10:01, 13 October 2025 (UTC)Reply
@GrounderUK: That's an interesting approach. I imagine the list-interleaving trick won't work in many languages due to agreement or ordering, but it is indeed a bit neater to have it apart. Accessing the data is fundamentally language-specific, as that determines what part of the data (grammatical gender, case, etc.) is used in so many cases. Jdforrester (WMF) (talk) 19:05, 14 October 2025 (UTC)Reply
I guess it depends what you mean by “the data”. The information to be expressed is fundamentally language-neutral, whereas the information that governs the text that surrounds it tends to be language-specific. That just means that linking text is not static; it’s a function. What it’s a function of varies according to the language, but we won’t expect a Wikidata statement about the population of London, say, to tell us the grammatical gender of London or of the “population” concept.
It’s a hard problem, of course! But I think we should try to ground “fragment experiments” in Wikidata statements. This would go some way to answering this question. The distinction between “article-less” and “article-ful” is language-dependent; the current English “Models” are not really appropriate for a language-neutral project. Even in English, “France is a country” and “the United States is a country”; “antelopes are mammals” and “the violin is a string instrument”; “a frog is an amphibian”, “humans are primates” but “the blue whale is a cetacean”. But I digress… in some imminent (?) repository of language-neutral content, how will the first instance of (P31) relation know which “model” applies for its expression? GrounderUK (talk) 21:30, 14 October 2025 (UTC)Reply

Turning this into a WikiProject?

I've been struck with the idea that this work could better be organised as a project page, or maybe as part of the catalogue. What do others thing? Jdforrester (WMF) (talk) 19:50, 7 October 2025 (UTC)Reply

Seems perfect as a section of WF:catalogue/Natural language operations. Maybe divided into ~a dozen subsections rather than one big table, though. Arlo Barnes (talk) 20:26, 7 October 2025 (UTC)Reply
@Arlo Barnes: I see there's Wikifunctions:Catalogue/Natural language operations/Global language functions but none of the things in the " Cross-lingual sentence creation" are creating sentences. Maybe it shouldn't go there? Jdforrester (WMF) (talk) 13:46, 8 October 2025 (UTC)Reply
As a follow-up, today I started restructuring the page to split the mega-table up, and add the background and how-to bits, in placeholder form for now. Jdforrester (WMF) (talk) 19:06, 14 October 2025 (UTC)Reply

Fallbacks

We should provide fallbacks, particularly for proper-noun labels. The current approach in select monolingual text labels from Wikidata item (Z24139) returns no labels if none are found for the list of languages supplied. I would expect the calling function to supply the required languages via cascading first Object or default (Z22839) calls, with each “default” specifying a broader set of languages. Ultimately, perhaps “any” label is better than none, but the calling function is free to avoid that fallback and insert a placeholder or (maybe) an error. GrounderUK (talk) 12:00, 11 October 2025 (UTC)Reply

@GrounderUK: I think this is a good idea. Maybe (eventually) we should return HTML fragments with a "fix this" link/call-to-action when e.g. a label isn't available in your language in Wikidata for an entity, or there's no matching lexeme, or… — but not today. Jdforrester (WMF) (talk) 19:02, 14 October 2025 (UTC)Reply

Translating example sentences

I want to create a table with the German translations of the example sentences and the link to the outer function for each fragment. Maybe people can help translating it into other language and having one table per language. At the moment I think it can help make it easier to create fragments in a specific language. Hogü-456 (talk) 20:02, 21 October 2025 (UTC)Reply

@Hogü-456: Maybe we should make the background section (at least) shown via the Translate extension? I worry about extending the table; it's already too large for mobile devices with just one example and the 7 target languages' statuses. Jdforrester (WMF) (talk) 16:11, 7 November 2025 (UTC)Reply
I am interested in the example sentences and how they look like in different languages. As I understand after using the Translate extension there is only one section. As it makes it easier to translate I support using the Translate extension. Hogü-456 (talk) 21:15, 9 November 2025 (UTC)Reply
@Hogü-456: We are not using the Translate extension for content on this wiki. Using it in documentation to show what the output might look like, but not actually using it, seems like it would be confusing? Jdforrester (WMF) (talk) 21:07, 13 November 2025 (UTC)Reply
I have seen translations of the Status Update. What is used to translate it. I think for the example sentences there can be subpages per Language. So if someone wants to translate the example sentences it can be done through adding a subpage. This avoids huge pages not easy to use at a mobile phone. The subpages are monolingual at least in the beginning. I think if there is enough information about the content included it is possible to understand at least a bit of it. For these I think it is useful to write down the Wikidata items used in each example sentence in its English version. Hogü-456 (talk) 19:33, 14 November 2025 (UTC)Reply
Maybe we could use Tatoeba for that purpose, for example would fit Z27243. YoshiRulz (talk) 14:21, 20 November 2025 (UTC)Reply
I have not visited this page before. The page is interesting and it is possible to learn something at the page. For example the example sentence fitting Z23743 is licensed under CC BY SA 2.0 FR. Is this upwards compatible to CC BY SA 4.0 what is used in Wikifunctions. I think it can help to look at the page. At the end content should be in Wikifunctions or later in Abstract Wikipedia. So it is important to check if the transfer of content is allowed. Hogü-456 (talk) 20:43, 24 November 2025 (UTC)Reply

re: promoting these experiments on Help:Multilingual

I'd appreciate more pairs of eyes on my addition to Help:Multilingual#Wikidata lexemes, since I don't know how frequently that's being translated and I don't want to send people to do fruitless work. (Also would appreciate more language-specific implementations of that function, but that was a given.) YoshiRulz (talk) 00:48, 28 November 2025 (UTC)Reply

I have been working on an external prototype toolkit called Abstract Wiki Architect , and my experience strongly supports this recommendation to have a top-level all-language function that fans out to language-specific ones.

In the toolkit I model “instantiation” patterns very close to the fragments here (“X is a Y”, “X is a Y in Z”, “X is the Y of Z”, etc.). The architecture is:

one language-agnostic fragment definition, expressed as a semantic frame (roles like ENTITY, TYPE, LOCATION, ROLE, etc.);

a small number of family-level realisation engines (e.g. Romance, Slavic, Bantu, Japonic);

per-language configuration cards (JSON) that specify morphology, agreement, word order options, determiners, etc., plus a library of cross-linguistic constructions for these patterns.

A few observations that might be useful for the fragment work:

Having a single top-level fragment function with a stable semantic interface makes it much easier to add new languages later, because family- or language-specific implementations can evolve underneath without changing callers.

Grouping languages by family (with shared code and tests) has been effective: adding a new Romance or Slavic language is mostly configuration, not new code.

Keeping the output type consistently as “monolingual string/text” from the top level down (rather than sometimes Z6, sometimes Z11) simplifies both composition and testing, especially when generating larger sentences from fragments.

I am not suggesting to adopt this prototype directly, but if it would be helpful I can try to map one or two of the current fragment functions to this style and share concrete examples of the semantic frame + family config for them.

Abstract Wiki Architect

Hello,

I offer a stable version, aligned with your goals.

v0.9.0-matrix-stable is on https://github.com/Rejean-McCormick/abstract-wiki-architect

https://meta.wikimedia.org/w/index.php?title=Abstract_Wikipedia/Tools/abstract-wiki-architect

For total disclosure, I am not supported or representing any member or group of Wikimedia communities, but offer this solution of my own initiative.

I am not familiar with your procedures and habits, so I hope my contribution will be appreciated even tough the situation is highly unusual. I couldn't wait any longer for Wiki Abstract, I needed it. I can pull out with my project if it's against your regulations or whatever, I just hope here is the building ground where AWA can evolve. Réjean McCormick (talk) 03:03, 11 December 2025 (UTC)Reply