Talk:Z6

Detailed definition of a String needed

Latest comment: 8 months ago5 comments3 people in discussion

The following should be described:

What encoding is used for String? e.g. UTF-8?
Is there a maximum length for String?
Are all control characters/codepoints permitted to be used? e.g. CRLF, bidirectional text overrides?

Dhx1 (talk) 01:45, 4 August 2023 (UTC)Reply

@Dhx1: The function model says: String := "Character*" // to be specific, as in JSON / ECMA-404 The ECMA-404 standard says (p. 4) "A string is a sequence of Unicode code points..." and then talks about escapes in their representation for special characters.

To answer your specific questions:

What encoding is used for String? e.g. UTF-8?

A string is a list of Unicode codepoints. Whether it is UTF-8 or UTF-16 depends on the used encoding, i.e. some programming languages require specific encodings and then the string would be encoded accordingly. From the point of view of Wikifunctions, a string is defined to be a compact representation of a "list of codepoints", i.e. Z881(Z86).

Is there a maximum length for String?

Only practically. We did not limit the length. Our APIs might have limits in the length of inputs. If you want to try it out to see where it breaks, please use the Betacluster installation.

Are all control characters/codepoints permitted to be used? e.g. CRLF, bidirectional text overrides?

All code points are currently permitted. This might change if we discover that some cause issues. We expect that the UI will not graciously deal with some of them. If you want to try it out to see where it breaks, please use the Betacluster installation.

Hope that helps! --DVrandecic (WMF) (talk) 19:29, 8 August 2023 (UTC)Reply

Please note currently API require strings in requests to be normalized to Unicode NFC form. To use a string that is not in NFC form, see hex (string) to string (UTF-8) (Z10373). GZWDer (talk) 19:33, 8 August 2023 (UTC)Reply

Thanks @DVrandecic (WMF) for the link to the function model. Is it possible to narrow the definition of Z6/String to only ever be UTF-8 encoded per requirement in RFC8259? Code point (Z86) also has a alias "UTF-8 code point" so I wonder whether this definition of UTF-8 only is already in effect? Detection of whether a string is UTF-8 or UTF-16 encoded is non-deterministic and error prone, e.g. 0x2020 is two spaces in UTF-8 and a dagger in UTF-16. RFC8259 also allows JSON implementations to ignore byte-order marks that would have otherwise explicitly allowed an implementation to know whether a number of bytes is a UTF-8 or UTF-16 encoded string.

Additionally it appears that ECMA-404/RFC8259 requiring escaping of some control codes, quotation marks, etc is not a restriction either other than APIs needing to escape these characters before Z6/String is used in a JSON response? Dhx1 (talk) 02:30, 9 August 2023 (UTC)Reply

JavaScript encodes strings as UTF16, and much of our code is running on JavaScript. By requiring UTF8, we might be introducing unnecessary specificity that might make everyone's life harder. Right now it feels to me like it is easier to allow each programming language to do what it wants as long as the sequence of code points is NFC equivalent. But I keep myself the right to change my mind on this as we learn more about the system (and having this discussion and the material you collected here is exactly the right place to find it should we need to revisit this decision).

In short: it feels to me premature to commit to UTF8. --DVrandecic (WMF) (talk) 05:22, 9 August 2023 (UTC)Reply