Wikifunctions:Type proposals/Unicode codepoint
Done: Unicode code point (Z86): A single code point in Unicode --DVrandecic (WMF) (talk) 11:02, 26 February 2025 (UTC)
Summary
A Unicode codepoint represents a single code point in the Unicode standard. This is often (mis)understood as a single Unicode character (but that’s not exactly correct). This is to fix up the current Z86 Type.
Uses
- Why should this exist?
Unicode is the most widely used standard for encoding text. Since our goal is to generate texts, it would be good to work effectively with Unicode concepts.
- What kinds of functions would be created using this?
This is for breaking a string into parts and modifying them.
- What standard concepts, if any, does this align with?
Unicode
Structure
A Unicode codepoint consists of a single key with a natural number as the value type.
Example values
{
"type": "Code point",
"value": {
"type": "Natural number",
"value": "85"
}
}
|
{
"Z1K1": "Zxyz",
"ZxyzK1": {
"Z1K1": "Z13518",
"Z13518K1": "85"
}
}
|
Validator
The validator ensures that the natural number is below 1,114,113.
Identity
Two code points are the same if their numbers are the same.
Note that two code points which have the same glyph are not the same.
Converting to code
Python
In Python, turning a character into a code point is using the function ord. That function results in an int. Therefore it seems that an int is idiomatic in Python to represent a code point.
JavaScript
In JavaScript, turning a character into a code point is using the method .codePointAt(). That method results in a number. Therefore it seems that number is idiomatic in JavaScript to represent a code point.
Renderer
Represent the number without commas or anything else.
Parsers
Accept a number only.
Alternatives
(Option 1) is the main proposal as described.
(Option 2) The validator also may disallow the 2,048 values which are due to UTF16’s encoding, i.e. the values from 55,296 to 57,343, including the edges.
(Option 3) The converters could turn them into strings instead of numeric values for the code points. That seems closer to the notion of a character.
(Option 4) Renderers and parsers could be using localization, i.e. display the numbers as appropriate for the given language.
(Option 5) Renderers and parsers could work on individual glyphs or strings. This seems error-prone.
Comments
Support as proposer.
Oppose I disagree with the necessity. A one character string is better and makes more sense here
Support lgtm Feeglgeef (talk) 00:53, 1 December 2024 (UTC)
- I don't mind having a one-character datatype, and I think we should have one, but I wouldn't even know how to figure out what a character is if you cannot talk about codepoints. --DVrandecic (WMF) (talk) 09:35, 16 December 2024 (UTC)
- I mean a Z6 with a length of 1. I disagree with your "how to figure out what a character is" comment, because we define a string as being before a code point (the very opposite of what the computer is actually doing), so we already kinda do know what a character is. Feeglgeef (talk) 03:17, 17 December 2024 (UTC)
- @Feeglgeef I don't understand. Here is an example: "👩🏿🔬". When I ask JavaScript to split this up, it returns me four codepoints (correctly). The four codepoints cannot be represented by four strings of length 1 (especially not the the zero-width joiner). --Denny (talk) 17:38, 12 February 2025 (UTC)
- No, that would be one one-char string. Feeglgeef (talk) 23:11, 12 February 2025 (UTC)
- @Feeglgeef: So, "👩🏿🔬" is a one-char string, fine (although, Z11040 says it has length 4).
- But if I want to know the codepoints this string is constituted from, we would need a data type to represent the codepoints. This proposal is that. It is also helpful, as pointed out below, to build complex Unicode strings together, such as diacritics etc.
- Are you not convinced the codepoint datapoint is useful, or is your oppose for something else? Or are you saying a "one-char string", whatever that means, would be more useful? If the latter, we can have both. --Denny (talk) 10:08, 13 February 2025 (UTC)
- No, that would be one one-char string. Feeglgeef (talk) 23:11, 12 February 2025 (UTC)
- @Feeglgeef I don't understand. Here is an example: "👩🏿🔬". When I ask JavaScript to split this up, it returns me four codepoints (correctly). The four codepoints cannot be represented by four strings of length 1 (especially not the the zero-width joiner). --Denny (talk) 17:38, 12 February 2025 (UTC)
- I mean a Z6 with a length of 1. I disagree with your "how to figure out what a character is" comment, because we define a string as being before a code point (the very opposite of what the computer is actually doing), so we already kinda do know what a character is. Feeglgeef (talk) 03:17, 17 December 2024 (UTC)
- I don't mind having a one-character datatype, and I think we should have one, but I wouldn't even know how to figure out what a character is if you cannot talk about codepoints. --DVrandecic (WMF) (talk) 09:35, 16 December 2024 (UTC)
Support --99of9 (talk) 00:45, 21 January 2025 (UTC)
Support It might be useful when dealing with characters like diacritics marks, zero spaces, etc. --Mohanad (talk) 20:05, 12 February 2025 (UTC)
Implementation notes
Related functions have been all updated. Unfortunately, Z888, Z868, and Z886 have all been deprecated and replaced with new functions.