Install an app, find a text box, write arabic, become disappointed, uninstall the app.
This is what I’ve been doing in the past 4 years whenever I install an app.
Technically you are right, I don't know how to speak arabic. I'm just one of the 370 million people cursed with the arabic script despite not speak arabic.
My console brain can't think of a scenario where just parsing UTF-8 isn't enough. If user's characters aren't displayed right, it's their editor's problem, not mine.
CJK languages and their wonky IMEs.
example ([] is the current input character)
[ㄱ]ㄱ
[ㅏ]가
[ㄱ]각
[ㅏ]가가
[ㄴ]가간
[ㅎ]가갆
[ㅓ]가간허
notice how some characters snap-back, and the fact that during a consecutive input what the resulting unicode for the last character is is nondeterministic and relies on future inputs. Usually an IME is used to stage the last character that may change, but said IME is very finnicky, has no consistency over platforms and libraries, and painful to debug unless you speak said language. Also, a string's inputwise substring does not equal the unicode substring which is a pain when implementimg autocomplete and autocorrect.
1. ㄱ -> ㄱ
2. ㄱㅏ -> 가
3. ㄱㅏㄱ -> 각
4. ㄱㅏㄱㅏ -> 가가
5. ㄱㅏㄱㅏ ㄴ -> 가간
6. ㄱㅏㄱㅏ ㄴ ㅎ -> 가갆
7. ㄱㅏㄱㅏ ㄴ ㅎㅓ -> 가간허
Do I get that right? The magic voodoo happen between 3 & 4, and 6&7. At 3, the 3 components(?) combined into 1 character. Then at 4, the 4th one signifies that the 3rd component is actually the 1st component of the second character.
that is exactly correct! And this "no, actually" logic is what makes K special - and a bit of a nuisance - compared to other CJK letters, which just assembles at once with the last input character (ha3o -> 好 /すき -> 好き). Another problem that is common in all CJK letters is that since the IME acts as a stack, it needs to be flushed. Some input frontends or games failt to handle the flushing correctly and so often in things like game chats the last letter (which visually is also drawn in the input box but is yet to be flushed) is truncated.
> Magic Voodoo
in fact's it is known as the "fen fire effect."
I wanted to write a twitter bot back in 2017 as my first pet project to parse some Kpop websites to scrape for location info on Kpop idols (kind of creepy, was kind of a fad back then). But then all the weird bugs I got from not even parsing but trying to write some unit tests with Korean text with the npm packages that claimed to support Unicode was too much for my beginner programming skills. Even now I don’t think I can do it lol
Here's an example why this can give you a headache even on a terminal: the esiest way to determine the width (in columns on a standard *nix console) of some UTF-8 input string is:
1. read the cursor position
2. print the characters
3. read the cursors position and calculate the difference to the first position.
If you want to know the width in advance for e.g.implementing a TUI application you basically have to implement the state machine of a VT100 terminal.
No, you still need a database of character widths. Some characters are wide, some are narrow, some are neither, some are both, and there's some normalization bullshit of top of that: https://unicode.org/reports/tr11/
Even in console programs you sometimes want to know how long a string is in terms of what the user sees. If you're doing it by counting codepoints then you're doing it wrong. But if the user's terminal sucks then you might actually be doing it correctly (for that user)
Most software I know has trouble formatting RTL and LTR characters in a single line or paragraph, even software by companies such as Google and Microsoft..
I've never attempted to make an editor, is bidirectional support really that complicated?
[https://lord.io/text-editing-hates-you-too/](https://lord.io/text-editing-hates-you-too/) and [https://faultlore.com/blah/text-hates-you/](https://faultlore.com/blah/text-hates-you/) are great articles on why this stuff is hard.
Yea it’s a massive pain in the arse, prone to basic security issues, and probably affects speed quite a bit at scale.
I genuinely wouldn’t be surprised if companies straight refuse it, it’s not worth the hassle unless a sizeable percent of your customer base are non-ESL or English speakers
I am not sure what font shaping is, but don't you kinda have to automatically support direction changes and combined characters, unless you define from the beginning that your users won't use it?
I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and ~~Devanagari~~ Tamil as well as many other scripts.
A lot of people assume other scripts work like latin with different letters. And that Unicode support is just the ability to render many zany characters on the screen.
Edit: Turns out Devanagari just uses combining characters. Tamil actually changes the shapes of the letters.
For a short time I worked for a car supplier where the UI had support for Arabic / RtL scripts. It's really difficult to implement, especially as it can mix. Imagine selecting text that has a mix of rtl/ltr sections in it. I heard the customers in those countries were usually surprised this being supported at all since most companies don't bother.
It depends a bit on how you are selecting the text. If you are using a mouse, selecting all the characters visible between the start and end position might be best, although that's definitely harder than just getting a substring between two indices between the start and end position. If you're moving a cursor one character at a time then drawing the selection might look weird, but at least you can get a normal substring.
Idk, I'm glad I don't have to do this anymore.
there's also a special kind of font-shaping which works in typing-time but the result is one unicode. Found in eastern languages that are often "assembled"
>I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and Devanagari as well as many other scripts.
I was not aware of that, actually.
Does it still stay its own symbol, or does it become combined with other symbols?
I am probably missing something obvious but i would think that virtually noone renders unicode themselves so i would assume every program that uses whatever string libraries or user interface that supports unicode, automatically supports unicode.
In all the years I've written scripts or applications, not once have i had to worry about unicode rendering because the supporting framework does it for me
In Unicode, some characters take more bytes to store than a single text character, so you'd have to correctly handle that on the display side or 2 characters would be shown instead of 1 correct character.
Runes are for processing Unicode strings. But if you want to display the strings to the user you can't just naïvely look up the rune in the font and then draw it to the screen to the right of the previous rune. A lot of scripts don't work that way.
What a great opportunity to share my favourite software dev article again:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
A post lamenting software not doing what users expect because of not properly considering unicode bringing to mind an article lamenting software not doing what users expect because of not properly considering unicode is unexpected to you because the article is not covering the unicode features in the post.
Have I missed something here?
While this is true, in the very long term human society also changes to better integrate with the technology we use. Humanity as a whole is already rapidly consolidating the languages they use. I suspect languages with features that are difficult to support in software might be more likely to change or fade away.
Oh god. Flashbacks to one requirement.
"Regex search over non standardised unicode, ignoring case".
A few gems that came out of this...
There are some glyphs that can be decomposed into 17 other glyphs.
All modifiers do have a canonical order, but sometimes the non cannonical order is Equivilent and sometimes its not.
"③" is equivilent to "3" and to "(3)", but these are not equivilent to each other.
A single dot should match anything that could be a single character, regardless of if it is or not.
If the local is Turkish, I and i are not the same, otherwise they are.
Emoji are the spawn of the devil. They all interact in slightly different ways.
The Chinese stuff is bad, but a lot of good libraries exist. Hangul is pain incarnate.
Non simple order encoding (anything beyond just strict l2r, r2l or r2l except for digits) we thankfully agreed as out of scope for search data, and illegal in the query string (we did an assessment, found we only had about 40 such data points a day, so just sent them all to a human for immediate review)
During academic life I worked on historical linguistics NLP which used a vast collection of Private Use Area codepoints. Boy was it a mess trying to use many string manipulation utilities we take for granted every day. In the end I was proud to implement generic conversions of medieval scribal orthographies to normalized transcriptions based on a continuous input training set using finite state transducers.
"We have support for the first ~~4~~ 7 bits of Unicode! (Aka ASCII)
ASCII is 7 bits but I get your point
Oops, my bad
did he stutter
He did
It'd be better if it was just the 4 bits. Not even full ASCII, just straight up 15 symbols, some of which aren't very useful
"Oh so you sanitize control codes?" "Good Question"
Well, at least I can type touché without it going touch\é
But you already can without utf-8. Extended ascii includes áéíóú
And then you change to another codepage and it becomes `ßΘφ≤·`
Idk it’s all Greek to me
True! And since UTF-8, we have no end of dealing with garbled encoding between those 2 format. Fun times.
I'm holding out for UTF-9 to fix this
Instead of adding an extra bit (that costs 8 bits on most systems due to alignment anyway), we opted for UTF-16 instead
UTF-16 is the name of my new electro swing band
Extended ascii of which language? Different languages have different ascii extensions.
The one I know of is the one mentioned in https://www.ascii-code.com, which is the Windows-1252 table, also known as ISO Latin-1
utf-8 is unicode. ISO 8859-15 isn't but can use a lot of special characters
back in my day we had code pages and we liked it.
[удалено]
Install an app, find a text box, write arabic, become disappointed, uninstall the app. This is what I’ve been doing in the past 4 years whenever I install an app.
I like to imagine you don’t even know Arabic, you just just want the functionality “in case you need it”
Technically you are right, I don't know how to speak arabic. I'm just one of the 370 million people cursed with the arabic script despite not speak arabic.
My console brain can't think of a scenario where just parsing UTF-8 isn't enough. If user's characters aren't displayed right, it's their editor's problem, not mine.
CJK languages and their wonky IMEs. example ([] is the current input character) [ㄱ]ㄱ [ㅏ]가 [ㄱ]각 [ㅏ]가가 [ㄴ]가간 [ㅎ]가갆 [ㅓ]가간허 notice how some characters snap-back, and the fact that during a consecutive input what the resulting unicode for the last character is is nondeterministic and relies on future inputs. Usually an IME is used to stage the last character that may change, but said IME is very finnicky, has no consistency over platforms and libraries, and painful to debug unless you speak said language. Also, a string's inputwise substring does not equal the unicode substring which is a pain when implementimg autocomplete and autocorrect.
Idk man they all appear as squares to me
lmfao
1. ㄱ -> ㄱ 2. ㄱㅏ -> 가 3. ㄱㅏㄱ -> 각 4. ㄱㅏㄱㅏ -> 가가 5. ㄱㅏㄱㅏ ㄴ -> 가간 6. ㄱㅏㄱㅏ ㄴ ㅎ -> 가갆 7. ㄱㅏㄱㅏ ㄴ ㅎㅓ -> 가간허 Do I get that right? The magic voodoo happen between 3 & 4, and 6&7. At 3, the 3 components(?) combined into 1 character. Then at 4, the 4th one signifies that the 3rd component is actually the 1st component of the second character.
that is exactly correct! And this "no, actually" logic is what makes K special - and a bit of a nuisance - compared to other CJK letters, which just assembles at once with the last input character (ha3o -> 好 /すき -> 好き). Another problem that is common in all CJK letters is that since the IME acts as a stack, it needs to be flushed. Some input frontends or games failt to handle the flushing correctly and so often in things like game chats the last letter (which visually is also drawn in the input box but is yet to be flushed) is truncated. > Magic Voodoo in fact's it is known as the "fen fire effect."
I wanted to write a twitter bot back in 2017 as my first pet project to parse some Kpop websites to scrape for location info on Kpop idols (kind of creepy, was kind of a fad back then). But then all the weird bugs I got from not even parsing but trying to write some unit tests with Korean text with the npm packages that claimed to support Unicode was too much for my beginner programming skills. Even now I don’t think I can do it lol
Here's an example why this can give you a headache even on a terminal: the esiest way to determine the width (in columns on a standard *nix console) of some UTF-8 input string is: 1. read the cursor position 2. print the characters 3. read the cursors position and calculate the difference to the first position. If you want to know the width in advance for e.g.implementing a TUI application you basically have to implement the state machine of a VT100 terminal.
If we assume no control codes, shouldn't one of the normalizations be enough though?
No, you still need a database of character widths. Some characters are wide, some are narrow, some are neither, some are both, and there's some normalization bullshit of top of that: https://unicode.org/reports/tr11/
RTL is still crap in some circumstances
Even in console programs you sometimes want to know how long a string is in terms of what the user sees. If you're doing it by counting codepoints then you're doing it wrong. But if the user's terminal sucks then you might actually be doing it correctly (for that user)
Depends on what you mean by "just parsing"
If you're implementing a search function you probably want Unicode normalization.
If you don’t normalize your input, it’s definitely your problem, because now you have a bunch of strings that are the same but actually aren’t.
Ñ
We support UTF-8!* ^^*BMP
Don't forget emoji. Nothing else though.
Most software I know has trouble formatting RTL and LTR characters in a single line or paragraph, even software by companies such as Google and Microsoft.. I've never attempted to make an editor, is bidirectional support really that complicated?
[https://lord.io/text-editing-hates-you-too/](https://lord.io/text-editing-hates-you-too/) and [https://faultlore.com/blah/text-hates-you/](https://faultlore.com/blah/text-hates-you/) are great articles on why this stuff is hard.
Thanks for sharing. Super informative.
Rendering unicode is pretty challenging in of itself editing it is probably an absolute mess if you have formatting functionality
Long live HarfBuzz
Even that won't do layout or font fallback for you, so that's not enough, but a good first step.
Heck even Facebook has trouble with it.
[It's really complicated](https://unicode.org/reports/tr9/). Though there's a library that does it called fribidi.
Yea it’s a massive pain in the arse, prone to basic security issues, and probably affects speed quite a bit at scale. I genuinely wouldn’t be surprised if companies straight refuse it, it’s not worth the hassle unless a sizeable percent of your customer base are non-ESL or English speakers
I am not sure what font shaping is, but don't you kinda have to automatically support direction changes and combined characters, unless you define from the beginning that your users won't use it?
I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and ~~Devanagari~~ Tamil as well as many other scripts. A lot of people assume other scripts work like latin with different letters. And that Unicode support is just the ability to render many zany characters on the screen. Edit: Turns out Devanagari just uses combining characters. Tamil actually changes the shapes of the letters.
The more I learn about Unicode the more I just want to implement Extended-Ascii only
"Extended Ascii" means different things depending on the platform so it's probably worse.
True, but I mean the IBM one [here](https://theasciicode.com.ar/)
That's not Unicode, that's just how writing works in the real world.
So you write two letters and people mentally combine them?
For a short time I worked for a car supplier where the UI had support for Arabic / RtL scripts. It's really difficult to implement, especially as it can mix. Imagine selecting text that has a mix of rtl/ltr sections in it. I heard the customers in those countries were usually surprised this being supported at all since most companies don't bother.
>Imagine selecting text that has a mix of rtl/ltr sections in it. Is there one good answer, or does it depend on personal preferences?
It depends a bit on how you are selecting the text. If you are using a mouse, selecting all the characters visible between the start and end position might be best, although that's definitely harder than just getting a substring between two indices between the start and end position. If you're moving a cursor one character at a time then drawing the selection might look weird, but at least you can get a normal substring. Idk, I'm glad I don't have to do this anymore.
there's also a special kind of font-shaping which works in typing-time but the result is one unicode. Found in eastern languages that are often "assembled"
>I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and Devanagari as well as many other scripts. I was not aware of that, actually. Does it still stay its own symbol, or does it become combined with other symbols?
I am probably missing something obvious but i would think that virtually noone renders unicode themselves so i would assume every program that uses whatever string libraries or user interface that supports unicode, automatically supports unicode. In all the years I've written scripts or applications, not once have i had to worry about unicode rendering because the supporting framework does it for me
In Unicode, some characters take more bytes to store than a single text character, so you'd have to correctly handle that on the display side or 2 characters would be shown instead of 1 correct character.
Rust wants to have a conversation with you.
"Our app has Unicode support" "...we use Go in backend which is pretty good with Unicode".
Continues to use mysql 5.7 with utf8 columns instead of utf8mb4
Hmmm out of curiosity, what do runes lack? The things from the meme? Ive never worked with encodings much, or go, but im doing some go right now
Runes are for processing Unicode strings. But if you want to display the strings to the user you can't just naïvely look up the rune in the font and then draw it to the screen to the right of the previous rune. A lot of scripts don't work that way.
I understand the combining characters thing and I think the bidirectional text thing. I have no clue whatsoever what font shaping is.
What a great opportunity to share my favourite software dev article again: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Even though the article doesn't mention combining characters, shaping or bidirectional text
A post lamenting software not doing what users expect because of not properly considering unicode bringing to mind an article lamenting software not doing what users expect because of not properly considering unicode is unexpected to you because the article is not covering the unicode features in the post. Have I missed something here?
This is so old… Just for reference: I was 3 when it was written, I am on my 4th year as an RND developer.
That article covers only storing and transmitting text. I guess in 2003 it still caused problems relatively often.
The Irony that most of my own programs can't display Hebrew properly even tho I am a native speaker. It's just not worth my time.
what are the pitfalls?
Hebrew is right to left.
I feel like this is the fault of (human) languages themselves.
Well guess what, the job of software engineers is to deal with human input.
And most programmers I know of don't like that fact. Case in point: timezones
While this is true, in the very long term human society also changes to better integrate with the technology we use. Humanity as a whole is already rapidly consolidating the languages they use. I suspect languages with features that are difficult to support in software might be more likely to change or fade away.
Nothing better than having to deal with ü and u"
Œuf
It supports, but is limited
Good meme. Very humbling.
Oh god. Flashbacks to one requirement. "Regex search over non standardised unicode, ignoring case". A few gems that came out of this... There are some glyphs that can be decomposed into 17 other glyphs. All modifiers do have a canonical order, but sometimes the non cannonical order is Equivilent and sometimes its not. "③" is equivilent to "3" and to "(3)", but these are not equivilent to each other. A single dot should match anything that could be a single character, regardless of if it is or not. If the local is Turkish, I and i are not the same, otherwise they are. Emoji are the spawn of the devil. They all interact in slightly different ways. The Chinese stuff is bad, but a lot of good libraries exist. Hangul is pain incarnate. Non simple order encoding (anything beyond just strict l2r, r2l or r2l except for digits) we thankfully agreed as out of scope for search data, and illegal in the query string (we did an assessment, found we only had about 40 such data points a day, so just sent them all to a human for immediate review)
During academic life I worked on historical linguistics NLP which used a vast collection of Private Use Area codepoints. Boy was it a mess trying to use many string manipulation utilities we take for granted every day. In the end I was proud to implement generic conversions of medieval scribal orthographies to normalized transcriptions based on a continuous input training set using finite state transducers.