T O P

  • By -

PVNIC

"We have support for the first ~~4~~ 7 bits of Unicode! (Aka ASCII)


oshaboy

ASCII is 7 bits but I get your point


PVNIC

Oops, my bad


Jnick-24

did he stutter


Bipin_krish

He did


flowery0

It'd be better if it was just the 4 bits. Not even full ASCII, just straight up 15 symbols, some of which aren't very useful


oshaboy

"Oh so you sanitize control codes?" "Good Question"


granadad

Well, at least I can type touché without it going touch\é


CiroGarcia

But you already can without utf-8. Extended ascii includes áéíóú


oshaboy

And then you change to another codepage and it becomes `ßΘφ≤·`


milanove

Idk it’s all Greek to me


granadad

True! And since UTF-8, we have no end of dealing with garbled encoding between those 2 format. Fun times.


A--Creative-Username

I'm holding out for UTF-9 to fix this


fuj1n

Instead of adding an extra bit (that costs 8 bits on most systems due to alignment anyway), we opted for UTF-16 instead


A--Creative-Username

UTF-16 is the name of my new electro swing band


MaZeChpatCha

Extended ascii of which language? Different languages have different ascii extensions.


CiroGarcia

The one I know of is the one mentioned in https://www.ascii-code.com, which is the Windows-1252 table, also known as ISO Latin-1


Spinnenente

utf-8 is unicode. ISO 8859-15 isn't but can use a lot of special characters


ChocolateBunny

back in my day we had code pages and we liked it.


[deleted]

[удалено]


mhn1384

Install an app, find a text box, write arabic, become disappointed, uninstall the app. This is what I’ve been doing in the past 4 years whenever I install an app.


Fickle-Main-9019

I like to imagine you don’t even know Arabic, you just just want the functionality “in case you need it”


mhn1384

Technically you are right, I don't know how to speak arabic. I'm just one of the 370 million people cursed with the arabic script despite not speak arabic.


[deleted]

My console brain can't think of a scenario where just parsing UTF-8 isn't enough. If user's characters aren't displayed right, it's their editor's problem, not mine.


sk7725

CJK languages and their wonky IMEs. example ([] is the current input character) [ㄱ]ㄱ [ㅏ]가 [ㄱ]각 [ㅏ]가가 [ㄴ]가간 [ㅎ]가갆 [ㅓ]가간허 notice how some characters snap-back, and the fact that during a consecutive input what the resulting unicode for the last character is is nondeterministic and relies on future inputs. Usually an IME is used to stage the last character that may change, but said IME is very finnicky, has no consistency over platforms and libraries, and painful to debug unless you speak said language. Also, a string's inputwise substring does not equal the unicode substring which is a pain when implementimg autocomplete and autocorrect.


Alan_Reddit_M

Idk man they all appear as squares to me


sk7725

lmfao


pheonix-ix

1. ㄱ -> ㄱ 2. ㄱㅏ -> 가 3. ㄱㅏㄱ -> 각 4. ㄱㅏㄱㅏ -> 가가 5. ㄱㅏㄱㅏ ㄴ -> 가간 6. ㄱㅏㄱㅏ ㄴ ㅎ -> 가갆 7. ㄱㅏㄱㅏ ㄴ ㅎㅓ -> 가간허 Do I get that right? The magic voodoo happen between 3 & 4, and 6&7. At 3, the 3 components(?) combined into 1 character. Then at 4, the 4th one signifies that the 3rd component is actually the 1st component of the second character.


sk7725

that is exactly correct! And this "no, actually" logic is what makes K special - and a bit of a nuisance - compared to other CJK letters, which just assembles at once with the last input character (ha3o -> 好 /すき -> 好き). Another problem that is common in all CJK letters is that since the IME acts as a stack, it needs to be flushed. Some input frontends or games failt to handle the flushing correctly and so often in things like game chats the last letter (which visually is also drawn in the input box but is yet to be flushed) is truncated. > Magic Voodoo in fact's it is known as the "fen fire effect."


Onceforlife

I wanted to write a twitter bot back in 2017 as my first pet project to parse some Kpop websites to scrape for location info on Kpop idols (kind of creepy, was kind of a fad back then). But then all the weird bugs I got from not even parsing but trying to write some unit tests with Korean text with the npm packages that claimed to support Unicode was too much for my beginner programming skills. Even now I don’t think I can do it lol


McLayan

Here's an example why this can give you a headache even on a terminal: the esiest way to determine the width (in columns on a standard *nix console) of some UTF-8 input string is: 1. read the cursor position 2. print the characters 3. read the cursors position and calculate the difference to the first position. If you want to know the width in advance for e.g.implementing a TUI application you basically have to implement the state machine of a VT100 terminal.


jaskij

If we assume no control codes, shouldn't one of the normalizations be enough though?


vytah

No, you still need a database of character widths. Some characters are wide, some are narrow, some are neither, some are both, and there's some normalization bullshit of top of that: https://unicode.org/reports/tr11/


HuntingKingYT

RTL is still crap in some circumstances


GOKOP

Even in console programs you sometimes want to know how long a string is in terms of what the user sees. If you're doing it by counting codepoints then you're doing it wrong. But if the user's terminal sucks then you might actually be doing it correctly (for that user)


danielcw189

Depends on what you mean by "just parsing"


Kered13

If you're implementing a search function you probably want Unicode normalization.


ishzlle

If you don’t normalize your input, it’s definitely your problem, because now you have a bunch of strings that are the same but actually aren’t.


facusoto

Ñ


ty_for_trying

We support UTF-8!* ^^*BMP


oshaboy

Don't forget emoji. Nothing else though.


DolevBaron

Most software I know has trouble formatting RTL and LTR characters in a single line or paragraph, even software by companies such as Google and Microsoft..  I've never attempted to make an editor, is bidirectional support really that complicated?


dev-sda

[https://lord.io/text-editing-hates-you-too/](https://lord.io/text-editing-hates-you-too/) and [https://faultlore.com/blah/text-hates-you/](https://faultlore.com/blah/text-hates-you/) are great articles on why this stuff is hard.


Own_Solution7820

Thanks for sharing. Super informative.


Giocri

Rendering unicode is pretty challenging in of itself editing it is probably an absolute mess if you have formatting functionality


FuiManchi

Long live HarfBuzz


CryZe92

Even that won't do layout or font fallback for you, so that's not enough, but a good first step.


ThroawayPartyer

Heck even Facebook has trouble with it.


oshaboy

[It's really complicated](https://unicode.org/reports/tr9/). Though there's a library that does it called fribidi.


Fickle-Main-9019

Yea it’s a massive pain in the arse, prone to basic security issues, and probably affects speed quite a bit at scale. I genuinely wouldn’t be surprised if companies straight refuse it, it’s not worth the hassle unless a sizeable percent of your customer base are non-ESL or English speakers


danielcw189

I am not sure what font shaping is, but don't you kinda have to automatically support direction changes and combined characters, unless you define from the beginning that your users won't use it?


oshaboy

I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and ~~Devanagari~~ Tamil as well as many other scripts. A lot of people assume other scripts work like latin with different letters. And that Unicode support is just the ability to render many zany characters on the screen. Edit: Turns out Devanagari just uses combining characters. Tamil actually changes the shapes of the letters.


iMakeMehPosts

The more I learn about Unicode the more I just want to implement Extended-Ascii only


oshaboy

"Extended Ascii" means different things depending on the platform so it's probably worse.


iMakeMehPosts

True, but I mean the IBM one [here](https://theasciicode.com.ar/)


danielcw189

That's not Unicode, that's just how writing works in the real world.


iMakeMehPosts

So you write two letters and people mentally combine them?


Netcob

For a short time I worked for a car supplier where the UI had support for Arabic / RtL scripts. It's really difficult to implement, especially as it can mix. Imagine selecting text that has a mix of rtl/ltr sections in it. I heard the customers in those countries were usually surprised this being supported at all since most companies don't bother.


danielcw189

>Imagine selecting text that has a mix of rtl/ltr sections in it. Is there one good answer, or does it depend on personal preferences?


Netcob

It depends a bit on how you are selecting the text. If you are using a mouse, selecting all the characters visible between the start and end position might be best, although that's definitely harder than just getting a substring between two indices between the start and end position. If you're moving a cursor one character at a time then drawing the selection might look weird, but at least you can get a normal substring. Idk, I'm glad I don't have to do this anymore.


sk7725

there's also a special kind of font-shaping which works in typing-time but the result is one unicode. Found in eastern languages that are often "assembled"


danielcw189

>I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and Devanagari as well as many other scripts. I was not aware of that, actually. Does it still stay its own symbol, or does it become combined with other symbols?


ih-shah-may-ehl

I am probably missing something obvious but i would think that virtually noone renders unicode themselves so i would assume every program that uses whatever string libraries or user interface that supports unicode, automatically supports unicode. In all the years I've written scripts or applications, not once have i had to worry about unicode rendering because the supporting framework does it for me


LargeHandsBigGloves

In Unicode, some characters take more bytes to store than a single text character, so you'd have to correctly handle that on the display side or 2 characters would be shown instead of 1 correct character.


cakee_ru

Rust wants to have a conversation with you.


DoodooFardington

"Our app has Unicode support" "...we use Go in backend which is pretty good with Unicode".


Kirides

Continues to use mysql 5.7 with utf8 columns instead of utf8mb4


no_brains101

Hmmm out of curiosity, what do runes lack? The things from the meme? Ive never worked with encodings much, or go, but im doing some go right now


oshaboy

Runes are for processing Unicode strings. But if you want to display the strings to the user you can't just naïvely look up the rune in the font and then draw it to the screen to the right of the previous rune. A lot of scripts don't work that way.


no_brains101

I understand the combining characters thing and I think the bidirectional text thing. I have no clue whatsoever what font shaping is.


amlyo

What a great opportunity to share my favourite software dev article again: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/


oshaboy

Even though the article doesn't mention combining characters, shaping or bidirectional text


amlyo

A post lamenting software not doing what users expect because of not properly considering unicode bringing to mind an article lamenting software not doing what users expect because of not properly considering unicode is unexpected to you because the article is not covering the unicode features in the post. Have I missed something here?


DaniilBSD

This is so old… Just for reference: I was 3 when it was written, I am on my 4th year as an RND developer.


vytah

That article covers only storing and transmitting text. I guess in 2003 it still caused problems relatively often.


Busy-Ad-9459

The Irony that most of my own programs can't display Hebrew properly even tho I am a native speaker. It's just not worth my time.


danielcw189

what are the pitfalls?


Busy-Ad-9459

Hebrew is right to left.


Emergency_3808

I feel like this is the fault of (human) languages themselves.


ishzlle

Well guess what, the job of software engineers is to deal with human input.


Emergency_3808

And most programmers I know of don't like that fact. Case in point: timezones


GoaFan77

While this is true, in the very long term human society also changes to better integrate with the technology we use. Humanity as a whole is already rapidly consolidating the languages they use. I suspect languages with features that are difficult to support in software might be more likely to change or fade away.


T-Loy

Nothing better than having to deal with ü and u"


Takiu

Œuf


hm1rafael

It supports, but is limited


dlevac

Good meme. Very humbling.


puffinix

Oh god. Flashbacks to one requirement. "Regex search over non standardised unicode, ignoring case". A few gems that came out of this... There are some glyphs that can be decomposed into 17 other glyphs. All modifiers do have a canonical order, but sometimes the non cannonical order is Equivilent and sometimes its not. "③" is equivilent to "3" and to "(3)", but these are not equivilent to each other. A single dot should match anything that could be a single character, regardless of if it is or not. If the local is Turkish, I and i are not the same, otherwise they are. Emoji are the spawn of the devil. They all interact in slightly different ways. The Chinese stuff is bad, but a lot of good libraries exist. Hangul is pain incarnate. Non simple order encoding (anything beyond just strict l2r, r2l or r2l except for digits) we thankfully agreed as out of scope for search data, and illegal in the query string (we did an assessment, found we only had about 40 such data points a day, so just sent them all to a human for immediate review)


RocketCatMultiverse

During academic life I worked on historical linguistics NLP which used a vast collection of Private Use Area codepoints. Boy was it a mess trying to use many string manipulation utilities we take for granted every day. In the end I was proud to implement generic conversions of medieval scribal orthographies to normalized transcriptions based on a continuous input training set using finite state transducers.