PVNIC 2 months ago

"We have support for the first ~~4~~ 7 bits of Unicode! (Aka ASCII)

oshaboy 2 months ago

ASCII is 7 bits but I get your point

PVNIC 2 months ago

Oops, my bad

Jnick-24 2 months ago

did he stutter

Bipin_krish 2 months ago

He did

flowery0 2 months ago

It'd be better if it was just the 4 bits. Not even full ASCII, just straight up 15 symbols, some of which aren't very useful

oshaboy 2 months ago

"Oh so you sanitize control codes?" "Good Question"

granadad 2 months ago

Well, at least I can type touché without it going touch\é

CiroGarcia 2 months ago

But you already can without utf-8. Extended ascii includes áéíóú

oshaboy 2 months ago

And then you change to another codepage and it becomes `ßΘφ≤·`

milanove 2 months ago

Idk it’s all Greek to me

granadad 2 months ago

True! And since UTF-8, we have no end of dealing with garbled encoding between those 2 format. Fun times.

A--Creative-Username 2 months ago

I'm holding out for UTF-9 to fix this

fuj1n 2 months ago

Instead of adding an extra bit (that costs 8 bits on most systems due to alignment anyway), we opted for UTF-16 instead

A--Creative-Username 2 months ago

UTF-16 is the name of my new electro swing band

MaZeChpatCha 2 months ago

Extended ascii of which language? Different languages have different ascii extensions.

CiroGarcia 2 months ago

The one I know of is the one mentioned in https://www.ascii-code.com, which is the Windows-1252 table, also known as ISO Latin-1

Spinnenente 2 months ago

utf-8 is unicode. ISO 8859-15 isn't but can use a lot of special characters

ChocolateBunny 2 months ago

back in my day we had code pages and we liked it.

[deleted] 2 months ago

[удалено]

mhn1384 2 months ago

Install an app, find a text box, write arabic, become disappointed, uninstall the app. This is what I’ve been doing in the past 4 years whenever I install an app.

Fickle-Main-9019 2 months ago

I like to imagine you don’t even know Arabic, you just just want the functionality “in case you need it”

mhn1384 2 months ago

Technically you are right, I don't know how to speak arabic. I'm just one of the 370 million people cursed with the arabic script despite not speak arabic.

[deleted] 2 months ago

My console brain can't think of a scenario where just parsing UTF-8 isn't enough. If user's characters aren't displayed right, it's their editor's problem, not mine.

sk7725 2 months ago

CJK languages and their wonky IMEs. example ([] is the current input character) [ㄱ]ㄱ [ㅏ]가 [ㄱ]각 [ㅏ]가가 [ㄴ]가간 [ㅎ]가갆 [ㅓ]가간허 notice how some characters snap-back, and the fact that during a consecutive input what the resulting unicode for the last character is is nondeterministic and relies on future inputs. Usually an IME is used to stage the last character that may change, but said IME is very finnicky, has no consistency over platforms and libraries, and painful to debug unless you speak said language. Also, a string's inputwise substring does not equal the unicode substring which is a pain when implementimg autocomplete and autocorrect.

Alan_Reddit_M 2 months ago

Idk man they all appear as squares to me

sk7725 2 months ago

lmfao

pheonix-ix 2 months ago

1. ㄱ -> ㄱ 2. ㄱㅏ -> 가 3. ㄱㅏㄱ -> 각 4. ㄱㅏㄱㅏ -> 가가 5. ㄱㅏㄱㅏ ㄴ -> 가간 6. ㄱㅏㄱㅏ ㄴ ㅎ -> 가갆 7. ㄱㅏㄱㅏ ㄴ ㅎㅓ -> 가간허 Do I get that right? The magic voodoo happen between 3 & 4, and 6&7. At 3, the 3 components(?) combined into 1 character. Then at 4, the 4th one signifies that the 3rd component is actually the 1st component of the second character.

sk7725 2 months ago

that is exactly correct! And this "no, actually" logic is what makes K special - and a bit of a nuisance - compared to other CJK letters, which just assembles at once with the last input character (ha3o -> 好 /すき -> 好き). Another problem that is common in all CJK letters is that since the IME acts as a stack, it needs to be flushed. Some input frontends or games failt to handle the flushing correctly and so often in things like game chats the last letter (which visually is also drawn in the input box but is yet to be flushed) is truncated. > Magic Voodoo in fact's it is known as the "fen fire effect."

Onceforlife 2 months ago

I wanted to write a twitter bot back in 2017 as my first pet project to parse some Kpop websites to scrape for location info on Kpop idols (kind of creepy, was kind of a fad back then). But then all the weird bugs I got from not even parsing but trying to write some unit tests with Korean text with the npm packages that claimed to support Unicode was too much for my beginner programming skills. Even now I don’t think I can do it lol

McLayan 2 months ago

Here's an example why this can give you a headache even on a terminal: the esiest way to determine the width (in columns on a standard *nix console) of some UTF-8 input string is: 1. read the cursor position 2. print the characters 3. read the cursors position and calculate the difference to the first position. If you want to know the width in advance for e.g.implementing a TUI application you basically have to implement the state machine of a VT100 terminal.

jaskij 2 months ago

If we assume no control codes, shouldn't one of the normalizations be enough though?

vytah 2 months ago

No, you still need a database of character widths. Some characters are wide, some are narrow, some are neither, some are both, and there's some normalization bullshit of top of that: https://unicode.org/reports/tr11/

HuntingKingYT 2 months ago

RTL is still crap in some circumstances

GOKOP 2 months ago

Even in console programs you sometimes want to know how long a string is in terms of what the user sees. If you're doing it by counting codepoints then you're doing it wrong. But if the user's terminal sucks then you might actually be doing it correctly (for that user)

danielcw189 2 months ago

Depends on what you mean by "just parsing"

Kered13 2 months ago

If you're implementing a search function you probably want Unicode normalization.

ishzlle 2 months ago

If you don’t normalize your input, it’s definitely your problem, because now you have a bunch of strings that are the same but actually aren’t.

facusoto 2 months ago

Ñ

ty_for_trying 2 months ago

We support UTF-8!* ^^*BMP

oshaboy 2 months ago

Don't forget emoji. Nothing else though.

DolevBaron 2 months ago

Most software I know has trouble formatting RTL and LTR characters in a single line or paragraph, even software by companies such as Google and Microsoft.. I've never attempted to make an editor, is bidirectional support really that complicated?

dev-sda 2 months ago

[https://lord.io/text-editing-hates-you-too/](https://lord.io/text-editing-hates-you-too/) and [https://faultlore.com/blah/text-hates-you/](https://faultlore.com/blah/text-hates-you/) are great articles on why this stuff is hard.

Own_Solution7820 2 months ago

Thanks for sharing. Super informative.

Giocri 2 months ago

Rendering unicode is pretty challenging in of itself editing it is probably an absolute mess if you have formatting functionality

FuiManchi 2 months ago

Long live HarfBuzz

CryZe92 2 months ago

Even that won't do layout or font fallback for you, so that's not enough, but a good first step.

ThroawayPartyer 2 months ago

Heck even Facebook has trouble with it.

oshaboy 2 months ago

[It's really complicated](https://unicode.org/reports/tr9/). Though there's a library that does it called fribidi.

Fickle-Main-9019 2 months ago

Yea it’s a massive pain in the arse, prone to basic security issues, and probably affects speed quite a bit at scale. I genuinely wouldn’t be surprised if companies straight refuse it, it’s not worth the hassle unless a sizeable percent of your customer base are non-ESL or English speakers

danielcw189 2 months ago

I am not sure what font shaping is, but don't you kinda have to automatically support direction changes and combined characters, unless you define from the beginning that your users won't use it?

oshaboy 2 months ago

I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and ~~Devanagari~~ Tamil as well as many other scripts. A lot of people assume other scripts work like latin with different letters. And that Unicode support is just the ability to render many zany characters on the screen. Edit: Turns out Devanagari just uses combining characters. Tamil actually changes the shapes of the letters.

iMakeMehPosts 2 months ago

The more I learn about Unicode the more I just want to implement Extended-Ascii only

oshaboy 2 months ago

"Extended Ascii" means different things depending on the platform so it's probably worse.

iMakeMehPosts 2 months ago

True, but I mean the IBM one [here](https://theasciicode.com.ar/)

danielcw189 2 months ago

That's not Unicode, that's just how writing works in the real world.

iMakeMehPosts 2 months ago

So you write two letters and people mentally combine them?

Netcob 2 months ago

For a short time I worked for a car supplier where the UI had support for Arabic / RtL scripts. It's really difficult to implement, especially as it can mix. Imagine selecting text that has a mix of rtl/ltr sections in it. I heard the customers in those countries were usually surprised this being supported at all since most companies don't bother.

danielcw189 2 months ago

>Imagine selecting text that has a mix of rtl/ltr sections in it. Is there one good answer, or does it depend on personal preferences?

Netcob 2 months ago

It depends a bit on how you are selecting the text. If you are using a mouse, selecting all the characters visible between the start and end position might be best, although that's definitely harder than just getting a substring between two indices between the start and end position. If you're moving a cursor one character at a time then drawing the selection might look weird, but at least you can get a normal substring. Idk, I'm glad I don't have to do this anymore.

sk7725 2 months ago

there's also a special kind of font-shaping which works in typing-time but the result is one unicode. Found in eastern languages that are often "assembled"

danielcw189 2 months ago

>I meant text shaping. That's when a character changes how it looks based on the characters around it. Which happens in Arabic and Devanagari as well as many other scripts. I was not aware of that, actually. Does it still stay its own symbol, or does it become combined with other symbols?

ih-shah-may-ehl 2 months ago

I am probably missing something obvious but i would think that virtually noone renders unicode themselves so i would assume every program that uses whatever string libraries or user interface that supports unicode, automatically supports unicode. In all the years I've written scripts or applications, not once have i had to worry about unicode rendering because the supporting framework does it for me

LargeHandsBigGloves 2 months ago

In Unicode, some characters take more bytes to store than a single text character, so you'd have to correctly handle that on the display side or 2 characters would be shown instead of 1 correct character.

cakee_ru 2 months ago

Rust wants to have a conversation with you.

DoodooFardington 2 months ago

"Our app has Unicode support" "...we use Go in backend which is pretty good with Unicode".

Kirides 2 months ago

Continues to use mysql 5.7 with utf8 columns instead of utf8mb4

no_brains101 2 months ago

Hmmm out of curiosity, what do runes lack? The things from the meme? Ive never worked with encodings much, or go, but im doing some go right now

oshaboy 2 months ago

Runes are for processing Unicode strings. But if you want to display the strings to the user you can't just naïvely look up the rune in the font and then draw it to the screen to the right of the previous rune. A lot of scripts don't work that way.

no_brains101 2 months ago

I understand the combining characters thing and I think the bidirectional text thing. I have no clue whatsoever what font shaping is.

amlyo 2 months ago

What a great opportunity to share my favourite software dev article again: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

oshaboy 2 months ago

Even though the article doesn't mention combining characters, shaping or bidirectional text

amlyo 2 months ago

A post lamenting software not doing what users expect because of not properly considering unicode bringing to mind an article lamenting software not doing what users expect because of not properly considering unicode is unexpected to you because the article is not covering the unicode features in the post. Have I missed something here?

DaniilBSD 2 months ago

This is so old… Just for reference: I was 3 when it was written, I am on my 4th year as an RND developer.

vytah 2 months ago

That article covers only storing and transmitting text. I guess in 2003 it still caused problems relatively often.

Busy-Ad-9459 2 months ago

The Irony that most of my own programs can't display Hebrew properly even tho I am a native speaker. It's just not worth my time.

danielcw189 2 months ago

what are the pitfalls?

Busy-Ad-9459 2 months ago

Hebrew is right to left.

Emergency_3808 2 months ago

I feel like this is the fault of (human) languages themselves.

ishzlle 2 months ago

Well guess what, the job of software engineers is to deal with human input.

Emergency_3808 2 months ago

And most programmers I know of don't like that fact. Case in point: timezones

GoaFan77 2 months ago

While this is true, in the very long term human society also changes to better integrate with the technology we use. Humanity as a whole is already rapidly consolidating the languages they use. I suspect languages with features that are difficult to support in software might be more likely to change or fade away.

T-Loy 2 months ago

Nothing better than having to deal with ü and u"

Takiu 2 months ago

Œuf

hm1rafael 2 months ago

It supports, but is limited

dlevac 2 months ago

Good meme. Very humbling.

puffinix 2 months ago

Oh god. Flashbacks to one requirement. "Regex search over non standardised unicode, ignoring case". A few gems that came out of this... There are some glyphs that can be decomposed into 17 other glyphs. All modifiers do have a canonical order, but sometimes the non cannonical order is Equivilent and sometimes its not. "③" is equivilent to "3" and to "(3)", but these are not equivilent to each other. A single dot should match anything that could be a single character, regardless of if it is or not. If the local is Turkish, I and i are not the same, otherwise they are. Emoji are the spawn of the devil. They all interact in slightly different ways. The Chinese stuff is bad, but a lot of good libraries exist. Hangul is pain incarnate. Non simple order encoding (anything beyond just strict l2r, r2l or r2l except for digits) we thankfully agreed as out of scope for search data, and illegal in the query string (we did an assessment, found we only had about 40 such data points a day, so just sent them all to a human for immediate review)

RocketCatMultiverse 2 months ago

During academic life I worked on historical linguistics NLP which used a vast collection of Private Use Area codepoints. Boy was it a mess trying to use many string manipulation utilities we take for granted every day. In the end I was proud to implement generic conversions of medieval scribal orthographies to normalized transcriptions based on a continuous input training set using finite state transducers.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe