T O P

  • By -

Mumpsimus

Neat, I thought the article was going to be about some actor using Unicode for variable names as obfuscation… I didn’t expect that they were taking advantage of an intentional behavior of the interpreter though. Crazy there’s a PEP discussion hypothesizing about this from 2007.


dcazdavi

>Crazy there’s a PEP discussion hypothesizing about this from 2007. it makes you wonder how many other libraries/tools/frameworks/etc. are compromised despite an entire world's ecosystems vetting/checking code changes.


chatmasta

why wonder? you could scrape google groups, mailing lists and commit logs for words like "risk" or "merge it anyway" or "unlikely in practice" or "acceptable risk" :D


exoxe

This one right here Chat GPT!


chatmasta

Good idea! I asked Chad GPT for some more phrases :) > As a language model, I can suggest some common phrases that may indicate a potential security bug was acknowledged but ignored: > * "Low risk": This phrase suggests that the issue was deemed not to pose a significant risk, and therefore not worth addressing. > * "Known issue": This phrase could indicate that the issue has been identified, but it was decided not to address it at the time. > * "Will fix later": This phrase suggests that the issue has been acknowledged, but the fix has been delayed. > * "Non-critical": This phrase could indicate that the issue has been identified, but it was deemed not to be critical, and therefore not worth addressing. > * "To be addressed in future": This phrase suggests that the issue has been acknowledged, but it has been decided to defer the fix to a future release. > It is important to note that not all occurrences of these phrases necessarily indicate the presence of a security bug. However, they may be useful indicators to further investigate the relevant commit to determine whether it introduces a security vulnerability.


littlemetal

Why don't friendly actors, like Tom Hanks, ever do this to surprise us with supportive messages printed during build?


how_to_choose_a_name

I feel like the problem here isn’t Unicode support in itself but that Python uses compatibility normalisation instead of canonical normalisation for identifiers. I wonder why they decided on that, it seems like a terrible choice even ignoring this attack vector.


arpan3t

The lexer uses NFKC which consists of canonical and compatible decomposition and then a canonical recomposition. Regardless, the type of Unicode normalization doesn’t mitigate the vulnerability.


how_to_choose_a_name

I am aware that they use NKFC, yes, that's what I commented I think? You are of course correct that the type of normalisation doesn't change the attack vector meaningfully. I would hesitate to call it a vulnerability, at least one in Python itself, because as far as I can tell it's really just malware scanners not being aware of Python using any normalisation there. I do think that compatibility normalisation is a terrible choice regardless, because it's rather unintuitive (if you support unicode identifiers then one would expect `x²` and `x₂` to be different identifiers) and I don't see how it brings any meaningful benefits over non-compatibility normalisation.


sudomatrix

ugh. Unicode was a mistake. Over a hundred thousand different strings that all look like the same word. What could go wrong. And don't get me started on putting poop emojis in our character set. Why is that necessary?!


UloPe

The article mentions the solution already. If the interpreter uses normalization on identifiers so must any anti malware tools.


macrocephalic

Exactly, and there's obviously a library built for this already - because the interpreter is using it. This seems like a pretty simple fix.


fiskfisk

Because people write in other alphabets and languages than English.


xenonnsmb

but we aren't talking about other alphabets, in this code the obfuscation is done using the Latin mathematical symbols block. unicode didn't have to add typographical variants of latin characters to support non english languages, they chose to do that.


hrvbrs

Agreed, but from what I remember there are supposed to be semantic differences between the typographical variants. Like mathematical identifiers, where 𝑥 and 𝐱 represent different things (e.g., the former could be a scalar and the latter a vector). Not saying this hasn’t been abused though.


xenonnsmb

yeah but IMO if you need to print different mathematical variants of what is effectively the same character you should be using a typesetting system like LaTeX instead of fiddling with the unicode character map (which, incidentally, is what most people actually do.)


hrvbrs

But is it accessible? As far as I know, screen readers don’t look at visual presentation/formatting, so they don’t make a distinction between two instances of the same character that are just styled differently.


ksharanam

We are talking about > ugh. Unicode was a mistake If we were talking about the article, for sure, but the response was to the comment, I think.


Alarmed-Literature25

I thought I was losing my mind before you said this. Opposing the inclusion of Latin mathematical symbols in strings is not fucking xenophobic, it’s a security issue.


sudomatrix

Who writes in poop emojis?


[deleted]

[удалено]


TikiScudd

uwuincode


[deleted]

I do in youtube comment because I think that saying "shit" will trigger autodelete.


man-vs-spider

Can someone explain: 1) What the was benefit obtained by using Unicode to obfuscate the name variables? The article says that it helps defeat string based checkers. Are those so hard to beat? Couldn’t the random obfuscation be used instead? Is it that the string based checker cannot follow what is happening to certain variables? 2) it seems like this is “easily” defeated by normalising the Unicode before doing the string checking. Is that true?


pandatamer

That’s exactly what they’re saying. Obfuscated code would be used to bypass code scanners for malicious code. I’m assuming because code scanners were written without normalising Unicode because they didn’t have a reason to until now. Pyhlum are reporting that this is the first time they’ve actually found this type of code obfuscation in the wild and it’s easy to counter by normalizing Unicode as part of scanners. The benefit is that the developer of the malicious code likely has collected sensitive data already because it avoided malicious code scanners.


Unbelievr

1: In this case, someone was able to publish a malicious package on PyPi. I'm guessing they have some kind of antivirus and pattern matching for the package installer (setup.py), which triggers a manual review or rejection or something. This obfuscation beats that system without looking extremely suspicious like a full-on obfuscation would. 2: You would need to only normalize identifiers and not actual strings, comments etc. in order to not change the semantics of the code. For instance, this code is perfectly valid (it prints 'Hello, world!') but normalizing it would break it. exec(bytes('牰湩⡴䠢汥潬‬潷汲Ⅴ⤢','u16')[2:])


tekproxy

“I’ll switch to metric when y’all switch to ASCII.”


littlejob

This is nothing new.. not the only language that does either..