My clawbot & other AI agents already have this figured out.
/s
Sure, third-party services like the OP can provide bots that can scan. But if you create an ecosystem in which PRs can be submitted by threat actors, part of your commitment to the community should be to provide visibility into attacks that cannot be seen by the naked eye, and make that protection the norm rather than the exception.
[0] https://docs.github.com/en/get-started/learning-about-github...
It makes the product better
I know people love to talk money and costs and "value", but HN is a space for developers, not the business people. Our primary concern, as developers, is to make the product better. The business people need us to make the product better, keep the company growing, and beat out the competition. We need them to keep us from fixating on things that are useful but low priority and ensuring we keep having money. The contention between us is good, it keeps balance. It even ensures things keep getting better even if an effective monopoly forms as they still need us, the developers, to make the company continue growing (look at monopolies people aren't angry at and how they're different). And they need us more than we need them.So I'd argue it's the responsibility of the developers, hired by GitHub, to create this feature because it makes the product better. Because that's the thing you've been hired for: to make the product better. Your concern isn't about the money, your concern is about the product. That's what you're hired for.
See commenter on their 2025 bounty for reporting it, won't-fix resolution: https://news.ycombinator.com/item?id=47393393
i see squares on a properly configured vim on xterm.
https://docs.github.com/en/authentication/managing-commit-si...
I first heard about the possibility of this kind of attack >10 years ago, and I'll sometimes do a xxd if i'm feeling a bit paranoid.
The mere fact that a software maintainer would merge code without knowing what it does says more about the terrible state of software.
I don't know if it is relevant in any specific case that is being discussed here, but if the exploit route is via gaining access to the accounts of previously trusted submitters (or otherwise being able to impersonate them) it could be a case of teams with a pile of PRs to review (many of which are the sloppy unverified LLM output that is causing a problem for some popular projects) lets through an update from a trusted source that has been compromised.
It could correctly be argued that this is a problem caused by laziness and corner cutting, but it is still understandable because projects that are essentially run by a volunteer workforce have limited time resources available.
Of course, it doesn't work though. I reported this to their bug bounty, they paid me a bounty, and told me "we won't be fixing it": https://joshua.hu/2025-bug-bounty-stories-fail#githubs-utf-f...
The exact quote is "Thanks for the submission! We have reviewed your report and validated your findings. After internally assessing your report based on factors including the complexity of successfully exploiting the vulnerability, the potential data and information exposure, as well as the systems and users that would be impacted, we have determined that they do not present a significant security risk to be eligible under our rewards structure." The funny thing is, they actually gave me $500 and a lifetime GitHub Pro for the submission.
The article is about in JavaScript, although it can apply to other programming languages as well. However, even in JavaScript, you can use \u escapes in place of the non-ASCII characters. (One of my ideas in a programming language design intended to be better instead of C, is that it forces visible ASCII (and a few control characters, with some restrictions on their use), unless you specify by a directive or switch that you want to allow non-ASCII bytes.)
Same. And I enforce it. I've got scripts and hooks that enforces source files to only ever be a subset of ASCII (not even all ASCII codes have their place in source code).
Unicode chars strings are perfectly fine in resource files. You can build perfectly i18n/l10n apps and webapps without ever using a single Unicode character in a source file. And if you really do need one, there's indeed ASCII escaping available in many languages.
Some shall complan that their name as "Author: ..." in comments cannot be written properly in ASCII. If I wanted to be facetious I'd say that soon we'll see:
# Author: Claude Opus 27.2
and so the point shall be moot anyway.The biggest use of Unicode in source repos now might be LLM slop, so I certainly don't miss its absence at all.
The malicious code was introduced in this commit - https://github.com/pedronauck/reworm/commit/d50cd8c8966893c6...
It says coauthored by dependabot and refers to a PR opened in 2020 (https://github.com/pedronauck/reworm/pull/28).
That PR itself was merged in 2020 here - https://github.com/pedronauck/reworm/commit/df8c1803c519f599...
But the commit with the worm (d50cd8c), re-introduces the same change from df8c180 to the file `yarn.lock`.
And when you look at the history of yarn.lock inside of github, all references to the original version bump (df8c180) are gone...? In fact if you look at the overall commit history, the clean df8c180 commit does not exist.
I'm struggling to understand what kind of shenanigans happened here exactly.
Innocuous PR (but do note the line about "pedronauck pushed a commit that referenced this pull request last week"): https://github.com/pedronauck/reworm/pull/28
Original commit: https://github.com/pedronauck/reworm/commit/df8c18
Amended commit: https://github.com/pedronauck/reworm/commit/d50cd8
Either way, pretty clear sign that the owner's creds (and possibly an entire machine) are compromised.
But really, it still has to be injected after the fact. Even the most superficial code review should catch it.
Sure the payload is invisible (although tbh im surprised it is. PUA characters usually show up as boxes with hexcodes for me), but the part where you put an "empty" string through eval isn't.
If you are not reviewing your code enough to notice something as non sensical as eval() an empty string, would you really notice the non obfuscated payload either?
It sounds like Python only allows approved Unicode characters to start a variable name but if it allowed any you could do something like `nonprintable = lambda x: insert exploit code here`. If that was hidden in what looked like a blank line between other additions would you catch it?
I'm sure there's some other language out there that has similar syntax and lax Unicode rules this could be used in.
The solution is that this and many other Unicode formatting characters should be ignored and converted to a visible indicator in all code views when you expect plain text.
This isn't about formating characters, this is about private use characters.
For data or code hiding the Acme::Bleach Perl module is an old example though by no means the oldest example of such. This is largely irrelevant given how relevant not learning from history is for most.
Invisible characters may also cause hard to debug issues, such as lpr(1) not working for a user, who turned out to have a control character hiding in their .cshrc. Such things as hex viewers and OCD levels of attention to detail are suggested.
Then, any appearance of unprintable characters should also be flagged. There are rather few legitimate uses of some zero-width characters, like ZWJ in emoji composition. Ideally all such characters should be inserted as \xNNNN escape sequences, and not literal characters.
Simple lint rules would suffice for that, with zero AI involvement.
Emojis are another abomination that should be removed from Unicode. If you want pictures, use a gif.
I have considered allowing a short list that does not include emojis, joining characters, and so on - basically just currency symbols, accent marks, and everything else you'd find in CP-1521 but never got around to it.
grep -P '[\x{200B}\x{200C}\x{200D}\x{FEFF}]' code.ts
See https://stackoverflow.com/q/78129129/223424And please, everyone arguing the code snippet should never have passed review - do you honestly believe this is the only kind of attack that can exploit invisible characters?
Is there ever a circumstance where the invisible characters are both legitimate and you as a software developer wouldn't want to see them in the source code?
And, yes, there is a circumstance if you want to include Arabic or Hebrew in comments or strings. You need the zero width left-right markers to make that work.
Looks like it's pilfering Solana wallets that don't belong to Russians.
Claude’s analysis seems solid here based on reading the snippets it tested.
A purpose-built linter could be cross-language, it’s pretty reasonable to blanket ban these characters entirely, or at least allowlist them.
Things that vanish on a printout should not be in Unicode.
Remove them from Unicode.
I am wondering how that they've LLM, are people using them for making new kind of malicious codes more sophisticated than before?
Besides, that's why the ban only extends to syntax and string literals (use escapes instead), and not comments.
From my experience, the only two nationalities that insist on mixing their native languages with the mostly English syntax of programming languages are the French and the Japanese. And they can just suck it up for the other 8 billion of us.
Are people using eval() in production code?
Notice that the original commit is verified: https://github.com/pedronauck/reworm/commit/df8c1803c519f599...
While the malicious one is not: https://github.com/pedronauck/reworm/commit/d50cd8c8966893c6...
I still don't quite understand what GitHub is doing to allow someone to say that dependabot coauthored a spoofed commit. This isn't the commit message itself I'm talking about. It's the GitHub interface that officially recognizes this as a dependabot co authored commit. My hunch is that the malicious author squashed two commits, the original good commit to yarn.lock and a malicious change to package.json, and that somehow maintains the dependabot authorship instead of reassigning it fully to the squash-er.
Like the following example (you can paste it into node to verify), could be spread out over multiple source files to make it even harder to follow:
// prelude 1, obfuscate the constructor property name to avoid raising simple analyser alarms
const prefix = "construction".substring(0,7);
const suffix = "tractor".substring(3);
const obfuscatedConstructorName = prefix + suffix; // innocent looking, but we have the indexing name.
// prelude 2, get the Function class by indexing a function object with our constructor property name (that does not show up in source-code)
const existingFunction = ()=>"nothing here";
const InnocentLookingClass = existingFunction[obfuscatedConstructorName];
// payload decoding elsewhere (this is where we decode our nasty source)
const nastyPayloadDisguisedAsData = "console.log('sourced string that could be malicious')";
// Unrelated location where payload gets executed
const hardToMissFun = new InnocentLookingClass(nastyPayloadDisguisedAsData);
hardToMissFun(); // when this function is run somewhere.. the nasty things happen.
Unless you have a data-tracing verifier or a sandbox that is continiously run it's going to be very hard to even come close to determining that arbitrary code is being evaluated in this example. Not a single trace of eval or even that the property name constructor is used.Yes, it's a red flag. Yes, there's legitimate uses. Yes, you should always interrogate evals more closely. All these are true
Unicode needs tab, space, form feed, and carriage return.
Unicode needs U+200E LEFT-TO-RIGHT MARK and U+200F RIGHT-TO-LEFT MARK to switch between left-to-right and right-to-left languages.
Unicode needs U+115F HANGUL CHOSEONG FILLER and U+1160 HANGUL JUNGSEONG FILLER to typeset Korean.
Unicode needs U+200C ZERO WIDTH NON-JOINER to encode that two characters should not be connected by a ligature.
Unicode needs U+200B ZERO WIDTH SPACE to indicate a word break opportunity without actually inserting a visible space.
Unicode needs MONGOLIAN FREE VARIATION SELECTORs to encode the traditional Mongolian alphabet.
I should be able to use Ü as a cursed smiley in text, and many more writing systems supported by Unicode support even more funny things. That's a good thing.
On the other hand, if technical and display file names (to GUI users) were separate, my need for crazy characters in file names, code bases and such are very limited. Lower ASCII for actual file names consumed by technical people is sufficient to me.
Sure, but more crazy stuff gets added all the time.
Rule of thumb: two Unicode sequences that look identical when printed should consist of the same code points.
And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?
Some middle ground so that you can use greek letters in Julia might be nice as well.
But I don't see any purpose in using the Personal Use Areas (PUA) in programming.
Do you honestly think this is a workable solution?
Also this attack doesnt seem to use invisible characters just characters that dont have an assigned meaning.
Eval for json also lead to other security issues like XSSI.
And when the incremental cost to build a feature is low in an age of agentic AI, there should be no barrier to a member of the technical staff (and hopefully they're not divided into devs/test/PM like in decades past) putting a prototype together for this.
Engineers and developers are especially sensitive. It's our job to find problems and fix them. I don't trust engineers that aren't a bit grumpy because it usually means they don't know what the problems are (just like when they don't dogfood). Though I'll also clarify that what distinguishes a grumpy engineer from your average redditer is that they have critiques rather than just complaints. Critique oriented is searching for solutions of problems, you can't just stop at problem identification.
> And when the incremental cost to build a feature is low in an age of agentic AI
I'm not sure that's even necessary. A very quick but still helpful patch would be to display invisible characters. Just like we often do with whitespace characters. The diff can be a bit noisier and it's the perfect place for this even if you purposefully use invisible characters in your programming environment.Though we're also talking about an organization that couldn't merge a PR for a year that fixed a one liner. A mistake that should never have gotten through review. Seriously, who uses a while loop counter checking for equality?!? I'm still convinced they left the "bug" because it made them money
What is this in reference to? I tried to search for it but only found this comment. “Github while loop fix that was in review for a year”?
Making the product better generally stems from acting in their interest, honing the tool you offer to provide the best possible experience, and making business decisions that respect their dignity.
Your comment talks a lot about product and I agree with it, I just mentioned this so we don't lose sight of the fact this is ultimately about people.
But I also think we've had a culture shift that's hurting our field. Where engineers are arguing about if we should implement certain features based on the monetary value (which are all fictional anyways). But that's not our job. At best, it's the job of the engineering manager to convince the business people that it has not only utility value, but monetary.
According to whom? Certainly not the people did the hiring.
I somewhat agree that developers should optimize for something other than pure monetary value, but it has nothing to do with the hiring relationship, just the moral duty to use what power you have to make the world better. In general, this can easily conflict with "what you're hired for."
In this case I think showing suspicious (or even all) invisible Unicode in PRs is even a monetarily valuable feature, so the moral angle is mostly moot. And I would put the primary moral burden primarily on the product management either way, since they're the ones with the most power to affect the product, potentially either ordering the right thing to be done or stopping the devs when they try to do it on their own.
> According to whom? Certainly not the people did the hiring.
Actually yes, according to them. Maybe they'll say that you should also be concerned about the money but that just makes the business people redundant now doesn't it? So is it better if I clarify and say that the product is your primary concern?As a developer you have a de facto primary concern with the product. They hire you to... develop. They do not hire you to manage finances, they hire you to manage the product. Doing both is more the job of the engineering manager. But as a developer your expertize is in developing. I don't think this is a crazy viewpoint.
You were hired for your technical skills, not your MBA.
> In this case I think showing suspicious (or even all) invisible Unicode in PRs is even a monetarily valuable feature
I agree. Though I also think this is true for many things that improve the product.Also note that I'm writing to my audience.
>> but HN is a space for developers, not the business people.
How I communicate with management is different, but I'm exhausted when talking to fellow developers and the first question being about monetary value. That's not the first question in our side of things. Our first question is "is this useful?" or "does this improve the product?" If the answer is "yes" then I am /okay/ talking about monetary value. If it's easy to implement and helps the product, just implement it. If it requires time and the utility is valuable then yes, it helps to formulate an argument about monetary value since management doesn't understand any other language, but between developers that is a rather crazy place to start out (unless the proposal is clearly extremely costly. But then say "I don't think you'd ever convince management" instead of "okay, but what is the 'value' of that feature?"). If I wanted to talk to business people I'd talk to the business people, not another developer...The switch in text direction has resulted in malicious code injection attacks, as the reversed text becomes invisible. I had to change my compiler to reject those Unicode characters for that reason. It can be used in other cases to have hidden, malicious text.
Have you checked your SQL code for invisible backwards text that injects malware?
How would that work with Text-To-Speech output?
1. Tell the TTS program that the text is RTOL.
2. If the TTS program can speak Arabic, it can detect RTOL Arabic text.
The only purpose for RTOL English I can think of is to insert hidden text for malicious purposes.
I would say they are arguing that in bad faith, so I wanted to enter a dialogue where they are either forced to agree, or more likely, not respond at all.
Yes. Unicode should not be about semantic meaning, it should be about the visual. Like text in a book.
> And, for example, Greek words containing this letter should be encoded with a mix of Latin and Greek characters?
Yup. Consider a printed book. How can you tell if a letter is a Greek letter or a Latin letter?
Those Unicode homonyms are a solution looking for a problem.
Do you think 1, l and I should be encoded as the same character, or does this logic only extend to characters pesky foreigners use.
And that's where it went off the rails into lala land. 'a' can have all kinds of distinct meanings. How are you going to make that work? It's hopeless.
I can absolutely tell Cyrillic k from the lating к and latin u from the Cyrillic и.
>should not be about semantic meaning,
It's always better to be able to preserve more information in a text and not less.
They look visually distinct to me. I don't get your point.
> It's always better to be able to preserve more information in a text and not less.
Text should not lose information by printing it and then OCR'ing it.
And what about the round-trip rule?
And ligatures? Aren't those a semantic distinction?
That's a problem with the fonts.
> And what about the round-trip rule?
Print Unicode on paper, then ocr it, and you'll get different Unicode. Oh, and normalization.
> ligatures
Generally an issue with rendering.
> semantic distinction
Unicode isn't about semantics (or shouldn't be). Consider 'a'. It's used for all kinds of meanings.
While at it we could also unify I, | and l. It's too confusing sometimes.
They render differently, so it's not a problem.
> And why do we not anymore make use of it, but instead implemented separate JSON loading functionality in JavaScript?
In other words: I'm asking for reasons why was native JSON JavaScript module created, if we already had eval.
> Can you think of any reasons beyond performance?
One of the reasons is that native JSON parser is faster than eval: give some other reason.
I very much disagree that you start with money and work backwards to technical problems. I do not think this approach would make you efficient at solving problems nor at increasing profits for the business.
And I still firmly believe they need us more than we need them. At the end of the day this is why they want AI coding agents to work out but I do not think that even in the best situation we'll end up in any different of a situation than COBOL. You can make developers more efficient, but replacing them requires an entirely different set of skills.
An MBA-type, with no programming background, has a better chance getting their photos taken with their iPhone in a museum than they do replacing a developer. I'm sure there will be some successful at it, but exceptions do not define the rule.
> I do not think this approach would make you efficient at solving problems nor at increasing profits for the business.
If optimizing for profit doesn't result in profit, it's not the fault of the goal. That company was just incompetent. However many companies are, in fact, moderately competent, and optimizing for profit works fine for them. It even has a pretty heavy overlap with optimizing for good products, so that's nice.
It's fine. We agree on the ideal outcome in this situation.
Tell me what the problem is and what your proposed solution would be.
a) it's a bullet point
b) a+b means a is a variable
c) apple means a means the sound "aaaah"
d) ape means a means the sound "aye"
e) 0xa means a means "10"
f) "a" on my test paper means I did well on it
g) grade "a" means I bought the good bolts
h) "achtung" means it's a German "a"
I didn't need 8 different Unicode characters. And so on.