Glassworm is back: A new wave of invisible Unicode attacks hits repositories

Glassworm is back: A new wave of invisible Unicode attacks hits repositories(aikido.dev)

303 points by robinhouston 109 days ago | 193 comments

btown 109 days ago |

IMO while the bar is high to say "it's the responsibility of the repository operator itself to guard against a certain class of attack" - I think this qualifies. The same way GitHub provides Secret Scanning [0], it should alert upon spans of zero-width characters that are not used in a linguistically standard way (don't need an LLM for this, just n-tuples).

Sure, third-party services like the OP can provide bots that can scan. But if you create an ecosystem in which PRs can be submitted by threat actors, part of your commitment to the community should be to provide visibility into attacks that cannot be seen by the naked eye, and make that protection the norm rather than the exception.

[0] https://docs.github.com/en/get-started/learning-about-github...

andrewflnr 109 days ago | |

Regardless of the thorny question of whether it's Github's responsibility, it sure would be a good thing for them to do ASAP.

godelski 109 days ago | | |

Here's the big reason GitHub should do it:

  It makes the product better

I know people love to talk money and costs and "value", but HN is a space for developers, not the business people. Our primary concern, as developers, is to make the product better. The business people need us to make the product better, keep the company growing, and beat out the competition. We need them to keep us from fixating on things that are useful but low priority and ensuring we keep having money. The contention between us is good, it keeps balance. It even ensures things keep getting better even if an effective monopoly forms as they still need us, the developers, to make the company continue growing (look at monopolies people aren't angry at and how they're different). And they need us more than we need them.

So I'd argue it's the responsibility of the developers, hired by GitHub, to create this feature because it makes the product better. Because that's the thing you've been hired for: to make the product better. Your concern isn't about the money, your concern is about the product. That's what you're hired for.

jacquesm 109 days ago | | |

It absolutely is. They are simply spreading malware. You can't claim to be a 'dumb pipe' when your whole reason for existence is to make something people deemed 'too complex' simple enough for others to use, then you have an immediate responsibility to not only reduce complexity but to also ensure safety. Dumbing stuff down comes with a duty of care.

zzo38computer 109 days ago | |

I think a "force visible ASCII for files whose names match a specific pattern" mode would be a simple thing to help. (You might be able to use the "encoding" command in the .gitattributes file for this, although I don't know if this would cause errors or warnings to be reported, and it might depend on the implementation.)

OJFord 108 days ago | |

They advertise that they do do it, they just don't/it doesn't work.

See commenter on their 2025 bounty for reporting it, won't-fix resolution: https://news.ycombinator.com/item?id=47393393

iririririr 109 days ago | |

specially because it's literally a problem with their code viewer (and vscode, which is also theirs).

i see squares on a properly configured vim on xterm.

RVuRnvbM2e 108 days ago | |

Vigilant mode exists, and would have flagged the malicious commit as unverified in this case. Maybe it should be the default.

https://docs.github.com/en/authentication/managing-commit-si...

athrowaway3z 108 days ago | |

For some reason I was under the impression this was already the default.

I first heard about the possibility of this kind of attack >10 years ago, and I'll sometimes do a xxd if i'm feeling a bit paranoid.

ocornut 109 days ago |

It baffles me that any maintainer would merge code like the one highlighted in the issue, without knowing what it does. That’s regardless of being or not being able to see the “invisible” characters. There’s a transforming function here and an eval() call.

The mere fact that a software maintainer would merge code without knowing what it does says more about the terrible state of software.

dspillett 109 days ago | |

> It baffles me that any maintainer would merge code like the one highlighted in the issue, without knowing what it does.

I don't know if it is relevant in any specific case that is being discussed here, but if the exploit route is via gaining access to the accounts of previously trusted submitters (or otherwise being able to impersonate them) it could be a case of teams with a pile of PRs to review (many of which are the sloppy unverified LLM output that is causing a problem for some popular projects) lets through an update from a trusted source that has been compromised.

It could correctly be argued that this is a problem caused by laziness and corner cutting, but it is still understandable because projects that are essentially run by a volunteer workforce have limited time resources available.

mmlb 109 days ago | |

In this instance the PR that was merged was from 6 years ago and was clear https://github.com/pedronauck/reworm/pull/28. Looks to me like a force push overwrote the commit that now exists in history since it was done 6y later.

globular-toast 108 days ago | | |

So who force pushed and why?

pdonis 109 days ago | |

Wish I could upvote this more.

mmsc 109 days ago |

GitHub advertises itself as warning about those Unicode characters: https://github.blog/changelog/2025-05-01-github-now-provides...

Of course, it doesn't work though. I reported this to their bug bounty, they paid me a bounty, and told me "we won't be fixing it": https://joshua.hu/2025-bug-bounty-stories-fail#githubs-utf-f...

The exact quote is "Thanks for the submission! We have reviewed your report and validated your findings. After internally assessing your report based on factors including the complexity of successfully exploiting the vulnerability, the potential data and information exposure, as well as the systems and users that would be impacted, we have determined that they do not present a significant security risk to be eligible under our rewards structure." The funny thing is, they actually gave me $500 and a lifetime GitHub Pro for the submission.

OJFord 108 days ago | |

That's bizarre. They won't be fixing it, and yet the changelog post is unretracted.

user_7832 107 days ago | |

Tangential, but that's quite interesting, I had no idea you could get GitHub Pro for life, and certainly not through something as "accessible" as bug bounties.

zzo38computer 109 days ago |

I use non-Unicode mode in the terminal emulator (and text editors, etc), I use a non-Unicode locale, and will always use ASCII for most kind of source code files (mainly C) (in some cases, other character sets will be used such as PC character set, but usually it will be ASCII). Doing this will mitigate many of this when maintaining your own software. I am apparently not the only one; I have seen others suggest similar things. (If you need non-ASCII text (e.g. for documentation) you might store them in separate files instead. If you only need a small number of them in a few string literals, then you might use the \x escapes; add comments if necessary to explain it.)

The article is about in JavaScript, although it can apply to other programming languages as well. However, even in JavaScript, you can use \u escapes in place of the non-ASCII characters. (One of my ideas in a programming language design intended to be better instead of C, is that it forces visible ASCII (and a few control characters, with some restrictions on their use), unless you specify by a directive or switch that you want to allow non-ASCII bytes.)

TacticalCoder 109 days ago | |

> ... and will always use ASCII for most kind of source code files

Same. And I enforce it. I've got scripts and hooks that enforces source files to only ever be a subset of ASCII (not even all ASCII codes have their place in source code).

Unicode chars strings are perfectly fine in resource files. You can build perfectly i18n/l10n apps and webapps without ever using a single Unicode character in a source file. And if you really do need one, there's indeed ASCII escaping available in many languages.

Some shall complan that their name as "Author: ..." in comments cannot be written properly in ASCII. If I wanted to be facetious I'd say that soon we'll see:

    # Author: Claude Opus 27.2

and so the point shall be moot anyway.

amake 108 days ago | |

That’s great for you. Isn’t feasible for software development by teams that are native in a language with a non-Latin script.

1718627440 107 days ago | | |

Do you write the code itself in a language other than English? Localizations typically are in different files.

userbinator 109 days ago | |

CP437 forever!

The biggest use of Unicode in source repos now might be LLM slop, so I certainly don't miss its absence at all.

nstart 108 days ago |

I don't quite understand how this is working tbh. I looked at one of the affected repos, ironically named "reworm".

The malicious code was introduced in this commit - https://github.com/pedronauck/reworm/commit/d50cd8c8966893c6...

It says coauthored by dependabot and refers to a PR opened in 2020 (https://github.com/pedronauck/reworm/pull/28).

That PR itself was merged in 2020 here - https://github.com/pedronauck/reworm/commit/df8c1803c519f599...

But the commit with the worm (d50cd8c), re-introduces the same change from df8c180 to the file `yarn.lock`.

And when you look at the history of yarn.lock inside of github, all references to the original version bump (df8c180) are gone...? In fact if you look at the overall commit history, the clean df8c180 commit does not exist.

I'm struggling to understand what kind of shenanigans happened here exactly.

vitus 109 days ago |

Looks like the repo owner force-pushed a bad commit to replace an existing one. But then, why not forge it to maintain the existing timestamp + author, e.g. via `git commit --amend -C df8c18`?

Innocuous PR (but do note the line about "pedronauck pushed a commit that referenced this pull request last week"): https://github.com/pedronauck/reworm/pull/28

Original commit: https://github.com/pedronauck/reworm/commit/df8c18

Amended commit: https://github.com/pedronauck/reworm/commit/d50cd8

Either way, pretty clear sign that the owner's creds (and possibly an entire machine) are compromised.

chrismorgan 109 days ago | |

The value of the technique, I suppose, is that it hides a large payload a bit better. The part you can see stinks (a bunch of magic numbers and eval), but I suppose it’s still easier to overlook than a 9000-character line of hexadecimal (if still encoded or even decoded but still encrypted) or stuff mentioning Solana and Russian timezones (I just decoded and decrypted the payload out of curiosity).

But really, it still has to be injected after the fact. Even the most superficial code review should catch it.

vitus 109 days ago | | |

Agreed on all those fronts. I'm just dismayed by all the comments suggesting that maintainers just merged PRs with this trojan, when the attack vector implies a more mundane form of credential compromise (and not, as the article implies, AI being used to sneak malicious changes past code review at scale).

minus7 109 days ago |

The `eval` alone should be enough of a red flag

gnabgib 109 days ago |

Small discussion yesterday (9+9 points, 9+4 comments) https://news.ycombinator.com/item?id=47374479 https://news.ycombinator.com/item?id=47385244

bawolff 109 days ago |

I feel like the threat of this type of thing is really overstated.

Sure the payload is invisible (although tbh im surprised it is. PUA characters usually show up as boxes with hexcodes for me), but the part where you put an "empty" string through eval isn't.

If you are not reviewing your code enough to notice something as non sensical as eval() an empty string, would you really notice the non obfuscated payload either?

loumf 109 days ago | |

The threat is that you depend on this library or use the VS Code Extension.

Arrowmaster 108 days ago | |

Honestly I was expecting more. There are many languages that support Unicode in variable or function names and I expected it to be used there.

It sounds like Python only allows approved Unicode characters to start a variable name but if it allowed any you could do something like `nonprintable = lambda x: insert exploit code here`. If that was hidden in what looked like a blank line between other additions would you catch it?

I'm sure there's some other language out there that has similar syntax and lax Unicode rules this could be used in.

The solution is that this and many other Unicode formatting characters should be ignored and converted to a visible indicator in all code views when you expect plain text.

bawolff 107 days ago | | |

> The solution is that this and many other Unicode formatting characters

This isn't about formating characters, this is about private use characters.

tolciho 109 days ago |

Attacks employing invisible characters are not a new thing. Prior efforts here include terminal escape sequences, possibly hidden with CSS that if blindly copied and pasted would execute who knows what if the particular terminal allowed escape sequences to do too much (a common feature of featuritis) or the terminal had errors in its invisible character parsing code.

For data or code hiding the Acme::Bleach Perl module is an old example though by no means the oldest example of such. This is largely irrelevant given how relevant not learning from history is for most.

Invisible characters may also cause hard to debug issues, such as lpr(1) not working for a user, who turned out to have a control character hiding in their .cshrc. Such things as hex viewers and OCD levels of attention to detail are suggested.

DropDead 109 days ago |

Why didn't some make av rule to find stuff like this, they are just plain text files

nine_k 109 days ago | |

The rule must be very simple: any occurrence of `eval()` should be a BIG RED FLAG. It should be handled like a live bomb, which it is.

Then, any appearance of unprintable characters should also be flagged. There are rather few legitimate uses of some zero-width characters, like ZWJ in emoji composition. Ideally all such characters should be inserted as \xNNNN escape sequences, and not literal characters.

Simple lint rules would suffice for that, with zero AI involvement.

hamburglar 109 days ago | | |

I think there’s debate (which I don’t want to participate in) over whether or not invisible characters have their uses in Unicode. But I hope we can all agree that invisible characters have no business in code, and banishing them is reasonable.

WalterBright 109 days ago | | |

> There are rather few legitimate uses of some zero-width characters, like ZWJ in emoji composition.

Emojis are another abomination that should be removed from Unicode. If you want pictures, use a gif.

trollbridge 109 days ago | | |

In our repos, we have some basic stuff like ruff that runs, and that includes a hard error on any Unicode characters. We mostly did this after some un-fun times when byte order marks somehow ended up in a file and it made something fail.

I have considered allowing a short list that does not include emojis, joining characters, and so on - basically just currency symbols, accent marks, and everything else you'd find in CP-1521 but never got around to it.

abound 109 days ago | |

Yeah it would have been nice to end with "and here's a five-line shell script to check if your project is likely affected". But to their credit, they do have an open-source tool [1], I'm just not willing to install a big blob of JavaScript to look for vulns in my other big blobs of JavaScript

[1] https://github.com/AikidoSec/safe-chain

nine_k 109 days ago | | |

Something like this should work, assuming your encoding is Unicode (normally UTF-8), which grep would interpret:

  grep -P '[\x{200B}\x{200C}\x{200D}\x{FEFF}]' code.ts

See https://stackoverflow.com/q/78129129/223424

codechicago277 109 days ago |

I wonder if this could be used for prompt injection, if you copy and paste the seemingly empty string into an LLM does it understand? Maybe the affect Unicode characters aren’t tokenized.

ancillary 108 days ago | |

There's at least one paper (though pretty recent) about it: https://arxiv.org/html/2603.00164v1

jibal 108 days ago | |

Yes, and that happens.

like_any_other 109 days ago |

Invisible characters, lookalike characters, reversing text order attacks [1].. the only way to use unicode safely seems to be by whitelisting a small subset of it.

And please, everyone arguing the code snippet should never have passed review - do you honestly believe this is the only kind of attack that can exploit invisible characters?

[1] https://attack.mitre.org/techniques/T1036/002/

NoMoreNicksLeft 109 days ago |

Why can't code editors have a default-on feature where they show any invisible character (other than newlines)? I seem to remember Sublime doing this at least in some cases... the characters were rendered as a lozenge shape with the hex value of the character.

Is there ever a circumstance where the invisible characters are both legitimate and you as a software developer wouldn't want to see them in the source code?

ted_dunning 109 days ago | |

Check out emacs for options like this.

And, yes, there is a circumstance if you want to include Arabic or Hebrew in comments or strings. You need the zero width left-right markers to make that work.

retropragma 108 days ago |

If anyone's curious what the malware does: https://pastebin.com/raw/KiuwueMU

Looks like it's pilfering Solana wallets that don't belong to Russians.

herpdyderp 109 days ago |

I keep seeing this and wondering if the ESLint default rules against weird characters would catch this? But I can’t figure out how to check.

CGamesPlay 109 days ago | |

Appears not to. https://claude.ai/share/ac070cf5-0034-4f3c-9a8c-1c43a58eea36

Claude’s analysis seems solid here based on reading the snippets it tested.

A purpose-built linter could be cross-language, it’s pretty reasonable to blanket ban these characters entirely, or at least allowlist them.

mhitza 109 days ago |

Their button animations almost "crash" Firefox mobile. As soon as I reach them the entire page scrolls at single digit FPS.

P-MATRIX 108 days ago |

This gets a lot worse when a coding agent is in the loop. A human at least has a review step—an autonomous agent that reads a Glassworm-infected file just acts on it. The fix probably needs to happen at the tool result layer, before the payload ever enters the agent's context, not just on what the agent writes out.

WalterBright 109 days ago |

Unicode should be for visible characters. Invisible characters are an abomination. So are ways to hide text by using Unicode so-called "characters" to cause the cursor to go backwards.

Things that vanish on a printout should not be in Unicode.

Remove them from Unicode.

faangguyindia 109 days ago |

Back in time I was on hacking forums where lot of script kiddies used to make malicious code.

I am wondering how that they've LLM, are people using them for making new kind of malicious codes more sophisticated than before?

Yokohiii 109 days ago | |

In this case LLMs were obviously used to dress the code up as more legitimate, adding more human or project relevant noise. It's social engineering, but you leave the tedious bits to an LLM. The sophisticated part is the obscurity in the whole process, not the code.

rvnx 109 days ago |

This shows the failure of human reviews alone, an LLM-based reviewer would have caught it. Both approaches are complementary

rhysfonixone 108 days ago | |

Exactly this. I think a hybrid approach is going to be mandatory before long, if it's not already. A well-prompted frontier-lab LLM would catch things like this easily.

hananova 109 days ago |

My hot take is that all programming languages should go back to only accepting source code saved in 7-bit ASCII. With perhaps an exception for comments.

krior 109 days ago | |

Yeah, fuck those non-english-speaking peasants /s.

hananova 108 days ago | | |

I'm a non-english-speaking peasant. I code in English, because it's the lingua franca of coding, and because they form the only characters that you can reliably use everywhere.

Besides, that's why the ban only extends to syntax and string literals (use escapes instead), and not comments.

From my experience, the only two nationalities that insist on mixing their native languages with the mostly English syntax of programming languages are the French and the Japanese. And they can just suck it up for the other 8 billion of us.

chairmansteve 109 days ago |

eval() used to be evil....

Are people using eval() in production code?

max_ 109 days ago |

I don't have to worry about any of this.

My clawbot & other AI agents already have this figured out.

// prelude 1, obfuscate the constructor property name to avoid raising simple analyser alarms const prefix = "construction".substring(0,7); const suffix = "tractor".substring(3); const obfuscatedConstructorName = prefix + suffix; // innocent looking, but we have the indexing name. // prelude 2, get the Function class by indexing a function object with our constructor property name (that does not show up in source-code) const existingFunction = ()=>"nothing here"; const InnocentLookingClass = existingFunction[obfuscatedConstructorName]; // payload decoding elsewhere (this is where we decode our nasty source) const nastyPayloadDisguisedAsData = "console.log('sourced string that could be malicious')"; // Unrelated location where payload gets executed const hardToMissFun = new InnocentLookingClass(nastyPayloadDisguisedAsData); hardToMissFun(); // when this function is run somewhere.. the nasty things happen.