An Urgent Notice from AssemblyScript

An Urgent Notice from AssemblyScript(assemblyscript.org)

76 points by iMuzz 4 years ago | 52 comments

> On August 3rd, the WebAssembly CG will poll on whether JavaScript string semantics/encoding are out of scope of the Interface Types proposal. This decision will likely be backed by Google (C++), Mozilla (Rust) and the Bytecode Alliance (WASI), who appear to have a common interest to exclusively promote C++, Rust respectively non-Web semantics and concepts in WebAssembly.

> If the poll passes, which is likely, AssemblyScript will be severely impacted as the tools it has developed must be deprecated due to unresolvable correctness and security problems the decision imposes upon languages utilizing JavaScript-like 16-bit string semantics and its users.

So, the problem is that AssemblyScript wants to keep using UTF-16? I'm not sure I understand.

Is AssemblyScript the thing that lets you hand-write WebAsm?

andyferris 4 years ago | |

Yes, it seems they want to use UTF-16 strings.

I’m confused why they can’t just switch their (nascent) language to UTF-8, and if so why the alarmist attitude? I didn’t think they were mature enough to claim no breaking changes, for example.

I probably prefer we drag the web (and .Net and Java) platforms towards UTF-8, to be honest… but maybe that’s just me.

kevingadd 4 years ago | | |

Realistically speaking, you can't "switch" AssemblyScript to UTF-8 unless you also decide it only can run in UTF-8 host environments (i.e. not web browsers). Right now it uses UTF-16, which is what the web uses. If you move it over to UTF-8 now every operation that passes strings to web APIs has to perform encoding and decoding, and you end up with a bunch of new performance and correctness issues. It's a very complex migration.

P.S. the web will never switch to UTF-8. It would break too many web pages. Most browser vendors won't even accept breaking 0.1% of web pages, unless they're doing it to show you more ads (i.e. Chrome).

trusktr 4 years ago | | |

Because if they did, then interop with JS would require performance-losing conversion any time a string needs to be sent from one side to the other, making Web a secondary and irrelevant target compared to native.

That's not what the web needs. The web needs WebAssembly to work flawlessly with JavaScript for maximal potential, so the web will be great and not just a performance landmine that native developers will laugh (as much) at.

aaron-santos 4 years ago | | |

What should Blazor and TeaVM do when existing code allows for isolate pairs? If they perform implicit conversions to utf-8 they have the option to either trap, or perform lossy conversion which has immense security and data integrity implications.

rryan 4 years ago | |

AIUI, AssemblyScript is a TypeScript-like language that is designed to compile to wasm.

syrusakbary 4 years ago |

I'm not going to enter the discussion regarding UTF-8 vs WTF-16 for representing strings, as I lack the context to determine which one is the right approach if everything has to fit the same model. However, I think an approach that allows multiple serialization/deserialization mechanisms depending on the host/guest language seems like a nice way to move it forward.

If you want to chime in and retrieve more context, here are some relevant issues:

* https://github.com/WebAssembly/interface-types/issues/135

* https://github.com/WebAssembly/interface-types/issues/136

* https://github.com/WebAssembly/design/issues/1419

duped 4 years ago |

Can the authors expound on the reasons why they can't compile their language's string semantics into whatever representation will be used by WASI? Both C++ and Rust support numerous string representations, C++ even more so than Rust.

trusktr 4 years ago | |

How would an end developer write code in one fashion (f.e. `let foo: string = "hello "`) while the compiler makes that work perfectly in every scenario? It would take a high amount of engineering effort compared to having one format that works well in the web to begin with.

How does a compiler ensure that when that string is passed to a Rust Wasm module it goes to it in UTF-8 and then when moments later the same string is passed by the same module to JS it goes over as WTF-16?

How will the compiler know where the string is being passed after compilation (at runtime)?

What new syntax would you propose for TypeScript to make it possible to work with all strings types? How would you keep TS/JS developer ergonomics up to par with what currently exists?

If Interface Types we're to consider web as a first class citizen (because Wasm originated as a web feature) then interop between Wasm modules and JS would considered of utmost importance, without making a web language ,(such as AssemblyScript) have to go through great lengths to engineer that aforementioned complication.

duped 4 years ago | | |

I don't really understand the questions. Have you looked at any prior art to see how C++ and Rust handle different string representations? C++ is probably the best influence due to type coercion since it sounds like you care about ergonomics over correctness.

For FFI there's nothing a compiler can do. That's why FFI is unsafe and restricted to rudimentary types in most languages - it's up to the caller to ensure the data is laid out as the callee expects.

I also don't know what interface types have to do with anything. Wasm is far lower level than interfaces, and nothing is stopping you from implementing interfaces in your language and doing automatic type conversion through them to handle string representations as required.

Look past the web for a moment - wasm is a competitor with the JVM, GraalVM, and LLVM as a platform and implementation independent byte code. Think about how your language would be implemented on those targets before the web.

conrad-watt 4 years ago |

Full disclosure, I am an active participant in WebAssembly standardisation, my github is here (https://github.com/conrad-watt). What follows is purely my personal opinion.

This announcement is deliberately phrased to scare people who do not have sufficient context. I don't know why some AssemblyScript maintainers have decided to act in this extreme way over what is quite a niche issue. The vote that this announcement is sounding the alarm over is _not_ a vote on whether UTF-16 should be supported.

There has been a longstanding debate as part of the Wasm interface types proposal regarding whether UTF-8 should be privileged as a canonical string representation. Recently, we have moved in the direction of supporting both UTF-8 and UTF-16, although a vote to confirm this is still pending (but I personally believe would pass uncontroversially).

However, JavaScript strings are not always well-formed UTF-16 - in particular some validation is deferred for performance reasons, meaning that strings can contain invalid code points called isolated surrogates. Again, the referenced vote is _not_ a vote on whether UTF-16 should be supported, but is in fact a vote on whether we should require that invalid code points should be sanitised when strings are copied across component boundaries. Some AS maintainers have developed a strong opinion that such sanitisation would somehow be a webcompat/security hazard and have campaigned stridently against it. However sanitising strings in this way is actually a recommended security practice (https://websec.github.io/unicode-security-guide/character-tr...), so they haven't gained the traction they were hoping for with their objections.

The announcement is worded to obscure this point - talking about "JavaScript-like 16-bit string semantics" (i.e. where isolated surrogates are not sanitised) as opposed to merely "UTF-16", which forbids isolated surrogates by definition, but inviting the conflation of the two.

AS does not need to radically alter its string representation - if we were were to support UTF-16 with sanitisation, they could simply document that their potentially invalid UTF-16 strings will be sanitised when passed between components. Note that the component model is actually still being specified, so this design choice doesn't even affect any currently existing AS code. I interpret the announcement's threat of radical change as some maintainers holding AS hostage over the (again, very niche) string sanitisation issue, which is frankly pretty poor behaviour.

qalmakka 4 years ago |

This is an unfortunate consequence of the poor choice of keeping UCS-2 alive as UTF-16 for way too long. The plug in 16 bit encodings should have been pulled a long time ago, but some people were and still are so focused on backwards compatibility that they didn't see they were just pushing the issue to another decade. UTF-8 has won, completely. UTF-16 is basically a zombie nobody wants anymore, kept artificially alive by the fear of big 90s frameworks of clean breaks with the past.

We must get rid of legacy encodings no matter the cost, I'm tired of seeing Java and Qt apps wasting millions of CPU cycles mindlessly converting stuff back and forth from UTF-16. It's plain madness, and sometimes you just need the courage to destroy everything and start again.

jiggawatts 4 years ago | |

I love reading stuff like this, because it reminds me that there are two entire universes of IT, and both are mostly filled with people blissfully unaware of the other.

UTF-8 is a great hack that works wonderfully on Linux and BSD, because neither actually supported internationalisation properly until recently. They clung to 8-bit ASCII with white knuckles until they could bear it no longer, but then UTF-8 came to the rescue and there was much rejoicing. "It's the inevitable future!" cried millions of Linux devs... in English. I mention this because UTF-8 is a bit... shit... if you're from Asia.

Meanwhile, in the other universe, UCS-2 or UTF-16 have been around for forever because in that Universe people do things for money and had to take internationalisation seriously. Not just recently, but decades ago. Before some Linux developers were born. In this Universe, an ungodly amount of Real Important Code was written by Big Business and Big Government. The type of code that processes trillions of dollars, not the type used to call MySQL unreliably from some Python ML bullshit running in a container or whatever the kids are doing these days.

So, yes. Clearly UTF-16 has to "die" because it's inconvenient for C developers that never figured out how to deal with strings based on more than encoding.

PS: There are several Unicode compression formats that blow UTF-8 out of the water if used in the right way. If you can support those, then you can support UTF-16. If you can't, then you can't claim that you chose UTF-8 because you care about performance.

maxgraey 4 years ago | |

Are you sure UTF-8 is the ideal format? After all, we have grapheme clusters that cannot be rendered as text units using UTF-8. Maybe UTF-8 is already obsolete and never took over the world? I am more than sure that soon we will see the new Unicode format ;)

AndrewDucker 4 years ago |

This seems to be the discussion thread related to this.

https://github.com/WebAssembly/interface-types/issues/13

ferdowsi 4 years ago | |

The tone of this discussion makes it seem like there is long-running interpersonal conflicts between discussants. Anyone know the context here?

kevingadd 4 years ago | | |

That dates back to the origins of the WASM spec process, it's always been very combative due to the fact that the Google side of things was reluctantly shooting Native Client in the head while the Mozilla side had basically done all of the initial heavy lifting to prove their model with asm.js. I would have preferred a more mature process, personally, but the tension didn't really interfere with the actual outcome as far as I could tell, it just made it a bit more stressful. Because WASM sits on top of existing JS runtimes each JS vendor also had to make compromises in order for it to be possible to implement it across all browsers (the very strange control flow model is a good example of this - some JS engines couldn't handle unconstrained control flow)

The needs of all the different WASM consumers also creates tension here. A C# programmer trying to ship a webapp has very different needs from a C programmer trying to run WASM on a cloudflare edge node, and you can't really satisfy both of them, so you end up having to tell one of them to go take a walk into the sea.

felipellrocha 4 years ago |

Can someone explain the issue at hand? I'm not sure I have enough context to understand the problem

trusktr 4 years ago | |

The upcoming Interface Types spec for WebAssembly was thinking to not support WTF-16 string format, which means any Wasm modules (for example those written in AssemblyScript, a web-inspired language) passing strings from one side to the other (f.e. from Wasm to JS) will experience two things:

- an extra performance cost due to format conversion at the boundary,

- as well as negative implications on security and data integrity,

thus making this a loss for the web if Interface Types will not be fully compatible with the web (JavaScript) by default.

Hope that sums it up in one sentence. :)

TeaVMFan 4 years ago |

This seems to also impact Java and TeaVM, see this post:

https://groups.google.com/g/teavm/c/gpy0JoKYqbU

trusktr 4 years ago | |

Yep, it will impact any language with WTF-16. Those languages may incur a performance hit if the vote passes to not support "expressive UTF-16", but more notably, there will be data integrity issues that can lead to security issues.

amluto 4 years ago |

Is there a link to the actual poll or its content?

trusktr 4 years ago | |

The poll will happen in this meeting on August 3rd:

https://github.com/WebAssembly/meetings/blob/main/main/2021/...

jalino23 4 years ago |

this sucks. poor web assemblyscript but i really like the rust way

trusktr 4 years ago | |

It is fair to like Rust, but there is nonetheless an influx of web developers who already know JavaScript and TypeScript moving to AssemblyScript to (finally) experience what Wasm is all about. They don't want to move to other languages. I believe their experience should be highly valued, and as optimal as possible.

This influx is the reason why AssemblyScript is now in the top three WebAssembly languages next to C++ and Rust (https://blog.scottlogic.com/2021/06/21/state-of-wasm.html) and should not be taken lightly.

There is a huge opportunity here to build an optimal foundation for these incoming developers, so that they won't be let down.

The influx has only just begun.

Ideally though, interface types would give languages options: the ability to choose which format their boundary will use. Obviously a JS host and a language like AssemblyScript would align on WTF-16, while a Rust Wasm module running on a Rust-powered Wasm runtime like wasmtime could optimally choose UTF-8.

I'm hoping things will be designed with flexibility in mind for this upcoming most-generic runtime feature.

greendream17 4 years ago | |

Yea, my first contact with wasm was through AS but I switched to rust soon after, it just doesn't have the same momentum.

trusktr 4 years ago | | |

I think you're mistaken there: AssemblyScript is much newer, it's momentum is only just getting started. AssemblyScript is one of the top three most desired languages for WebAssembly now: https://blog.scottlogic.com/2021/06/21/state-of-wasm.html

In the past year it has gained numerous libraries and bindings, including from Surma from Google. Stay tuned...

xvilka 4 years ago |

UTF-16 was always a mistake[1]. Good riddance. Time to get it out of LSP specification[2] as well.

[1] http://utf8everywhere.org/

[2] https://github.com/microsoft/language-server-protocol/issues...

maxgraey 4 years ago | |

So basically even in UTF8 you can create malformed string. For example ðŸŒ. It miss one byte and may cause to problems in some editors / text viewers which doesn't handle or pre-verify such cases . Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'. Similarly for three and four byte UTF8 characters it starts with '1110xxxx' and '11110xxx' followed by '10xxxxxx' one less times as there are bytes.

So it's not just UTF16 that has problems and can cause security problems. I just wanted to emphasize that

lokedhs 4 years ago | | |

The point is that UTF-16 is the worst of both worlds. It's not ASCII compatible like UTF-8, but it still has the disadvantaged of being a variable length encoding.

Every problem that UTF-8, it shares with UTF-16. It also shares every problem with UTF-32.

kevingadd 4 years ago | |

The fact that UTF-16 is bad doesn't mean you should necessarily get rid of it. We keep all sorts of bad stuff around, like C strings (and C).

You can certainly decide you don't care about any existing code, and that anyone using UTF-16 based platforms (Windows, .NET, Java, JavaScript) should get a bad experience, but I don't think the case for that is as obvious as you believe it is.