Escaping user input is ridonkulously hard

Escaping user input is ridonkulously hard(codeofhonor.substack.com)

63 points by yimby 3 years ago | 86 comments

teddyh 3 years ago |

An alternate view: “string” is not a granular enough type, just like “bitfield” is not a type. Firstly, a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4), and you absolutely need to know which it is. But let’s assume that you’ve been a diligent programmer and filtered all that at the edges, and now have a sequence of Unicode code points (or possibly graphemes). You still need to know the escaped-ness of the string! This is also a form of typing. Perl was early with its concept of “tainted” strings, but modern languages can use types to mark this concept in the code. At all points in your code, you should be sure what type the value you have is. If you need to use the types in your language to ensure this, then use types. But make sure of it somehow.

tialaramex 3 years ago | |

> Firstly, a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4), and you absolutely need to know which it is.

This is a language defect. If your language was invented in the 1960s it's an understandable defect, but it's still a defect. I do not want to write computer software with strings in a language that doesn't even have an actual string type rather than "Eh, maybe this is a string or maybe it's just some random bytes, who cares".

Only in very low level software should it make a difference whether the string is in fact represented as UTF-8 or UTF-16 or whatever, but Rust shows that you can write software at a low level and still enforce type safety for strings.

I agree though that here once again the Right Thing™ is a strong type system. If I've got a Microsoft Graph username, a URL, an email address and a UUID, that's four types, those are not four strings with human names to distinguish them. We don't need to escape some or any of these types - in their context.

FreezerburnV 3 years ago | | |

A type system isn't going to save you from users submitting all kinds of potentially different encodings. Which also depends on what kind of user input is being handled: Is it OS-provided UI? Is it something being sent to a service accessible on the internet? Is it from a CLI? Is it from a file? Context matters for the potential space of what kind of data you might be operating on, which could require different ways of either knowing what kind of data you have based on having more control over the input versus having to detect stuff (or be told, correctly) from highly arbitrary things like reading from a file. All of that is external to the type system, and requires doing something before you can tag it with the correct type. Some languages might attempt to detect this stuff for you, but that could potentially be considered a language defect if it's hard to detect what a string is without having other input telling you what that string contains, such as a header in an HTTP request saying that it's UTF-8.

atoav 3 years ago | | |

The way Rust does it is IMO interesting. There is e.g. an OsStr for strings that e.g. describe filenames in an directory listing, because these could actually be invalid UTF-8 but your program might still need to be able to handle them.

So when you wanna convert that OsStr to a String you are forced to handle this in one way or another. This is less comfortable, but describes the underlying systems more accurately.

simiones 3 years ago | |

There is no such thing as an "escaped string". Escaping is not a general concept, it is something that differs given the intended destination of that string. For example, "I am a %SYSTEM% person" is a perfectly fine escaped bash string, but an unsafe CMD string; it is also fine as a C# format string, but potentially unsafe as a C format string, depending on your actual implementation of printf; it is an escaped MSSQL filter string, but not an escaped PostgresSQL filter string.

Also, not all strings/texts should be thought of as Unicode code points/graphemes.

mjw1007 3 years ago | | |

Sure, you need a different type for each different form of escaping you want to track, but that doesn't make the idea unworkable.

A type that says (say) "this is a string containing html PCDATA" is a useful thing to have.

teddyh 3 years ago | | |

(Just for the record, I agree completely, and nothing of what I wrote should be construed as contradicting any of that.)

bcrosby95 3 years ago | |

Escape strings at the last possible moment, and ideally it's done by whatever library you're using so you never have to worry about it. It's never not been clear to me in our codebases if I'm dealing with a raw string or a safe one. They're all unsafe, because you have no clue what context they're going to be used in.

If you're writing a web framework or a DB library things might be different though - in that case a different class probably makes sense. If you have a module for a certain communication medium, then yeah you might use it in that module. But if you're writing a webapp, passing around escaped strings is a bad idea 99% of the time. It creates code highly coupled to one aspect of your system.

Just imagine if you did this with networking. I'm glad we're not in a world where we're passing around TCPString or UDPString or IPString or EthernetString or TokenRingString or CarrierPigeonString because that happens to be a networking stack the app uses sometimes. It sounds like hell.

masklinn 3 years ago | | |

> They're all unsafe, because you have no clue what context they're going to be used in.

That's correct, but it's the reverse thinking from the escaping one.

Because in the escaping one, when you need not to escape you will also not-escape at the last possible moment, and that's a sure-fire way to launder attacker-controlled data.

Instead you should escape everything, and opt-out as early as possible.

> But if you're writing a webapp, passing around escaped strings is a bad idea 99% of the time. It creates code highly coupled to one aspect of your system.

That's why you do the reverse: most strings are unsafe to everything, but the strings which are safe are generally safe to one specific subsystem. So you say that.

> Just imagine if you did this with networking. I'm glad we're not in a world where we're passing around TCPString or UDPString or IPString or EthernetString or TokenRingString or CarrierPigeonString because that happens to be a networking stack the app uses sometimes. It sounds like hell.

It sounds like hell because it makes no sense, there's no such thing as a TCPString because TCP is not string-based and TCP messages are not composed that way.

specialist 3 years ago | |

> a string could be raw unknown bytes, verified UTF-8, or UCS-2 (or even UTF-16 or UCS-4)

Agreed. My future perfect programming language has the predefined types 'ascii', 'utf-8', 'url', 'base64', etc. for misc kinds of character sequences.

Just like how raw bits are different from numerals: short vs byte, word vs int, 64-bits vs double, etc.

(Any one have a better naming system for 8, 16, 32, and 64 bit chunks of raw data? 'byte', 'word', 'doubleword', 'quadword'?)

Per this "ridonkulously hard" OC article, I'll also ponder predefined types for raw 'html5', 'json', etc (as in unparsed, char sequence vs DOM).

> Perl was early with its concept of “tainted” strings.

Not being a Perl dev, I'm unfamiliar with "taint". Quickly found articles like this: https://www.geeksforgeeks.org/perl-taint-method/

In my future perfect language, char seqs cannot be cast. They must be converted. Basically syntactic sugar for Java-style char encoding infrastructure.

I have assumed that disallowing casting was sufficient. But now I'll have to ponder "taint" too. From the hip, I really like the notion of tracking the provenance of data, a la defensive programming.

Great idea. Thanks.

tinus_hn 3 years ago | |

No. The way to solve this is to recognize where the problem lies. The problem does not lie with storing user input. The problem lies with improperly putting strings in other data.

So all you need to do, is to do that properly. Either you commit to using constructs like paramtrized queries instead of concatenizing strings and use the DOM to put together HTML the way you want, or you escape as you concatenate the strings.

Don’t store escaped strings, it’s a recipe for disaster.

jerf 3 years ago |

It really isn't. Proof: Most people who try, largely succeed. Those who do something silly like try to do it 100% manually generally rapidly realize that's not a good plan, and usually there is a not-very-hard way to encapsulate it somehow, since that's pretty much what our languages do, encapsulate things.

I'm not saying it's completely trivial or that there's never an issue here or there. What I'm saying is, it's on par with any of dozens of other issues in programming. Bugs happen, errors happen, but no more so than anyone else. A series of systems with slightly different encoding practices can also cause some headaches, but, again, these are on par with a number of other issues that can emerge in such systems, not especially bad. I've seen a lot of crappy code that gets this wrong at scale, written by programmers who don't really know or care what they're doing, but the same code was crap in a dozen other ways too, and generally screwed up even easier things as well.

Where you get the problems are, from largest to smallest, 1. People who don't realize it's an issue at all and concatenate everything and 2. People who have just been taught about it, and are doing a wrong thing, most often trying to filter on the way "in" instead of the way "out". ("Sanitize user input" delenda est. Stop saying it. It's wrong.) Which is also not an exceptional case, because again there are any number of things that have the exact same characteristics in the programming world.

I would expect "ridonkulously hard" to encompass something that even when tried is super hard and often a failure, and this isn't that case.

phyzome 3 years ago |

It's not, though. It's the easiest thing in the world: Just use a library that never emits unescaped content by default, or if you make a single-character typo.

The problem is that most of the libraries aren't that.

evilDagmar 3 years ago |

I've always found it more useful to just discard user input that doesn't come in the format you're asking for, and bail on the entire operation.

Like, if the user might be attempting something fishy, there's no reason to try and "clean it up" and have your program "do it's best" with the remainder. Throw an error back at the user and move on to the next query.

jiggawatts 3 years ago | |

User: "My surname is O'Neill"

Server: HTTP/403, begone with you, foul SQL-injecting hacker!

8n4vidtmkvmk 3 years ago | |

that sounds awful. you probably reject phone numbers that use spaces instead of dashes or something? if its correctable, just correct it and don't hassle the user. if its ambiguous, then fine, ask the user to clarify.

OkayPhysicist 3 years ago | |

This is the way. Parse, don't sanitize, and fail early. If 99% of traffic follows the pretty path, and most of the rest is actively hostile.

shadowgovt 3 years ago |

This is a space where type systems can be extremely helpful.

Escaped input and unescaped input are separate types. And a robust type system will allow you to craft your functions so that the streams cannot be crossed without going through translation layers.

In fact, the most robust type systems will offer things like automatic function composition so that you have to write a minimum of code... If a type coercion function is available, the type system can be taught to just automatically apply that coercion function before dropping the string into the relevant processing.

8n4vidtmkvmk 3 years ago | |

+1 wrapping safe strings with a type is the way to go if you need to compose things together

js2 3 years ago |

The interesting part of the article is below the fold and not reflected in the headline:

https://codeofhonor.substack.com/i/78789944/security-theater...

"Most of the rendering bugs I’ve seen in security audits don’t matter. This is not how your organization will be pwned. ... What would fix this? Layered security built around a plausible threat model. What would not help? Removing reflected ASCII text from Shodan’s API error message. I’m not saying that small security bugs aren’t worth fixing, or that organizational security always trumps application security. Rather, real damage usually does not come from where security engineers tend to expect, because they spend their time on pentests and CTFs that differ substantially from the approaches popular among actual attackers."

Everyone commenting that dealing with user input is easy: if it were really easy, we wouldn't keep making the same mistakes. I fixed my first SQL injection attack by switching some code to bind variables over 20 years ago, yet we still have Little Bobby Tables showing up in our collective databases. The fix may be easy ("just do X"), but the mistake is even easier.

AtlasBarfed 3 years ago | |

Breadth-first security attacks will exploit input sanitizing exploits like that. Security audits can certainly help with that, assuming they don't impose a huge security infrastructure and review process that crushes developer productivity, which always seems to happen.

Depth-first attacks as described are a different class of attack, and of course "audit" won't help that much. Education, penetration testing, and honeypots are some of the stuff that works for that.

Ultimately, if an organization treats its work force like crap, then depth-first attacks are unstoppable. The crypto-locker attackers are strangely pro-worker, because it highlights how disgruntled employees are such effective attack vectors via bribery, vengeance, or apathy.

iLoveOncall 3 years ago |

After reading the article I fail to see what is hard about escaping user input.

It seems like what the author means is that it's hard to think of all the places where user input should be escaped, but even then, if you use any modern framework, everything is escaped by default.

kayodelycaon 3 years ago |

Ruby on Rails pretty much handles this. Regular strings are always escaped in views. Only html_safe strings will emit html. For user input, you should always use the sanitize method instead of raw. :)

jiggawatts 3 years ago | |

Razor pages in ASP will do this too: https://learn.microsoft.com/en-us/dotnet/api/system.web.ihtm...

DeathArrow 3 years ago |

I use a strongly typed language, repository pattern and an ORM. Good luck trying SQL injections. Also input is sanitized at framework level so good luck with XSS.

Also the input has to bypass validation (for which I have unit tests) and the DTOs are mapped to database models before being written.

6510 3 years ago |

It is exactly as hard as one would expect it to be if the only document format becomes an application platform but you still want to do documents.

AtlasBarfed 3 years ago |

Bobby tables: that is sanitizing inputs, not escaping. related, but not the same.