EOF is not a character(ruslanspivak.com) |
EOF is not a character(ruslanspivak.com) |
What if instead of a char, getchar() returned an Option<char>? Then you can pattern match, something like this Rust/C mashup:
match getchar() {
Some(c) => putchar(c),
None => break,
}
Magical sentinels crammed into return values — like EOF returned by getchar() or -1 returned by ftell() or NULL returned by malloc() — are one of C's drawbacks. #include <stdio.h>
struct { int err; char c; } myfunc() {
return { 0, 'a' };
}
int main(int argc, const char *argv[]) {
{ int err; char c; } = myfunc();
if (err) {
// handle
return err;
}
printf("Hello %c\n", c);
return 0;
}
This is (semantically) perfectly possible today, you just have to jump through some syntactic hoops explicitly naming that return struct type (because among others anonymous structs, even when structurally equivalent, aren't equivalent types unless they're named...). Compilers could easily do that for us! It would be such a simple extension to the standard with, imo, huge benefits.Every time I have to check for in-band errors in C, or pass a pointer to a function as a "return value", I think of this and cringe.
#include <stdio.h>
#include <tuple>
std::tuple<int, char> myfunc() {
return { 0, 'a' };
}
int main(int argc, const char *argv[]) {
auto [ err, c ] = myfunc();
if (err) {
// handle
return err;
}
printf("Hello %c\n", c);
return 0;
}More stuff like this in https://pdfs.semanticscholar.org/31ac/b7abaf3a1962b27be9faa2...
AFAIK, no? You can return a pointer to a struct, and you can pass whole structs as arguments, but not, IIRC, return them from functions.
EDIT: Apparently you can, sort of, but not portably; how exactly it is defined to work depends on the compiler, and each compiler might define it differently. This means that if you’re using a library which returns a struct and your program use a different C compiler than the library used when it was compiled, your program will not work. I.e. there is no one defined stable ABI for functions returning structs.
Therefore I think it’s reasonable to regard it as impossible in practice.
Getchar doesn’t return a char; it returns an int (https://en.cppreference.com/w/c/io/getchar).
⇒ if C didn’t do automatic conversions from int to char, we would have that (in a minimalistic sense)
That wouldn’t work for ftell and malloc (and, in general, most of the calls that set errno), though.
Dammit, I knew that. Thank you for flagging my blunder; being precise is really important in this case. The Linux manpage better explains the return value of getchar:
https://linux.die.net/man/3/getchar
"fgetc(), getc() and getchar() return the character read as an unsigned char cast to an int or EOF on end of file or error."
getchar() needs to return an object the width of an unsigned char, but all the values in that range are taken by possible character values. The return type had to be expanded to int in order to accommodate the sentinel.
The alternative of using an algebraic type is superior because the end-of-stream condition has a different type (so to speak), and furthermore, the programmer has no choice but to deal with it because the character value comes wrapped inside an Option which must be stripped away before the character value can be used.
Really, you also want the type system to express all possible error conditions as well, since getchar() returning EOF can mean either that end-of-file was reached or that some other error occurred!
As someone who has written lots of C code and worked hard to account for all possibilities manually, I really appreciate it when the type system and APIs can express all possibilities and back me up.
They're part of the C standard library. The POSIX I/O APIs don't have these problems. The Linux I/O system calls are even better because they don't have errno.
Honestly, the C standard library just isn't that good. Freestanding C is a better language precisely because it omits the library and allows the programmer to come up with something better.
That would be the textbook case of stupid over-engineering.
Programs retrieve the data in a file by a system call ... called read. Each time read is called, it returns the next part of a file ... read also says how many bytes of the file were returned, so end of file is assumed when a read says "zero bytes are being returned" ... Actually, it makes sense not to represent end of file by a special byte value, because, as we said earlier, the meaning of the bytes depends on the interpretation of the file. But all files must end, and since all files must be accessed through read, returning zero is an interpretation-independent way to represent the end of a file without introducing a new special character.
Read what follows in the book if you want to understand Ctrl-D down cold.
It's an artifact of that era. Along with "BREAK", which isn't a character either.
GCC only outputs a warning by default: "warning: return type defaults to ‘int’ [-Wimplicit-int]"
Procedural programmers don't generally have a problem with this -- getchar() returns an int, after all, so of course it can return non-characters, and did you know that IEEE-754 floating point can represent a "negative zero" that you can use for an error code in functions that return float or double?
Functional programmers worry about this much more, and I got a bit of an education a couple of years ago when I dabbled in Haskell, where I engaged with the issue of what to do when a nominally-pure function gets an error.
I'm not sure I really got it, but I started thinking a lot more clearly about some programming concepts.
For example,
$ python3 -c 'print("".join(chr(c) for c in range(10)))' | python3 -c 'print(list(ord(c) for c in input()))'
will confirm that it doesn't happen in a pipe (the ASCII 4 character there is totally unrelated to EOF).It was sometimes used to have TYPE print something human readable and stop before the remaining (binary) file data would scroll everything away
So, is the length of each file stored as an integer, along with the other metadata? This reminds me of how in JavaScript the length of an array is a property, instead of a function that counts it right then, like say in PHP.
Apparently it works. I've never heard of a situation where the file size number did not match the actual file size, nor of a time when the JavaScript array length got messed up. But it seems fragile. File operations would need to be ACID-compliant, like database operations (and likewise do JavaScript array operations). It seems like you would have to guard against race conditions.
Does anyone have a favorite resource that explains how such things are implemented safely?
EDIT: Seems like 26 = EOF is a DOS thing.
EDIT 2: Some confusing comments: https://www.perlmonks.org/bare/?node_id=228760
EDIT 3: A pretty good thread (read NigelQ's replay): http://forums.codeguru.com/showthread.php?181171-End-of-File...
Hoping Cunningham's Law comes into play with this comment. :)
since I am more used to Windows where ctrl-c is copy, I followed other people's suggestion and mapped ctrl-x to do what ctrl-c usually does, with:
stty intr ^X -ixon
This is because X and C are very close, and I couldn't sacrifice ctrl-v (paste) or ctrl-z (background) while I seldom use ctrl-c
I'm sure you could do the same with ctrl-d if you really wanted to.
[1]: https://doc.rust-lang.org/std/io/trait.Read.html#method.read...
(In fact, thinking better about it, there are some cases where `read()` could legitimately return `UnexpectedEof`, like when it's a wrapper for a compressed stream which has fixed-size fields, and that stream was truncated in the middle of one of these fields. It's clear that, in that case, `UnexpectedEof` is not an end-of-file for the wrapper; it should be treated as an I/O error.)
Yes, you can. You just end your stream by closing the pipe.
The exception even tells you that "chr() arg not in range(0x110000)" which has nothing to do with range of C's character types.
https://sourceware.org/bugzilla/show_bug.cgi?id=1190
https://sourceware.org/legacy-ml/libc-alpha/2018-08/msg00003...
> All stdio functions now treat end-of-file as a sticky condition. If you read from a file until EOF, and then the file is enlarged by another process, you must call clearerr or another function with the same effect (e.g. fseek, rewind) before you can read the additional data. This corrects a longstanding C99 conformance bug. It is most likely to affect programs that use stdio to read interactive input from a terminal.
Although interestingly somehow I'm still seeing the old behavior in Debian Buster with glibc 2.28 with python3.
import sys
while True:
b = sys.stdin.read(1)
print(repr(b))
With old glibc with both python2 and python3 the EOF isn't sticky (as expected). With 2.28 with python2 the EOF is sticky (like you said). With 2.28 with python3 it's not sticky for some reason.^D (0x04) is EOT and 0x03 is EOText: https://www.systutorials.com/ascii-table-and-ascii-code/
So, kinda, but somehow I'm happy it never got turned into a weird combinations depending on the OS.
ISO C says that char must be at least 8 bits, and that int must be at least 16. It is entirely legal to have an implementation that has 16-bit signed char and sizeof(int)==1. In which case -1 is a valid char, and there's no way to distinguish between reading it and getting EOF from getchar().
Large swaths of the C standard were built during the heyday of computer design, when you had all sorts of wacky sizes, behaviors and abstractions. Lots of "undefined behavior" is effectively deterministic, because all modern computers have converged to do so many things the same way.
I am begging, please never ever do this. NaN literally exists for this reason. NaN even allows you to encode additional error context and details into the value.
This is a supplementary source of confusion.
> Character 26 was used to mark "End of file" even if the ASCII calls it Substitute, and has other characters for this. Number 28 which is called "File Separator" has also been used for similar purposes. [1]
I think today we would think of character 4 (End of Transmission, Ctrl-D) as the end of file/input marker, but historically Character 26/Ctrl-Z was used, even on disk.
If by procedural you mean, nonsense, then sure... I agree that a function named `getchar` returning an `int` is procedural. :P
(Though by the way: having functions that evaluate to a value when executed is itself a feature that belongs to the functional paradigm, although one so trivial and common that it’s not usually thought as such. But a purely imperative/procedural way of returning values would be via out parameters or global variables.)
When Rust introduced ADTs they were recognizably a concept from functional programming. It's a place or community of practice, not a purely descriptive adjective.
Why are you being snarky?
They clearly mean the issue of modelling partial functions which would normally be done by a side-effect in a procedural language but can’t in a functional language.
For binary files, you just assume there is padding at the end of the file to the end of the sector. For text files, the SUB code was used to indicate where the file ended.
One gives a priori information the other a posteriori.
It's amusing that almost the same can be said about NT: for any given eccentricity of Windows NT it's a good bet that it came from VMS, since the two had the same principal designer.
Either way, no platform defines bytes to be Unicode code points.
Whether this is an advantage is heavily domain dependent.
Notably, in the PNG file format (created back when MS-DOS was still very relevant):
"The first eight bytes of a PNG file always contain the following values: [...] The control-Z character stops file display under MS-DOS. [...]" (http://www.libpng.org/pub/png/spec/1.2/PNG-Rationale.html#R....)
Do architectures like that have non-freestanding C implementations, though? It's kinda moot if there's no getchar()...
Implementing IO in a "pure" way, is however another discussion.
The DOS syscall interface has no concept of an EOF character. ^Z being considered EOF was a feature of the COPY command, later replicated by the runtimes of various languages targetting DOS.
When the TTY device takes (by default) Ctrl+C or Ctrl+D, it sends the signals to the program. The TTY's 'line discipline' (the policy for when the program's STDIN can read from a line of input) can be changed from a default 'cooked' to a 'raw mode'. In with raw mode line discipline the Ctrl+C doesn't send the signal. Presumably that's why e.g. vi or emacs don't just close on Ctrl+C.
This helps me, thanks for pointing me back at this great write-up.
> 'stty -icanon' still interprets control characters such as Ctrl-C whereas 'stty raw' disables even this and is the real raw mode.
From the very detailed link posted by rgoulter above.
Still, in raw mode, Ctrl+D will send EOT, and thus end your shell. While Ctrl+C wont.
What kind of wicked education you had for this to be the case?
My dad taught me about bits and bytes and words when I was a kid, and by 16 I had a quite solid grasp of it (without any textbook). Then I studied several years and got a phd in applied math (mostly numerical pde, and that involved a lot of programming). Then I have spent 15 more years doing math and programming in several languages (mostly C and Python) and getting paid for teaching data science and signal processing to people who got on to have fruitful jobs in industry. Today, I read the wikipedia page about "option type" [1] and the one about about type theory [2], which seems a prerequisite, and couldn't understand a word.
I'm not sure what you mean about compilers.
But structure type return values are well specified for most calling conventions, and quite a number of compilers support explicitly specifying the calling convention for mixed-language or mixed-compiler situations.
Also from that link:
> 32-bit cdecl calling convention
> For return values of structure or class type, there is wide incompatibility amongst compilers. Some make the return thread-safe, by breaking compatibility with the 16-bit cdecl calling convention. Some retain compatibility, at the expense of their 32-bit cdecl calling convention not being thread-safe. The ones that break compatibility don't all agree with one another on how to do so.
https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Incompatibiliti...
https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Code-Gen-Option...
https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Warning-Options...
This is mostly not a practically relevant issue. (Nor are pre-K&R compilers relevant, although something like this could arise among modern compilers.) As far as oddball situations go, it's far from the thorniest to deal with - it doesn't even involve C++.
https://play.rust-lang.org/?version=stable&mode=debug&editio...
I'm being snarky, as is my nature, to highlight the madness of a function called `getchar` returning anything but a `char`.
Integers are numbers like -1337, 0, and 42.
Characters are things that compose strings of text.
These are not the same kind of thing at all. Just because APIs may be leaky, and some of these APIs are held in very high regard doesn't change that fact.
It’s effectively returning a Maybe(char).
A `Maybe<char>` has exactly one `None` variant. While an `int` has many, many negative values.
Also, just calling it `None` (or similar) makes clear what is meant, while `-1` is some magic value.
It's a documented return value. Nothing magic about it.
You do not need to understand theoretical type theory to understand options. It's just like a pointer that can be NULL except the compiler makes sure you can't accidentally dereference it if it is. Algebraic data types in general are basically just structs and tagged unions, except the compiler makes sure you can't screw the tags up.
Like, dude, by your own account, you're pretty smart; that's the point of your last paragraph, right? There are, at this point, hoards of Rust and Scala and Swift and Kotlin programmers who can figure out how option types work, and don't seem to have too much of a problem with it and pretty much universally think they're great. Are they actually just smarter than you?
Sure they are. Or at least they do not hold an irrational, primary hatred of over-abstraction like I do. In math there's also people like this, who work in stuff like category theory, logic and whatnot. Fortunately, they are a mostly controlled minority.
It’s just a wrapper around some value that is either Some(value) or None and you need to unwrap it and handle both possibilities for your code to compile.
You don’t need to know anything about monads or ADT’s to understand it.
If you understand both, which one is conceptually easier to you?
1) It applies exactly the same way to any type, not just char.
2) You don't need to read the man page for every single function that returns an int on the off chance that said int actually contains a bool, a char, or a short plus additional flags.
I just do not understand what the problem with Option is.
It's either Some(1) or None. If it's some you have the value, otherwise you handle the fact you don't have it.
It's so simple and basically every modern language uses it to handle nullable types.
Rust has tons of help dealing with options built into the language.
Which answers your second question: "bitstreams" would be terrible because they are not well connected with a hardware reality. Unless you have bitstream-oriented CPU, it is a bad idea for a basic type to go against the hardware.
Why even have types... Well, yes, there are languages without type checking where the notion still exists. For instance Forth has no type checking but two types are implied: the "byte" type and the "machine word size" type, maybe three if you count strings.
getchar() gets a char. not a character.