Hints for writing Unix tools

Hints for writing Unix tools(monkey.org)

294 points by mariusae 11 years ago | 125 comments

hoggle 11 years ago |

“One thing well” misses the point: it should be “One thing well AND COMPOSES WELL”

If the implementation isn't respecting The Rule of Composition it's actually not adhering to the Unix philosophy in the first place. The tweet is referring to one of Doug McIlroy's (one of the Unix founders, inventor of the Unix pipe) famous quotes:

"This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface."

Pure beauty, but it's almost too concise a definition if you haven't experienced the culture of Unix (many years of usage / reading code / writing code / communication with other followers). ESR's exhaustive list of Unix rules in plain English might be a better start for the uninitiated (among which one will find the aforementioned Rule of Composition).

For all those seeking enlightenment, go forth and read the The Art of Unix Programming:

https://en.wikipedia.org/wiki/The_Art_of_Unix_Programming

17 Unix Rules:

https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E...

jzwinck 11 years ago |

Here's one more tip: did you ever notice that "ls" displays multiple columns, but "ls | cat" prints only one filename per line? Or how "ps -f" truncates long lines instead of wrapping, while "ps -f | cat" lets the long lines live?

You can do it too, and if you're serious about writing Unix-style filter programs, you will someday need to. How do you know which format to write? Call "isatty(STDOUT_FILENO)" in C or C++, "sys.stdout.isatty()" in Python, etc. This returns true if stdout is a terminal, in which case you can provide pretty output for humans and machine-readable output for programs, automatically.

dap 11 years ago | |

IMO, this is an anti-pattern. It's violates the principle of least surprise. (How come I see X when I run the command, but I can't grep for X in its output? How come it works when I run it from my interactive shell, but it's broken when I run it from a script? And things like that.)

teddyh 11 years ago | | |

Indeed, the GNU Coding Standards explicitly argues against doing that:

“please don’t make the behavior of a command-line program depend on the type of output device it gets as standard output or standard input.”¹

① https://www.gnu.org/prep/standards/standards.html#User-Inter...

burke 11 years ago | | |

I think it depends what sort of things you use it for. I often use it to switch on or off ANSI colourization, which doesn't really violate the principle of least surprise.

When used sparingly and thoughtfully, I've never personally had an issue with it.

userbinator 11 years ago | | |

I agree, especially for the behaviour from the parent:

"ps -f" truncates long lines instead of wrapping, while "ps -f | cat" lets the long lines live

How people usually discover what these commands do is by running them interactively, and if that results in some output being hidden vs being run noninteractively, then they have little reason to believe that it could yield more output than what they're used to seeing. I think a certain number of "ps" users don't know it can display full paths and commands, if they've only ever used it interactively.

vog 11 years ago | | |

Ideed. I've seen people over and over stumbling over this weird behaviour.

It may have some merits, but as a general advice this is definitely an anti-pattern.

Another example is "curl", where "curl URL >outfile" is chatty on stderr, while "curl URL" is quiet on stderr. That's very annoying for scripting, you easily forget to set "-s" in your scripts due to that behaviour.

emmelaich 11 years ago | | |

And yet .. programs are not just for composition. They have to behave sensibly for people.

I love that 'git log' outputs in a pager. 'svn log' by comparison is nuts.

gohrt 11 years ago | | |

This behavior of Unix programs is basically the same concept as Perl's "context" (list vs scalar), but even moreso.

easytiger 11 years ago | | |

Yea most commands don't do this though. Most commonly used for colouring output.

ls is a bit more than just a command though. It's part of the furniture and prehistoric.

oakwhiz 11 years ago | | |

I have run into this problem when trying to automate certain tasks on UNIX boxes.

Dealing with programs that act differently depending on their output device is very annoying.

pessimizer 11 years ago | | |

I despise it when commands do this - mysql -e results are formatted differently depending on whether the output is directed to the terminal or to a file.

burke 11 years ago | |

Or, execute "/bin/[ -t 1" (or "test -t 1", or "[[ -t 1 ]]", or ...). This is handy in shellscripts (obviously), but also in languages like Go, which lack a builtin way to test whether stdout is a TTY. e.g.:

    cmd := exec.Command("/bin/[", "-t", "1")
    cmd.Stdout = os.Stdout
    isatty := nil == cmd.Run()

ksherlock 11 years ago | | |

Please just call Fstat and check Stat_t.Mode & S_IFCHR

grymoire1 11 years ago | |

As I recall, the original ls didn't have that feature.

Examining the characteristics of the output stream and changing behavior is another "rule" that is not mentioned often. Another example is buffering the output to a large block if sending to a pipe, but making it line-buffered if going to a terminal.

voltagex_ 11 years ago |

I'm not sure I agree with the "no JSON, please" remark. If I'm parsing normal *nix output I'm going to have to use sed, grep, awk, cut or whatever and the invocation is probably going to be different for each tool.

If it's JSON and I know what object I want, I just have to pipe to something like jq [1].

PowerShell takes this further and uses the concept of passing objects around - so I can do things like ls | $_.Name and extract a list of file names (or paths, or extensions etc)

[1]: http://stedolan.github.io/jq/

osandov 11 years ago |

A nitpicky tip: --help is normal execution, not an error, so the usage information should be printed to stdout, not stderr (and it should exit with a successful status). Nothing is more annoying than trying to use a convoluted program with a million flags (which should have a man page in the first place) and piping --help into less with no success.

grymoire1 11 years ago | |

I hate it when a program has a huge --help output, and the man page is nearly empty, and says "see the --help option for more details." Things like examples, see also, etc. are very valuable to someone trying to figure out how to use a program....

foobarbaz1234 11 years ago | |

I am not so sure with that. Say, your program is used in a shell script and is invoked badly - you might want to print its usage then. If you exit normally your shell script might break weirdly but if you exit with error it's easier to spot the reason of failure.

On the other hand you made me thinking and probably you should have three code passes per default:

  [0] normal behaviour (exit 0)
  [1] bad arguments (exit EINVAL)
  [2] --usage (print to stdout but but exit != 0)?

Anyway I am not sure if it makes sense to declare "usage" as normal behaviour.

Someone 11 years ago | | |

In my book, there is a difference between explicitly asking for help/usage and passing arguments that do not make sense, which triggers the output of help/usage.

The former, I think, should write to stdout and return 0, the latter should write to stderr and return something non-zero.

Giving help if the user asks for it is normal behaviour.

pimlottc 11 years ago | |

This annoys me to no end. Of course, you can work around it:

    annoying_program 2>&1 | less

but it is very unfriendly to stymie a user's attempt to get help when they're already probably confused.

Animats 11 years ago |

1978 called. It wants its pipes back.

That approach dates from the days when you got multi-column directory listings with

  ls | mc

Putting multi-column output code in "ls" wasn't consistent with the UNIX philosophy.

There's a property of UNIX program interconnection that almost nobody thinks about. You can feed named environment variables into a program, but you can't get them back out when the program exits. This is a lack. "exit()" should have taken an optional list of name/value pairs as an argument, and the calling program (probably a shell) should have been able to use them. With that, calling programs would be more like calling subroutines.

PowerShell does something like that.

grosskur 11 years ago | |

You can simulate this with so-called "Bernstein chaining". Basically, each program takes another program as an argument, and finishes by calling exec() on it rather than exit(), which preserves the environment. See:

http://www.catb.org/~esr/writings/taoup/html/ch06s06.html

Or write environment variables to stdout in Bourne shell syntax so the caller call run "eval" on it. Like ssh-agent, for example.

gohrt 11 years ago | | |

Continuation Passing Style! http://en.wikipedia.org/wiki/Continuation-passing_style

agumonkey 11 years ago | | |

Oh wow, unix continuation passing style. Never heard of that o_o;

oneeyedpigeon 11 years ago | |

I agree that the column formatting code shouldn't be in ls. However, if it were removed (which it won't ever be, of course: theoretical) I would want every system I ever access via a terminal to somehow alias ls to "ls | mc". To support full working of ls, though, that can't just be a straight alias, so I need a shell script to handle things like parameters to ls, which itself is then aliased to ls ... is that really better?

4ad 11 years ago | |

In Plan 9 programs return strings instead of numeric codes.

to3m 11 years ago |

Additional tip: if writing a tool that prints a list of file names, provide a -0 option that prints them separated by '\x0' rather than white space. Then the output can be piped through xargs -0 and it won't go wrong if there are files with spaces in their paths.

I suggest -0 for symmetry with xargs. find calls it -print0, I think.

(In my view, this is poor design on xargs's part; it should be reading a newline-separated list of unescaped file names, as produced by many versions of ls (when stdout isn't a tty) and find -print, and doing the escaping itself (or making up its own argv for the child process, or whatever it does). But it's too late to fix now I suppose.)

fragmede 11 years ago | |

> newline-separated list of unescaped file names

That breaks when you have newlines in filenames, no?

to3m 11 years ago | | |

File names often have spaces in them, but very rarely newlines. Based on xargs's current behaviour, it's clearly no problem to just not support certain characters in file names by default. I just think it would have been more useful for it to not support a smaller set of names.

pstuart 11 years ago | | |

> That breaks when you have newlines in filenames, no?

That seems like an extremely pathological case.

mappu 11 years ago | | |

And \x0 separator breaks when you have \x0 in filenames. Pragmatically it's a question of rarity, but ultimately the shell should support something like prepared queries in SQL.

ole_tange 11 years ago | |

You view was heard in the design of GNU Parallel: It defaults to newline separation, escapes the argument, and is for most cases a drop-in replacement of xargs.

This does what you would expect:

  echo My brother\'s 12\" records.txt | parallel touch

wahern 11 years ago | |

xargs assumes the input is composed of quoted and escaped atoms. Compare

$ printf '"foo bar"' | xargs -n1

and

$ printf '"foo" "bar"' | xargs -n1

and

$ printf "%s" '\\"foo bar\\"' | xargs -n1

acabal 11 years ago |

Great article. The other thing I've always wished for command-line tools is some kind of consistency for flags and arguments. Kind of like a HIG for the command line. I know some distros have something like this, and that it's not practical to do as many common commands evolved decades ago and changing the interface would break pretty much everything. But things like `grep -E,--extended-regexp` vs `sed -r,--regexp-extended` and `dd if=/a/b/c` (no dashes) drive me nuts.

In a magical dream world I'd start a distro where every command has its interface rewritten to conform to a command line HIG. Single-letter flags would always mean only one thing, common long flags would be consistent, and no new tools would be added to the distro until they conformed. But at this point everyone's used to (and more importantly, the entire system relies on) the weird mismatches and historical leftovers from older commands. Too bad!

dap 11 years ago |

Lots of great points here, but as always, these can be taken too far. Header lines are really useful for human-readable output, and can be easily skipped with an optional flag. (-H is common for this).

The "portable output" thing is especially subjective. I buy that it probably makes sense for compilers to print full paths. But it's nice that tools like ls(1) and find(1) use paths in the same form you gave them on the command-line (i.e., absolute pathnames in output if given absolute paths, but relative pathnames if given relative paths). For one, it means that when you provide instructions to someone (e.g., a command to run on a cloned git repo), and you want to include sample output, the output matches exactly what they'd see. Similarly, it makes it easier to write test suites that check for expected stdout contents. And if you want absolute paths in the output, you can specify the input that way.

zaptheimpaler 11 years ago | |

I also think headers should be included. Its really annoying to go pore through a man page just to see what the columns mean. You could use flags, or maybe send headers to STDERR.

peterwwillis 11 years ago |

Not every program will be able to take input in stdin and output to stdout. If you have a --file (or -f) option, you'd do well to support a "-" file argument, which means either stdin or stdout, depending if you're reading or writing to -f. But you won't support "-" if the -f option requires seeking backwards in a file. Neither will you be using stdin or stdout if binary is involved (because tty drivers).

'One thing well' is often intended to make people's lives easier on the console. Sometimes this means assuming sane defaults, and sometimes just a simpler program that does/assumes less. Take these two examples and tell me which you'd prefer to type:

  user@host~$ ls *.wav | xargs processAudio -e mu-law --endian swap -c 2 -r 16000
  user@host~$ find . -type f -maxdepth 1 -name '*.wav' -exec processAudio -e mu-law --endian swap -c 2 -r 16000 {} \;

Write concise technical documentation. Imagine it's your first day on a new job and you need to learn how all your new team's tools work; do you want to read every line of code they've written just to find out how it works, or do you want to read a couple pages of technical docs to understand in general how it works? (That's a rhetorical question)

Definitely provide a verbose mode. When your program doesn't work as expected, the user should be able to figure it out without spending hours debugging it.

_pmf_ 11 years ago |

I have a strong bias against people who quote their own tweets in their own blog posts. I find this to be highly narcissistic.

1amzave 11 years ago | |

I sympathize, but I have to say I find it far less annoying than the constant implorings to "follow me on Twitter!" that have become obnoxiously ubiquitous in the last few years.

RexRollman 11 years ago |

Wow, its been a while since I've seen a monkey.org link. I thought the site was dead. Nice to see I was wrong.

mseepgood 11 years ago |

Another tip: don't do colored output. I don't want to deal with ANSI codes in your output.

arh68 11 years ago |

I think it's insane to restrict programs to just STDOUT & STDERR. Why 2? Why not use another file descriptor, maybe STDFMT, to capture all the formatting markup? This would avoid -0 options (newlines are markup sent to stdfmt, all strings on stdout are 0-terminated), it would avoid -H options (headers go straight to STDFMT), it would allow for less -R to still work, etc.

It's possible other descriptors would be useful, like stdlog for insecure local logs, stddebug for sending gobs of information to a debugger. It's certainly not in POSIX, so too bad, but honestly stdout is hard to keep readable and pipe-able. Adding just one more file descriptor separates the model from the view.

chilicuil 11 years ago |

I agree with what is exposed on the article and I've actually added more details in how to apply this "principles" to shell scripting:

http://javier.io/blog/en/2014/10/21/hints-in-writing-unix-to...

jwr 11 years ago |

I would add to this list:

If you are intercepting UNIX signals (starting with SIGINT), go back to the drawing board and think again. Don't do it. There is almost never a good reason for doing it, and you will likely get it wrong and frustrate users.

pjc50 11 years ago | |

I wrote one of these ages ago that was very useful (regain interactive control of an otherwise batch program) but broke all sorts of 'rules', including doing blocking IO in the signal handler.

edwintorok 11 years ago | |

How about cleaning up tempfiles on ^C?

renox 11 years ago | | |

YMMV but I prefer cleaning the old tempfiles at start-up. It allows you to get the content of the tempfiles after the program stopped, very handy for debugging..