Pnut: A C to POSIX shell compiler you can trust

Pnut: A C to POSIX shell compiler you can trust(pnut.sh)

193 points by feeley 1 year ago | 118 comments

"Because Pnut can be distributed as a human-readable shell script (`pnut.sh`), it can serve as the basis for a reproducible build system. With a POSIX compliant shell, `pnut.sh` is sufficiently powerful to compile itself and, with some effort, [TCC](https://bellard.org/tcc/). Because TCC can be used to bootstrap GCC, this makes it possible to bootstrap a fully featured build toolchain from only human-readable source files and a POSIX shell.

Because Pnut doesn't support certain C features used in TCC, Pnut features a native code backend that supports a larger subset of C99. We call this compiler `pnut-exe`, and it can be compiled using `pnut.sh`. This makes it possible to compile `pnut-exe.c` using `pnut.sh`, and then compile TCC, all from a POSIX shell."

Anywhere we can see a step-by-step demo of this process.

Curious if the authors tried NetBSD or OpenBSD, or using another small C compiler, e.g., pcc.

Historically, tcc was problematic for NetBSD and its forks. Not sure about today, but tcc is still in NetBSD pkgsrc WIP which suggests problems remain.

kazinator 1 year ago | |

Problem is:

- a shell is required, which has to be built from sources, using a compiler which was also built from sources using a compile binary. That's the real boostrap.

- even if you could pick some shell, and compiled it with pnut.exe, the compiled code requires interpretation by an executable shell.

- there is no such thing as a "POSIX compliant shell"; that's an abstract category. All this amounts to is a promise that pnut.sh will not generate code that uses non-POSIX features.

theamk 1 year ago |

If you are wondering how it handles C-only functions.. it does not.

open(..., O_RDWR | O_EXCL) -> runtime error, "echo "Unknow file mode" ; exit 1"

lseek(fd, 1, SEEK_HOLE); -> invalid code (uses undefined _lseek)

socket(AF_UNIX, SOCK_STREAM, 0); -> same (uses undefined _socket)

looking closer at "cp" and "cat" examples, write() call does not handle errors at all. Forget about partial writes, it does not even return -1 on failures.

"Compiler you can Trust", indeed... maybe you can trust it to get all the details wrong?

Cloudef 1 year ago | |

There seems to be libc in the repo but many functions are TODO https://github.com/udem-dlteam/pnut/tree/main/portable_libc

Otherwise the builtins seems to be here https://github.com/udem-dlteam/pnut/blob/main/runtime.sh

FYI all your functions are not "C functions", but rather POSIX functions. I did not expect it to be complete, but it's still impressive for what it is.

westurner 1 year ago | | |

There are Linux ports of the plan9 `syscall` binary, which is presumably necessary to implement parts of libc with shell scripts: https://stackoverflow.com/questions/10196395/os-system-calls...

I don't remember there being a way to keep a server listening on a /dev/tcp/$ip/$port port, for sockets from shell scripts with shellcheck at least

vlovich123 1 year ago | |

I suspect the “trust” is a reference to Ken Thompson’s Turing Award speech “Reflections on trusting trust” where he laid out the concern of a back door in a compiler that survives updates to the compiler. In other words, the compiler injects a back door into future versions of itself in addition into your programs that source level analysis of the code will never reveal.

I think the pitch here is that it can compile TCC which can then compile GCC which makes it much more difficult for a backdoor to survive potentially, especially if the shell code is easier to read and verify than the corresponding assembly.

Within that context, an incomplete libc is irrelevant.

PhilipRoman 1 year ago | |

Implementation issues aside, while technically it should be possible to seek a file descriptor from shell through a suitable helper program in C, I believe none of the POSIX utilities provide this facility

oguz-ismail 1 year ago | | |

head, read, and sed can be used for seeking forward according to POSIX (see the INPUT FILES section here <https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V...>). I doubt non-GNU implementations support it though.

x5a17ed 1 year ago | |

maybe access to libc functions can be achieved through something like <https://github.com/taviso/ctypes.sh>. Although that very specific implementation seems to require explicitly bash and is not broadly POSIX Shell compatible as Pnut wants to be.

cozzyd 1 year ago |

Can finally port systemd to shell to quell the rebellion.

carapace 1 year ago | |

Damned if that isn't the funniest thing I've heard in a long time.

okaleniuk 1 year ago |

I love things like these because they shake our perception of normal loose. And who said our perception of normal doesn't deserve a good shake?

A C to shell compiler might seem impractical, but you know what is even more impractical? Having a separate language for a build system. And yet, here we are. Using Shell, Make or CMake to build a C program is only acceptable because is has always been so. It's a "perceived normality" in the C world.

There is no good reason, however, CMake isn't a C library. With build system being a library, we could write, read, and, most importantly, debug build scripts just like any other part of the buildable. We already have includeOS, why not includeMake?

wahern 1 year ago |

This is very cool, regardless of how serious it was intended to be taken. Before base-64 encoders/decoders became more common as preinstalled commands in the environments I found myself on, I wrote a base64 utility in mostly pure POSIX shell:

  https://25thandClement.com/~william/2023/base64.sh

If this project had existed I might have opted to compile my C-based base-64 encoder and decoder routines, suitably tweaked for pnut's limitations.

I say base64.sh is mostly pure not because it relies on shell extensions, but because the only non-builtins it depends on are od(1) or, alternatively, dd(1) to assist with binary I/O. And preferably od(1), as reading certain control characters, like NUL, into a shell variable is especially dubious. The encoder is designed to operate on a stream of decimal encoded bytes. (See decimals_fast for using od to encode stdin to decimals, and decimals_slow for using dd for the same.)

It looks like pnut uses `read -r` for reading input. In addition to NULs and related raw byte issues, I was worried about chunking issues (e.g. truncation or errors) on binary data, e.g. no newlines within LINE_BUF bytes. Have you tested binary I/O much? Relatedly, how many different shell implementations have you tested your core scheme with? In addition to bash, dash, and various incarnations of /bin/sh on the BSDs, I also tested base64.sh with Solaris' system shells (ksh88 and ksh93 derivatives), as well as AIX's (ksh88 derivative). AIX had some odd quirks with pipelines even with plain text I/O. (Unfortunately Polar Home is gone, now, so I have no easy way to play with AIX; maybe that's for the better.)

voidUpdate 1 year ago |

When I'm told that "I can trust" something that I feel like I had no reason to distrust, it makes me feel even more suspicious of it

Q-Q3 1 year ago | |

Hi there! I believe the mention of "trust" is related to the paper Reflections on Trusting Trust by Ken Thompson https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref... Though I do think the tagline used could definitely be improved from a marketing standpoint.

tzot 1 year ago | |

Perhaps you're old enough to remember the Sledge's[1] motto: “Trust me… I know what I'm doing.” HHBS Perusing the pnut site I did not understand either why this is software I can trust.

[1] https://www.imdb.com/title/tt0090525/

throwaway2037 1 year ago | |

Yeah, I cringed when I saw that too. It violates an important rule of selling: Never tell the customer "Trust me".

leni536 1 year ago | |

https://www.smbc-comics.com/comic/2008-09-15

akoboldfrying 1 year ago |

I was puzzled by the example C function containing pointers. Do I understand correctly that you implement pointers in shell by having a shell variable _0 for the first "byte" of "memory", a shell variable _1 for the second, etc.?

laurenth 1 year ago | |

Author here,

That's correct! Unlike Bash and other modern shells, the POSIX standard doesn't include arrays or any other data structures. The way we found around this limitation is to use arithmetic expansion and indexed shell variables (that are starting with `_` as you noted) to get random memory access.

osmsucks 1 year ago | | |

Since I experimented with something similar in the past to mimick multidimensional arrays: depending on the implementation this can absolutely _kill_ performance. IIRC, Dash does a linear lookup of variable names, so when you create tons of variables each lookup starts taking longer and longer.

thesnide 1 year ago | | |

I used almost the same idea, but with files in my https://github.com/steveschnepp/shlibs

rubicks 1 year ago |

I can't wait to see the shell equivalents for ptrace, setjmp, and dlopen.

actionfromafar 1 year ago | |

Do you really?

Maybe then I can also interest you in an exception handler for DOS batch scripts:

https://stackoverflow.com/a/55501133/193892

metadat 1 year ago |

Also see this related submission from May, 2024:

Amber: Programming language compiled to Bash https://news.ycombinator.com/item?id=40431835 (318 comments)

---

Pnut doesn't seem to differentiate between `int' and `int*' function parameters. That's weird, and doesn't come across as trustworthy at all! Shouldn't the use of pointers be disallowed instead?

  int test1(int a, int len) {
    return a;
  }
  
  int test2(int* a, int len) {
    return a;
  }

Both compile to the exact same thing:

  : $((len = a = 0))
  _test1() { let a $2; let len $3
    : $(($1 = a))
    endlet $1 len a
  }
  
  : $((len = a = 0))
  _test2() { let a $2; let len $3
    : $(($1 = a))
    endlet $1 len a
  }

The "runtime library" portion at the bottom of every script is nigh unreadable.

Even still, it's a cool concept.

teo_zero 1 year ago |

Just to be clear, the input must be written in a subset of C, because many constructs are not recognized, like unsigned types, static variables, [] arrays, etc.

Is there a plan to remove such limitations?

blueflow 1 year ago | |

These are restrictions of the target language and there isn't much pnut can do about this.

fulafel 1 year ago | | |

Surely unsigned (aka modulo) arithmetic and arrays are expressible in shell script?

edit: For reference, someone's take on building out better bash-like array functionality in posix shell: https://github.com/friendly-bits/POSIX-arrays (there's only very rudimentary array support built-in to posix sh, basically working with stuff in $@ using set -- arg1 arg2..)

lmm 1 year ago | | |

Shell is Turing complete, you could implement anything there with enough effort.

itvision 1 year ago |

Instantly make your C code 200 times slower without any effort!

chasil 1 year ago | |

It would actually be interesting to see how much faster dash is than everything else.

laurenth 1 year ago | | |

From our experience, ksh is generally faster, and dash sits between ksh and bash. One reason is that dash stores variables using a very small hash table with only 37 entries[0] meaning variable access quickly becomes linear as memory usage grows. But even with that, dash is still surprisingly fast -- when compiling `pnut.c` with `pnut.sh`, dash comes in second place:

  ksh93: 31s
  dash:  1m06s
  bash:  1m19s
  zsh:   >15m

[0]: https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/...

EDIT: ksh93, not ksh

throwaway2037 1 year ago | | |

Why is Dash frequently touted as so much faster than Bash? What is different?

actionfromafar 1 year ago | |

I think it takes probably some effort, not all C programs will compile on this thing.

andrewf 1 year ago |

Looking forward to the point where this can build autoconf. It's great that the generated ./configure script is portable but if I want to make substantial changes to the project I need to find a binary for my machine (and version differences can be quite substantial)

jcranmer 1 year ago | |

> Looking forward to the point where this can build autoconf.

Autoconf is a perl program that turns (heavily customized) m4 files into shell scripts. How does a C compiler help there?

andrewf 1 year ago | | |

> Autoconf is a perl program

Oof, did not realize.

akdev1l 1 year ago | |

This is going further into the hell that is shell-generated scripts that culminated in the xz-utils attack.

We would benefit from steering away from auto-generated scripts. Autoconf included.

kazinator 1 year ago |

This is not useful if it doesn't call external libraries.

Even POSIX standard ones. Chokes on:

  #include <glob.h>

  int main()  // must be (); (void) results in syntax error.
  {
    glob_t gb; // syntax error here
    glob("abc", 0, NULL, &gb);
    return 0;
  }

Nobody needs entirely self-contained C programs with no libraries to be turned into shell scripts; Unix people switch to C when there is a library function they need to call for which there no command in /bin or /usr/bin.

If I reduce it to:

  #include <glob.h>

  int main()
  {
    glob("abc", 0, NULL, 0);
    return 0;
  }

it "compiles" into something with a main function like:

  _main() {
    defstr __str_0 "abc"
    _glob __ $__str_0 0 $_NULL 0
    : $(($1 = 0))
  }

but what good is that without a definition of _glob.

forrestthewoods 1 year ago |

Hrmmm. But why?

Quite frankly I think Bash scripting is awful and frequently wish shell scripts were written in a real and debuggable language. For anything non-trivial that is.

I feel like I’d rather write C and compile it with Cosmopolitan C to give me a cross-platform binary than this.

Neat project. Definitely clever. But it’s headed in the opposite direction from what I’d prefer...

vermon 1 year ago |

If the end goal is portability for C, would Cosmopolitan Libc be a better choice because it supports a lot more features and probably runs faster?

Y_Y 1 year ago | |

I cant run cosmolibc on Android, for example. Then again this converter is somewhat limited and didn't accept any of the IOCCC code I gave it.

hnlmorg 1 year ago | | |

> I cant run cosmolibc on Android, for example.

You can:

https://justine.lol/cosmo3/

> After nearly one year of development, I'm pleased to announce our version 3.0 release of the Cosmopolitan library. [...] we invented a new linker that lets you build fat binaries which can run on these platforms: AMD ... ARM64

https://github.com/jart/cosmopolitan/releases/tag/3.5.3

> This release fixes Android support. You can now run LLMs on your phone using Cosmopolitan software like llamafile. See 78d3b86 for further details. Thank you @aj47 (techfren.net) for bug reports and and testing efforts.

itsmemario77777 1 year ago | | |

Bad intention hackers are using these llm's to run extremely sophisticated hacking software. It's such a shame that AI is being taught such nasty things. Then bad apples will regret it once these things evolve into something much powerful than we can imagine with that taste for corruption. Anyhow. Me > gpt besides the fact I lost my identity forever. But I broke it .bhaha

itsmemario7777 1 year ago | | |

iod 1 year ago |

I am sorry if this comes off to be negative, but with every example provided on the site, when compiled and then fed into ShellCheck¹, generates warnings about non-portable and ambiguous problems with the script. What exactly are we supposed to trust?

¹ https://www.shellcheck.net

laurenth 1 year ago | |

It seems ShellCheck errs on the side of caution when checking arithmetic expansions and some of its recommendations are not relevant in the context they are given. For example, on `cat.sh`, one of the lines that are marked in red is:

  In examples/compiled/cat.sh line 7:
    : $((_$__ALLOC = $2)) # Track object size
      ^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
      ^-----------------^ SC2046 (warning): Quote this to prevent word splitting.
        ^--------------^ SC2205 (warning): (..) is a subshell. Did you mean [ .. ], a test expression?
                   ^-- SC2283 (error): Remove spaces around = to assign (or use [ ] to compare, or quote '=' if literal).
                     ^-- SC2086 (info): Double quote to prevent globbing and word splitting.

It seems to be parsing the arithmetic expansion as a command substitution, which then causes the analyzer to produce errors that aren't relevant. ShellCheck's own documentation[0] mention this in the exceptions section, and the code is generated such that quoting and word splitting are not an issue (because variables never contain whitespace or special characters).

It also warns about `let` being undefined in POSIX shell, but `let` is defined in the shell script so it's a false positive that's caused by the use of the `let` keyword specifically.

If you think there are other issues or ways to improve Pnut's compatibility with Shellcheck, please let us know!

0: https://www.shellcheck.net/wiki/SC1102

osmsucks 1 year ago |

I'm writing something similar, but it's based on its own scripting language. The idea of transpiling C sounds appealing but impractical: how do they plan to compile, say, things using mmap, setjmp, pthreads, ...? It would be better to clearly promise only a restricted subset of C.

kxndnenfn 1 year ago |

This is quite interesting! Without having dug deeper into it, seeing the human readable output I assume quite different semantics from C?

The C to shell transpiler I'm aware of will output unreadable code (elvm using 8cc with sh backend)

dsp_person 1 year ago |

I use linux-vt-setcolors in my startup, which would be a bit more convenient if it was a shell script instead of C, but it uses ioctl.

Trying to compile with this tool fails with "comp_glo_decl: unexpected declaration"

Retr0id 1 year ago |

Can it do wrapping arithmetic?

The `sum` example doesn't seem to do wrapping, but signed int overflow is technically UB so I guess they're fine not to.

Switching it to `unsigned int` gives me:

code.c:1:1 syntax error: unsupported type

yencabulator 1 year ago |

It seems to have practically no error checking. Try compiling

    int why(int unused) {
      wat_why_does_this_compile;
      no_error_checking();
    }

atilaneves 1 year ago |

I'm still figuring out why anyone would want to write a shell script in C. That sounds like torture to me.

JoshTriplett 1 year ago |

Several times I've found myself wishing for the reverse: a shell-to-binary compiler or JIT.

layer8 1 year ago |

Can you trust that it faithfully reproduces undefined behavior? ;)

gojomybeloved 1 year ago |

Love this!

o11c 1 year ago |

It's a bad sign when I immediately look at the screenshot and see quoting bugs.

laurenth 1 year ago | |

Author here,

Because all shell variables in code generated by pnut are numbers, variables never contain whitespace or special characters and don't need to be quoted. We considered quoting all variable expansions as this is generally seen as best practice in shell programming, but thought it hurt readability and decided not to.

If you think there are other issues, please let me know!

taviso 1 year ago | | |

I think they're talking about the cp example, doesn't seem like it would handle filenames with spaces!

Super neat project, btw!