Because all shell variables in code generated by pnut are numbers, variables never contain whitespace or special characters and don't need to be quoted. We considered quoting all variable expansions as this is generally seen as best practice in shell programming, but thought it hurt readability and decided not to.
If you think there are other issues, please let me know!
Super neat project, btw!
Because Pnut doesn't support certain C features used in TCC, Pnut features a native code backend that supports a larger subset of C99. We call this compiler `pnut-exe`, and it can be compiled using `pnut.sh`. This makes it possible to compile `pnut-exe.c` using `pnut.sh`, and then compile TCC, all from a POSIX shell."
Anywhere we can see a step-by-step demo of this process.
Curious if the authors tried NetBSD or OpenBSD, or using another small C compiler, e.g., pcc.
Historically, tcc was problematic for NetBSD and its forks. Not sure about today, but tcc is still in NetBSD pkgsrc WIP which suggests problems remain.
- a shell is required, which has to be built from sources, using a compiler which was also built from sources using a compile binary. That's the real boostrap.
- even if you could pick some shell, and compiled it with pnut.exe, the compiled code requires interpretation by an executable shell.
- there is no such thing as a "POSIX compliant shell"; that's an abstract category. All this amounts to is a promise that pnut.sh will not generate code that uses non-POSIX features.
open(..., O_RDWR | O_EXCL) -> runtime error, "echo "Unknow file mode" ; exit 1"
lseek(fd, 1, SEEK_HOLE); -> invalid code (uses undefined _lseek)
socket(AF_UNIX, SOCK_STREAM, 0); -> same (uses undefined _socket)
looking closer at "cp" and "cat" examples, write() call does not handle errors at all. Forget about partial writes, it does not even return -1 on failures.
"Compiler you can Trust", indeed... maybe you can trust it to get all the details wrong?
Otherwise the builtins seems to be here https://github.com/udem-dlteam/pnut/blob/main/runtime.sh
FYI all your functions are not "C functions", but rather POSIX functions. I did not expect it to be complete, but it's still impressive for what it is.
I don't remember there being a way to keep a server listening on a /dev/tcp/$ip/$port port, for sockets from shell scripts with shellcheck at least
I think the pitch here is that it can compile TCC which can then compile GCC which makes it much more difficult for a backdoor to survive potentially, especially if the shell code is easier to read and verify than the corresponding assembly.
Within that context, an incomplete libc is irrelevant.
A C to shell compiler might seem impractical, but you know what is even more impractical? Having a separate language for a build system. And yet, here we are. Using Shell, Make or CMake to build a C program is only acceptable because is has always been so. It's a "perceived normality" in the C world.
There is no good reason, however, CMake isn't a C library. With build system being a library, we could write, read, and, most importantly, debug build scripts just like any other part of the buildable. We already have includeOS, why not includeMake?
https://25thandClement.com/~william/2023/base64.sh
If this project had existed I might have opted to compile my C-based base-64 encoder and decoder routines, suitably tweaked for pnut's limitations.I say base64.sh is mostly pure not because it relies on shell extensions, but because the only non-builtins it depends on are od(1) or, alternatively, dd(1) to assist with binary I/O. And preferably od(1), as reading certain control characters, like NUL, into a shell variable is especially dubious. The encoder is designed to operate on a stream of decimal encoded bytes. (See decimals_fast for using od to encode stdin to decimals, and decimals_slow for using dd for the same.)
It looks like pnut uses `read -r` for reading input. In addition to NULs and related raw byte issues, I was worried about chunking issues (e.g. truncation or errors) on binary data, e.g. no newlines within LINE_BUF bytes. Have you tested binary I/O much? Relatedly, how many different shell implementations have you tested your core scheme with? In addition to bash, dash, and various incarnations of /bin/sh on the BSDs, I also tested base64.sh with Solaris' system shells (ksh88 and ksh93 derivatives), as well as AIX's (ksh88 derivative). AIX had some odd quirks with pipelines even with plain text I/O. (Unfortunately Polar Home is gone, now, so I have no easy way to play with AIX; maybe that's for the better.)
That's correct! Unlike Bash and other modern shells, the POSIX standard doesn't include arrays or any other data structures. The way we found around this limitation is to use arithmetic expansion and indexed shell variables (that are starting with `_` as you noted) to get random memory access.
Maybe then I can also interest you in an exception handler for DOS batch scripts:
Amber: Programming language compiled to Bash https://news.ycombinator.com/item?id=40431835 (318 comments)
---
Pnut doesn't seem to differentiate between `int' and `int*' function parameters. That's weird, and doesn't come across as trustworthy at all! Shouldn't the use of pointers be disallowed instead?
int test1(int a, int len) {
return a;
}
int test2(int* a, int len) {
return a;
}
Both compile to the exact same thing: : $((len = a = 0))
_test1() { let a $2; let len $3
: $(($1 = a))
endlet $1 len a
}
: $((len = a = 0))
_test2() { let a $2; let len $3
: $(($1 = a))
endlet $1 len a
}
The "runtime library" portion at the bottom of every script is nigh unreadable.Even still, it's a cool concept.
Is there a plan to remove such limitations?
edit: For reference, someone's take on building out better bash-like array functionality in posix shell: https://github.com/friendly-bits/POSIX-arrays (there's only very rudimentary array support built-in to posix sh, basically working with stuff in $@ using set -- arg1 arg2..)
ksh93: 31s
dash: 1m06s
bash: 1m19s
zsh: >15m
[0]: https://git.kernel.org/pub/scm/utils/dash/dash.git/tree/src/...EDIT: ksh93, not ksh
Autoconf is a perl program that turns (heavily customized) m4 files into shell scripts. How does a C compiler help there?
Oof, did not realize.
We would benefit from steering away from auto-generated scripts. Autoconf included.
Even POSIX standard ones. Chokes on:
#include <glob.h>
int main() // must be (); (void) results in syntax error.
{
glob_t gb; // syntax error here
glob("abc", 0, NULL, &gb);
return 0;
}
Nobody needs entirely self-contained C programs with no libraries to be turned into shell scripts; Unix people switch to C when there is a library function they need to call for which there no command in /bin or /usr/bin.If I reduce it to:
#include <glob.h>
int main()
{
glob("abc", 0, NULL, 0);
return 0;
}
it "compiles" into something with a main function like: _main() {
defstr __str_0 "abc"
_glob __ $__str_0 0 $_NULL 0
: $(($1 = 0))
}
but what good is that without a definition of _glob.Quite frankly I think Bash scripting is awful and frequently wish shell scripts were written in a real and debuggable language. For anything non-trivial that is.
I feel like I’d rather write C and compile it with Cosmopolitan C to give me a cross-platform binary than this.
Neat project. Definitely clever. But it’s headed in the opposite direction from what I’d prefer...
You can:
> After nearly one year of development, I'm pleased to announce our version 3.0 release of the Cosmopolitan library. [...] we invented a new linker that lets you build fat binaries which can run on these platforms: AMD ... ARM64
https://github.com/jart/cosmopolitan/releases/tag/3.5.3
> This release fixes Android support. You can now run LLMs on your phone using Cosmopolitan software like llamafile. See 78d3b86 for further details. Thank you @aj47 (techfren.net) for bug reports and and testing efforts.
In examples/compiled/cat.sh line 7:
: $((_$__ALLOC = $2)) # Track object size
^-- SC1102 (error): Shells disambiguate $(( differently or not at all. For $(command substitution), add space after $( . For $((arithmetics)), fix parsing errors.
^-----------------^ SC2046 (warning): Quote this to prevent word splitting.
^--------------^ SC2205 (warning): (..) is a subshell. Did you mean [ .. ], a test expression?
^-- SC2283 (error): Remove spaces around = to assign (or use [ ] to compare, or quote '=' if literal).
^-- SC2086 (info): Double quote to prevent globbing and word splitting.
It seems to be parsing the arithmetic expansion as a command substitution, which then causes the analyzer to produce errors that aren't relevant. ShellCheck's own documentation[0] mention this in the exceptions section, and the code is generated such that quoting and word splitting are not an issue (because variables never contain whitespace or special characters).It also warns about `let` being undefined in POSIX shell, but `let` is defined in the shell script so it's a false positive that's caused by the use of the `let` keyword specifically.
If you think there are other issues or ways to improve Pnut's compatibility with Shellcheck, please let us know!
The C to shell transpiler I'm aware of will output unreadable code (elvm using 8cc with sh backend)
Trying to compile with this tool fails with "comp_glo_decl: unexpected declaration"
The `sum` example doesn't seem to do wrapping, but signed int overflow is technically UB so I guess they're fine not to.
Switching it to `unsigned int` gives me:
code.c:1:1 syntax error: unsupported type
int why(int unused) {
wat_why_does_this_compile;
no_error_checking();
}Nah, using shell, make or cmake is acceptable because C is obviously a terrible language for doing things. (Those languages are also all terrible, but not quite as terrible as C).
> There is no good reason, however, CMake isn't a C library.
Isn't it the other way round? There's no good reason people write programs in C rather than CMake.
> With build system being a library, we could write, read, and, most importantly, debug build scripts just like any other part of the buildable.
Which is to say, with extreme difficulty?
Like, I agree with where you're coming from, it is absolutely a damning indictment of C that people don't want to express their builds in it. But writing in a build in C really would be terrible.
What Pnut shows us is that the language itself is a very thin construct. C could be as low-level as you want, but it can also... compile to shell. Pnut shows that C is only a set of grammatical rules, and the source code in C doesn't necessary reflect the binary program, it's only a script for the C compiler. A compiler then decides how to interpret the source and what to do with it.
Now back to builds. The difference between:
set(SOME_VARIABLE "SOME VALUE")
and set(SOME_VARIABLE, "SOME VALUE");
is purely grammatical. The underlying functionality is the same. When I'm saying, CMake could be a C library, I'm not saying we should ditch CMake and everything it brings to the table and start writing build scripts in pure C. I'm saying we can use both C language and CMake functionality with very little, skin deep, adjustments.The only thing that keeps us down is the perception of C as a low-level language for low-level applications. C is for drivers and shell is for moving files around. And that's when Pnut comes up and tells us: "hold on, are they?"
I disagree. For a very simple example it really makes life easier to not have to care about quoting filenames in build systems and just list a.c b.cpp etc., while you really want strings to be quoted in normal programming languages. Build systems that tried to be based on syntax of existing PLs (for instance Meson, QBS) are a real PITA for me when compared to CMake due to a lot of such affordances.
Why is it you think that?
https://github.com/udem-dlteam/pnut/blob/main/examples/compiled/base64.sh
It doesn't support NULs as you pointed out, but it's interesting to see similarities between your implementation and the one generated by Pnut.Because we use `read -r`, we haven't tested reading binary files. Fortunately, the shell's `printf` function can emit all 256 characters so Pnut can at least output binary files. This makes it possible for Pnut to have a x86 backend for the use of reproducible builds.
Regarding the use of `read`, one constraint we set ourselves when writing Pnut is to not use any external utilities, including those that are specified by the POSIX standard (other than `read` and `printf`). This maximizes portability of the code generated by Pnut and is enough for the reproducible build use case.
We're still looking for ways to integrate existing shell code with C. One way this can be done is through the use of the `#include_shell` directive which includes existing shell code in the generated shell script. This makes it possible to call the necessary utilities to read raw bytes without having Pnut itself depends on less portable utilities.
I'd choose a different example to showcase pnut.
The programmer, who was very proud of his mastery of C, said: “How can this be? C is the language in which the very kernel of Unix is implemented!”
Master Foo replied: “That is so. Nevertheless, there is more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
The programmer grew distressed. “But through the C language we experience the enlightenment of the Patriarch Ritchie! We become as one with the operating system and the machine, reaping matchless performance!”
Master Foo replied: “All that you say is true. But there is still more Unix-nature in one line of shell script than there is in ten thousand lines of C.”
The programmer scoffed at Master Foo and rose to depart. But Master Foo nodded to his student Nubi, who wrote a line of shell script on a nearby whiteboard, and said: “Master programmer, consider this pipeline. Implemented in pure C, would it not span ten thousand lines?”
The programmer muttered through his beard, contemplating what Nubi had written. Finally he agreed that it was so.
“And how many hours would you require to implement and debug that C program?” asked Nubi.
“Many,” admitted the visiting programmer. “But only a fool would spend the time to do that when so many more worthy tasks await him.”
“And who better understands the Unix-nature?” Master Foo asked. “Is it he who writes the ten thousand lines, or he who, perceiving the emptiness of the task, gains merit by not coding?”
Upon hearing this, the programmer was enlightened.
Master Foo is shorthand for Fool.
GNU Mes: https://www.gnu.org/software/mes/
Stage0: https://bootstrapping.miraheze.org/wiki/Stage0
Ribbit (same authors): https://github.com/udem-dlteam/ribbit
stage0-posix: https://github.com/oriansj/stage0-posix
Bootstrappable Builds: https://bootstrappable.org/
See also this LWN article about bootstrappable and reproducible builds: https://lwn.net/Articles/841797/ It contains a plethora of interesting links.
I.e., you can take your compiled.sh and run in an obscure processor with an obscure OS, as long as it's POSIX, it should work...
I suppose the trust moves to the shell executable then, but at least you could run the bootstrapping with multiple shells and expect identical output.
because Bash goes brrrr
$ ll /bin/bash /bin/dash /bin/ksh93 /bin/ls /bin/mksh
-rwxr-xr-x. 1 root root 1389064 May 1 00:59 /bin/bash
-rwxr-xr-x. 1 root root 128608 May 9 2023 /bin/dash
-rwxr-xr-x. 1 root root 1414912 Apr 9 07:26 /bin/ksh93
-rwxr-xr-x. 1 root root 140920 Apr 8 08:20 /bin/ls
-rwxr-xr-x. 1 root root 325208 Jan 9 2022 /bin/mksh
$ rpm -qi dash | tail -4
Description :
DASH is a POSIX-compliant implementation of /bin/sh that aims to be as small as
possible. It does this without sacrificing speed where possible. In fact, it is
significantly faster than bash (the GNU Bourne-Again SHell) for most tasks.As you point out, it moves the trust from the binary to the shell executable, but the shell is already a key piece of any build process and requires a minimum level of trust. The technique of bootstrapping on multiple shells and comparing the outputs is known as Double Diverse Compiling[0] and we think POSIX shell is particularly suited for this use case since it has so many implementations from different and likely independent sources.
The age and stability of the POSIX shell standard also play in our favor. Old shell binaries should be able bootstrap Pnut, and those binaries may be less likely to be compromised as the trusting trust attack was less known at that time, akin to low-background steel[1] that was made before nuclear bombs contaminated the atmosphere and steel produced after that time.
0: https://dwheeler.com/trusting-trust/ 1: https://en.wikipedia.org/wiki/Low-background_steel
When people discuss Turing completeness and related concepts one of the unstated caveats is that neither the concept itself, nor most solutions or environments, meaningfully address the problem of I/O with the external environment. pnut is kind of exceptional in this regard, even with the limitations.
I think seeking a specific number of bytes and then writing data there will be a problem, though.
For seeking n bytes, read nor sed will work; they work with lines.
sed is the only one of those that can write, and POSIX doesn’t appear to have the -i option for in-place editing (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/s...)
So, I think head for seeking followed by sed (or ed or vi, but sed is the simpler tool, I think) for replacing the first n characters, redirecting to a temp file and then doing a mv is your only option.
Advantage will be that writes will be atomic; disadvantage that it will be slow
But in a build script you don't want to be doing either. You want SOME_VARIABLE = SOME VALUE, or at most "SOME VALUE". Grammar and syntax matter.
> Pnut shows that C is only a set of grammatical rules, and the source code in C doesn't necessary reflect the binary program, it's only a script for the C compiler.
The only thing worse than writing C is writing something that looks like C but doesn't follow the rules of C, where you have to use some other logic to understand what it actually does. Build tools that do that kind of thing have been tried and they have not turned out well.
> When I'm saying, CMake could be a C library, I'm not saying we should ditch CMake and everything it brings to the table and start writing build scripts in pure C. I'm saying we can use both C language and CMake functionality with very little, skin deep, adjustments.
"Skin deep" perhaps, but making your language uglier and weirder is still unpleasant (and CMake is unpleasant and weird enough as it is).
> The only thing that keeps us down is the perception of C as a low-level language for low-level applications.
No, the other thing is the perception of C as a crude, inexpressive language full of weird edge cases that requires dozens of lines to write even simple things, and that in turn comes from the reality of C as a crude, inexpressive language full of weird edge cases that requires dozens of lines to write even simple things.
Syntax-wise C is fine. I personally have a soft spot for Rebol's "syntax free" approach, but the world prefers C. Five out of ten TIOBE's most popular languages have C-like syntax.
And you're right that the perception of C comes from the usage of C. Of course it does. But this creates the vicious cycle, the cycle things like Pnut are trying to break.
I don't know which five you're classifying that way, but even for languages that started off C-like the trend is in the direction of less C-like. Even for C++ the big popular changes recently have been things like auto; similarly for Java, and C# always had a more lightweight syntax for expressing values. And certainly JavaScript has an object literal syntax good enough that people use it separately. Python is admittedly weirdly bad for writing values in; I wonder if that's why Scons has more or less failed.
One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?
Or... work with me: Make does that, well enough.
For the sake of mental experiment, let's pretend Make is a separate executable, separate process, but with some sort of API. You can manage dynamic dependency graphs by calling its routines from C.
Now let's say Make is a dynamic library with all functionality exposed. You can invoke and manage subprocesses using its functions, but now your C program and the Make share a process together.
Now let's say Make is a C library. GNU Make is written in C so this is not impossible to imagine. Your C program shares the process, and the names on compilation+linking phase with Make, which is annoying. But you can still work with metadata using Make's facilities. Also now you can use all the tools: debuggers, profilers, static analyzers, dynamic analyzers - you use for the rest of your codebase.
We perceive C as a low-level language but, and Pnut shows it well, C is only a set of rules. We can write shell scripts with C rules. Why can't we then write build scripts?
https://web.archive.org/web/20180722051250/http://discuss.jo...
http://www.mirbsd.org/mksh.htm
OpenBSD switched their default shell to their own pdksh-derivative known as oksh.
There was an effort to (re)start ksh93 development, but AT&T halted this effort. The bugfixes from the failed effort have moved back into Korn's last release.
My comment was based on cloning master yesterday and trying to build redbean but hitting what looks like https://github.com/jart/cosmopolitan/issues/940
Indeed it lioks like the commit you mentioned should have fixed the issue with the pointer having too many bits for the weird kernel used on android and some raspis. Fingers crossed that release works.
edit:
Testing that release on Termux 118, stock Android 14 on a moto g73 5G (XT2237-2):
~/cosmopolitan $ uname -a
Linux localhost 5.10.205-android12-9-00027-g4d6c07fc6342-ab11525972 #1 SMP PREEMPT Mon Mar 4 18:49:33 UTC 2024 aarch64 Android
~/cosmopolitan $ /data/data/com.termux/files/home/cosmopolitan/build/bootstrap/cocmd
ape error: /data/data/com.termux/files/home/cosmopolitan/build/bootstrap/cocmd: prog mmap failed w/ errno 12Interesting. When you say "even when hundreds of KBs are allocated", do you mean this is allocating variables with large values, or tons of small variables? My case was the latter, and with that I saw a noticeable slowdown on Dash.
Simplest repro case:
$ cat many_vars_bench.sh
#!/bin/sh
_side=500
i=0
while [ "${i}" -lt "${_side}" ]; do
j=0
while [ "${j}" -lt "${_side}" ]; do
eval "matrix_${i}_${j}=$((i+j))" || exit 1
: $(( j+=1 ))
done
i=$((i+1))
done
$ time bash many_vars_bench.sh
5.60user 0.12system 0:05.78elapsed 99%CPU (0avgtext+0avgdata 57636maxresident)k
0inputs+0outputs (0major+13020minor)pagefaults 0swaps
$ time dash many_vars_bench.sh
40.75user 0.14system 0:41.22elapsed 99%CPU (0avgtext+0avgdata 19972maxresident)k
0inputs+0outputs (0major+4951minor)pagefaults 0swaps
Dash was ~8 times slower. Increase the side of the square "matrix" for a proportionally bigger slowdown (this one uses 250003 variables).> One thing we did notice is that subshells can be a bottleneck when the environment is large, and so we avoided subshells as much as possible in the runtime library. Did you observe the same in your testing?
Yes, launching a new process is generally expensive and so is spawning a subshell. If the shell is something like Bash (with a lot of startup/environment setup cost) then you'll feel this more than something like Dash, where the whole point was to make the shell small and snappy for init scripts: https://wiki.ubuntu.com/DashAsBinSh#Why_was_this_change_made...
In my limited testing, Bash generally came out on top for single-process performance, while Dash came out on top for scripts with more use of subshells.
Shell: <= 5 lines
Python: <= 500 lines
Rust: > 500 lines
Although to be honest I'd be perfectly happy if Shell was restricted to single line commands only.I've wasted a lot of time and energy deciphering undebuggable shell scripts that were written to "save programmer time". Not a fan.
I wasn't the strictest reviewer (most feared, sure, but not strictest) at least partly because my personal line for "oh that bit of shell is obvious" is way too high.
Sometimes you just want to execute 50 lines with little logic.
Sometimes you just have some simple logic that needs to be repeated.
Sometimes that logic is complicated, sometimes it is not.
But let's not blind ourselves with the survivor bias. Not everything new and very bright will succeed the test of time.
So let's take evrything with a grain of salt, and wait until the time has choosen its champions. Which might not be the best technology as we learned
If shell scripting didn’t exist I would be totally fine with that. There are far more scripts that I wish were written in a real language than the other way around.