GNU Parallel, where have you been all my life?

GNU Parallel, where have you been all my life?(alexplescan.com)

448 points by alexpls 2 years ago | 262 comments

BoppreH 2 years ago |

It's a nice tool, but it also shows the shortcomings of shell commands.

In a proper programming language, we'd have something like

    parallel [1..5], i => { sleep random()*10+5; possibly_flaky i }
    // [{"Seq": 4, "Host": ":", "Starttime": 1692491267...

And `parallel` would only have to worry about parallelization.

Instead, the shell environment forces programs to invent their own parameter separator (:::), a templating format ({1}), and a way to output a list of structures (CSV-like). You can see the same issues in `find`, where the exec separator is `\;`, the template is `{}`, and the output is delimited by \n or \0. And `xargs` does it in yet another different way.

It's very hard to acquire and retain mastery over a toolbox where every tool reinvents the basics. If you ever found yourself searching "find exec syntax" multiple times in a week, it's not your fault.

As for alternatives, I'm a fan of YSH[1] (Javascript-like), Nushell[2] (reinvented from first-principles for simplicity and safety) and Fish[3] (bash-like but without the footguns). Nushell is probably my favorite from the bunch, here's a parallel example:

    ls | where type == dir | par-each { |it|
        { name: $it.name, len: (ls $it.name | length) }
    }

[1] https://www.oilshell.org/release/latest/doc/ysh-tour.html

[2] https://github.com/nushell/nushell

[3] https://fishshell.com/

JNRowe 2 years ago | |

[I'm not recommending this, but maybe… No, no. I'm not sure…]

It isn't even just the newer shells that have solved this, zsh also has a solution out of the box¹. The extensive globbing support in zsh can largely replace `find`, and things like zargs allow you to reuse your common knowledge throughout the shell.

For example, performing your first example with zargs would use regular option separators(`--`), regular expansion(`{1..5}`), and standard shell constructs for the commands to execute.

I'll contrive up an example based around your file counter, but slightly different to show some other functionality.

    f() { fs=($1/*(.)); jo $1=$#fs }
    zargs -P 32 -n1 -- **/*(/) -- f

That should recursively list directories, counting only the files within each, and output² jsonl that can be further mangled within the shell². You could just as easily populate an associative array for further work, or $whatever. Unlike bash, zsh has reasonable behaviour around quoting and whitespace too.

Edit to add: I'm not suggesting zargs is a replacement for parallel, but if you're only using a small subset of its functionality then it may be able to replace that.

¹ https://zsh.sourceforge.io/Doc/Release/User-Contributions.ht...

² https://github.com/jpmens/jo

³ https://github.com/stedolan/jq

coliveira 2 years ago | |

What you mention is the main reason why shell script is not a decent language to write long programs. It is full of inconsistencies, and since it depends on other commands, you have to learn the quirks of each command you use. Moreover, good luck if you need to debug this. Shell should only be used for small scripts that are easy to debug.

runeks 2 years ago | | |

If doing even simple things requires looking up documentation, why does it matter whether the shell script is long or short?

Spending extra time doing simple things — because you need to Google e.g. "how to pass multiple space-separated arguments from a string to a command" — is also a waste of time.

lysium 2 years ago | | |

Do you recommend any good alternative when your shell program gets too large?

Honest question, as I’m struggling to leave the shell environment once the program gets too large. I could use Perl, but $? and the likes get quickly out of hand. Python’s support for pipes was difficult last time I used it, but that may have changed. What would you recommend?

chasil 2 years ago | |

GNU Parallel is also based on perl, so the footprint is quite large.

GNU xargs implements limited parallelization, and is compiled C. This functionality is present within busybox, including the Windows version.

https://www.linuxjournal.com/content/parallel-shells-xargs-u...

GNU Parallel will have much greater functionality, but it will not reach as far as xargs.

reddit_clone 2 years ago | | |

> GNU Parallel is also based on perl

Time to rewrite it in Rust /s

mistrial9 2 years ago | | |

meanwhile, python DASK is very well funded to be cloud-native, and also local.. however it relies on a python runtime, so you know .. also not sure about the DASK license terms

salawat 2 years ago | |

Your find exec problem can be trivially solved with either - exec /bin/bash -c "script" or you can spend a little extra time figuring out how to properly structure your scripts in such a way where the incocations just flow with little more than an invocation +getopts

If you feel like the answer is rewriting the shell, the answer is practically never rewriting the shell. It's learning to use it.

caro11ne 2 years ago | |

Do you mean like:

    parallel 'sleep {= $_=rand()*10+5; =} ; possibly_flaky {}' ::: {1..5}

The {= =} escapes to perl, so you have a full programming language available.

zackmorris 2 years ago |

Since nobody asked, I'm reiterating my position that computers to effectively utilize parallel functionality simply aren't available today. I've always wanted a computer with at least 256 cores and local content-addressable memories beside each core to send data where it's needed. By Moore's Law, we could have had MIPS machines with 1000 cores around 2010, and 100,000 to 1 million cores today, for under $1000.

Contrast that with GPU shaders where one C-style loop operates on buffers separate from system memory, and can't access system services like network sockets or files. GPUs have around 32 or 64 physical cores, so theoretically that many shaders could run simultaneously, although we rarely see that in practice. And we'd need bare-metal drivers to access the GPU cores directly, does anyone know of any?

The closest thing now is Apple's M1 line, but it has specialized NN and GPU cores, so missed out on the potential of true symmetric multiprocessing.

The reason I care about this so much is that with this amount of computing power, kids could run genetic algorithms and other "embarrassingly parallel" code that solves problems about as well as NNs in many cases. Instead we're going to end up with yet another billion dollar bubble that locks us into whatever AI status quo that the tech industry manages to come up with. And everyone seems to love it. It reminds me of the scene in Star Wars III when Padme notes how liberty dies with thunderous applause.

ketanmaheshwari 2 years ago |

GNU Parallel has been one of my go to tool to accomplish more on the terminal. Generate test data, transferring data from one node to another using rsync, run many-task, embarrassingly parallel jobs on HPC, pipelines with simple data dependencies but run over hundreds or files are some of the places where I use GNU Parallel.

Many thanks to Ole Tange for developing the wonderful tool and helping the users on Stack Overflow sites to this day.

Shameless plug, I am developing a tutorial on GNU Parallel to be presented at eScience conference in Cyprus this year: https://www.escience-conference.org/2023/tutorials/gnu_paral...

juujian 2 years ago | |

I'm surprised the CPU would in any way be the bottleneck for transferring data. Is it really faster to parallelize that?

Ultimatt 2 years ago | | |

It's more GNU Parallel has host groups in a config so you can send files for a job to the right one where its going to execute and bring things back. Essentially it can turn a local xargs type job into any kind of remote task execution including dealing with files locally needing to be remote.

Aissen 2 years ago |

GNU parallel is great for the kind of tasks highlighted in the post. Note that being written in Perl, it's slower than its simpler C counterpart moreutils parallel. And that in many uses cases xargs --max-procs=$(nproc) can replace it.

astrodust 2 years ago | |

`xargs` has you covered in more cases than most realize.

cb321 2 years ago | | |

This really is true and you may be understating with "most". Here are a couple:

    mkdir /tmp/g
    seq 1 10 | tr \\n \\0 |
      xargs -0n2 -P4 bash -c 't=$EPOCHREALTIME; sleep $((RANDOM%5)); echo "$@" >/tmp/g/$t' d0
    cat /tmp/g/*

Another one is

    xargs -P "$(nproc)" --process-slot-var=s sh -c 'grep X "$@" >>/tmp/g.$s' d0
    cat /tmp/g.*

You can also cobble together that second style with a custom config setup wherein a command is given $s and responds with some host names and there might be an `ssh` in front of the `grep`, for example. That `d0` argument (for $0) is a bit janky and there can be shell quoting issues, of course. But then again, you may not have hostile filenames/whatever. Remote loadavg adaptation might be nice, but then again, maybe you control all the remotes. Similarly, I could not get back-to-back executions of the EPOCHREALTIME thing closer than 250 microseconds. So, collision basically will not happen even though it probably could in theory.

cstrahan 2 years ago | |

I also recommend checking out `xe`: https://github.com/leahneukirchen/xe

It’s like xargs with sane defaults and a couple tricks of its own.

green-orca 2 years ago |

I'm using task spooler a lot for parallel background processing. What I like the most it the ability to add further tasks to the queue after processing has already started.

https://manpages.ubuntu.com/manpages/xenial/man1/tsp.1.html

pdimitar 2 years ago | |

Never knew about this, thanks! I'll definitely try it because `parallel` has bitten me before in a few more advanced cases. It has rough edges here and there.

NelsonMinar 2 years ago | |

Wow this tool is fantastic, thank you! The UI is very nice and simple. How has this not existed in Unix for 30+ years?

https://github.com/justanhduc/task-spooler

gjvc 2 years ago | |

from that man page, there is a name clash with "ts" from moreutils

codetrotter 2 years ago | | |

I installed task-spooler just now, because I’ve been wanting something like this for a long time.

It looks like the actual name of the task-spooler command on Debian after install is “tsp”, not “ts”. So no collision :)

Now it just remains to be seen if the package by default allows the tasks to continue to run after I log out, or if systemd will annoyingly kill the tasks after I disconnect from ssh the same way systemd annoyingly kills my “screen” sessions when I disconnect ssh, and there is some cumbersome thing you have to do on each of your systems to have systemd not kill “screen” :(

chriswarbo 2 years ago | | |

Some distros rename the binary to 'tsp' (I think Debian does that)

aftbit 2 years ago | | |

moreutils also clashes with parallel, does it not? i remember installing some package for chronic and thus breaking GNU parallel, at least back in the late 2010s.

throwaway277432 2 years ago |

Is the author still adding the "cite me or pay 10000€" notice to the output? And calling that GPL?

And still answering every xargs Stackoverflow question with "you should use GNU Parallel" instead of answering the question? That really gets old quickly when googling for xarg answers.

These are just some of the reasons I'll never use parallel. xargs is perfectly fine for most usecases, and it can do everything I need it to.

ssddanbrown 2 years ago |

Love finding a good use-case of parallel as an easy way to gain massive time savings, especially on the modern high-threaded CPUs of today. Most recently found it useful when batch-compressing large jpeg images to smaller webp files, via use with find and ImageMagick:

   find ./ -type f -iname '*.jpg' -size +1M -print0 | parallel -0 mogrify -format webp -quality 80 {}

wiredfool 2 years ago | |

Xargs is a nearly drop in replacement and probably already installed by default in most distros. You may need the -n 1 (one file per) and -P to parallelize.

  xargs -n 1 -P 8

c-hendricks 2 years ago | | |

find + xargs has become my go-to "process files in parallel". Tho now I'm wondering if I should be using `-n` instead of `-L`

    #!/usr/bin/env bash
    set -e

    main() {
      if [ "$1" = "handle-file" ]; then
        shift
        handle-file "$@"
      else
        find . \
          -type f \
          -not -path '*/optimized/*' \
          -print0 \
          | xargs \
            -0 \
            -L 1 \
            -P 8 \
            -I {} \
            bash -c "cd \"$PWD\" && \"$0\" handle-file \"{}\""
      fi
    }

    handle-file() {
      echo "handle-file $1 ..."
    }

    main "$@"

indymike 2 years ago | | |

Actually, parallel is a drop in for xargs as xargs has been around longer. Parallel has a few big improvements:

* Grouped output (prevents one process from writing output in the middle of another's output) * In-order output (task a output first, task b output second even though they ran in parallel) * Better handling of special characters * Remote execution

More here: https://www.gnu.org/software/parallel/parallel_alternatives....

toastal 2 years ago | |

You should batch compress to JPEG XL too with cjxl --lossless_jpeg=1 --quality=80 --effort=9 {} {/.}.jxl (or magick)

asicsp 2 years ago | |

Any particular reason to use -print0 and pipe instead of -exec?

Gabrys1 2 years ago | | |

-exec would not be parallel, pipe to parallel makes it parallel

titzer 2 years ago |

I didn't know about this, and reading through the comments, I found out that xargs can also do batching and parallelism (nice!). However, it appears that if you pipe the output of an xargs-parallel command into another utility, it jumbles the output of the multiple subprocesses, whereas GNU parallel does not.

I was a little put off by the annoying/scary citation issue mentioned by another commenter, so I am not sure I will use parallel.

I want to pipe the output of parallel processes into a utility that I wrote for progress printing (https://github.com/titzer/progress), but I think that neither of these solutions work; my progress utility will have to do this on its own.

cb321 2 years ago | |

You can probably do something that creates as many FIFOs as you have parallelism and just be careful about emitting whole records like https://github.com/c-blake/bu/blob/main/doc/funnel.md . That one's Nim, but the meat is only like 50 lines and easily ported to C like your progress tool. ( EDIT: and it will also probably be drastically lower overhead than `parallel` which has over 70X worse time overhead and 10X the RAM overhead of tools written in fast, native-compiled languages: https://github.com/c-blake/bu/blob/main/tests/strench.sh )

titzer 2 years ago | | |

Thanks for the suggestion!

bloopernova 2 years ago |

There's a shell script version of GNU parallel that's great for CI/CD pipeline tasks. You just keep it in your repo and source it as needed. It's incredibly useful, we use it in one build to batch process a few thousand things in groups of 25.

Edited to add: finally got signed in to work, you create the script via:

    parallel --embed > scriptname.sh

It's about 14,000 lines of awesome and works on "ash, bash, dash, ksh, sh, and zsh"

notpachet 2 years ago | |

Maybe this is a silly question, but what advantage do you get from checking that huge file into VC instead of just installing parallel ahead of time on the CI images?

bloopernova 2 years ago | | |

Not a silly question!

In this case, we don't have control over the docker images used to build our apps.

ilyt 2 years ago | | |

Parallel was born way before docker and modern CI practices. Having one script that did it all was more of a benefit before those become commonplace

rhysrhaven 2 years ago |

I much prefer rush over parallel. Namely that everything is executed as a bash shell.

https://github.com/shenwei356/rush

Decabytes 2 years ago |

I’ve been writing a lot of PowerShell recently and discovered the ForEach-Object cmdlets with the -parallel parameter and it has been addicting to parallelize my scripts, so I totally understand why parallelizing using a command line tool is attractive

asicsp 2 years ago |

Didn't know about the book: https://zenodo.org/record/1146014 (discussed 4 years back: https://news.ycombinator.com/item?id=20726631)

See also https://hn.algolia.com/?q=gnu+parallel for other related discussions.

SPBS 2 years ago |

xargs is more useful because it's posix so you can always guarantee it to be there (whereas with GNU Parallel you probably have to reach for a package manager to install it first). The ergonomics are worse though, as usual.

ketanmaheshwari 2 years ago | |

The entirety of GNU Parallel is just one Perl program. It could be copied over and used in a pinch. The installation itself is very simple and no special dependencies or privileges are needed.

em500 2 years ago | | |

Except Perl isn't always present by default either (e.g. in Arch Linux or FreeBSD).

bloopernova 2 years ago | |

See my comment above, there's a shell version you can store in your project repository and use wherever you want with zero installation!

https://news.ycombinator.com/item?id=37208250

Joel_Mckay 2 years ago | |

Indeed, xargs can be a better option, but it has trouble doing some tasks efficiently.

For example, translating a large list of IPv4 ranges into a standard format for a firewall rule-set parser:

cat ~/blacklist.p2p | parallel --ungroup --eta --jobs 20 "ipcalc {} | sed '2!d' " | grep -Ev '^(0.|255.|127.)' >> ~/blacklist_p2p_converted

Makes an annoyingly slow task tolerable, as parallel doesn't block while fetching to preserve order. We probably should rewrite this to be more efficient, but this task is run infrequently.

Happy computing =)

CJefferson 2 years ago | |

Last time I checked (which was a few years ago, admittedly), some popular ystem's xargs were too old to support parallelism -- Mac in particular.

krackers 2 years ago | | |

This is not the case I think, xargs on mac supports parallel, and does so back to 10.9 or older

adrian_b 2 years ago | |

GNU Parallel has been created precisely for solving some deficiencies of xargs.

While there are cases when it makes sense to stick to what is specified by POSIX, there are also cases when the POSIX specification is so obsolete that using POSIX instead of some free ubiquitous programs is a big mistake.

Among these latter cases are writing scripts for a POSIX shell instead of writing them for bash and using xargs instead of parallel.

TZubiri 2 years ago |

First paragraph: I want to test my tests.

Second paragraph: I want to test my test-tester.

OP 100% fell down a rabbit-hole.

latchkey 2 years ago | |

Exactly! I was kind of shaking my head over this one...

"they execute extensive scenarios against a live service over HTTP"

Any time I've seen people think they've needed to test live services, over HTTP... it means that there are far deeper issues.

AvImd 2 years ago |

If none of the examples from the article work, make sure you are running GNU Parallel and not an identically named utility from moreutils.

ranting-moth 2 years ago |

Learning Parallel pays high dividends for the rest of your life.

bloopernova 2 years ago | |

Similarly with the command line in general. Yet you'd think it was torture to some developers I know!

pimpl 2 years ago |

Having a layer of parallelisation on top of good old sequential code seems like a very neat idea. It resolves headaches of learning how to run code in parallel in languages that aren’t necessarily my primary language (e.g. short, one-off scripts). Thanks for sharing!!

ogou 2 years ago |

Someone gifted an old blade server to me a few years ago. Very slow, but 16 cores and 24 gig of RAM. At the time I was making a lot of video art with ffmpeg, without a GPU. That version of ffmpeg wasn't optimized for multiple cores so rendering was really slow and sequential. I discovered Parallel and set the server to process large videos with most of the cores in parallel. Voila, it chewed through a massive amount of media fairly quickly. Faster than the hard drives actually.

bcjordan 2 years ago |

Folks who are here and interested in parallelization for CI/CD may also be interested in Dagger.io — I had heard about it on HN over the years but not played w it. It's basically a more fine-grained Docker-like executor with better caching and utilities for spinning up services and running tests.

Curious if anyone else has experiences with it, honestly been surprised at how little I've heard about it

jamietanna 2 years ago |

One thing I've used parallel before is to add the ability to add straightforward retry mechanisms, and it was great! https://www.jvt.me/posts/2022/04/28/shell-queue/

figomore 2 years ago |

I use GNU Parallel to render Blender videos distributed by a bunch of nodes https://github.com/tfmoraes/blender_gnu_parallel_render

rubicks 2 years ago |

I can appreciate that GNU parallel exists. I always use `xargs -P0` in my own work, though.

sneak 2 years ago |

See also: ppss (parallel processing shell script) https://github.com/louwrentius/PPSS#

nateb2022 2 years ago |

There's also PaSh: https://github.com/binpash/pash

jooz 2 years ago |

I try to use it last week to run 10 instances of curl against a webserver.

I was expecting something simple as 'parallel -j10 curl https://whatever' but couldnt find the right syntax in less time that took me to prepare a dirty shell script that did the same.

brabel 2 years ago | |

If you want a simple load testing tool for HTTP, use wrk2[1].

    wrk -t2 -c100 -d30s -R2000 http://127.0.0.1:8080/index.html

> This runs a benchmark for 30 seconds, using 2 threads, keeping 100 HTTP connections open, and a constant throughput of 2000 requests per second (total, across all connections combined).

Some distros include `ab`[2] which is also good, but wrk2 improves on it (and on wrk version 1) in multiple ways, so that's what I use myself.

[1] https://github.com/giltene/wrk2

[2] https://httpd.apache.org/docs/2.2/programs/ab.html

b5n 2 years ago | |

Quick solution:

    parallel -j 10 curl 2> /dev/null \
        ::: $(for i in {1..10};do echo 'https://whatever.com';done)

grepfru_it 2 years ago |

The same can be implemented with just bash using jobs and wait. Useful if parallel is not available in your pipeline

heinrichhartman 2 years ago |

As the answer to the question was not actually given in the post:

    /usr/bin/parallel

aquir 2 years ago |

"Do one thing and do it well"

nullc 2 years ago |

parallel is great but its default behaviors never quite seem to match my needs, so every time I use it I have to spend some time consulting the man page. Fortunately, the man page is more than up to the task.

But because of the mini learning curve on each use and because I find I need a little more boiler plate to use parallel, I use xargs -P more often, only using parallel when I need its special features (e.g. multiple hosts or collating the output streams).

Oh also, parallel itself can be a bit of a resource hog. (Obviously that depends a lot on how you're using it-- but I mean in cases where xargs' usage is unnoticeable I sometimes have to change the size of my jobs to get parallel out of the way).

herrkanin 2 years ago |

I have wanted to parallelize my .zshrc file for a while – all those environment setup scripts for nvm, pyenv, starship, etc really makes the startup time noticably slow. Does anyone know how to do this?

dahart 2 years ago | |

Ooh nice thought. I’m not certain, but I kinda doubt it’s possible, because those startup scripts need to modify the current shell environment. I believe GNU parallel runs in a subshell and launching new tasks in separate processes, so fundamentally doesn’t operate the same way that e.g. sourcing the nvm script does, unfortunately. Even if there was some way to hack it, I’d be nervous about changing environment variables in parallel, to me that sounds like asking for really nasty race condition bugs.

jp57 2 years ago |

Seems like you could accomplish the same thing more cleanly (IMO) with make. You can create a target for each test, which can be done with patterns, and then use `make -j` to run them in parallel.

morbidious 2 years ago |

Looks like a great tool!

Thanks for the link to the book: https://zenodo.org/record/1146014

michaelcampbell 2 years ago |

parallel is one of those tools like jq, to me. It's great, but by the time I've grokked the syntax, AGAIN, I'd've been quicker to write a quick shell/ruby/python script to do it that's almost readable.

b0afc375b5 2 years ago |

What about & and wait? Could it have been an adequate alternative?

capableweb 2 years ago | |

Probably for very simple use cases, but the real power in parallel really comes from the myriad of switches that enables so much more than what "&" and "wait" could do.

Here are a bunch of examples: https://www.gnu.org/software/parallel/parallel_examples.html

A fun one I end up using ~monthly or so for various things (usually with more switches added as needed):

    GNU Parallel as queue system/batch manager

    # start queue
    true >jobqueue; tail -n+0 -f jobqueue | parallel

    # add job
    echo my_command my_arg >> jobqueue

    # to start queue for remote execution
    true >jobqueue; tail -n+0 -f jobqueue | parallel -S ..

klyrs 2 years ago | |

When I'm using parallel, it's usually because I have thousands of jobs. Worse, they have nontrivial memory requirements. When you background processes with &, the system starts timeslicing. Each process gets to allocate its memory before being paused to make time for the next process. Your system will almost immediately crumple under load. Hopefully, the oom killer will target your backgrounded jobs... but the script spawning them will go untouched because it isn't the thing hogging memory.

Before I learned of parallel, I tried a hack where I'd manually assemble jobs into batches, and wait on the batches before starting the next. It achieved very low system utilization, because inevitably, one job each the batch takes much longer than the rest. A slight improvement (still not good), is to use `split` to chop your jobs file into $num_cores chunks, and background each chunk. But still, this gets low utilization. Problem being that you aren't using a thread/worker pool.

Parallel (or, TIL, xargs) can maintain 100% system utilization, until the very last $num_cores jobs.

eisbaw 2 years ago | |

No, that is more messy and can easily leave lingering processes.

But it can be done in pure BASH: https://gist.github.com/mped-oticon/b11dafa937e694ce4fa6fbf2...

GNU parallel supports expansion, which bash_parallel doesn't. However bash_parallel works with bash functions, which GNU parallel doesn't.

untilted 2 years ago | | |

GNU parallel supports bash functions, provided you "export -f" them beforehand

agumonkey 2 years ago | |

You just taught me something

toastal 2 years ago |

I use this with Nix all the time. Great utility.

tomberek 2 years ago | |

Especially with the remote SSH features one needs a way to ensure everything needed for your process is on the target machine; Nix makes this easy.

Nix + GDAL + GNUParallel + autoscaling groups === massive geospatial data processing pipeline

timtom39 2 years ago |

Love the tool. One of my favorite snippets adds parallel processing to jq

#!/bin/bash

cat - | parallel --line-buffer --pipe --roundrobin jq "$@"

pmarreck 2 years ago |

HIPS (Hiding In Plain Sight)!

lfconsult 2 years ago |

Wonderful... Thanks for sharing.

amelius 2 years ago |

Another reminder that you shouldn't use Bash to write scripts.

E.g. in Python this would all be very easy to do. Just start a bunch of threads and e.g. invoke subprocess.run() from them.

bloopernova 2 years ago | |

You don't have to reinvent the wheel for your script, all the parallel options are ready for you to use and are well documented. It's also packed with features that might take a long time to write into your Python script.

I am trying to use Python by default when writing scripts nowadays, but sometimes the best tool for the job isn't Python or writing your own Python.

imajoredinecon 2 years ago | | |

IMO, effective "scripting" just means the ability to solve ad hoc problems easily by writing task-specific glue that delegates the hard parts of the program to (1) an effective set of libraries you've written yourself and (2) external code or tools when it makes sense.

From this perspective, the languages of the glue, the libraries, and the external code all matter less than the ease of writing the glue; interfacing with the external code; and maintaining the libraries. The best language for this probably comes down to a combination of what you're comfortable writing (and reading, and maintaining) and what kinds of tasks you're trying to solve.

For me personally, using Python glue and libraries strikes a pretty good balance here. Writing a script "in Python" doesn't mean you need to reinvent the wheel. If you think `parallel` provides a better interface for map-reduce parallelism than `subprocess` (or than a library function you've written on top of `subprocess`), no problem: you can just call `parallel` from Python (and you'll probably find yourself writing a library function on top of it to abstract away the fact that it's a shell script).

But if you're much more effective working in Bash than Python, then writing your glue and developing your libraries in Bash could be the way to go.

amelius 2 years ago | | |

Well, writing such a script takes me only a few minutes maybe and gives me a lot of flexibility.

cusspvz 2 years ago |

You guys know that in bash you can use `&` to pass a foreground terminal process to the background and then use `wait` to wait for all the session's background process to end, right?

bduffany 2 years ago | |

Yes, and those work well for smaller workloads, but if you just run 1,000,000 commands with `&` in a `for` loop, it will grind your computer to a halt (if the tasks are modestly resource intensive). GNU parallel will let you run those same 1,000,000 tasks but make sure that only (e.g.) 16 of them are running at once. It's not easy to do that in bash.

oniony 2 years ago | |

I've been using `&` to run stuff in the background for donkeys, but had no idea about `wait`.

seized 2 years ago | |

And that's not really at all comparable to what Parallel can do.... Bash can't do that across thousands of cores on separate machines for example.

da-x 2 years ago | |

It takes time to notice that if you do _several_ of these background jobs with `&`, you will only get the exit status of the last one when you do `wait`. Errors of the others will be swallowed.

Then you _have_ resort to 'wait <pid>' with the 20 lines of bash coded need to manage all those PIDs. I have a large editor bash snippet just for that.

remram 2 years ago | |

Strong "Dropbox is just rsync it'll never sell" vibes.

Alifatisk 2 years ago | |

Didn’t know about wait

quickthrower2 2 years ago |

It is sort if a shame that tools can’t figure out how to parallelize things without being herded like cattle to do so.

It might be a culture thing. In .NET code I see people running things in parallel a lot within code but maybe this is less so for linux tools.

Maybe functional programming style could lend to a parallel-first programming style, with heuristics to decide when it isn’t worth it.

pdimitar 2 years ago | |

You seem a bit behind or too invested in C# in particular. Elixir for example can run stuff in parallel with just 3-4 added lines of code added to an otherwise sequential code.

quickthrower2 2 years ago | | |

Yes of course other programming languages can do this. I was more referring to culture and idioms. The point is that tools don't support it or think about it. And that is because probably things work for most small use cases without it, and that it is a leaky abstraction - you need to change your code to support it.

Imagine a world where there were only GPUs for example - then everyone by default would be running parallel-first code, and in that imaginary world you would need to do nothing to run a series of bash commands piping into each other in parallel.

Price US$7.9 million in 1977 (equivalent to $38.2 million in 2022) Weight 5.5 tons (Cray-1A) Power 115 kW @ 208 V 400 Hz[1] CPU 64-bit processor @ 80 MHz[1] Memory 8.39 Megabytes (up to 1 048 576 words)[1] Storage 303 Megabytes (DD19 Unit)[1] FLOPS 160 MFLOPS

- # *YOU* will be harming free software by removing the notice. You - # accept to be added to a public hall of shame by removing the - # line. That includes you, George and Andreas.

[david@pc ~]$ echo foo | parallel echo Academic tradition requires you to cite works you base your article on. If you use programs that use GNU Parallel to process data for an article in a scientific publication, please cite: Tange, O. (2023, July 22). GNU Parallel 20230722 ('Приго́жин'). Zenodo. https://doi.org/10.5281/zenodo.8175685 This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing. More about funding GNU Parallel and the citation notice: https://www.gnu.org/software/parallel/parallel_design.html#citation-notice To silence this citation notice: run 'parallel --citation' once. foo [david@pc ~]$

If you use --will-cite in scripts to be run by others you are making it harder for others to see the citation notice. The development of GNU parallel is indirectly financed through citations, so if your users do not know they should cite then you are making it harder to finance development. However, if you pay 10000 EUR, you have done your part to finance future development and should feel free to use --will-cite in scripts. If you do not want to help financing future development by letting other users see the citation notice or by paying, then please consider using another tool instead of GNU parallel. You can find some of the alternatives in man parallel_alternatives.

== Is the citation notice compatible with GPLv3? == Yes. The wording has been cleared by Richard M. Stallman to be compatible with GPLv3. This is because the citation notice is not part of the license, but part of academic tradition. Therefore the notice is not adding a term that would require citation as mentioned on: https://www.gnu.org/licenses/gpl-faq.en.html#RequireCitation The link only addresses the license and copyright law. It does not address academic tradition, and the citation notice only refers to academic tradition. [...]

Does the GPL allow me to add terms that would require citation or acknowledgment in research papers which use the GPL-covered software or its output? (#RequireCitation) No, this is not permitted under the terms of the GPL. While we recognize that proper citation is an important part of academic publications, citation cannot be added as an additional requirement to the GPL. Requiring citation in research papers which made use of GPLed software goes beyond what would be an acceptable additional requirement under section 7(b) of GPLv3, and therefore would be considered an additional restriction under Section 7 of the GPL. And copyright law does not allow you to place such a requirement on the output of software, regardless of whether it is licensed under the terms of the GPL or some other license.