Design of GNU Parallel (2015)

Design of GNU Parallel (2015)(gnu.org)

170 points by Havelock 3 years ago | 71 comments

ketzu 3 years ago |

This was quite interesting to look through!

Perl 5.8.0 is over 20 years old (https://dev.perl.org/perl5/news/2002/07/18/580ann/) while centOS 3.9 was released in 2007! At the same time it seems not-that-old and ancient.

My personal anecdote with gnu parallel was running into it while working in academia. It worked well and saved me some time, but I felt that it was unreasonable of a tool to ask for a citation to parallelise a script - it seemed that matplotlib, jupyter and co would need one as well. On the other hand, I decided to not use it, because I also feel that authors can ask for whatever they want.

ajsnigrutin 3 years ago | |

Yep, that's the great thing about perl... take a 20 year old script and it still works today. In comparison, if they used python, they'd be using python 2.2.

fmajid 3 years ago | | |

That's basically a side-effect of Perl being a dead language, frozen because Perl 6 will never happen. It's surprisingly hard to eradicate, however.

chubot 3 years ago | | |

Python 2.7 was released in 2010, and is even more frozen than Perl!

It still works, though you would have to archive/vendor dependencies

Ferret7446 3 years ago | |

It's a request, not a requirement. I see nothing wrong with the request nor if an individual decides to not cite it due to their principles/judgement.

ketzu 3 years ago | | |

As I said, I think it is okay for authors to make any request they want, it is their software after all.

But I still think making citation for gnu parallel is unreasonable. There is a huge body of software, of which gnu parallel is probably the least important, that contributed to (at least my) research. Blowing up citation lists with those makes the citation list borderline useless.

It makes citations into advertising space for software - it's bad enough being coerced to make it an advertisement for reviewers papers.

a2800276 3 years ago |

Wait what: `parallel` is a Perl script!? [1]

I would have thought it's black magic with assembler optimisations for MIPS and special considerations for HP-UX...

This is such a lovely and interesting writeup, it's wonderful that people take their time to share so generously!

[1] : an 11k loc petal script, you can read along here: https://github.com/gitGNU/gnu_parallel/blob/master/src/paral...

mhh__ 3 years ago | |

assembly optimizations for starting processes?

remram 3 years ago | | |

Maybe for reading the input, splitting it, and assembling the possibly-very-long argument lists passed to the processes.

NortySpock 3 years ago |

I found GNU parallel useful when I wanted to queue up transcoding of flac files to mp3 on my Raspberry Pi. A few ffmpeg flags plus a list of files meant I could easily just saturate one job per core with a one-line bash command.

krylon 3 years ago | |

I like to use ts(1) for that. http://vicerveza.homeunix.net/~viric/soft/ts/

hkt 3 years ago | |

I've used it to parallelise updating hundreds of helm releases whose CI pipelines had ceased to exist. It is a neat tool.

noloblo 3 years ago | | |

Can you please share the example code in gnu parallel

cricalix 3 years ago |

parallel is a tool I've reached for many times; the citation bit it prints is odd - it seems to assume that the general use case is research/academic - but easily squelched.

A sample use case would be having a file that has words in it, one per line, and you want to run a program that operates on each word (device name, dollar amount, whatever). Sure, you can use a loop, but if the words and actions are independent, parallel is one way to spin up N copies of your program and pass it a single word from the file. Can get around Python's GIL without having to use multiprocessing or threads (as a more concrete example).

Didn't realise that it busy waits, but I'm typically running it on a not very busy server with tens of cores.

chungy 3 years ago | |

Thankfully both Debian and Arch patch out the citation nonsense.

RhysU 3 years ago | | |

It is "nonsense" because...?

A) You don't understand. Please read the "Citation notice" section in the article.

B) You understand but don't use GNU Parallel.

C) You understand and use GNU Parallel in a non-academic setting and find the hassle of supplying --no-notice to be onerous vs the effort to write/maintain your own tool.

D) You understand and use GNU Parallel in an academic setting and have cited Ole or plan to cite Ole.

From the article, nearly 10 years ago Ole added the citation behavior after discussing it with his users: https://lists.gnu.org/archive/html/parallel/2013-11/msg00006...

Ole's citations took off roughly coincident with this behavior being added: https://scholar.google.com/citations?hl=en&user=D7I0K34AAAAJ... (click "Cited By" and notice the bar chart).

BooneJS 3 years ago |

Before GNU Parallel I used to use Ruby's workers and job queue to keep ${N} cores busy with work. It sorta worked like GNU parallel but was quite basic. I've since switched to using GNU Parallel. Stable code I don't have to write doesn't have to be maintained... not to mention it has more features than I normally supported.

Alifatisk 3 years ago | |

What did you use exactly? I am curious, Resque? Sidekick?

BooneJS 3 years ago | | |

Ruby's Queue structure to push work, and Thread for spinning up workers based on the number of cores on the machine. Main thread would push all commands to run to the Queue, followed by ${N} shutdown hints, and ${N} Threads would pick them off in a while loop that would only stop when it saw a shutdown command. Once the last thread consumed the last shutdown hint, all threads were done and the script would exit. This was barely one step beyond a bash script that backgrounded all tasks at once and swamped a host until it slowly finished up.

docandrew 3 years ago |

I couldn’t make heads or tails of what this would be useful for from the OP (maybe it’s something I should already have known), but this from the official site was pretty helpful: https://www.gnu.org/software/parallel/parallel_cheat.pdf

psychphysic 3 years ago | |

That cheat sheet is super enlightening!

But quite useless as it'll print poorly and is overall a waste of resources to have that lovely beach scene in the background.

kakadzhun 3 years ago | | |

Try this resource instead. Although it is 100 pages, the introductory part is already useful in and of itself!

https://zenodo.org/record/1146014/files/GNU_Parallel_2018.pd...

RadiozRadioz 3 years ago | | |

The beach will certainly make the cheat sheet stick in my memory, I can tell you that much.

bloopernova 3 years ago | | |

I was able to remove the background using LibreOffice to open the PDF.

mianos 3 years ago |

I once replaced a 10 machine Hadoop cluster job with a python script and parallel on my laptop because I didn't want to wait for hours for it to finish.

The i7 on my laptop with quite a few CPUS/threads and a few optimisations got the job finished in 10 minutes.

(I later put the Hadoop use on my resume, not the GNU parallel. That's the joke of modern job hunting. There is no interested in what you did, just buzzwords and leetcode. Luckily there are still a few places that value real work or I'd be too old to get a job. :) )

ZoomZoomZoom 3 years ago |

If anyone needs a pretty basic alternative with Windows support, there's Rush:

https://github.com/shenwei356/rush

I use it pretty extensively with ffmpeg, imagemagick and the like.

I'd been using the mmstick/parallel for a while, but it moved to RedoxOS repos and then stopped being updated, while still having some issues not ironed out.

https://github.com/shenwei356/rush

seized 3 years ago |

Parallel is a fun tool. I use it as a sort of simple slurm to distribute work over many VMs to process tens to hundreds of TBs of data. Sometimes across 2400+ cores.

michalc 3 years ago |

I've never been sure if it's too much of a hack, but I've used GNU parallel in Docker containers as a quick and easy way of getting multiple processes running for web applications.

And with the `--halt now,done=1` option (that I think is relatively recent?) it means that if any of the parallel processes exit, parallel would exit itself, the whole container will shut down, and external orchestration would start another one if needed.

imglorp 3 years ago |

Don't forget "make -j" is another option.

fmajid 3 years ago | |

Or `xargs -P`

fbdab103 3 years ago | |

I was just attempting to parallelize a makefile (~500 files, ~20 minutes per file), and I was not happy with the experience. Make syntax for globbing is not ideal. Doubly so as my files had spaces inside of them. All solvable of course, but I feel more comfortable leaning on a parallel/xargs/find workflow than esoteric make syntax to handle the realities of filenames in the wild.

Which is a shame - 95% of my make usage is PHONY targets where I have a task and not a generated artifact. My current use case would have greatly benefited from the native parallelism and the ability to restart only failed files.

anthk 3 years ago |

Parallel, vidir to edit directories with nvi/vim, moreutils, detox to scrap out any non-typeable char...

These are a must have today.

InfamousRece 3 years ago | |

moreutils have its own parallel utility that I actually prefer to Gnu parallel.

anthk 3 years ago | | |

No problems, they almost work the same I think. Oh, another bunch of small tools to help yourself:

    - entr. It runs a command on file/directory changes.
    - spt. Simple pomodoro technique. A good timer to help yourself to work and take rests.
    - herbe. It works great as a notifier for spt. Add "play" from sox to write a script to both
   notify and play a sound in parallel.
    - sox/ffmpeg/imagemagick. Audio, video and image production and conversion on the CLI. A must have.
    - catdoc/antiword/odt2txt/wordgrinder+sc-im+gnuplot. Word/Excel/Libreoffice files reading and editing on the terminal. Gnuplot with help with sc-im. This can be a beast over SSH. With Gnuplot compiled with sixel support (and XTerm) you can do magic.

- iomenu

     - cat bookmarks.txt | iomenu | xargs firefox. Pick from a list of items (one per line) and choose. I think it has fuzzy-finding matches.

I have several more. Simple battery meter (sbm), grabc to grab a color from the screen, pointtools+catpoint to do "presentations" over a terminal, nncp-go+yggdrasil for ad-hoc networking and secure encrypted backups between devices...

andrewshadura 3 years ago | | |

There's also paexec

rurban 3 years ago |

I wrote down a small usage example here: https://savannah.gnu.org/forum/forum.php?forum_id=9197

No need for massive distributed clusters when you have a simple perl oneliner

rockwotj 3 years ago |

I recently used parallel to write a 1TB data file for testing using all cores

  seq 0 10000 | parallel dd if=/dev/urandom of=/mnt/foo/input bs=10M count=10 seek={}0

codetrotter 3 years ago | |

Was it noticeably different from

    dd if=/dev/urandom of=/mnt/foo/input bs=10M count=100000

in the amount of time that it took?

rockwotj 3 years ago | | |

Yes, I had 16 cores and I gave up on the this version after several minutes. I don't remember the disk throughput difference but it was significant

globalreset 3 years ago |

What's the best rewrite of GNU Parallel in Rust? That citation thing is so annoying.

[supervisord] nodaemon=true [program:php-fpm] command=/usr/sbin/php-fpm8.0 -c /etc/php/8.0/fpm/php-fpm.conf --nodaemonize stdout_logfile=/dev/stdout stdout_logfile_maxbytes=0 stderr_logfile=/dev/stderr stderr_logfile_maxbytes=0 [program:nginx] command=/usr/sbin/nginx stdout_logfile=/dev/stdout stdout_logfile_maxbytes=0 stderr_logfile=/dev/stderr stderr_logfile_maxbytes=0

The following NEW packages will be installed: libexpat1 libmpdec3 libpython3-stdlib libpython3.10-minimal libpython3.10-stdlib libreadline8 libsqlite3-0 media-types python3 python3-minimal python3-pkg-resources python3.10 python3.10-minimal readline-common supervisor 0 upgraded, 15 newly installed, 0 to remove and 0 not upgraded. Need to get 6905 kB of archives. After this operation, 25.7 MB of additional disk space will be used.