Qsv: Efficient CSV CLI Toolkit

Qsv: Efficient CSV CLI Toolkit(github.com)

100 points by s1291 2 years ago | 30 comments

astowaway 2 years ago |

gnu awk recently got CSV support built into it which is quite nice imo though certainly less featureful than qsv appears to be

snidane 2 years ago |

This looks great!

Please consider removing any implicit network calls like the initial "Checking GitHub for updates...". This itself will prevent people from adoption or even trying it any further. This is similar to gnu parallel's --citation, which, albeit a small thing - will scare many people off.

Consider adding pivot and unpivot operations. Mlr gets it quite right with syntax, but is unusable since it doesn't work in streaming mode and tries to load everything into memory, despite claiming otherwise.

Consider adding basic summing command. Sum is the most common data operation, which could warrant its own special optimized command, instead offloading this to external math processor like lua or python. Even better if this had a group by (-by) and window by (-over) capability. Eg. 'qsv sum col1,col2 -by col3,col4'. Brimdata's zq utility is the only one I know that does this quite right, but is quite clunky to use.

Consider adding a laminate command. Essentially adding a new column with a constant. This probably could be achieved by a join with a file with a single row, but why not make this common operation easier to use.

Consider the option to concatenate csv files with mismatched headers. cat rows or cat columns complains about the mismatch. One of the most common problems with handling csvs is schema evolution. I and many others would appreciate if we could merge similar csvs together easily.

Conversions to and from other standard formats would be appreciated (parquet, ion, fixed width lenghts, avro, etc.). Othe compression formats as well - especially zstd.

It would be nice if the tool enabled embedding outputs of external commands easily. Lua and python builtin support is nice, but probably not sufficient. i'd like to be able to run a jq command on a single column and merge it back as another for example.

Inspiration:

  - csvquote: https://news.ycombinator.com/item?id=31351393
  - teip: https://github.com/greymd/teip

dima55 2 years ago | |

You can get quite far by piping to other tools and/or using DSLs. pivoting can almost certainly be done by the luau support in qsv (or `vnl-filter`, for instance). Summing and grouping is something that `datamash` does well (or qsv luau probably, or `vnl-filter --eval`). Adding a column once again can be done with luau or `vnl-filter`.

Would you be more likely to use this tool if it had even more stuff in it requiring reading even more documentation? That's a genuine question.

jqnatividad 2 years ago | |

Thanks for the detailed feedback @snidane!

As maintainer of qsv, here's my reply:

- Given qsv's rapid release cycle (173 releases over three years), the auto-update check is essential at the moment. Once we reach 1.0, I'll turn it off. For now, given your feedback, I've only made it check 10% of the time.

- Pivot is in the backlog and I'll be sure to add unpivot when I implement it. (https://github.com/jqnatividad/qsv/issues/799)

- I'll add a dedicated summing command with the group by (-by) and window by (-over) capability (https://github.com/jqnatividad/qsv/issues/1514). Do note that `stats` has basic sum as @ezequiel-garzon pointed out.

- With the `enum` command, qsv can achieve what you proposed with `laminate`. E.g. qsv enum --new-column newcol --constant newconstant mydata.csv --output laminated-data.csv

- With the cat rowskey command, qsv can already concatenate files with mismatched headers.

- other file formats. qsv supports parquet, csv, tsv, excel, ods, datapackage, sqlite and more (see https://github.com/jqnatividad/qsv/tree/master#file-formats). Fixed-format though is not supported yet and quite interesting, and have added it to the backlog (https://github.com/jqnatividad/qsv/issues/1515)

- as to "enable embedding outputs of commands", qsv is composable by design, so you can use standard stdin/stdout redirection/piping techniques to have it work with other CLI tools like jq, awk, etc.

Finally, just released v0.120.0 that already incorporates the less aggressive self-update check. https://github.com/jqnatividad/qsv/releases/tag/0.120.0

ezequiel-garzon 2 years ago | |

I know this is just one thing out of many, but sum is included in stats.

quasarj 2 years ago | |

Wait, who is scared off by parallel's --citation?

fbdab103 2 years ago | | |

I refuse to use parallel due to that obnoxiousness.

At minimum, it is not installed by default, so it is already a negative to just using xargs. That it then puts that barrier in my way makes it an easy tool to skip.

sweetgiorni 2 years ago | | |

I find it incredibly obnoxious and I refuse to use parallel because of it. To me, it violates the spirit of free software and tarnishes the GNU project. As someone who has released my source to the public for free, I couldn't fathom adding such a flag.

Bonus SO post to enhance your fury:

https://stackoverflow.com/questions/61762189/installing-gnu-...

alchemist1e9 2 years ago |

Wow! This looks a really complete set of operations and extremely useful.

foehrenwald 2 years ago |

I am wondering who really uses these tools and for what since there are R and python data science tools available?

dima55 2 years ago | |

For simple analyses (i.e. what most people do most of the time) doing this on the commandline gets you there faster. I use vnlog (https://github.com/dkogan/vnlog/). By the time you fired up your editor to write your Python code, I already have analyses and plots ready.

fbdab103 2 years ago | |

I write Python every day, but still use miller here and there. If I am doing a "simple" operation (eye of the beholder), being able to pipe it on the command line is great.

To do a comparable amount of manipulation in Python takes a lot more boilerplate (imports, command line arguments, diety-can-we-default-to-Int64 already?, etc), plus you have to ensure you have a virtual environment with correct dependencies. Which is more or less standard numpy+pandas, but a single executable tool to do some data workup is always appreciated.

I am never performance constrained, but I have been told that miller is one of the slower tools in this space, but I still reach for it do to its wide format support.

snidane 2 years ago | |

Out of core computations. While your python and R script will choke after reading few hundred megs, my compiled binary cli will keep streaming through many such files with memory usage sitting somewhere near zero.

mbreese 2 years ago | | |

That’s just the effect of streaming IO vs reading in the file into memory all at once. That has nothing to do with the language you use, but how you process the data.

I keep multiple little Python scripts around to do things like sum lists of numbers (think extracting a column with awk, then calculating a sum). Compiled vs an interpreted script really doesn’t matter. What matters is using the right algorithm for the job. R and Python data science libraries like to read in all of the data at once into one single data structure. That’s the anti-pattern to avoid if at all possible.

(But they are very handy for small datasets of complex calculations that require the entire dataset in memory. )

hermitcrab 2 years ago | |

Also: https://github.com/BurntSushi/xsv https://csvkit.readthedocs.io/en/latest/

dima55 2 years ago |

An incomplete list of other similar tools: https://github.com/dkogan/vnlog/#description

alchemist1e9 2 years ago | |

Here is a related but more obscure tool that can be surprisingly useful.

http://hopper.si.edu/wiki/mmti/Starbase

Their tbl format is so trivially close to standard csv that I just convert on the fly back and forth with tiny helper perl scripts.