The Linux Command Line(linuxcommand.org) |
The Linux Command Line(linuxcommand.org) |
Yep, the command line is what lets you solve "We have 178 CSV dumps of tables that has ~60 GB of data and we want them imported into a SQL database, there's no previous DB schema info, here's a zip file of questionably named CSVs, can you have this done in 2 days?". Meanwhile there's 8,000+ columns of data that are strings, booleans, datetimes, etc. and some of the files are 15 GB each.
It didn't take too much shell scripting to solve that problem in a way that you can run it against a directory of CSV files and have it produce SQL files with table schemas that can be created and then generate the SQL to efficiently import them from a CSV. Basically a little bit of shell scripting and using tools like find, head, sed, grep, wc and friends. It took 4 hours to solve the problem in a way that was testable.
One of my biggest moments was figuring out you could run SQL from the command line, using environment variables in the query string.
Got me started. Probably still useful, hardly available (unless your library has it.)
I have a collection of resources for Linux CLI and Shell Scripting here: https://learnbyexample.github.io/curated_resources/linux_cli...
/bin/true && echo "indeed" && /bin/false || echo "uh?"
It amuses me quite a lot to see how my code changes when I'm thinking more with my "dba hat" or my "unix hat".
The unix hat will make much more use of embedded environment variables and shell constructs for loops/repetition, the DBA hat leans towards tally tables, sometimes 2 phase sql generation, or stored procedures.
It does make my scripts inconsistent as all hell tho, which is irritating - but the shel, approaches usually flow more easily from my fingers, while the pure sql ones will usually have the edge in runtime performance - shifting back to shell if I need to chunk up a job for parallel processing..
A recent fun one for me was realising the I could lean on the the posix shells printf feature of repeatedly using the format string to consume parameters if there are more parameters that format placeholders.. I can actually pass thousands of lines of sql output to printf and have printf reformat it. Some relatively easy scripting with "split" to chunk up the db output leaves me with something that only runs about 2x slower that raw output from the db (in ksh ), but gives lots of flexibility on things like adding thousands separators to numbers etc, something our db platform is poor at.
Given that most of the line by line iteration is happening inside printf, not in a script for/while loop, it's really quite quick. Haven't benchmarked it against anything yet tho, it's still a toy I wrote while waiting on something else.. but it's got potential.
Judicious use of the old ascii separator codes rather than commas/tabs/pipes/whatever makes the field splitting much easier to deal with too.
Embedded newlines remain a pain tho. For now each chunk of lines gets passed through a perl fragment to turn them into literal \r or \n strings
Ha ha, nice.
You can do something similar to that with awk.
The classic book titled The Unix Programming Environment, by Kernighan and Pike, has some good examples of that.
https://en.m.wikipedia.org/wiki/The_Unix_Programming_Environ...
You have to quote and unquote parts of the awk command appropriately, so that some parts of the command are processed directly by awk, and others by the shell (first).