Tips for analyzing logs(jvns.ca) |
Tips for analyzing logs(jvns.ca) |
- grep (or ripgrep), sort, uniq, head, tail, cut, colrm, tr
- diff
- jq
- bat
- ministat
- less, more, percol
- Facebook Path Picker (fpp)
watch -d -n $INT
Watch for delta changes in interval.I have lots of bad habits on the cli I should correct, now I'm feeling guilty after reading all the comments.
It's not super cheap, but it's 10x cheaper than wasting dev time in the terminal. (Sumologic, splunk are the two I can vouch for)
It's also good to have histograms by hour or day. I've hacked up scripts to do this but I should really make something better!
... | perl -ne 'print "$1 $2 ...\n" if /some-regex-with-capture-groups/' | ...
I've been mulling this idea over in my head as well. I have a fleet of PCs out in the wild, all running the same software. It would be nice to have an easy way to detect strange behavior like processes that are continually respawning / segfaults / crashes / etc, without explicitly writing a bunch of search terms.
It works for simple explanations though.
cd ~/logs
cat * | grep <my string>
Yup, it's often a waste of resources to run an extra 'cat'. It really demonstrates that you don't have the usage of the command receiving the output completely memorized. You know, the thousand or so commands you might be piping it into.
But, if you're doing a 'useless' use of cat, you're probably just doing it in an interactive session. You're not writing a script. (Or maybe you are, but even still, I bet that script isn't running thousands of times per second. And if it is, ok, time to question it).
So you're wasting a few clock cycles. The computer is doing a few billion of these per second? By the time you explain the 'useless' use of cat to someone, the time you wasted explaining to them why they are wrong, is greater than the total time that their lifetime usage of cat was going to waste.
There's a set of people who correct the same three pairs of homophones that get used incorrectly, but don't know what the word 'homophone' is. (Har har, they're/their/there). I always liken the people who are so quick to chew someone out for using cat, in the same batch of people who do this: what if I just want to use cat because it makes my command easier to edit? I can click up, warp to the front of the line, and change it real quick.
Sorry. I did say, it is a pet peeve.
<f a | b | cWear it as a badge of honor! It marks you as a person who puts clarity, convenience and simplicity before raw performance. I can't think of a single case when that bit of performance matters.
Needless to say, I'm happily using cat (uselessly) myself and have no plans to convert.
This. As noted, even in scripts it usually makes more sense since the result is a pipeline that's easier to read, annotate and modify.
Case in point:
cat file.txt \
| sed '1s/^\xEF\xBB\xBF//' `# Strip UTF-8 BOM at the beginning of the file` \
| ...
Specifying a file name would only make the "black-magic-line" of `sed` more complicated while also making it more complicated to modify the pipeline itself. Now, if I want to skip that step to test something, I don't have to figure out if/how the next command takes an input file (or, ironically, replace `sed` with `cat`).No, it demonstrates that you don't have redirection memorized, and don't know that you can place it anywhere in the command line, including on the left.
> So you're wasting a few clock cycles
Keystrokes too:
cat x | cmd
< x cmd
It's also possible that cmd may detect that its standard input is connected to a real file, and take advantage of being able to lseek() the file descriptor. For instance say that x is an archive and cmd is an extractor. If cmd needs to skip some indicated offset to get to the desired payload material, it may be able to do it efficiently with an lseek, whereas under cat, it has to read all the bytes and discard them to get to the desired offset.If you look for situations like this they are surprisingly common.
> pet peeve
But it's well worth wasting a process to have a nice pipeline where each command does a single thing so you can easily reason about them.
You're not alone. I disable the warning in shellcheck. I've been uselessly cat'ing since the 90s, like you.
This works:
cat file.log | tr 'a' 'b'
This doesn't: tr 'a' 'b' file.log tr a b < file.log
< file.log tr a b
The pattern "cat file | command" introduces an unnecessary cat process to achieve the effect of "< file command".It is not called "useless use of cat" (UUoC) without justification.
> pet peeve
nice
cat file.log | sed 's/[0-9]//g' | sort | uniq -c | sort -nr
This has been incredibly helpful in quickly resolving outages more than once. alias numberless="pbpaste | sed 's/[0-9]//g' | pbcopy"1) Fuck grep, use ripgrep, especially if you have to scour over an entire directory.
2) Get good with regex, seriously, it will shave hours off your searching.
3) For whatever application you are using, get to know how the logging is created. Find the methods used where said logs are made, and understand why such a log line exists.
4) Get good with piping into awk if needed if you need some nice readable output.
Usually it's something like rg > ag > ack > grep and that's overall testimonial, not just speed.
What am I missing?
The way you do this is by aggregating logs by fingerprints. Removing everything but punctuation is a generic approach to fingerprinting, but is not exactly human friendly. For Java, log4j can use class in your logging pattern, and that plus log level is usually pretty specific.
Once you have a fingerprint, the rest is just counting and division. Over a specific time window, count the number of log events, for every finger print, for both good and bad systems. Then score every fingerprint as (1+ # of bad events) / (1 + # of good events) and everything at the top is most strongly bad. And the more often its logged, the further up it will be. No more lecturing people about "correct" interpretations of ERROR vs INFO vs DEBUG. No more "this ERROR is always logged, even during normal operations".
Since that number should be the same (or at least close!) in both good and bad datasets, I'm not sure the extra math matters much.
grep regex /var/log/logfile -A5 #To view the next 5 lines
grep regex /var/log/logfile -B5 #To view the previous 5 lines
grep regex /var/log/logfile -05 #To view the 5 lines before *and* after the match
This is super handy to find out what happened just before a service crashed, for example. grep regex /var/log/logfile -05 #To view the 5 lines before *and* after the match
This appears to work with GNU grep, at least, but YSK that the POSIX grep option for this is `-C`.All it does is consistently colors the first field in a line from stdin so you can quickly see which log lines have the same first field.
I used it in combination with the parallel[0] command to prefix log lines by replica name when tailing logs across machines: https://github.com/jasisk/color-prefix-pipe
And its not just log searching and correlation value. At work, the entire discussion "oh but we need access to all servers because of logs" just died when all logs were accessible via one web interface. I added a log aggregation and suddenly only ops needed access to servers.
Designing that thing with accessibility and discoverability in mind is a whone nother topic though.
Throw Loki+Grafana on a single VM somewhere (or run it on your Kubernetes/Nomad/ECS/etc. cluster) and it will get you very far, as long as you plan ahead a bit (most notably, indexing happens at ingestion, so you need to have an idea of what you want from your logs or your queries will be slower).
Or use a SaaS like logz.io, AWS OpenSearch, Datadog, etc. Most support OpenTelemetry now, so switching data ingestion is quite easy (unlike dashboards and alerts).
It makes sense to have centralised logs pretty much as soon as you outgrow the "everything runs on this one box and my DR plan is a prayer" stage, IMO.
cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | grep -v THING4
to this: egrep -v 'THING1|THING2|THING3|THING4' file
That gets rid of the cat and three greps. Both POSIX and GNU encourage grep -E to be used in preference to egrep.A pcregrep utility also used to exist, if you want expansive perl-compatible regular expressions. This has been absorbed into GNU grep with the -P option.
I know this happens, but I think it's because programmers are abusing INFO. In principle it's reserved for messages that are informative at a level sys admins and a few others can make sense of and use. Unfortunately abuse often leads to "We turned INFO off" making it much harder to diagnose things after the fact.
Stuff like, if you are exposing an HTTP endpoint, you should log the request URL and the time it took to serve it. Or if you are invoking an extenral service, you should log the response time of that service.
Corollary is that good day logs should be minimal and "clean", e.g not logging a lot, or, logging nice and predictably (which makes them easy to strip out via grep -v, etc.)
Yes, always include a request id in every request structure you create and include it also in the response and print it. It would seem something obvious that everyone does by default but instead, no, it's not so obvious it seems.
One simple example - Serilog enrichment https://github.com/serilog/serilog/wiki/Enrichment
This does however assume your entire application uses the same logger. But this is generally a requirement in services that require tracing like this.
In short, you can add clickhouse-local to a shell pipeline and then run SQL queries on the data. An example from the docs:
$ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | clickhouse-local --structure "user String, mem Float64" --query "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"
(1) I wanted something where colleagues could easily share links in workplace chat with each other, so we could cooperatively investigate bugs.
(2) For LSP we're often concerned with responsiveness, and I thought the best way to indicate times when viewing a log is with whitespace gaps between log messages in proportion to their time gap.
(3) For LSP we have lots of interleaved activity going on, and I wanted to have visual "threads" connecting related logs.
(4) As the post and lnav say, interactivity is everything. I tried to take it a step further with (1) javascript, (2) playground-style updates as you type, (3) autocomplete which "learns" what fields are available from structured logs.
My tool runs all in the browser. (I spent effort figuring out how people can distribute it safely and use it for their own confidential logs too). It's fast enough up to about 10k lines of logs.
I tried the demo link, it rendered a lonely @ at the top left of the page, the rest was blank. The console said:
Uncaught TypeError: [].filter(...).reverse()[0] is undefined
onload https://ljw1004.github.io/seaoflogs/demo.html:14
This was in Firefox 98.0 in Windows. Keep access logs, both when a service receives a request and finishes a request.
Record request duration.
Always rotate logs.
Ingest logs into a central store if possible.
Ingest exceptions into a central store if possible.
Always use UTC everywhere in infra.
Make sure all (semantic) lines in a log file contain a timestamp.
Include thread ids if it makes sense to.
It's useful to log unix timestamp alongside human readable time because it is trivially sortable.
Use head/tail to test a command before running it on a large log file.
If you find yourself going to logs for time series data then it is definitely time to use a time series database. If you can't do that, at least write a `/private/stats` handler that displays in memory histograms/counters/gauges of relevant data.Know the difference between stderr and stdout and how to manipulate them on the command line (2>/dev/null is invaluable, 2>&1 is useful), use them appropriately for script output.
Use atop, it makes debugging machine level/resource problems 10 fold easier.
Have a general knowledge of log files (sometimes /var/log/syslog will tell you exactly your problem, often in red colored text).
If you keep around a list of relevant hostnames:
cat $hostname_list_file | xargs -P $parallelness -I XHOSTNAME ssh XHOSTNAME -- grep <request_id> <log_file>
This needs to be used carefully and deliberately. This is the style of command that can test your backups. This style command has caused multiple _major_ outages. With it, you can find a needle in a haystack across an entire fleet of machines quickly and trivially. If you need to do more complex things, `bash -c` can be the command sent to ssh.I've had an unreasonable amount of success opening up log files in vim and using vim to explore and operate on them. You can do command line actions one at a time (:!$bash_cmd), and you can trivially undo (or redo) anything to the logs. Searching and sorting, line jumping, pagedown/up, etc, diffing, jump to top of file or bottom, status bar telling you how far you are into a file or how many lines it has without having to wc -l, etc.
Lastly, it's great to think of the command line in terms of map and reduce. `sed` is a mapping command, `grep` is a reducing command. Awk is frequently used for either mapping or reducing.
`journalctl -u myservice -S "5 min ago"`
This is a great point and probably underrated.
You may also benefit by scrolling sideways really fast.
For instance if an API call is made use the phone emoji, when there is a timeout use a clock, when an order is dispatched used a package...
When you have to go through huge log file it is a life saver.
Copy good log into left panel. Copy bad log into right panel. Quickly show which lines are new, which are missing and which are out of order. Obviously ignore the timestamps ;)
Keyboard manufacturers HATE this simple trick:
grep -vE "THING1|THING2|THING3|THING4" file
1) systemd logs (binary data stored at /var/log/journal)
2) docker logs (from various containers running on the host)
Any hints on getting this working? I don't see this covered in the docs (https://docs.lnav.org/en/latest/formats.html).
journalctl -o json | lnavUntil now I thought logview.el[0] is the bee's knees, but now I can feel feature envy set in. There are some seriously powerful ideas listed on the lnav page, and it's also the first time I saw SQLite virtual tables used in the wild.
--
It works by uploading a stub that is an "actually portable executable" to the remote over ssh. lnav then talks to the stub to sync files back to the local host. I don't know the scale of the hosts or files, but if they're not too big, it might work fine.
Seems to be packaged quite widely https://repology.org/project/lnav/versions
$ multitail /var/log/install.log -I /var/log/system.log
You get a view with the tail from one file followed by the tail from the other, they are not collated by timestamp. In contrast, if you do the same thing in lnav: $ lnav /var/log/install.log /var/log/system.log
You will get a single view with all of the log messages from both files and they're sorted by their timestamps. Here is all of what lnav is doing: * Monitoring files/directories for updates
* Decompressed files/archives
* Detected the log format for each file
* Created SQLite vtables that provide access to log messages
* Built an index of all the log messages in all the files so
you can jump to a certain point in time or to the
next/previous error message.
* Display all log messages with syntax highlighting* powerful filtering (including with regex)
* smooth scrolling of millions of rows of data
* support for csv, text, xml, json, Excel and other formats
* available for Window and Mac
'pcregrep' still exists. But with PCRE2 supplanting PCRE, it is now spelled 'pcre2grep'.
I don't know the precise history of 'grep -P' and whether 'pcregrep' was actually absorbed into it, but 'pcregrep' is its own thing with its own features. For example, it has a -M/--multiline flag that no standard grep (that I'm aware of) has. (Although there are some work-arounds, e.g., by treating NUL as the line terminator via the -z/--null-data flag in GNU grep.)
GNU grep is also linked to pcre, not pcre2.
# pcre2grep
bash: pcre2grep: command not found...
# yum install pcre2grep
Last metadata expiration check: 1:58:58 ago on Tue 13 Dec 2022 11:45:44 AM CST.
No match for argument: pcre2grep
Error: Unable to find a match: pcre2grep
# yum whatprovides pcre2grep
Last metadata expiration check: 2:09:25 ago on Tue 13 Dec 2022 11:45:44 AM CST.
Error: No matches found.
# rpm -qa | grep pcre2 | sort
pcre2-10.40-2.0.2.el9.x86_64
pcre2-syntax-10.40-2.0.2.el9.noarch
pcre2-utf32-10.40-2.0.2.el9.x86_64
# which grep
/usr/bin/grep
# ldd /usr/bin/grep | grep pcre
libpcre.so.1 => /lib64/libpcre.so.1 (0x00007efc473c4000)PS I checked back and I used sed rather than grep. I think the result would hold for grep but the morale is that you should verify rather than assume.
I have around 50 seds in the pipe, running in parallel (which is what makes it faster), it would have been a half of that when I tried the rewrite.
$ fgrep -h
fgrep: warning: fgrep is obsolescent; using grep -F
$ egrep -h
egrep: warning: egrep is obsolescent; using grep -EOff the cuff, another way to do it is with awk's index function. I don't know what speed penalty this might impose.
$ cat ~/fmgrep
#!/bin/awk -f
BEGIN { split(ARGV[1], s, ","); ARGV[1]="" }
{ m = 1; for(c in s) if(index($0, s[c])) m = 0; if(m) print }
$ ~/fmgrep sshd,systemd,polkitd /var/log/securefgrep -v -e THING1 -e THING2 -e THING3 -e THING4
grep -v -e THING1 -e THING2 -e THING3 -e THING4 file
Can Amazon do this? They use UTC and your local browser’s time seemingly randomly depending on AWS service, and it drives me nuts. They usually (not always) put the timezone next to it, but why can’t they just have a mandate that it either is or is not UTC?! (The worst one is that the Lambda console is UTC but cloudwatch isn’t, so you think you haven’t received a request in hours but then you did)
From a personal taste perspective, I'm not a fan of either. Having a floating "<" at the start of a line just isn't my cup of tee. Not dealing with explicit stdin/stdout just makes my code easier to read. And especially considering the post's advice is about reading logs, a lot of the post is very likely built around outage resolution. Not the time I want to be thinking "oh yeah `tr` is special and I need to be explicit" -- nah, just use cat as a practice. And no, I'm not going to write `grep blah < filename` as a practice just because of commands like `tr` being weird.
But honestly, if it's such a big deal to have a cat process floating around, there are probably other things you should be concerned about. "Adds extra load to the server" points to other problems. If perf matters, CPU shielding should be used. Or if that's not an option, then sure, there's some room for trifling around, but if you're at a point where you're already running a series of pipes, a single cat command is beans compared to the greps and seds that come after it.
$ grep --version | head -n2
grep (GNU grep) 3.8
Copyright (C) 2022 Free Software Foundation, Inc.
$ ldd /usr/bin/grep
linux-vdso.so.1 (0x00007ffd2ddd5000)
libpcre2-8.so.0 => /usr/lib/libpcre2-8.so.0 (0x00007f4f88b81000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f4f8899a000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f4f88c6f000)
As for pcre2grep, Archlinux includes it as part of the pcre2 package: $ pacman -Qo $(which pcre2grep)
/usr/bin/pcre2grep is owned by pcre2 10.40-3
So this is a distro packaging thing. But what I said is true: pcregrep, pcre2grep and grep -P are all distinct things.I second the above, just one pain point with multitail to add. I often page/search/filter in the scrollback buffer (I typoed "bugger" - Freudian slip?) and in multitail the scrollback is a separate window with a frame and everything, which is a pain (copying whole lines using mouse includes the frame, ugh). The filtering/searching being a separate pain.
One thing I used in multitail and not sure if I migrated wholly to lnav was log file syntax highlighting using regexes.
For log colorizing, I make use of the --color-match and --passthru arguments, like https://powdahound.com/2009/10/colorize-log-output-with-ack/
cat UNLICENSE
| rg --passthru --color always --colors 'match:fg:red' public
| rg --passthru --color always --colors 'match:fg:green' heirs
| rg --passthru --color always --colors 'match:fg:cyan' benefit