Tips for analyzing logs

300 points by historynops 3 years ago | 134 comments

One of my pet peeves is "The Useless Use of cat Award". Someone awarded it to me as a teenager in the late 90s and I've been sore ever since.

Yup, it's often a waste of resources to run an extra 'cat'. It really demonstrates that you don't have the usage of the command receiving the output completely memorized. You know, the thousand or so commands you might be piping it into.

But, if you're doing a 'useless' use of cat, you're probably just doing it in an interactive session. You're not writing a script. (Or maybe you are, but even still, I bet that script isn't running thousands of times per second. And if it is, ok, time to question it).

So you're wasting a few clock cycles. The computer is doing a few billion of these per second? By the time you explain the 'useless' use of cat to someone, the time you wasted explaining to them why they are wrong, is greater than the total time that their lifetime usage of cat was going to waste.

There's a set of people who correct the same three pairs of homophones that get used incorrectly, but don't know what the word 'homophone' is. (Har har, they're/their/there). I always liken the people who are so quick to chew someone out for using cat, in the same batch of people who do this: what if I just want to use cat because it makes my command easier to edit? I can click up, warp to the front of the line, and change it real quick.

Sorry. I did say, it is a pet peeve.

amarshall 3 years ago | |

I do “useless” use of cat quite often because, in my brain, the pipeline naturally starts with “given this file”, so it makes the pipeline more consistent e.g. `cat f | a | b | c` rather than `a < f | b | c` where one must start with the first “function” rather than with the data. I see starting with `cat` analogous to the `->` thread macro in Clojure, `|>` pipe in Elixir, and `&` reverse application operator in Haskell. If bash permitted putting the filename first, I’d stop using `cat`; alas, it does not.

aendruk 3 years ago | | |

There’s this. Though it would look nicer with another |.

  <f a | b | c

bornfreddy 3 years ago | |

> One of my pet peeves is "The Useless Use of cat Award". Someone awarded it to me as a teenager in the late 90s and I've been sore ever since.

Wear it as a badge of honor! It marks you as a person who puts clarity, convenience and simplicity before raw performance. I can't think of a single case when that bit of performance matters.

Needless to say, I'm happily using cat (uselessly) myself and have no plans to convert.

btschaegg 3 years ago | | |

> It marks you as a person who puts clarity, convenience and simplicity before raw performance.

This. As noted, even in scripts it usually makes more sense since the result is a pipeline that's easier to read, annotate and modify.

Case in point:

  cat file.txt \
  | sed '1s/^\xEF\xBB\xBF//' `# Strip UTF-8 BOM at the beginning of the file` \
  | ...

Specifying a file name would only make the "black-magic-line" of `sed` more complicated while also making it more complicated to modify the pipeline itself. Now, if I want to skip that step to test something, I don't have to figure out if/how the next command takes an input file (or, ironically, replace `sed` with `cat`).

kazinator 3 years ago | |

> It really demonstrates that you don't have the usage of the command receiving the output completely memorized.

No, it demonstrates that you don't have redirection memorized, and don't know that you can place it anywhere in the command line, including on the left.

> So you're wasting a few clock cycles

Keystrokes too:

   cat x | cmd
   < x cmd

It's also possible that cmd may detect that its standard input is connected to a real file, and take advantage of being able to lseek() the file descriptor. For instance say that x is an archive and cmd is an extractor. If cmd needs to skip some indicated offset to get to the desired payload material, it may be able to do it efficiently with an lseek, whereas under cat, it has to read all the bytes and discard them to get to the desired offset.

foobarian 3 years ago | | |

I still prefer to have cat there because it is interchangeable with other output-producing commands and it can handle globs. In an interactive session I iterate on the last command many times, and if I decide to filter stuff can just replace cat with grep or if I decide to pull from a directory of files can add a glob, if compressed it turns to zgrep or zcat etc. With redirects I'd have to change the structure of the pipeline which wastes mental effort. IMO.

infogulch 3 years ago | |

> the time [used to] explain the 'useless[ness]' .. of cat to someone .. is greater than the total time that their lifetime usage of cat was going to waste

If you look for situations like this they are surprisingly common.

> pet peeve

tksb 3 years ago | |

I don't often (err, ever…) reply without reading further but this time I must, because: I've never heard this turn of phrase "useless use of cat" and it turned my brain upside-down for a moment, because: "interactive" is precisely how I learn and do and I suppose it was a nice reminder that sometimes really big (read: useless) things are actually kinda small (useful) and vice versa.

marcosdumay 3 years ago | |

Well, you waste an entire fork and exec, so I believe you are underestimating the time by a few orders of magnitude. Also, it's almost always grep following the cat, so it's not much to memorize.

But it's well worth wasting a process to have a nice pipeline where each command does a single thing so you can easily reason about them.

dllthomas 3 years ago | |

It's a lot more than the few extra cycles to spin up the process - it's also an extra copy of all the data. Usually that's also not much, but occasionally it's everything, as the consuming program can seek in a file, but not in a pipe, so might otherwise only need a tiny bit of the data.

nikau 3 years ago | |

It totally makes sense, after all the electron apps/docker containers/statically linked utilities/microservices running on my machine, needlessly running cat might be the straw that breaks the camels back.

pantulis 3 years ago | | |

not counting the subpixel aliasing for your terminal font ;)

TacticalCoder 3 years ago | |

> Sorry. I did say, it is a pet peeve.

You're not alone. I disable the warning in shellcheck. I've been uselessly cat'ing since the 90s, like you.

civopsec 3 years ago | |

Doesn’t people like Rob Pike argue for `cat |` since it fits so orthogonally with the rest of the pipeline?

chaps 3 years ago | |

It's a perfectly valid pet peeve, especially with things like `tr` being weird about accepting files as input.

This works:

  cat file.log | tr 'a' 'b'

This doesn't:

  tr 'a' 'b' file.log

kazinator 3 years ago | | |

   tr a b < file.log

   < file.log tr a b

The pattern "cat file | command" introduces an unnecessary cat process to achieve the effect of "< file command".

It is not called "useless use of cat" (UUoC) without justification.

psd1 3 years ago | |

> cat

> pet peeve

nice

heliostatic 3 years ago |

My biggest quality of life improvement for understanding logs has been lnav (https://lnav.org/) -- does everything mentioned in this post in a single tool with interactive filtering and quick logical and time based navigation.

chaps 3 years ago |

One thing I've done to identify infrequent log entries within a log file is to remove all numbers from a file and print out a frequency of each. Basically just helps to disregard timestamps (not just at the beginning of the line), line numbers, etc.

  cat file.log | sed 's/[0-9]//g' | sort | uniq -c | sort -nr

This has been incredibly helpful in quickly resolving outages more than once.

spydum 3 years ago | |

Brilliant hack. I've used just about all the tricks from the blog and many of the comments here, but never this one. I've stripped timestamps for sure, but never considered all numerics. Nice one !

Noumenon72 3 years ago | |

I love this tip for all kinds of diffing. I'm so sure I'm going to use it that I already assigned it an alias to strip numbers from whatever's on my Mac clipboard:

    alias numberless="pbpaste | sed 's/[0-9]//g' | pbcopy"

defrost 3 years ago | |

It's probably worth exploring making that a frequency bucketing pipe that dashboards N minute intervals .. so that operators can see any abrupt changes.

Severian 3 years ago |

My tips:

1) Fuck grep, use ripgrep, especially if you have to scour over an entire directory.

2) Get good with regex, seriously, it will shave hours off your searching.

3) For whatever application you are using, get to know how the logging is created. Find the methods used where said logs are made, and understand why such a log line exists.

4) Get good with piping into awk if needed if you need some nice readable output.

nerdponx 3 years ago | |

Piping into AWK feels like a misuse of the tool in all but the simplest of cases. Don't forget that you can write complete AWK scripts and invoke them from a file!

jpitz 3 years ago | |

I use ack for this purpose, because the color filtering options are so good.

https://beyondgrep.com/

hayst4ck 3 years ago | | |

Every time I've done research (twice) ripgrep comes out on top.

Usually it's something like rg > ag > ack > grep and that's overall testimonial, not just speed.

What am I missing?

jldugger 3 years ago |

Honestly, the most amazing thing I did with logs was learn how to do subtraction. Any time you have multiple instances of a thing and only some of them are bad, you can easily find the problem (if anyone bothered to log it) by performing bad - good.

The way you do this is by aggregating logs by fingerprints. Removing everything but punctuation is a generic approach to fingerprinting, but is not exactly human friendly. For Java, log4j can use class in your logging pattern, and that plus log level is usually pretty specific.

Once you have a fingerprint, the rest is just counting and division. Over a specific time window, count the number of log events, for every finger print, for both good and bad systems. Then score every fingerprint as (1+ # of bad events) / (1 + # of good events) and everything at the top is most strongly bad. And the more often its logged, the further up it will be. No more lecturing people about "correct" interpretations of ERROR vs INFO vs DEBUG. No more "this ERROR is always logged, even during normal operations".

10000truths 3 years ago | |

Isn’t this basically what structured binary logs are? Instead of writing a string like `Error: foo timeout after ${elapsed_amt} ms` to the log, you write a 4-byte error code and a 4 byte integer for elapsed_amt. I know there are libraries like C++’s nanolog that do this for you, under the hood.

nerdponx 3 years ago | | |

Maybe this is a silly question, but is there much value in a 4-byte binary code compared to a human-readable log with human-readable codes? Maybe size, but logfmt especially is not much less compact than binary data.

jldugger 3 years ago | | |

That solves the fingerprint, but you still need to count and score.

nerdponx 3 years ago | |

You might be interested in the TF-IDF algorithm used in information retrieval and text classification.

jldugger 3 years ago | | |

Yes, this is pretty much TF-IDF for people too lazy to count the number of unique items in the corpus.

Since that number should be the same (or at least close!) in both good and bad datasets, I'm not sure the extra math matters much.

geocrasher 3 years ago |

One thing I didn't see was how to use GREP to view the lines before and after a match:

   grep regex /var/log/logfile -A5 #To view the next 5 lines
   grep regex /var/log/logfile -B5 #To view the previous 5 lines
   grep regex /var/log/logfile -05 #To view the 5 lines before *and* after the match

This is super handy to find out what happened just before a service crashed, for example.

oftenwrong 3 years ago | |

    grep regex /var/log/logfile -05 #To view the 5 lines before *and* after the match

This appears to work with GNU grep, at least, but YSK that the POSIX grep option for this is `-C`.

sisk 3 years ago |

Loosely related: a few years ago I wanted a simpler alternative to some of the more feature-full log viewers out there so I threw together a tiny (50kb) app that might be useful to some folks in here.

All it does is consistently colors the first field in a line from stdin so you can quickly see which log lines have the same first field.

I used it in combination with the parallel[0] command to prefix log lines by replica name when tailing logs across machines: https://github.com/jasisk/color-prefix-pipe

[0]: https://www.gnu.org/software/parallel/

hosteur 3 years ago | |

Nice tool. You should really consider including a screenshot showing the actual coloured output on the GitHub README.

tetha 3 years ago |

As much as I approve of a skillset to analyze local logs, but after a relatively small scale (10-20 systems), a central decent log aggregation like opensearch or ELK just brings so much value even on 1-3 nodes. It'd be one of the first changes I make to an infrastructure because it's so powerful.

And its not just log searching and correlation value. At work, the entire discussion "oh but we need access to all servers because of logs" just died when all logs were accessible via one web interface. I added a log aggregation and suddenly only ops needed access to servers.

Designing that thing with accessibility and discoverability in mind is a whone nother topic though.

sofixa 3 years ago | |

Just to expand on this more - you don't have to use ELK which is pretty resource and maintenance heavy, there are much easier and faster alternatives that don't need attentive care.

Throw Loki+Grafana on a single VM somewhere (or run it on your Kubernetes/Nomad/ECS/etc. cluster) and it will get you very far, as long as you plan ahead a bit (most notably, indexing happens at ingestion, so you need to have an idea of what you want from your logs or your queries will be slower).

Or use a SaaS like logz.io, AWS OpenSearch, Datadog, etc. Most support OpenTelemetry now, so switching data ingestion is quite easy (unlike dashboards and alerts).

It makes sense to have centralised logs pretty much as soon as you outgrow the "everything runs on this one box and my DR plan is a prayer" stage, IMO.

rsdbdr203 3 years ago | |

Agreed! Centralizing logs is so helpful, but you don't know it until you've done it. Too many people rely upon grep, when a handy tool is just a download away. Plug for said too: https://log-store.com

Noumenon72 3 years ago | |

We went to the trouble of setting up ELK and I was excited, but I never use it any more. You can't scroll/search through the logs as fast as vi and grep. You have to click to see more log lines. And I can write an alias to tail a log so I don't have to log into a UI.

bombolo 3 years ago | |

syslog has allowed forwarding logs long before elk

chrisweekly 3 years ago |

Glad to see lnav (https://lnav.org) already getting some love in the comments. Hands-down the most reliable source of "thank you! I wish I'd known about this tool sooner!" responses, even from v experienced sysadmins / SREs.

chasil 3 years ago |

A minor optimization is collapsing the grep -v, from this:

  cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | grep -v THING4

to this:

  egrep -v 'THING1|THING2|THING3|THING4' file

That gets rid of the cat and three greps. Both POSIX and GNU encourage grep -E to be used in preference to egrep.

A pcregrep utility also used to exist, if you want expansive perl-compatible regular expressions. This has been absorbed into GNU grep with the -P option.

kerblang 3 years ago |

> you’ll get overwhelmed by a million irrelevant messages because the log level is set to INFO

I know this happens, but I think it's because programmers are abusing INFO. In principle it's reserved for messages that are informative at a level sys admins and a few others can make sense of and use. Unfortunately abuse often leads to "We turned INFO off" making it much harder to diagnose things after the fact.

pantulis 3 years ago | |

I think there should be an counterpart to "log analysis" which is "logging strategies for your app". Which should be WHAT to log and WHEN.

Stuff like, if you are exposing an HTTP endpoint, you should log the request URL and the time it took to serve it. Or if you are invoking an extenral service, you should log the response time of that service.

lowbloodsugar 3 years ago | | |

And you should produce a single line of output for each request that identifies all of the pertinent information. You can have more than one, for e.g. a thread dump, but there should be one that provides a complete summary. I've lived with apps that logged at each stage of the process as separate lines, and that's just not useful data when grepping for anomalies.

kjellsbells 3 years ago |

When I was training sysadmins back in the dark ages, one of the rules I taught was: know what good looks like in your logs. If you are scanning hundreds of lines of logging under duress to find a smoking gun, and you don't know the difference between what the logs normally show, and what you are seeing, you'll waste a lot of time.

Corollary is that good day logs should be minimal and "clean", e.g not logging a lot, or, logging nice and predictably (which makes them easy to strip out via grep -v, etc.)

Manjuuu 3 years ago |

> Often log lines will include a request ID.

Yes, always include a request id in every request structure you create and include it also in the response and print it. It would seem something obvious that everyone does by default but instead, no, it's not so obvious it seems.

janci 3 years ago | |

Not so obvious. How to implement it without passing request id to all the functions, when they are unrelated to request/http ? Especially in languages without thread-locals such as javascript?

gh123man 3 years ago | | |

Many frameworks solve this with logger context. Add the properties you want to the logging context and all future logs in that context will have that property.

One simple example - Serilog enrichment https://github.com/serilog/serilog/wiki/Enrichment

This does however assume your entire application uses the same logger. But this is generally a requirement in services that require tracing like this.

mrloba 3 years ago | | |

Nodejs now has async local storage built in, which makes this fairly easy.

veyh 3 years ago | | |

For example: https://github.com/puzpuzpuz/cls-rtracer

yakshaving_jgt 3 years ago | | |

You can stick it in a middleware which wraps the request lifecycle.

linsomniac 3 years ago |

I recently used clickhouse-local to do some log analysis on a lot of elastic load balancer logs (~10s of GBs) and it was spectacular.

In short, you can add clickhouse-local to a shell pipeline and then run SQL queries on the data. An example from the docs:

$ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' | clickhouse-local --structure "user String, mem Float64" --query "SELECT user, round(sum(mem), 2) as memTotal FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"

ljw1004 3 years ago |

I wrote https://github.com/ljw1004/seaoflogs - an interactive filtering tool, for similar ends to what's described here. I wrote it because my team was struggling to analyze LSP logs (that's the protocol used by VSCode to communicate with language servers). But I made it general-purpose able to analyze more log formats too - for instance, we want to correlate LSP logs with server logs and other traffic logs.

(1) I wanted something where colleagues could easily share links in workplace chat with each other, so we could cooperatively investigate bugs.

(2) For LSP we're often concerned with responsiveness, and I thought the best way to indicate times when viewing a log is with whitespace gaps between log messages in proportion to their time gap.

(3) For LSP we have lots of interleaved activity going on, and I wanted to have visual "threads" connecting related logs.

(4) As the post and lnav say, interactivity is everything. I tried to take it a step further with (1) javascript, (2) playground-style updates as you type, (3) autocomplete which "learns" what fields are available from structured logs.

My tool runs all in the browser. (I spent effort figuring out how people can distribute it safely and use it for their own confidential logs too). It's fast enough up to about 10k lines of logs.

unwind 3 years ago | |

That sounds really interesting, although too limited for my current work focus which deals with Really Big logs.

I tried the demo link, it rendered a lonely @ at the top left of the page, the rest was blank. The console said:

    Uncaught TypeError: [].filter(...).reverse()[0] is undefined
        onload https://ljw1004.github.io/seaoflogs/demo.html:14

This was in Firefox 98.0 in Windows.

hayst4ck 3 years ago |

Some extra tips:

  Keep access logs, both when a service receives a request and finishes a request.
  Record request duration.
  Always rotate logs.
  Ingest logs into a central store if possible.
  Ingest exceptions into a central store if possible.
  Always use UTC everywhere in infra.
  Make sure all (semantic) lines in a log file contain a timestamp.
  Include thread ids if it makes sense to.
  It's useful to log unix timestamp alongside human readable time because it is trivially sortable.
  Use head/tail to test a command before running it on a large log file.

If you find yourself going to logs for time series data then it is definitely time to use a time series database. If you can't do that, at least write a `/private/stats` handler that displays in memory histograms/counters/gauges of relevant data.

Know the difference between stderr and stdout and how to manipulate them on the command line (2>/dev/null is invaluable, 2>&1 is useful), use them appropriately for script output.

Use atop, it makes debugging machine level/resource problems 10 fold easier.

Have a general knowledge of log files (sometimes /var/log/syslog will tell you exactly your problem, often in red colored text).

If you keep around a list of relevant hostnames:

  cat $hostname_list_file | xargs -P $parallelness -I XHOSTNAME ssh XHOSTNAME -- grep <request_id> <log_file>

This needs to be used carefully and deliberately. This is the style of command that can test your backups. This style command has caused multiple _major_ outages. With it, you can find a needle in a haystack across an entire fleet of machines quickly and trivially. If you need to do more complex things, `bash -c` can be the command sent to ssh.

I've had an unreasonable amount of success opening up log files in vim and using vim to explore and operate on them. You can do command line actions one at a time (:!$bash_cmd), and you can trivially undo (or redo) anything to the logs. Searching and sorting, line jumping, pagedown/up, etc, diffing, jump to top of file or bottom, status bar telling you how far you are into a file or how many lines it has without having to wc -l, etc.

Lastly, it's great to think of the command line in terms of map and reduce. `sed` is a mapping command, `grep` is a reducing command. Awk is frequently used for either mapping or reducing.

jeppesen-io 3 years ago |

also, journalctl. Very nice to have all logs in one place in a semi-structured nature

`journalctl -u myservice -S "5 min ago"`

phillipcarter 3 years ago |

Surprised to see that under the section "correlate between different systems" tracing isn't mentioned as an alternative approach. That's what tracing is: logging across different systems and getting that structure all stitched together for you.

2devnull 3 years ago |

“scroll through the log really fast”

This is a great point and probably underrated.

You may also benefit by scrolling sideways really fast.

https://en.m.wikipedia.org/wiki/Superior_colliculus

edweis 3 years ago |

Best tips I discovered: use emojis. They have colors and they are easy to spot.

For instance if an API call is made use the phone emoji, when there is a timeout use a clock, when an order is dispatched used a package...

When you have to go through huge log file it is a life saver.

jldugger 3 years ago | |

This seems like a prime candidate for a "7 tips for app devs SREs will hate" article, that only works because you're the only one doing it.

baud147258 3 years ago |

A few weeks ago I had a windows installer that was silently failing when upgrading from an older version (installation from scratch was working without issues). And as windows install logs aren't exactly easy to read, I was stumped, until I took an upgrade log from an older, working build, strip all date information from both files and compare them, checking all the sections which were different, until I found a line indicating that a colleague had forgotten about a limitation when dealing with msp (don't delete components on minor upgrades, but I didn't throw any stones as I've done the same mistake, twice, one and two years ago...)

Too 3 years ago |

One of my favorite tricks is to use a visual difftool.

Copy good log into left panel. Copy bad log into right panel. Quickly show which lines are new, which are missing and which are out of order. Obviously ignore the timestamps ;)

Noumenon72 3 years ago | |

I do this all the time (PyCharm has an amazing diff tool hidden under Cmd-Shift-A, Show Diff). With the command from above that removes all the numeric characters it's going to be even easier to compare, no more timestamps!

oxff 3 years ago |

Just use structured logs, its 2022 ffs.

julian_sark 3 years ago |

> cat file | grep -v THING1 | grep -v THING2 | grep -v THING3 | grep -v THING4

Keyboard manufacturers HATE this simple trick:

grep -vE "THING1|THING2|THING3|THING4" file

spc476 3 years ago |

One thing we did at my previous job was to add a "trace flag" to each account. Normally, they log nothing about a transaction (other than it happened), but if the trace flag is set, then a lot of information is logged about the transaction. Also, this trace flag is propagated throughout the distributed system they have, so we can trace the action across the network.

kqr 3 years ago |

One thing that's improved my log analysis is learning awk (well actually Perl but I think 95 % of what I do with Perl I could also do with just awk). Often the most useful way to look at logs is statistically, and awk can quickly let you aggregate things like time-between-statements, counts, rates, state transition probabilities, etc for arbitrary patterns.

trashcan 3 years ago |

Also, sometimes logs stretch across multiple lines, and the other lines won't have the identifier you are searching for. For example, Java stack traces. In that case if you are stuck parsing unstructured logs, the simplest thing to do is to look at the entire file and search for the timestamp that found the first line.

29athrowaway 3 years ago |

Your friends

- grep (or ripgrep), sort, uniq, head, tail, cut, colrm, tr

- diff

- jq

- bat

- ministat

- less, more, percol

- Facebook Path Picker (fpp)

arminiusreturns 3 years ago | |

One of my favorites that I don't see used enough is:

    watch -d -n $INT

Watch for delta changes in interval.

I have lots of bad habits on the cli I should correct, now I'm feeling guilty after reading all the comments.

pdimitar 3 years ago | |

jless for JSON is very good.

zug_zug 3 years ago |

I feel like this is a huge anti-pattern. Use a hosted service that does all of this for you, and then have a whole query language, build alerts, graphs, etc based on these results.

It's not super cheap, but it's 10x cheaper than wasting dev time in the terminal. (Sumologic, splunk are the two I can vouch for)

emmelaich 3 years ago |

I found the histogram technique to be really helpful. Slight mod - I tend to sort reverse at the end of the pipeline (sort -rn); then |head is often more useful.

It's also good to have histograms by hour or day. I've hacked up scripts to do this but I should really make something better!

AtNightWeCode 3 years ago |

Usage of correlation ids and structured logs is pretty much standard. Don't go down this path.

terom 3 years ago |

Re timing, logs like nginx access logs have their timestamp from when the request completed, not when the request came in. That's a significant difference for long duration (~10s+) requests, and matters when trying to correlate logs or metrics to a request.

_8j50 3 years ago |

I saw a tip a whilr back about not needing to keep adding "| grep -v stuff" but instead "grep -e -v stuff -v stuff2" i remember getting it to work on Linux but last i tried that on macos I didn't have much luck

danjc 3 years ago |

We've added a log tailing feature into our product UI which also has a basic find/filter. It's been enormously useful for cases where something weird happens as you can immediately access the last few mins of logs.

andrewgilmartin 3 years ago |

If I had a nickel for every time I have used this pattern

... | perl -ne 'print "$1 $2 ...\n" if /some-regex-with-capture-groups/' | ...

didip 3 years ago |

If you are using Kubernetes, I highly recommend using https://github.com/stern/stern

dopidopHN 3 years ago | |

I second that. It's fast and conveniant, and you can visualize logs accross several instances easily,

febed 3 years ago |

No mention of klogg? It’s a lifesaver when opening huge log files

kamma4434 3 years ago |

Jeez, this stuff is frontpage on HN? Sounds… pretty basic. I’m sure our AI overlords could produce deeper content.

hex4def6 3 years ago | |

I'm interested in this idea as well. Seems like it would be useful for detecting unusual issues assuming you have a large enough data set of "normal" runs.

I've been mulling this idea over in my head as well. I have a fleet of PCs out in the wild, all running the same software. It would be nice to have an easy way to detect strange behavior like processes that are continually respawning / segfaults / crashes / etc, without explicitly writing a bunch of search terms.

phillipcarter 3 years ago | |

Yours is an unhelpful and non-constructive comment. Clearly a lot of people have been getting something out of the content in this post, as it's started several discussions.

viraptor 3 years ago | |

People on HN have varied level of skills, and this is a well structured introduction to diving into logs. It already started conversation about better tooling. Let's celebrate today's lucky 10000 https://xkcd.com/1053/ rather than talk something down for being basic.

DesiLurker 3 years ago |

in addition to these one of my favorite is a perl one liner that generates time delta from a regex pattern of interest. And then I plot it using gnuplot. It seriously helps to 'see' the events with timing in a chart & allows you to do a quick visual search for problem areas.

alexpetralia 3 years ago |

Has anyone tried passing logs into chatGPT? I had thought it would be especially effective here.

viraptor 3 years ago | |

It may work for a few lines, but you can't really pipe a GB of logs into it, which is what we typically deal with.

It works for simple explanations though.

par 3 years ago |

My tips for analyzing logs:

cd ~/logs

cat * | grep <my string>

mxuribe 3 years ago | |

I guess it depends, but would you not encounter issues like permission blocks, or there being so much content in the logs that it is slow to find what you need?

bombolo 3 years ago |

journalctl supports grep, queries by priority, time, daemon, and custom defined fields.

the_arun 3 years ago |

Do we still use utilities like grep for searching logs? Are these when we cannot stream logs to tools like Splunk & Loggly and use their search services?

slyall 3 years ago | |

One problem is that you start using these services and then the bill arrives and you then spend all your time removing stuff from the logs to keep the bill down.

ranger207 3 years ago | |

All of our logs are in Kibana, but sometimes I'll `kubectl logs pod > tmp; grep pattern tmp` because Kibana's search and filtering is often annoying. Actually, I'll usually open the logs in vim and use its search, which also get me the ability to eg delete every line not containing the pattern and then search only things matching the first pattern. I'm going to try lnav as mentioned in this thread but I've gotten by fine with judicious use of grep

* Monitoring files/directories for updates * Decompressed files/archives * Detected the log format for each file * Created SQLite vtables that provide access to log messages * Built an index of all the log messages in all the files so you can jump to a certain point in time or to the next/previous error message. * Display all log messages with syntax highlighting

# pcre2grep bash: pcre2grep: command not found... # yum install pcre2grep Last metadata expiration check: 1:58:58 ago on Tue 13 Dec 2022 11:45:44 AM CST. No match for argument: pcre2grep Error: Unable to find a match: pcre2grep # yum whatprovides pcre2grep Last metadata expiration check: 2:09:25 ago on Tue 13 Dec 2022 11:45:44 AM CST. Error: No matches found. # rpm -qa | grep pcre2 | sort pcre2-10.40-2.0.2.el9.x86_64 pcre2-syntax-10.40-2.0.2.el9.noarch pcre2-utf32-10.40-2.0.2.el9.x86_64 # which grep /usr/bin/grep # ldd /usr/bin/grep | grep pcre libpcre.so.1 => /lib64/libpcre.so.1 (0x00007efc473c4000)

$ grep --version | head -n2 grep (GNU grep) 3.8 Copyright (C) 2022 Free Software Foundation, Inc. $ ldd /usr/bin/grep linux-vdso.so.1 (0x00007ffd2ddd5000) libpcre2-8.so.0 => /usr/lib/libpcre2-8.so.0 (0x00007f4f88b81000) libc.so.6 => /usr/lib/libc.so.6 (0x00007f4f8899a000) /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f4f88c6f000)