I've found that over time, crawlers drown out the numbers of actual visitors but I find GoAccess hard to use to get any meaningful data from when interesting things do happen.
Can anyone suggest a way I can do something similar to this without relying on a service I don't host (and without having to write parsers into a SQL or similar DB by hand)?
[0] - https://lnav.org
I host a static site for my blog, using Hugo so "no server" etc is exactly what I need, and writing the filters/queries myself leaves me in control of getting what I need out of them.
[Edit] Sorry, I did read "which I don't host" instead of the other way around. You can check out the open-source core library, that might work for you if you put in some work.
Also a thing to do on the cheap, if you want more usable logs is to do JSON logging[1] (one object per line). This is trivial to import into PostgreSQL and also trivial to query via tools like jq, as is.
[1] Example: https://stackoverflow.com/questions/25049667/how-to-generate...
I would love to hear feedback, as we plan to fully release it soon :)
[0] https://github.com/pirsch-analytics/pirsch
[2] https://docs.pirsch.io/get-started/backend-integration/
[Edit]
I forgot to mention my website, which I initially created Pirsch for. The article I wrote about the issue and my solution is here: https://marvinblum.de/blog/server-side-tracking-without-cook...
It was useful for me when tweaking spam/bot detection rules a while ago; if I could roughly describe a rule in a query, I could back-test it on old traffic and follow up on questionable-looking results (e.g. what other requests did this IP make around the time of the suspicious ones?). We also used Athena on a project looking into performance, and on network flow logs. The lack of recurring charges for an always-on cluster makes it great for occasional use like that.
You can use what the docs call "partition projection" to efficiently limit the date range of logs to look at (https://docs.aws.amazon.com/athena/latest/ug/partition-proje...), so it was free-ish to experiment with a query on the last couple days of data before looking further back.
More generally, Athena/Presto/Hive support various data sources and formats (including applying regexps to text). Compressed plain-text formats like ALB logs can already be surprisingly cheap to store/scan. If you're producing/exporting data, it's worth looking into how these tools "like" to receive it--you may be able to use a more compact columnar format (Parquet or ORC) or take advantage of partitioning/bucketing (https://docs.aws.amazon.com/athena/latest/ug/partitions.html, https://trino.io/blog/2019/05/29/improved-hive-bucketing.htm...) for more efficient querying later.
As the blog post notes, usability was...imperfect, especially during initial setup. Error messages sometimes point at one of the first few tokens of the SQL, nowhere near the mistake, and there are lots of knobs to tweak, some controlled by 'magical.dotted.names.in.strings'. CLIs were sometimes easier than the GUI. But you can get a lot out of it once you've got it working!
"Both Google Analytics and Goatcounter agreed that I got ~13k unique visitors across the couple days where it spiked. GoAccess and my own custom Athena queries agreed that it was more like ~33k unique visitors, giving me a rough ratio of 2.5x more visitors than reported by analytics, and meaning that about 60% of my readers are using an adblocker."
Then I used... OctoSQL to analyze it!
Nit: The project may seem dead for a few months, but I'm just in midst of a rewrite (on a branch) which gets rid of wrong decisions and makes it easier to embed in existing applications.
To really jump into this, take a look at https://trino.io/blog/2020/12/27/announcing-trino.html.
A few more stats and info:
Trino commits: 22,383 Presto commits: 18,582
Trino slack members: 3,603 Presto slack members: 1,575
Trino supports iceberg: https://trino.io/docs/current/connector/iceberg.html HDP3 Support: https://github.com/trinodb/trino/issues/1218
Trino has addressed a critical security vulnerability that still exists in Presto: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-1508...
Give our repo a star if you have a sec: https://github.com/trinodb/trino/blob/master/.github/star.pn...
Yeah I mean, I'm just running a django site, so I imagine I could add a custom middleware that makes an API request on every page load. I guess it would have to try and see if the access token is expired first? and if so grab a new one then make the hit. Is that the recommended setup?
Would I be able to pass extra information to be included in the logs, like e.g. username?
Also, I know you have good privacy policies, but still sending this information through a request makes me nervous nevertheless, even though it's of course miles better than js-based solutions. But what are your thoughts on how possible is it for these requests to be intercepted and this logged data siphoned off by someone else?
Exactly. The token expires after 15 minutes, so you need to check the response and issue a new token should it have expired. You can read our docs on how to do that or take a look at our Go SDK [0] and re-implement it in Python. Unfortunately, I don't have enough time to provide one right now.
> Would I be able to pass extra information to be included in the logs, like e.g. username?
That's not possible right now, but you will be able to send custom events in the future.
> But what are your thoughts on how possible is it for these requests to be intercepted and this logged data siphoned off by someone else?
Highly unlikely. All traffic is SSL encrypted, the internal communication of our server cluster is encrypted, the database, ... I mean, software can always be hacked, but I spend a lot of my time on infrastructure and security.
[0] https://github.com/pirsch-analytics/pirsch-go-sdk/blob/maste...
We're trying to improve the docs and blog about confusing topics. We also started a twitchcast to dig into various technical topics around Trino: https://www.twitch.tv/trinodb. You can catch old episodes here: https://trino.io/broadcast/episodes.html.
My comment was not directly related to yours, I agree that DNS analytics are probably worse, I was just wondering if theoretically is it possible to produce high-accuracy analytics when everything can be spoofed/cached, etc.