- Designing Data Intensive Applicatons. Great overview of... basically everything, and every chapter has dozens of references. Can't recommend it enough.
- Read papers. I've had lots of a-ha moments going to wikipedia and looking up the oldest paper on a topic (wtf was in the water in Massachusetts in the 70s..). Yes they're challenging, no they're not impossible if you have a compsci undergrad equivalent level of knowledge.
- Try and build toy systems. Built out some small and trivial implementations of CRDTs here https://lewiscampbell.tech/sync.html, mainly be reading the papers. They're subtle but they're not rocket science - mere mortals can do this if they apply themselves!
- Follow cool people in the field. Tigerbeetle stands out to me despite sitting at the opposite end of the consistency/availability corner where I've made my nest. They really are poring over applied dist sys papers and implementing it. I joke that Joran is a dangerous man to listen to because his talks can send you down rabbit-holes and you begin to think maybe he isn't insane for writing his own storage layer..
- Did I mention read papers? Seriously, the research of the smartest people on planet earth are on the internet, available for your consumption, for free. Take a moment to reflect in how incredible that is. Anyone anywhere on planet earth can git gud if they apply themselves.
There is a flood of papers out there with unrepeatable processes. Where can you find quality papers to read?
A year of two ago I read Ralph Kimball’s seminal The Data Warehouse Toolkit. While I could see why it’s still often recommended, it was showing its age in many ways (though a fair bit older than DDIA). It felt like a mix of best practices and dated advice, but it was hard to tell for certain what was what.
It's not really a book about 'best practices', despite the name. It's more like an encyclopaedia, covering every approach out there, putting them in context, linking to copious reference papers, and talking about their properties on a very conceptual and practical level. It's not really like 'hey use this database vendor!!!'.
Hmm. The techniques used today were invented in the 1970s-80s.
For example in the world of engines, Heywood is basically god: (https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=John...) has 29k citations!
"Network partitioning can completely destroy mutual consistency in the worst case, and this fact has led to a certain amount of restrictiveness, vagueness, and even nervousness in past discussions, of how it may be handled"
https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2015/Papers...
But as a general starting point, all roads seem to lead to Lamport 78 (Time, Clocks). If you have a specific area of interest I or others might be able to point you in the right direction.
Any advice how to approach this?
https://twitter.com/DominikTornow
https://twitter.com/jorandirkgreef
https://twitter.com/JungleSilicon
You can also follow me. Not saying I'm cool but I do re-tweet cool people:
how did you get started, and what would you recommend for pivoting into this space?
RocksDB is an example of that.
I am playing around with SIMD, multithreaded queues and barriers. (Not on the same problem)
I haven't read the DDIA book.
I used Michaeln Nielsen's consistent hashing code for distributing SQL database rows between shards.
I have an eventually consistent protocol that is not linearizable.
I am currently investigating how to schedule system events such as TCP ready for reading EPOLLIN or ready for writing EPOLLOUT efficiently rather than data events.
I want super flexible scheduling styles of control flow. Im looking at barriers right now.
I am thinking how to respond to events with low latency and across threads.
I'm playing with some coroutines in assembly by Marce Coll and looking at algebraic effects
> Another example is figuring out the right tradeoffs between using local SSD disks and block-storage services (AWS EBS and others).
Local disks on AWS are not appropriate for long term storage, because when an instance reboot the data will be lost. AWS also doesn't offer huge amounts of local storage.
There are AWS instance types (I3en) with large and very fast SSDs (many times higher IOPS then EBS).
Amazon, google, MS, these companies print money, have built up massive engineering cultures to run reliable storage. I just dont see what the value is with trusting data with some VC funded group over proven engineering work.
I worked on one of these in house storage systems, all we did was look at how the cloud providers did things already for inspiration. Might as well just use those. IDK maybe someone can convince me of the value?
And some of the people in those VC-funded groups were alumni of those providers too. :)
-L
I'm given to understand Snowflake runs its own cloud platform, at least in part.
Can someone please elaborate that? What does it mean in conjunction of S3 and DB. I know how traditional DBs work (PostgreSQL and MySQL). I know how S3 work (opensource implementation like minio). But S3 is not a random access file on block storage which is a prerequirement for PostgreSQL and MySQL. How is that solved for S3 based DBs? Can someone point out to the doc, or even better an opensource implementation.
...but there is a lot of noise in those software papers, too - you are often disappointed by fine print, or have good curators/thought-leaders [2] - we all should share names ;)
enjoying the discussion though - very timely if you ask me.
-L, author of [1] below.
[0] - The original Calvin paper -
https://cs.yale.edu/homes/thomson/publications/calvin-sigmod...
[1] - How Fauna implements a variation of Calvin -
https://fauna.com/blog/inside-faunas-distributed-transaction...
[2] - A great article about Calvin by Mohammad Roohitavaf - https://www.mydistributed.systems/2020/08/calvin.html?m=1#:~....
Pure speculation on my behalf though.
It works well if the data is stored as immutable files (i.e., A log structure merge tree) or is not index at all (classical columnstores). S3 doesn't provide an efficient way to update a file.
[1] https://dl.acm.org/doi/10.1145/2882903.2903741 (snowflake SIGMOD paper) [2] https://dl.acm.org/doi/10.1145/3514221.3526055 (singlestore SIGMOD paper)
1. S3 as main storage with a write-through cache. 2. S3 as a cold tier in tiered storage.
It works well because the data is organized by a set of immutable parts called MergeTree. These data parts are atomically created, merged, and deleted, but never modified.
S3 does not work well with random access... but neither it's needed.
Present in buffer pool? -> Present on local disk? -> Retrieve from S3/Azure/GCP.
The challenge becomes optimizing this -- speculatively pulling pages in, background evictions, etc.
Garbage collecting old pages also turns out to be complicated. Doing a full trace for expired versions in secondary storage on disk is slow but conceivable. Doing it across petabytes in the cloud, with all the problematic latencies and reliability issues that come with network access... limits the approaches you can take.
They are not new problems -- DBMS development has always been about juggling the trade-offs in performance of different lvels in the memory hierarchy. But it permits higher scale.
Most of that research is decades old. I specifically remember Lamport timestamps. Not only has that held up, it's unlikely to go anywhere anytime soon. Most topics covered are as fundamental as the two generals problem; almost philosophical.
No database vendor can solve the issues around two concurrently existing write masters. Sync will be necessary, conflicts will occur. A concrete vendor could only hope to make that less painful (CRDTs for automatic conflict resolution, ...). That's kind of the level that book operates at.
That only applies if the clocks are within 1ms of each other, so around 100 miles (or equivalently: within a single cloud region), and only came in to force in 2014.
The bound that Spanner-likes keep is ~3ms for datacenters across continents, and that was in 2012.
Most of the stuff I read isn't "your SAAS app should have atomic clocks" and more "here's some maths on why it works, here's some explanation of what we were going for, here's some pseudocode".
I’m familiar with the concept wrt mathematics(in particular in the context of ulter-a-filters as my favorite professor would say it), but I don’t see the necessity in most CS research.
Or have I completely missed the point of your question..?
ChatGPT likely won't help - but you could look into the fact the the earth isn't round, being an oblate spheroid with 20km less radius at the poles than the equator.
Of course the fact the ideal WGS84 ellipsoid, the official mean global sea level, and the geoid (gravitational surface of equipotential) don't all align must surely come into play here - and that bloody great "gravitational hole" somewhere south of Ceylon.
https://www.e-education.psu.edu/geog862/node/1820