Things to know about databases(architecturenotes.co) |
Things to know about databases(architecturenotes.co) |
Show me either flowcharts and/or tables, it doesn’t matter, I’ll continue to be mystified.
I love this statement. It's true too, having seen a decades-old database that needed to be converted to Postgres. The old application was going to be thrown away, but the data was still relevant :).
Databases in heavy use will not just outlast your application, they have a strong chance of outlasting your career and they very well may outlast you as a person.
It's going to be interesting when this same problem occurs years from now when people are trying to reverse schemas from NoSQL databases or if they become difficult to extract.
The only sticking point is when business logic is put into stored procedures. On one hand if you're building an app on top of it, there's a temptation to extract and optimize that logic in your new back-end. On the other hand, it is kind of nice to even have it at all should the legacy app go poof.
Also, in my experience, the database is almost always the main cause of any performance issues. I would much rather hire someone who is very good at making the database perform well than making the front end perform well. If you are seeking to be a full stack developer, devote much more time to the database layer than anything else.
I would be careful with the term "cause". There is a symbiotic relationship between the application and the database. Or, if talking to a DBA...a database and its applications. Most databases can store any sets of arbitrary information...but how they are stored (read: structure) must take into account how the data is to be used. When the database designer can be told up-front (by the app dev team) considerations can be made to optimize performance along whatever vector is most desired (e.g. read speed, write speed, consistency, concurrency, etc). Most database performance issues result when these considerations are left out. Related: Just because a query works (ie. returns the right data) does not mean it's the best query.
If your database is great, at least you have the option of a fast backend.
More generically, state stores are almost always bottlenecks (they tend to be harder to scale without some tradeoff)
I’ve been working with and on databases for a long, long time, and I’ve even written about things I think people should know about if they want to do this, yet I never came up with such great insight. This is so true it should be engraved somewhere. Hats off!
1. If you store data in rows it's quite fast to insert/update/delete individual rows. Moreover, it's easy to do it concurrently. However reads can be very slow because you read the entire table if you scan a single column. That's why OLAP databases use column storage.
2. If you sort insert data in the table, reading ranges based on the sort key(s) is very fast. On the other hand inserts may spray data over over the entire table, (eventually) forcing writes to all blocks, which is very slow. That's why many OLTP databases use heap (unsorted) row organization.
In small databases you don't notice the differences, but they become dominant as volume increases. I believe this fact alone explains a lot of the proliferation of DBMS types as enterprise datasest have grown larger.
Edit: minor clarification
I’ve migrated ORM several times, and the only thing that changes is the entity definition. The database remains the same.
I don't know if there is a single soul who believes this. If you are designing a database, it is much more cooler than front end apps.
https://users.ece.utexas.edu/~adnan/pike.html
Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
I'm curious: what's the alternative to NULL? I'm struggling to think of a database where NULL wouldn't be super useful. It feels like NULL as a concept is almost required, but I think you're suggesting that's a faulty assumption.
Would love to hear more about this.
But if a columns truly has unknown values, NULL's are the best ways to represent it. It is sometimes suggested to use "sentinel values" like empty string or -1 to represent missing values, but IMHO this is much worse than NULL's, since these will be treated as regular values by operators. When you have missing values, you want three-valued logic.
Fixing up a database step by step is a painful process.
I really like how he (Martin Kelppman) in the book starts with a primitive data structure for constructing a database design, and then evolves the system slowly and describes the various trade offs with building a database from the ground up.
If you're choosing something other than an RDBMS - you should rethink why.
Because unless you're at massive scale (which still doesn't justify it), choosing something else is rarely the right decision.
It's even worse than this with MS SQL Server. When using the READ UNCOMMITTED isolation level it's actually possible to read corrupted data, e.g. you might read a string while it's being updated, so the result row you get contains a mix of the old value and new value of the column. SQL Server essentially does the "we got a badass over here" Neil deGrasse Tyson meme and throws data at you as fast as it can. Unfortunately I've worked on several projects where someone apparently thought that READ UNCOMMITTED was a magic "go fast" button for SQL and used it all throughout the app.
Dirty reads incidentally weren't supported for quite some time in the Sybase architecture (which forked to MS SQL Server in 1992). There was a Sybase effort to add dirty read support around 1995 or so. The project name was "Lolita."
Indexes are a nightmare to get right. Often performance optimizations of SQL databases include removing indexes as much as adding indexes.
If you are seeing performance gains from removing indexes, then I'm assuming your workload is very heavy on writes/updates compared to reads.
But yes, the issue with too many indexes is more often that they harm write performance.
A related issue is indexes that are too wide, either covering many columns or “including” them. As well as eating disk space they also eat extra memory (and potentially cause extra IO load) when used (less rows per page, so more pages loaded into RAM for the same query).
Both problems together, too many indexes many of which are too wide, usually comes from blindly accepting recommendations from automated tools (particularly when they are right that there is a problem, and it is a problem that a given index may solve, but fixing the queries so existing indexes are useful could have a much greater effect than adding the indexes).
One of my most popular StackOverflow questions to this day is about how to handle one million rows in a single MySQL table (shudder).
The product I work on now collects more rows than that a day in a number of tables.
> Therefore, if the price isn’t an issue, SSDs are a better option — especially since modern SSDs are just about as reliable as HDDs
This needs a tiny extra bit of detail: if you're buying random IO (IOPS) or throughput (MB/s), SSDs are significantly (orders of magnitude!) cheaper than HDDs. HDDs are only cheaper on space, and only if your need for throughput or IO doesn't cause you to "strand" space.
> Consistency can be understood after a successful write, update, or delete of a row. Any read request immediately receives the latest value of the row.
This isn't the ACID definition of C, and is closer to the distributed systems (CAP) one. I can't fault the article for getting this wrong, though - it's super confusing!
I have a post in draft to discuss disk trade offs which digs into this aspect, its impossible to dig into everything in this level of a post.
I know they’re trying to simplify, but this is confusing. If the first part is true, the second part can’t be. In reality the database does execute the queries concurrently, but will try to make it seem like they were done one by one. If it can’t manage that, a query will fail and have to be retried by the application.
I do appreciate the feedback and will look to add some more color here! Thank you!
I send those 2 links to coworkers all the time
I highly recommend reading https://jepsen.io/consistency and clicking on each model on the map. This is the best resource I found so far for understanding databases, especially distributed ones.
I am an expert on the subject matter, and I don't think that the overall approach is questionable. The approach that the author took seems fine to me.
The definition of certain basic concepts like 'consistency' is even confusing to experts at times. This is made all the more confusing by introducing concepts from the distributed systems world, where consistency is often understood to mean something else.
Here's an example of that that I'm familiar with, where an expert admits to confusion about the basic definition of consistency in the sense that it appears in ACID:
https://queue.acm.org/detail.cfm?id=3469647
This is a person that is a longtime peer of the people that invented the concepts!
Not trying to rigorously define these things makes a great deal of sense in the context of a high level overview. Getting the general idea across is far more important.
I think your level of abstraction is quite good for the absolute "what on earth are people talking about when they use that 'database' word?". With an extremely high level understanding, when they encounter more detail they'll have a "place to put it".
There are at least two ways (that I'm aware of) that this can be violated. For example, if you run an update statement like this:
UPDATE foo SET bar = bar + 1
Then the read of "bar" will always use the latest value, which may be different from the value other statements in the same transaction saw." Unlike SQL, it forms a logical pipeline of transformations, and supports abstractions such as variables and functions. It can be used with any database that uses SQL, since it transpiles to SQL. "
An ironic caveat to this is that balanced trees don't scale well, only offering good performance across a relatively narrow range of data size. This is a side-effect of being "balanced", which necessarily limits both compactness and concurrency.
That said, concurrent B+trees are an absolute classic and provide important historical context for the tradeoffs inherent in indexing. Modern hardware has evolved to the point where B+trees will often offer disappointing results, so their use in indexing has dwindled with time.
Which is to say they're frequently good enough such that the human working with them on whatever level can safely not know a lot of these details and get a LOT done. Kudos to whoever deserves them here.
As opposed to aspirationally discrete classifications that end up being porous, e.g. MVC, "Object Oriented" etc.
Except the most important problem: A pleasant API. Which is, no doubt, why 95% of those considering something other than an RDBMS are making such considerations.
RDBMS can have pleasant APIs. It is not a fundamental limitation. We have built layers upon layers upon layers of abstraction over popular RDBMSes to provide nice APIs and they work well enough. But those additional layers come with a lot of added complexity and undesirable dependencies that most would prefer to see live in the DBMS itself instead.
At least among the RDBMSes we've heard of, there does not seem to be much interest in improving the APIs at the service level to make them more compelling to use natively like alternative offerings outside of the relational space have done.
You either model things in a very domain specific and classic fashion. Here you get the benefit of being quite declarative and ad-hoc queries are natural. Also your schema is stronger, as in it can catch more misuse by default. But this kind of schema tends to have _logical_ repetition and is so specific that change/evolution is quite painful, because every new addition or use-case needs a migration.
Or you model things very generically, more data driven than schema driven. You lose schema strength and you definitely lose sensible ad-hoc queries. But you gain flexibility and generality and can cover much more ground.
You can kind of get around this dichotomy with views, perhaps triggers and such. In an ideal world you'd want the former to be your views and the latter to be your foundational schema.
But now you get into another problem, which is that homogeneous tables are just _super_ rigid as result sets. There are plenty of very common types you cannot cover. For example tagged unions, or any kind of even shallowly nested result (extremely common use case), or multiple result groups in one query. All of these things usually mean you want multiple queries (read transaction) or you use non-SQL stuff like building JSONs (super awkward).
If you can afford to use something like SQLite, then some of the concerns go away. The DB is right there so it's fine to query it repeatedly in small chunks.
I wonder if we're generally doing it wrong though, especially in web development. Shouldn't the backend code quite literally live on the database? I wish my backend language would be a data base query language first and a general purpose language second, so to speak. Clojure and its datalog flavors come close. But I'm thinking of something even more integrated and purpose built.
A pleasant API is clearly not the most important business problem a database is there to solve.
The data in it is presumably the life and blood of the business, whereas the API is something only developers need to deal with.
But that aside, the interface will be SQL which is quite powerful, long-lived (most important) and, fortunately, very pleasant.
For that, there are stored procedures.
I'm with you on using an RDBMS for almost everything, but worked on quite a few projects where alternatives were needed.
One involved a lot of analytics queries (aggregations, filters, grouping etc.) on ~100-200GB of data. No matter what we tried, we couldn't get enough performance from Postgres (column-based DBs / Parquet alternatives gave us 100x speedups for many queries).
Another was for storing ~100M rows of data in a table with ~70 columns or so of largely text based data. Workload was predominantly random reads of subsets of 1M rows and ~20 columns at a time. Performance was also very poor in Postgres/MySQL. We ended up using a key/value store, heavily compressing everything before storing, and got a 30x speedup compared to using an RDBMS using a far smaller instance host size.
I wouldn't call either of them massive scale, more just data with very specific query needs.
Kimball's dimensional modelling helps a lot in cases like this, since probably there is a lot of repeated data in these columns.
Every day, I see people struggling with problems that would be easy to understand if you had one. You don't even need to have an RDBMs. They are good just to model how things are related to each other.
> choosing something else is rarely the right decision
I think this is a little bit of a 'We always did it this way' statement.
More often than not it is worth sometime thinking and planning to work out at least the core requirements in that area, to save yourself a lot of refactoring (or throwing away and restarting) later, and potentially hitting bugs in production that a relational DB with well-defined constraints could have saved you from while still in dev.
Programming is brilliant. Many weeks of it sometimes save you whole hours of up-front design work.
Contrary to what people seem to assume, you actually can change the schema of a database and migrate the existing data to the new schema. There's a learning curve, but it's doable.
If you go schema-less, you run into another problem: not knowing the past shape of your data. When you try to load old records (from previous years), you may find that they don't look like the ones you wrote recently. And, if your code was changed, it may fail to handle them.
This makes it hard to safely change code that handles stored data. You can avoid changing that code, you can accept breakage, or you can do a deep-dive research project before making a change.
If you have a schema, you have a contract about what the data looks like, and you can have guarantees that it follows that contract.
I think a valid reason for not choosing a relational database is if your business plan requires that you grow to be a $100B+ company with hundreds of millions of users. Otherwise, you will probably be fine with RDBMS, even if it will require some optimizing in the future.
and that you ruled out using a JSON string column(s) as a dump for the uncertain parts, de-normalization and indexing, and the EAV schema as potential solutions to your problems.
the point is noting is free, and you have to be sure it's a price your willing to pay.
are you ready to give up joins ?, have your data be modeled after the exact queries your going to make ?, for you data to be duplicated across many places ? etc ...
That's a very good reason for going with a RDBMS even if looks like it's not the clearest winner for your use case.
If you invert any of those conditions, it may become interesting to study alternatives.
I’m curious how you handle this with less engineering effort without using an RDBMS.
What's interesting is query performance, and a RDBMS supports explicit control over indexing (usually including analyzing execution plans to find out which queries are going to work well). Where do you see "a terrible waste of engineering effort"?
If that assumption is true, then it follows that the same argument used in the last statement also applies— that if you're not at massive scale, then its likely the aforementioned tradeoff of not using RDBMS is likely de minimis.
(This assumes that the tradeoffs are of the magnitude that they only manifest impact at scale, hard to address that without concrete examples though)
The tradeoff is usually flexibility. You run into flexibility problems anytime requirements change. Scale doesn't factor in.
I have used both and have never regretted NOT using an RDBMs. Maybe its a taste thing but I'd rather use a simple K/V database than a relational database any day.
I want to be able to treat the servers in my database tier as cattle instead of pets and RDBMSs don't fit this paradigm well. Either NoSQL or NewSQL databases are, in my opinion, a much better place to start.
I feel like RDBMSs being the "default" option is because most people have worked with them in the past and already understand them. It doesn't mean they are the best tool for the job or even the tool most likely to solve the unknown problems you'll encounter in the future.
This is pure nonsense. B+Trees are used extensively and by default by 5 out of 5 of the top database systems, according to db-engines.com.
If your database engine is an old design or your data is small by modern standards, then a B+tree will be one of the few indexing algorithms available and if the data is small it will probably work. Modern database kernels targeting modern hardware and storage densities typically aren't using B+trees and the reasons why are well-understood. No one with any sense is using a B+tree to index e.g. a trillion records, which is a pretty ordinary thing to do on a single server in 2022.
You can't just swap out indexing architectures due to their dependency on storage engine and scheduling behavior, so older databases like PostgreSQL will be using B+trees for the indefinite future even if suboptimal.
The transition away from B+tree based architectures in new databases engines started about 10-15 years ago. Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.
The maximum you've told is "yeah do what scylladb does but you will still suck".
It just feels as advertisement and doesn't really add anything to the discussion I believe. All your comments are the same in all database threads.
endrant
Radically improving index compactness is achieved by loosening design constraints on B+trees: the indexes represent a partial order which only converges on a total order at the limit and the search structure is unbalanced. In the abstract these appear slightly less efficient but it enables the use of selectivity-maximizing succinct representations of the key space that can get pretty close to the information theoretic limits. Scalability gains result from the radical reduction in cache footprint when represented this way.
Optimal compressive indexes are not computable (being equivalent to AI), so the efficient approximation strategies people come up with tend to be diverse, colorful, and sometimes impractical. Tangentially, some flavors have excellent write performance. It is not a trivial algorithm problem but there are a few design families that generalize well to real databases engines. I wouldn't describe this as a fully solved problem but many ordinary cases are covered.
There isn't much incentive to design a relational database engine that can use these types of indexes, since the types of workloads and data models that recommend them usually aren't relational. Someone could, there just isn't much incentive. It is more de rigueur for graph, spatiotemporal, and some types of analytical databases, where there is no other practical option if scalability matters at all.
Big query would be one example.
All useful tools can be used incorrectly - and I agree SQL is one of the more frequently misused ones. I think a lot of that is that it's one of the more powerful tools.
With rollback, you always risk losing data - remember, you're doing it when something didn't go as you expected. What are the odds, the rollback will break something further?
These were all Postgres shops, so schema changes within a transaction - that can be rolled back safely if it fails in the middle.
Having schema is much better for ad-hoc queries though, doubly so if your schemaless types aren't JSON (e.g. protobufs).
E.g. our customers are stored in Postgres, so let's also log their actions there linked to the user table.
5 years on someone decides we need to run analytical queries across years of 200M logged actions, joined with other data in the DB.
So now we either have to live with horrible performance, migrate the logs to something suitable for OLAP (and lose all the benefits of a solid RDBMS), or have some syncing/export process to duplicate somewhere suitable for querying.
Client/front end guys: "you need to fix this query, it takes 10 minutes to run and the front end is timing out"
DBA: "err, this query wants to return 50,000 rows"
C/FEG: "yes, and?"
DBA: "what in blue blazes is your UI, or a user, going to do with 50,000 rows?"
C/FEG: "oh - we hadn't thought about that ..."
Repeatable read isolation creates read locks so that other transactions cannot write to those records. Of course our own transaction has to first wait for outstanding writes to those records to commit before starting.
Best as I know the goal is not to prevent one's own transaction from updating the records we read; the read locks will just get upgraded to write locks.
No it doesn't, that's just one possible implementation strategy. Postgres for example does not do this.
> Best as I know the goal is not to prevent one's own transaction from updating the records we read;
I'm talking about updates from other transactions. In postgres with REPEATABLE READ, the following transaction can be executed concurrently by two clients:
BEGIN
SELECT bar FROM foo WHERE id = 1; -- Returns 0
UPDATE foo SET bar = bar + 1 WHERE id = 1;
COMMIT
Both clients can see a value of "0" from the first SELECT, but after both COMMIT, the value of "bar" will be "2". ie. the "read" of "bar" in "bar = bar + 1" for one of the transactions does not use snapshot isolation.If possible, would be great to get Mark Brooker (Principal at AWS) to provide some notes on bridging the gap between CAP theorem and how AWS relaxes constraints for building out Elastic Block Storage (EBS).
I'm dealing with some pretty involved ETL at work these days, and it's really hard and ugly to do it in SQL to "stay in the DB" and keep things fast. It's tempting to read, transform in a high-level language (think pandas) then write back to the DB, but then you give up so much perf.
I would love another API than SQL like what Sparq is doing. You can keep a query optimizer, an engine to execute queries efficiently physically, etc, but it could be controlled with a nice general language API.
Another way to put it is that SQL became the standard and alternative haven't been developed much in comparison. Imagine like the ML world being stuck with Prolog or some DSL. It would be annoying to say the least. Instead people have a variety of APIs and the popular and most developed ones are actually using high-level languages to drive optimized ML engines. These convert function calls and chaining into efficient transformation that use memory, computing units (CPU, GPU, etc) efficiently. The same for DB would be wonderful.
Postgree isn’t a thing.
Postgres or PostgreSQL are acceptable. Postgre or postgree are not.
The fact that MyRocks doesn't use B+Trees seems like half the story. Less than half, even. The really important difference between MyRocks and InnoDB is that MyRocks uses log-structured storage (one LSM tree for everything), while InnoDB uses a traditional write-ahead log with checkpoints, and with logical UNDO. There are multiple dimensions to optimize here, not just a single dimension. Focusing only on time/speed is much too reductive. In fact, Facebook themselves have said that they didn't set out to improve performance as such by adopting MyRocks. The actual goal was price/performance, particularly better write amplification and space amplification.
NounFunction
NounFunctionTemporal
NounCollectionFunction
FunctionNoun_Function
FunctionNoun_FunctionItem
Noun_Collection
Noun_Collection_Function
Noun_Collection_FunctionItem
i've taken out duplicating combinations where different teams wanted same things over time, but never looked up what others had done before and decided to spin their own...Noun_Collection_Function_20200406 Noun_Collection_Function_20200320_Backup
And
Noun_Collection_Function_ForJim
And you can lock down your critical production databases from arbitrary sql
Oh wait, we do have Prisma. And it suffers from those same issues.
And that's how the scratchpad schema became mission-critical ;)
This is a good point and probably correct often enough, but I also think not understanding the entire problem you are solving is not only common, but in fact necessary to most early-stage velocity. There is need to iterate and adapt frequently, sometimes as part of your go-to-market strategy, in order to fully understand the problem space.
> a relational DB with well-defined constraints could have saved you from while still in dev
This presumes that systems built on top of non-RDBMS are incapable of enforcing similar constraints. This has not been my experience personally. But its possible I don't understand your meaning of constraints in this context. I assumed it to mean, for instance, something like schemas which are fairly common now in the nosql world. Curious what other constraints were you referencing?
If you're pivoting so hard that your SQL schema breaks, how is a schemaless system going to help you? You'll still have to either throw out your old data (easy in both cases) or figure out a way to map old records onto new semantics (hard in both cases).
I agree with GP that this is a central problem to solve, not something to figure out _after_ you write software. Build your house on stone.
I agree with your comment that it's a central problem to solve and that both options, throwing out data or map old records onto new semantics, is an endemic choice both stacks need to make. I don't agree that it's always possible to solve entirely up front though.
In my experience, it has been less so about whether the storage engine is schemaless or not, even many modern nosql stacks now ship with schemas (e.g. MongoDB). I think the differentiation I make between these platforms is mostly around APIs. Expressive, flexible semantics that (in theory) let you move quickly.
As an aside, I also think the differentiation between all these systems is largely unimpactful for most software engineers. And the choice often made is one of qualitative/subjective analysis of dev ergonomics etc. At scale there are certainly implementation details that begin to disproportionately impact the way you write software, sometimes prohibitively so, but most folks aren't in that territory.
The confusion there is due to the fact that non-R-DBMS (particular when referred to as noSQL) can mean several different things.
In this context I was replying to a comment about not knowing the shape of your data which implies that person was thinking about solutions that are specifically described as schemaless, which is what a lot of people assume (in my experience) if you say non-relational or noSQL.
That is the sort of constraints I was meaning: primary & unique keys and foreign keys for enforcing referential integrity and other validity rules enforced at the storage level. There are times when you can't enforce these things immediately with good performance (significantly distributed data stores that need concurrent distributed writes for instance - but the need for those is less common for most developers than the big data hype salespeople might have you believe) so then you have to consider letting go of them (I would advise considering it very carefully).
Are you kidding? They never can.
The entire point of ditching the relational model is discarding data constraints and normalization.
Mongo has always had semantics to support normalized data modeling, has schema support, and has had distributed multi-document ACID transactions since 2019 [1]. You don't have to use those features as they're opt-in, but they're there.
I know that full parity between the two isn't feasible, but to say they never can is a mischaracterization.
[1] Small addendum on this: Jepsen highlighted some issues with their implementation of snapshot isolation and some rightful gripes about poor default config settings and wonky API (you need to specify snapshot read concern on all queries in conjunction with majority write concern, which isnt highlighted in some docs). But with the right config, their only throat clearing was whether snapshot isolation was "full ACID", which would apply to postgres as well given they use the same model.
This is a double-edged sword. I have seen massive business logic baked into stored procedures...so much so, that the applications themselves are rather slim. If this stored procedure code is properly versioned and otherwise managed, this is not entirely bad. If the data model is sound, I don't worry that much...stored procs vs 100KLOC of Java code? I can tell you what is easier to migrate. The other side of it is that stored procedures can also serve as an API into the actual database. I built a system once (90's) where the developers never accessed the actual database tables directly. They always called stored procedures. The advantage here was that we could tune the snot out of the database without a code change/build/deploy. It also allowed the DBA some control over poorly written, runaway queries. YMMV. I think today I probably would try to keep the database as vanilla as possible.
Sure, YMMV.
In any non trivial dataset a lot of fields are effectively computed. For example merged entries: in order to get correct and whole data, one needs to consult some merge mapping, which can be easy to forget and tricky to get right - you have more than one relational identifier. This is not strictly business logic, but rather data assembly logic.
Similarly, a value can easily be spread over multiple fields (and tables) and it is crucial from data integrity standpoint to always update them in tandem if applicable. Again, this is very easy to screw up in client code, because the hidden relationship can be non-obvious. On one hand this is business logic, on the other hand violating this implicit relationship will result in non-agreeing data. This occurs for example when data represents parallel states.
Effectively, stored code acts as some kind of gateway API stored concurrently with the data. Sure, some peculiarities can be implemented with functions and triggers, but IMO those are just different sides of the same coin. In the end really depends on the dataset and what it represents.
Database stored code is a tool. Used appropriately it solves problems, used inappropriately it causes problems. If I were to design a database today, I would too try and make do without stored code, but would not try to twist data model so that it fits the relational model of RDBMSes.
Except for eg infinitely scalable cloud data stores like Google’s, or for ML where it’s just massive data and you need a dumb-ish store of many GB of parquet.
I share this sentiment. The apps will come and go, the real value is in the data. If the database can cover its ass, I am less concerned about ham-fisted developers randomly messing it up with ill-conceived DML. It's not that they are malicious...it just happens. I have seen devs that code in Erlang, Haskell and even Assembly...run in terror at doing some SQL. It's weird. Trust but verify. And hire persnickity, passionate DBAs.
From my perspective this is a negative. You don’t want your queries to change after deployment.
Procedures are also quite hard to type.
If you have DBAs with organizational power and their shit together, they tend to ask awkward questions and tell you the baby is ugly. It’s easy for developers to do stupid shit and blame solar flares or whatever, but stored procedures are the DBAs realm, and the DBA knows who gets blamed when it blows up.
Like anything, there’s ups and downs. But if you can commit to a DBMS platform for the life of the app, there are compelling reasons to use it, even if it pisses off the devs.
But I have also seen organizations with the policy that any operation touching base tables should be encapsulated in a stored procedure. This makes development extremely cumbersome, especially if some DBA is gatekeeper for the stored procedures. Something like this might have burned your colleague.
Stored procedures have the downside that they often work only in one vendor's database.
The advantages of GraphQL are its integration of backend and frontend workflows, and that won’t happen with PostGraphile: a frontend needing a schema change needs to go through the backend instead of the backend extending to meet the frontend in the middle.
PostgreSQL 14 comes with 6 builtin index types[1]: B-tree, Gist, SP-Gist, Gin, Brin, and Hash. More can be plugged in as extensions.
[1]: Chapters 63-69 of https://www.postgresql.org/docs/14/internals.html
Edited: Fixed the link to version 14.
You said that B-Trees "use in indexing has dwindled with time". This is demonstrably false.
> Back then I used them ubiquitously but I honestly don't remember the last time I've seen one in a new design.
Even if that was true (which it definitely isn't), why would anybody judge the commercial or scientific relevance of B-Trees by looking at what new systems do? There are very few new systems that are intended to be competitive as general purpose systems, which is where most of the market is.
You still haven't actually named a single example of a "modern database kernel" that exemplifies what you're talking about.
Use of B+trees in new database kernels has definitely diminished. I'm not counting the installed base of SQLite etc. Ubiquity doesn't make something the pinnacle of technology -- just as often it means "legacy installed base". I still use PostgreSQL a lot and mod it when I need to but I am not under any illusions about its limitations.
A "modern" database kernel that can efficiently use modern hardware is going to be a thread-per-core architecture with all I/O and execution scheduling done in user space, and the ability to operate on modern storage densities found on database servers, which can exceed a petabyte of direct-attached storage. The implications of storage density and its interaction with indexing drive most of the real changes in the way database kernels are designed. You can find elements of this in open source, but mostly in big data platforms rather than proper database engines.
That said, no one builds new high-end databases for retail anymore, the economics don't make sense. All the money moved to more specialized implementations that cater to smaller audiences where you don't need to advertise. The kernels are fully general, and widely reused, but the interfaces and surrounding bits are purpose-built for particular workloads. Hell, my old storage engines are still used under license by that lot. The days of database billboards on the 101 are an anachronism.
Other than in-memory hash indexing as used by SAP HANA, I’m not aware of any other data structures anywhere near as popular for database engines.
Can you name the data structure(s) that have superseded these?
I never said anything about workloads. All I said was that your statements about B+Trees having dwindling usage are clearly false.
If you make a claim that is self-evidently bogus, then you shouldn't expect anything else that you may have said at the same time to be taken seriously.
And I would never trust a database that has such a bad track record, regarding durability as MongoDB, although I admit that PostgreSQL had theoretical problems there as well in the past.
> And I would never trust a database that has such a bad track record, regarding durability as MongoDB
I can't comment on your own experience obviously, but I've been using mongo since 2011 in high throughput distributed systems and it's been mostly great (one of my current systems averages ~38 million docs per minute, operating currently at 5 9s of uptime).
Definitely some pain points initially in the transition to WiredTiger, but that largely was a positive move for the stack as a whole. Durability fires have not plagued my experience thus far, not to say they won't in the future of course.
As you noted, Postgres has had its own headaches as well. Finding out that all their literature claiming their transactions were serializable when they were in fact not serializable could be considered a mar on their record. But much like mongo, they have been quick to address implementation bugs as they are found.
> Definitely some pain points initially in the transition to WiredTiger, but that largely was a positive move for the stack as a whole. Durability fires have not plagued my experience thus far, not to say they won't in the future of course.
Good to read that some are actually using MongoDB to their benefit. Indeed I have encountered problems with durability in the wild. Nothing, that I would like to repeat. But as always for a specific use case the answer is: it depends. For a general advice with what to start a new project I would select PostgreSQL in 10 out of 10 cases, if a database server is actually required.
https://storage.googleapis.com/pub-tools-public-publication-...
What about relational databases? Most are best suited for OLTP workloads.
Doesn't really matter, because no one is writing database-agnostic SQL (unless that's part of your product). Any non-trivial implementation is going to require the use of proprietary SQL.
Once you control the deployment environment it largely makes very little sense to spend much time on db agnosticism. But that’s also true of many other things that have left vestigial forms in our software today such as IOC containers, pluggable log libraries, configuration files instead of configuration code, etc.
It's a bit like if you wrote C and could only compile your program with Oracle's C-compiler. Where is progress in SQL standardization going?
In theory we can use a tool which tells if you SQL is ANSI compliant, but this would not tell if your code expect only standard compliant behavior from the database. E. g. some RDBMs are flexible in which formats they accept datetime strings and some are not. SQL can be standards-compliant but code still can fail with one DB and work with another.
It is not to say that standards are useless. It much easier to port an application from one DB to another if a developer was trying to use only ANSI SQL. Or you can write an application which can work with multiple RDBMs without modification but like with almost everything else - if you haven't tested it you cannot be sure that it works.
Examples: GRANT TO, REVOKE FROM, DENY TO (oh yes, what's the difference between REVOKE and DENY? and did you know that you can REVOKE a DENY?), arbitrary requirements for punctuation (e.g., = in BACKUP DATABASE AdventureWorks2012 TO DISK = 'X:\SQLServerBackups\AdventureWorks1.bak'), dubious context-sensitive "magical identifiers" (e.g., SELECT DATEPART(year, @t)), non-composability (how do you dynamically give a part specification to DATEPART?), etc, etc, etc.
It's possible to write a novel about just how unstructured and irregular "structured" query language is.
In my experience <20% of developers are good enough to be dangerous with SQL, and maybe 5% what I'd consider adept.
The rest range from "SELECT * is as far as I'll go; where's the ORM" to "why the hell are you using cursors to aggregate"
SQL is powerful, elegant, and reliable. With modern DB support of JSON, ARRAY, statistical functions, and much more, SQL is probably the #1 most underutilized/improperly leveraged tool today. SQL-killers have been coming out for 40 years now and (for its domain!) SQL's lead is pulling farther away if anything.
*yes there are some questionable implementations, so please replace "SQL" with "PostgreSQL" if nonstandard SQL implementations have caused nightmares for you in the past.
Even for use-cases like graph based models you still find Twitter and Facebook using MySQL to build the graph on top of. It’s simply heavily abstracted for the majority of engineers at those companies (where I'd wager the <20% proficiency in set-based thinking holds true) but it still fundamentally relies on SQL.
SQL does implement the set operations of UNION, INTERSECT, and EXCEPT, but I meant "thinking in sets" more colloquially.
edit: this is more practical than theoretical but the author here actually does a nice job of discussing "set-based" thinking vs. procedural https://www.itprotoday.com/open-source-sql/fangraphs-tags-cl...
Fundamentals of Database Systems by Elmasri & Navathe. Part 2.
When it comes to prototyping, I'm not going to fuck with something like Java-- I'm going to reach for Python. If I don't know what I'm doing to begin with, I don't need something telling me all the ways in which I'm doing it wrong.
Same goes for SQL/NoSQL. I loosely know what I need to model and may revise it arbitrarily. SQL does not lend itself to this. NoSQL was designed for it.
NoSQL is the "fuck you dad you can't tell me what to do" of the database world. SQL is your annoying dad that tries to institute curfews and won't let you go to underage drinking parties. In the end, it's the latter you're going to be calling from a holding cell, but there's a lot more fun you can have with the former.
Additionally, I think drafting a DDL schema is a great way to learn about a new application domain. It forces me to ask hard questions. And that improves my understanding of the domain. I guess that is similar to some people who like to prototype using Haskell type signatures.
[1] Unless you mean specifically a document store without schemas.
This is unrelated to this conversation, but this is my main beef with Rust. I love Rust (like a lot), but it's just not good for prototyping. End of non-pertinent rant.
But then I'm trying to do relatively simple things in Pandas or R, like changing column values based on multiple field value conditions, and it is a struggle requiring at least 17 brackets, while in SQL it would be a simple statement that anyone can understand after reading it once.
For example, one of the biggest gripes I have with SQL is the fact that the table is the only output format, it doesn't allow outputting data with different cardinalities in a single query. The internal relational model of a RDBMS is extremely rich and powerful, but when extracting that data using SQL you have to leave all the richness behind and flatten the data into a simple dumb table. So as result we have to do much back and forth with separate queries based on results of previous ones (or even ugly hacks like concatenate higher cardinality data into a single field to then re-separate it in the application). A suitable successor for SQL should have a way to output a subset of the relational model with the relational part intact.
SQL is essentially the assembly language of the database. That makes it very powerful, but sometimes you just want a language that gives you a "garbage collector" for free, while still being able to drop down to assembly when you need additional power.
There is no technical reason why an RDBMS can't support different modes for accessing data. We are just so used to putting that work into the application that it has become a hard sell to want to move it to the right place.
With MSSQL at least you can return multiple result sets. Not sure about other database vendors.
Our layers of abstractions atop SQL along with approaches taken by DBMSes outside of the common relational names have shown that there is room for improvement within those tasks, able to be done natively by the DBMS, and it does not have to come at the expense of losing SQL when you need to be able to describe lower level questions. Different tools for different jobs, but little will within the RDMBS space to explore those other tools.
Doing regex is horrendous, but doing it on SQL in a modern database is no more difficult than in a full-fledged programming language. Most modern DBs have strong JSON support and built-in regex functions
1. JSON_CAST/JSON_PARSE your data
2. REGEXP_EXTRACT() on the result, here's several date validator regex from a SO post (https://stackoverflow.com/questions/15491894/regex-to-valida...)
And that's it. In fact in many cases it's probably faster to do it natively in SQL than to export it to python or R and parse there.
With dplyr, wouldn’t this be mutate(x = case_when(… ~ y, TRUE ~ x)) or the same but with ifelse?
CASE WHEN
SUM(daily_revenue) OVER (PARTITION BY department, TRIM(SUBSTR(region, 5)) IN ('North','West','Misc')) >
AVG(revenue) OVER (ORDER BY sale_time ASC rows BETWEEN 28 PRECEDING AND CURRENT ROW)
AND NOT COALSECE(had_prev_month_party, FALSE)
THEN pizza_party_points + 5
WHEN <above> AND had_prev_month_party THEN pizza_party_points + 3
WHEN MIN(sale_time) over (PARTITION BY department) = DATE_TRUNC('month', current_date) then 5
ELSE GREATEST(pizza_party_points - 1, 0)
END as pizza_party_performance_points_current
this example may be a bit esoteric but it actually draws from lots of real-world cases (comparing a partitioned sum with an overall average, checking conditions on other columns, messy data parsing and manipulation, implicit waterfall nature of CASE WHEN vs. explicit elseif, etc) SELECT json_agg(row_to_json(t))
FROM information_schema.tables as t;It just isn’t feasible to write everything out when the schema has >30 entities, some with N:M relations, and when queries routinely go several levels deep to fetch dozens of joined fields at every level. The boilerplate overhead is too much.
A natively GraphQL database makes such queries a magnitude less verbose to write out, and all the queries can stay in the frontend (or can become persisted queries on a GraphQL ”middle-end” server).