SuperSQLite: SQLite library for Python (2018)(github.com) |
SuperSQLite: SQLite library for Python (2018)(github.com) |
"SQLite has fantastic write performance as well. By default SQLite uses database-level locking (minimal concurrency), and there is an “out of the box” option to enable WAL mode to get fantastic read concurrency — as shown by this test. But lesser known is that there is a branch of SQLite that has page locking, which enables for fantastic concurrent write performance."
https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...
It's easy to get really stellar concurrent performance out of SQLite using a many reader, single writer model (ie many threads, single process). In testing we did it easily surpassed Postgres.
My experience pretty much matched what you describe and it was such a great opportunity to really lean on a fabulous piece of software. Given this was many years ago, before ES was stable and SOLR wasn't working well for us, I think it was the right choice. These days, ES is good enough that if I had to do it all over I'd go straight to it. None the less, I don't think SQLite should be ignored as an option when you have high levels of control over data access patterns.
Does anyone use this seriously in production on a typical web service? I wonder about how eg backups and stuff like that work out in real scenarios.
I've always been a big fan of SQLite and this is the one challenge I've always faced.
Can you give some more insights as to how you achieved that?
SQLite has progressed a lot in the last few years. It is no longer advisable to mock it over SQLITE_BUSY et al; you will come unstuck on the Internet very quickly ;-)
Unfortunately, I use SQLite through Python so I'm stuck with the system version :-(
Fair enough if you want to compile your own version, but there's a lot of people and Python devs who don't know how / won't go through the effort and that's what this project is for.
It's also a work in progress and not meant to be released yet. Shoot me an e-mail ajay@plasticityai.com if you have suggestions for how to improve it.
http://charlesleifer.com/blog/compiling-sqlite-for-use-with-...
Describes several handy ways to compile sqlite for use with python (2 or 3), as a system-wide shared library or compiled directly into the python module. This can also be applied to sqlcipher, etc.
If you don't know how to compile sqlite, I'd argue that you have no business trying to use it's more advanced features. How can you tune or optimize something you don't understand?
Furthermore, your library is stale. Anyone using it is dependent on you to merge in upstream changes constantly. And based on what I saw, it's already well out of date.
Because anyone can submit links?
Not sure if it's in this library, but I'd want the ability to open a database that's already loaded in memory as a byte array.
But if someone published an alternative APSW wheel with JSON1, ICU, and FTS5 enabled, I'd be happy.
I keep seeing this statement. Why is it considered one of the most well-engineered software?
[1]:https://github.com/plasticityai/supersqlite/blob/01e54bbb829...
This has not been our experience. Our experience is that it sort of bogs down around 12GB in file size. linux box, ext3/4, spinning disk (not SSD)
However, pointing this or any of the other practical shortcomings of SQLite out on hacker news is blasphemy and will invariably get you downvoted into oblivion by people who (apparently) never ran into them.
Either way, the fact that it is DO-178B certified for use on aircraft should tell you something about the reliability of it. It is certainly more reliable than most other software, OSS or otherwise.
I am curious what practical problems you encountered with SQLite. Could you perhaps expand on that?
With the default settings which I semi-affectionately refer to as paranoid mode, an untuned database can start to have worse performance after getting 500,000+ records going. Then things like indexes, RediSQL, and WAL mode start being more necessary rather than just best-practices.
But if you set your pragmas correctly and so on, SQLite scales up just fine. I haven't done a large scale Sqlite base simply due to caution around needing to use a 'real' database in production like SQL server or Postgres, Maria etc. Sqlite is excellent for ephemeral databases to be created, seeded with test data, run tests against, and deleted in repeatable automated testing.
Based on the link to the Expensify article, it sounds like Sqlite can scale up even better than Sql server under some circumstances.
But I have barely tried using it in production because of that aforementioned caution. What are some pitfalls to watch out for?
See: https://www.sqlite.org/testing.html https://www.sqlite.org/hirely.html
The bit about testing.
They intend to support it till 2050 - https://sqlite.org/lts.html - which includes keeping file format and API backwards compatibility. That accounts for the more careful decision making since things added now have to be supported for another 30 years.
>> If you don't know how to compile sqlite, I'd argue that you have no business trying to use it's more advanced features. How can you tune or optimize something you don't understand?
I'm not sure I agree with that, some people understand the advanced features but have a miserable time with compilers and compiling something like SQLite. There's a lot of great engineers with Python and SQL expertise that just don't know how to compile a C project. That might sound contradictory, but it's just something a lot of engineers still don't do day to day since people are using a lot of scripted languages (Node/Python/Ruby).
For example, there's probably a lot of people who want to use the JSON1 or FTS5 modules with SQLite in Python. That doesn't require advanced knowledge, but requires them to re-compile SQLite!
>> Furthermore, your library is stale. Anyone using it is dependent on you to merge in upstream changes constantly. And based on what I saw, it's already well out of date.
It is stale, I agree. But it's still a WIP. This was posted here early. My plan is to pull in changes from the upstream sqlite module with the latest CPython 2.7 tag and latest CPython 3 tag in the source.
Note that I'm not saying all of the test suite should be open source. It's clearly a valid/cool business model and SQLite is, for a lot of use cases, an excellent piece of software that I have often used myself. The criticism was honestly more directed towards the cherry-picking of facts that sometimes happens in discussions on hacker news, especially those related to SQLite.
So, if you're were getting unexpected "BUSY" erorrs than, yes, you would be using it incorrectly. However, AIUI, you are always expected to see some amount of BUSY errors during normal, concurrent operation and have to deal with them explicitly. So the fact that you're seeing BUSY errors alone doesn't mean you're doing anything wrong...
To use SQLite correctly from multiple processes, you have to do one of two things:
- Add explicit code to retry on BUSY errors everywhere you do SQL queries
- Serialize all access the database, e.g. by using a mutex
GP appears to have chosen the second option.Their rationalization for this is that the way SQLite stores data, this is more efficiently performed by simply creating a new table and copying all the rows from the old table into the new one, and that one would want to batch all the table changes together rather than emitting individual ALTER statemnts.
from the POV of people who make tools like my own Alembic Migrations, this is an annoyingly insufficient answer because the logistics of recreating table schemas and copying data over is much more complicated than just emitting an ALTER directive. I'd like if SQLite had at least the ability to be extended to support a third party "ALTER" plugin that would under the hood run the intricacies of copying the tables around, rather than pushing this out to the tool creators. It doesn't really matter in most cases, for the use cases used by SQLite, that ALTER would be inefficient.
instead, my users bugged me for years to solve this problem and I have to maintain this thing https://alembic.sqlalchemy.org/en/latest/batch.html which I mostly hate completely.
SQLite's typing model is also very idiosyncratic and is based on a naming convention approach, which I think most users of SQLite don't understand very well, because it works in a completely strange way based on looking for substrings inside of the completely arbitrary names you can assign to types. I can create a column with the datatype ELEPHINT and that is a legitimate datatype which will store integers. there's also the "INTEGER PRIMARY KEY " / "INTEGER PRIMARY KEY AUTOINCREMENT" silliness but that's a relatively mild poor API compared to things MySQL does all over the place, I suppose.
With the way transactions and durability work in SQLite, a normal setup will not do more than on the order of ~100 transactions per second without adding additional layers of complexity. Which is completely fine for it's intended usecase of being an embedded, lightweight database and of course is not due to some bad engineering decisions, but due to inherent tradeoffs in how transactions are handled.
So basically it's all fine until you try to use SQLite for something for which it is not a good fit, like a large volume of inserts. Other solutions exist that are conceptually better equipped for this usecase out of the box (Postgres/MySQL/Elastic/etc). Not a shortcoming in SQLite per-se, but rather a practical shortcoming that comes up when you try to use it for the wrong job.
Case in point is TFA which takes the theoretical SQLite database size (140TB) limit and runs with claiming that is the useful limit on how much data you can store in SQLite. LOL! That value is when SQLite page IDs start overflowing and not an estimate on how much data can be usefully handled in a single SQLite database.
SQLite is, by the nature of what it is, not a good fit for big data volumes. Try loading 100TB of data into SQLite and try to run even a single query. Even assuming that you have a hyper-fast SSD, a single query will take days to complete! But it still get's mentioned as a potential solution for that problem every so often. In this post even...
So you could consider those practical shortcomings or ill-advised usage. It depends on the definition/perspective I guess. At any rate, SQLite, even if generally being excellent, is not the panacea as which it sometimes gets sold here on HN. No database is, there are just too many tradeoffs involved.
Note that you can "batch" up many inserts into a transaction, which gives you a high "nominal" insert rate but still only ~100 actual transactions or so per second. To see why this is not the most useful number when comparing to a database like Postgres, consider that the limiting factor in a SQLite/Posgres design are cache flushes which outweight the costs of actually writing the data, so the number of rows per batch is mostly arbitrary; using this metric you can always claim a huge insert performance by choosing a suitable N. Also, if you do the batching, you of course loose some of SQLite's consistency/durability guarantees for your writes, which is probably fine if you didn't need them in the first place, but begets the question if an embedded ACID database is the best tool for the job at hand.
Motherfucker
Would Cython help, perhaps?
I already do Windows builds[2] - 21, becoming 23 in next release so that isn't a difficulty.
The problem is tying all this together in a way that is useful, with sensible defaults and appropriate tools. APSW supports python versions going back 16 years! I welcome discussion at python-sqlite[3]
[1] https://rogerbinns.github.io/apsw/download.html#easy-install...
[2] https://rogerbinns.github.io/apsw/download.html#source-and-b...
The problem for my case is that if I want to distribute anything with a Sqlite+JSON1 workflow I also have to host and distribute a version of sqlite3 or apsw with those flags enabled. If that's already done on Windows then great, but many people I'm sure (myself included) would like a cross-platform solution.
It would be great if we could use the Setuptools "extras" feature to select which flags we want enabled. Not sure what the technical requirements for that are.
Alternatively just having an "everything enabled" version of the package would i think be good for most people who dont need such fine control over their project dependencies. If anything, someone who cares so much about keeping the library minimal is probably in the minority and should be free to disable extensions as needed.
It seems unlikely that either of these are more widely deployed than SQLite, especially the Linux kernel.
Probably the vast majority of deployments of the Linux kernel include SQLite. Add in all the non-Linux devices (e.g. the billions of iOS and Windows devices) and that's probably far more than the number of Linux installations without SQLite.
I don't know much about ICU, and perhaps you're right that it's one of the most used bits of software. But I can certainly imagine embedded devices that never communicate with users and wouldn't need ICU but do need to store data.
Most Widely Deployed and Used Database Engine
That last part is rather important.> MOST WIDELY DEPLOYED SOFTWARE MODULE OF ANY KIND? -- SQLite is probably one of the top five most deployed software modules of any description. Other libraries with similar reach include: zlib... blah blah blah
It stops just short of claiming it's the most deployed software in the world...