What if OpenDocument used SQLite? (2014)(sqlite.org) |
What if OpenDocument used SQLite? (2014)(sqlite.org) |
Complete version history: https://sqlite.org/docsrc/finfo/pages/fileformat2.in
Note that there have been no breaking changes since the file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the "reserved for future extensions" bits with descriptions of those extensions as they occurred.
I feel like I have considerable disagreement with the author of these sentences.
But sqlite files aren't small. One thing I don't understand is why they don't do string deduplication in sqlite (as in you only store a string once and every other occurence is just a pointer to that string). It seems such an obvious and easy way to reduce file size, memory consumption and therefore increase performance (less I/O). Is there a technical reason why this would not be desirable?
This sounds like something you could do at the schema and application level.
There comes a point where ISOing things doesn't help. The SQLite format belongs to SQLite, and an ISO standard would result in that standard being rendered irrelevant by the SQLite team, should they wish to make a change for any reason. Also, people would have to pay ISO for access to the specifications. SQLite should be treated as a defacto standard defined by the SQLite project.
MongoDB also saves document db type of store space just FYI.
SQLite files require opening in a DB terminal or using special software to even get to the point where one can see what’s there at all. Further the entire internet basically natively supports XML and JSON.
(Json has a higher threshold of complexity before it succumbs)
That said there is some nuance and it depends what the user expects. Is you app more of an MSWord where people expect a format that is decades backward compatible and only changes on explicit save, or is it more like a live app with a db back end. If the latter there should be no save concept around the DB file but perhaps a backup and restore function that exports to a controlled format.
All that matters is that you should be able to issue sql to the sqlite embedded library and get back the results.
Freeing you from the overhead of owning (thus inventing and then maintaining) your own file format is almost the entire point of using sqlite in this manner.
An app can go bananas with serialization and use, I dunno, binary JSON or Matroska / ebml or .mp4 containers or whatever, and still serialize any way it wants.
This has nothing to do with sqlite. You can have (or not have) gradual saves in any file format. It's a choice that the developers of that app made.
> file format may have braking changes
The sqlite file format is unchanged for 19 years now. A world of features and capabilities have been added since. Don't hold your breath waiting for the sqlite format to change.
Fair enough about the history of it not changing and you can always embed a frozen copy if it does. But this is a pragmatic assumption not a guarantee.
So to open a file I copy it into the :memory: database [1], then the user can do whatever manipulation they want and I can directly make the change in the database I don't need to have a model of the document other than its database format. And to save the document I VACUUM [2] it back to the database file. It works quite well, at least for reasonably sized file (which is always the case for my app) :).
A proposal: write directly, and automatically in the database. No more Save button. There are multiple advantages:
- the system is crash-resistant. I like taking the approach of CouchDB where the only correct way to close the system is to crash it. That way a crash is an expected situation that you actually account for, not a special case that you might forget
- there is only one database. Less code, fewer bugs.
- it is safe. A write to SQLite works or doesn't work, there is no in-between. As said in the VACUUM doc you point to: "However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt"
- it is how SQLite was intended to work. And because of that, you won't have to think about it for the lifetime of SQLite
For example in many regulated domains such as human subjects research files must be approved and only approved files may be used. "Is this version of the consent document the version that the IRB approved?" Well let's see... (1) file modification date is after the approval date and (2) checksums do not match.
Not to mention that writing a single byte of content to a filesystems marks the entire blob as needing backup.
The fact is the filesystem is the user's database, save is commit, and it should be under the users control because application developers do not have the faintest idea about user context.
For the exact reason I gave in the comment you are replying to: I want to keep a usual workflow for users. Principle of least surprise.
Users are okay with change being autosaved when there is a single "thing" that can be edited to the point that you don't even have to open it, it's just there, it can be seen as a property (as in ownership) of the application more than of the user. For example, your music library in your jukebox application.
On the contrary, when the user have to open the "thing" with your application and can choose between many of their files that can be edited with your application, users do not expect their files to be automatically modified at all. For example users may start doing some heavy editing and then at the moment of saving their work, they might make a backup of the previous state file before saving, or choose to "save as…" in order to keep the old version just in case.
Crashes are not something that happen that often. It can become an actual problem when you have tens of thousands of users and rare-events do happen, but in the particular case of the application I working on, I do not actually have to worry about that (on the contrary, any solution would have downsides that are worst in the particular case of this application than having to do some work again because of a crash if it ever happens).
but the save button could simply tag specific save points in a larger table.
if the format can roll up changes to compress them, they also indicate where which variants need to be kept indefinitely.
The term you're looking for is (aptly named) crash-only software.[0]
It's much better to save after each operation in a temporary place (probably in ~/.local/share/application/yourapp, using XDG directories), and when the user clicks save, just copy the file into the desired location. That way, if there is a power loss and you reopen the app, it opens right back where it was doing (losing maybe he last few seconds of changes, but not all unsaved data)
To be perfectly safe, you want to rename it, not copy it. If there’s a power loss during copying, you may endcup with corrupted data.
Renaming is, to coin a phrase, “more atomic” than copying (on Linux, the OS says it is atomic. ISO C says it, too, but POSIX doesn’t (https://pubs.opengroup.org/onlinepubs/000095399/functions/re...: “This rename() function is equivalent for regular files to that defined by the ISO C standard. Its inclusion here expands that definition to include actions on directories and specifies behavior when the new parameter names a file that already exists. That specification requires that the action of the function be atomic”)
Also, filesystems may have bugs, hardware may lie about syncing to disk, and network shares can be finicky.
Doing this properly isn’t as easy as one would think. You’ve to make sure to sync the file to be written and you’ll have to handle the case where the save location is on a different file system than your temporary file. If so, you’ll have to create a copy on that file system first.
I think many tools do not check whether they need to work cross filesystem and just write their scratch files to the save directory with a different name and then rename them.
Of course, that means you always need twice the disk space on the target disk to do a save. That used to be a problem almost everywhere, but nowadays mostly is restricted to embedded systems and USB sticks.
In this case, however, SQLite will do a lot for you, and probably better than you would do it. It claims (https://www.sqlite.org/atomiccommit.html#_multi_file_commit):
“SQLite allows a single database connection to talk to two or more database files simultaneously through the use of the ATTACH DATABASE command. When multiple database files are modified within a single transaction, all files are updated atomically. In other words, either all of the database files are updated or else none of them are. Achieving an atomic commit across multiple database files is more complex that doing so for a single file. This section describes how SQLite works that bit of magic.”
However, about VACUUM INTO, it says (https://www.sqlite.org/lang_vacuum.html):
“The VACUUM INTO command is transactional in the sense that the generated output database is a consistent snapshot of the original database. However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt. Also, SQLite does not invoke fsync() or FlushFileBuffers() on the generated database to ensure that it has reached non-volatile storage before completing.”
So, I don’t think doing “VACUUM INTO” is sufficient to guarantee that you get a good copy of your data on disk.
What I have trouble imagining is people working with documents on computers for more than a few years yet somehow failing to develop the Always Save Instinct. I regularly catch myself saving unreasonably often.
When user saves, you just checkpointing the file, merging it back into the main database.
> The VACUUM INTO command works the same way except that it uses the file named on the INTO clause in place of the temporary database and omits the step of copying the vacuumed database back over top of the original database.
Do you use VACUUM (uses a write-ahead log to survive power-off) or VACUUM INTO (as far as I can tell, it doesn't survive power-off during writing, and might corrupt the existing file contents if the filename already exists)?
EDIT: there is no difference between VACUUM/VACUUM INTO - they both write to a new file (COW) it's just VACUUM [NOT INTO] does mv temp.sqlite originalfile.sqlite after that, while VACUUM INTO does not.
I don't get your point. Are you saying that you don't need to have a model of the document other than the model of the document? What's the nuance I'm missing?
Programmers often have to make software design decisions around how to represent a file in memory in order to manipulate it. For example, if I'm writing an HTML editor, should I mostly treat it like a text file (maybe a gap buffer) with syntax highlighting and auto indentation as an afterthought? Or should I maybe load the whole thing into a tree? What are the robustness and performance characteristics of each?
The commenter above was saying that using SQLite made that decision easy. He could keep traditional (or "atavistic" per the commenter upthread, depending on your perspective) load/save semantics while also making the data model easy to work with.
SQL <-> In-memory representation <-> Disk format.
With SQL on the client, in theory you only now need to maintain SQL <-> In-memory representation
Obviously I'm skirting over the format you would use to send either entire documents or partial updates of documents over the wire.Then, saving your document means serializing the data in some format (which could be JSON, XML, CSV, an SQLite database, …) and writing that to disk, and opening a document means reading the file from disk and unserializing it to your internal model.
What I'm saying is that my approach is to use an in-memory SQLite database as the internal model of the data in the applications. I presented an upside (opening and saving are easy), but is also has downsides: I have to do SQL queries to manipulate the data rather than manipulating objects directly (which could be mitigated using an ORM but that's outside my point). In Python-like pseudo-code you can imagining something like:
self.todos[42].status = 'DONE'
vs self._db.query("UPDATE todos SET status='DONE' WHERE id=42")
(Of course there is the possibility of using ORMs or other approach in between the two.)This isn't a concern for most software. If you're writing a domain-specific, closed-source application where interoperability with other apps or ISO standardization isn't a concern, SQLite is a perfectly fine file format, but as far as I understand the situation, those concerns did exist for Open Office.
- On Linux, saving into a new file onto a root-owned but world-writable NTFS mount created in /etc/fstab, fails due to permission errors or something. Saving into an existing file works as usual.
- Files are modified on disk when you edit the project in the program, creating spurious Git diffs if you check Audacity projects into Git as binary blobs. And when you save the file, old and deleted data is left in the SQLite file until you close the project's window (unlike saving a file in a text editor), and you can accidentally commit that into a Git repo if you don't close the window before committing. (I recall at one point that you had to manually vacuum the .aup3 file, but now closing the window is sufficient.) I'm getting Word 2003 Fast Save vibes.
I have a use case where users of a web service want to use data exported as a bunch of rows in a table in a variety of tools. Now, CSV with UTF-8 encoding is of course, totally open, conventional, and workable, but anyone who has ever offered CSV files to end users will know the pain of these users getting stuck when they want to use these files in a spreadsheet application¹. So I saved a sample spreadsheet in OpenDocument's ODS and another in that Microsoft XML abomination called OOXML as XLSX, and just figured out the basics of those XML formats. I trimmed the ZIP archives down to the essentials, marked the places where content goes, and just build a new spreadsheet file whenever data is requested in that format. Now I can output CSV, ODS, and XLSX (and JSON thrown in for good measure) of the same data.
Doing this with SQLite would be possible of course, just a tad more complex and with a lower development speed. Being able to fire up the office suite, create a template document, and just dig into its XML files in the saved file is a nice feature (although admittedly of niche interest).
1: More specifically, users who use Excel in a locale like nl_NL, where CSV files are, hardcoded, assumed to have their columns separated by semicolons, because Microsoft once notoriously decided that the Dutch did not use comma's in a comma separated values file.
This used to be configurable when opening CSV/TXT file in excel (and still is in LibreOffice) but as a part of the overall UI dumbification was moved somewhere under the “Data” menu/ribbon tab (so you have to open new workbook and find the right option, or well, use LibreOffice if you value your time).
Are you sure that affects it? The decimal point parameter sounds like it decides how to write out 5½ (i.e., 5.5 (English style) or 5,5 (Dutch style)) surely? Although on the topic of this particular bête noire I would not be surprised.
> SELECT manifest, versionId, max(checkinTime) FROM version;
> "Aside: Yes, that second query above that uses "max(checkinTime)" really does work and really does return a well-defined answer in SQLite. Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect: it returns the manifest and versionId of the entry that has the maximum checkinTime.)"
One of the improvements that I made was moving a few tables that contained small amounts of data to xml files. Because these files were small and rarely written; it simplified the data access layer, and simplified diagnostics. (I made sure the files were multi-line tabbed xml.)
For "technical" people who needed to diagnose the product, asking them to crack open a SQLite database was a huge ask; but for the major part of the product that used SQLite, it was hands-down better than XML files. (An older version of the product used XML files. It had scalability problems because there's no good way to make an incremental update to an XML file.)
The advantages of XML, specifically, a human-readable format; really only work for small files when the design of the schema is optimized for readable XML. Unfortunately, the need to always rewrite the entire XML file, and the "complexities" that come with lots and lots of features will quickly erode XML's biggest advantages.
IMO: A "lay" person needing to muck around with the internals of an office document is fringe enough that learning to use a SQLite reader is an acceptable speed bump. The limitations of XML + Zip, when it comes to random writes in the middle of a file, just can't be overcome by Moore's law.
Engineering is all about tradeoffs: SQLite is optimized for quick incremental updates where you don't need to rewrite the whole file. Zip & xml aren't. (IE, if you decide to add a letter to a word at the beginning of a document, with zip & XML you have to rewrite the whole document. SQLite can make a minor change without the whole rewrite.)
In our case, file size was not a factor in choosing between SQLite and XML.
But, remember that file size is deceptive: Disks are block devices; the 30 byte and 1k file take up the same space if you block size is 2k. (I've shipped a filesystem driver.) HTTP servers gzip on download. It's more important to know your needs than to get hung up on a single metric like file size.
> I'm unclear on how SQLite (native format, no zip) is achieving sizes similar to XML + Zip. Are SQLite TEXT or BLOB fields compressed? Or are they assuming the caller is compressing BLOBs before writing?
Remember, XML writes each tag name 1 time if there's no content and twice if there is. Each attribute has it's name written every time. I doubt SQLite writes all the metadata in each row.
Without being able to call out to existing standards, the ODT spec itself would suddenly become massive. The effort to update the standards appears to be significant and hasn't progressed much in recent years already :/
I think realistically, an Sqlite format could be offered as an option, but the office doc ship has really sailed.
Good argument to formalise the spec of Sqlite as a standard though...
On the other hand if one ignores few warts (explosion of local styles and text spans due to ooo:rsid attribute, non-sparse spreedsheets and weird mechanism for styling tables as a few examples) it is really well designed markup for this kind of document data that strikes right balance between it being semantic markup and representing the kinds of stuff users want to do. Compare that with Office OpenXML with stateful formatting empty tags (yes, really, in DOCX <b/> _TOGGLES_ whether following text is bold).
SQLite is good, but it is also fairly unique in this space. Why? Because it’s hard to replicate everything it does, because it does a lot.
But… for this case, do we need it do a lot? No, not really. We don’t need the full SQL standard, a query optimiser, etc etc for basic (+ safe) transaction semantics and the ability to store data in a basic table structure.
Perhaps there is a better file format we can use, but it would be better if it was decoupled from SQLite.
Zip vs tar vs filesystem vs sqlite. Tested all these scenarios, and sqlite was the fastest and the smallest, even beating plain archives with no overhead
SQLite has a major draw back (and yes, I love SQLite and built a lot of things around it over the years): the blob you get from the DB cannot be mmap and you have to copy it to somewhere else. For zip files, as long as the file is not compressed, you can mmap it (or it is compressed using some exotic encoding such as PVRTC) just fine.
All benefits SQLite's article lists (and I love SQLite to death by the way) can be implemented by having SQLite be the runtime model of the document. On disk and in memory. But SQLite doesn't need to be the transport format. In fact SQLite can easily get bigger than the current format, SQLite is full of unused space when you mutate it around, it can get fragmented and sparse. And if you need to optimize it every time, then the "fast save" etc. benefit goes away.
There are formats which do need delta updates and quick indexed look-ups without fully loading the file in RAM, and this is why so many apps do use SQLite as a file format. I just feel OpenDocument was a bad pick to use SQLite for in this hypothetical scenario.
If you then check the file into git, you are storing one source control system into another one, and older versions appear in two different histories. To be git friendly, you don’t want to save anything other than the current version, and then let git do its thing.
Possibly the answer is “don’t use git, we have it covered,” but then the app developer should realize that they are implementing something like a source control system. How do people share drafts, review them, and merge changes? How do you publish a release that only includes the version you wanted to release?
And it does seem relevant that the developer of Sqlite actually did implement their own source control system [1]. Maybe they could have warned people about what they’re getting themselves into if they go down this route?
I wonder how terrible it would be to either use a git repo as your file format, or to build in git compatibility into your app somehow so you could push and pull?
https://superuser.com/questions/1562130/can-people-see-the-c...
https://foiassist.ca/2019/04/04/i-thought-we-deleted-that-me...
This shocked me. Impressive how far SQLite's come in such a short space of time.
> OpenDocument - Initial release: 1 May 2005; 18 years ago
> SQLite - Initial release: 17 August 2000; 23 years ago
Wonder what gives.
Over the past 1.5 yrs I've build a computer vision tool from recording hardware/software, to derp learning pipelines, to front-end; we had some requirements on the recording side that were difficult to solve with existing solutions (storing exactly timestamped camera frames, gps data, car telemetry and other metadata).
Using a SQLite-backed data format for the video recordings made implementing things by ourselves super straightforward.
This accurately describes the majority of my efforts, too.
Past version and undo history should be stored separately from the document. They should be stored out of tree where they wont be commited into some git repository or be automatically synced or anything like that.
1) Vulnerabilities: not only in SQLite, but also in wrappers like https://nvd.nist.gov/vuln/detail/CVE-2023-32697
2) Lack of transparency: zip with xml's contains only xml's; meanwhile SQLite contains by design all kinds of traces with sensitive information or empty blocks. Attempts to fix these issues removes benefits that were mentioned.
3) Lack of implementer support. It was one of the reasons for WebSQL deprecation many years ago.
4) Lack of standardization for file format. SQLite does not even promise forward compatibility, only backward one. Which means that new documents might not open in old software, or vendor should fork SQLite and only backport security patches.
Why habit? Well, I can imagine back at the time OpenOffice was a fresh project, it went like this: "XML is going to stay forever and everybody uses XML, so ofc we use one... oh, it is so big! And there are many files, so we just zip'em"...
To be fair, the author of this excellent article doesn't even say about getting rid of XML in this format- but that could also be achieved by storing stuff in a SQLite file. Usage of XML was habitual thinking there- and not very visionary, as the format is dead now...
OpenOffice was born when Sun bought StarOffice, which was initially released in 1985 (on Z80 and certainly without any XML). So the project itself was far from fresh. OpenDocument was developed from OpenOffice.org XML format which was developed after Sun bought StarOffice in 1999. At the time XML was not used everywhere, but it was very much in vogue, certainly at Sun where the official line was that Java (created at Sun) and XML are going to conquer the world.
What If OpenDocument Used SQLite? (2014) - https://news.ycombinator.com/item?id=25462814 - Dec 2020 (194 comments)
What If OpenDocument Used SQLite? - https://news.ycombinator.com/item?id=15607316 - Nov 2017 (190 comments)
There are some more or less general hierarchical formats with support for random access, but most of them are tightly related with particular technology stack (ie. MS's COM Compound Document) or with particular usage area (there is HDF5 for scientific data and many multimedia containers are in fact a hierarchical databases, with both the various IFF variants and EBML being explicitly designed as reusable formats for arbitrary data). And then there are formats that implicitly contain some kind of hierarchical container mechanism (PDF, TIFF, DICOM, FPS game map files…).
https://github.com/uktrade/stream-read-ods https://github.com/uktrade/stream-write-ods
a) SQLite the file format - which is Public Domain and so well documented that parsers for it exist in numerous other languages even though it's almost pointless because...
b) SQLite, the Public Domain (and thus entirely source available) C implementation of the library that can operate on the file format -- and is documented to a level well above what most ISO standards shoot for. It's designed to be used in other software and has bindings for pretty much every major language.
c) Some notional OpenDocument stored in a SQLite file that's really just waiting for somebody to make and document.
ISO standards are great, but if we had to wait for ISO to define a file format we'd have pitifully little to work with.
https://www.sqlite.org/mostdeployed.html
Therefore I consider it a better supported format than most standardized formats.
> The [WebSQL] specification reached an impasse: all interested implementors have used the same SQL backend (Sqlite), but we need multiple independent implementations to proceed along a standardisation path.
Just to clarify: You don't actually need to implement all that for it to be a standardized file format, any more than you need to implement all the spreadsheet functionality to be able to read a LibreOffice spreadsheet. All you need to do is to be able to reconstruct the tables. There's no reason, having reconstructed the tables, you couldn't write your own imperative code in the language of your choice to go over them and get whatever information you wanted.
It's not even a concern for the US Library of Congress, which defined SQLite as a recommended storage format for datasets alongside CSV, XML, and JSON.
Library of congress considers if someone a 100 years from now could write a new importer in whatever langauge/AI they might use by then.
Office documents are something you send in email attachments to people you often barely know, and expect them to read it in whatever office system they have. And if the recipient uses e.g., Microsoft Word, OFD/Sqlite might not work.
The file format itself is pretty straightforward.
I am still salty about this, as WebSQL would have made it much easier to build a certain class of web apps.
SQLite is kind of its own standard. It's public domain and they don't do breaking changes all day, and it's in C. As long as C is still viable, SQLite is usable on basically all non embedded platforms, and nobody really needs to reimplement it, unless they want to port it to Rust or something.
Not that you'd need to, since it's already very reliable.
There wouldn't be another engine.
It would be SQLite. Period.
Hmm… I think I vaguely recall that Audacity 2 had the concept of a temporary working space, whereas it seems that Audacity 3 just uses the .aup3 file as its working space? Some advantages, some disadvantages.
Mildly less on-topic: I looked into Audacity 3’s format, and was utterly baffled by what they’ve done with the project data (what used to be the .aup file). They still encode it as XML, storing it in a single-row table, but instead of just writing it as text, they use a simplistic dictionary coder on it. Just… why? Why did someone go to all the trouble of writing that code? It makes interoperability and inspection much harder, surely harms performance (even if by a trivial amount), and the space saving will be rounding error in every plausible case (like, maybe as much as a few kilobytes out of hundreds of megabytes of audio files).
As for Git, it would benefit from using text format specifically aimed for easy diffing/merging. No idea how easy the sqlite dump is in this regard.
The problem I'd predict here is that then people would expect to be able to do three-way merges. It might even work correctly a lot of the time, depending on the exact pattern of changes. But my gut feel is that unless the schema were designed just right, there would be possible merges that would result in a database that was valid from SQLite's point of view but insane from the application's point of view (broke expected variants, etc).
I don’t expect to be able to merge though.
I've also been told that they don't understand or even want to understand folders...
Do users really expect this now a days? Most users use cloud apps, and almost all of those save after every operation automatically.
It may be a useful functionality, but it is NOT what I would expect such a query to return, to be frank.
Also you don't need a nested query in this specific, you can order by checkinTime and limit the result to one.
> select manifest, versionId, checkinTime from version order by checkinTime desc limit 1
or something like that. This should work in SQlite and Postgresql at the minimum. I think to remember that in Oracle you have to use "where rownum=1" so indeed you have to use a nested query. I don't know about other databases.
See: https://www.sqlite.org/lang_select.html#bare_columns_in_an_a...
That query isn't guaranteed to produce a well-defined result in most SQL engines. (For pretty much the same reason the original doesn't/can't/shouldn't…) In the simple case of two rows with the same `checkinTime`, many engines permit the results to be ordered arbitrarily.
SELECT manifest, versionId, max(checkinTime)
FROM version
GROUP BY manifest, versionId
ORDER BY 3 DESC LIMIT 1;
or WITH m AS (SELECT max(checkinTime) AS checkinTime FROM version)
SELECT v.manifest, v.versionId, v.checkinTime
FROM version v
JOIN m m USING (checkinTime)
LIMIT 1;
It's a bit of a footgun though because there is some randomness here if multiple rows have the same max checkinTime, so I try not to use this SQLite3-ism. You want to also do something to deterministically pick a "best" row, but for that you need to do something like the above.In Postgres you can do similar things with a DISTINCT ON query.
I always found this one of the hardest simple things to do in SQL.
> SELECT manifest, versionId, checkinTime FROM version ORDER BY checkinTime DESC LIMIT 1
The problem with putting aggregation functions in the select output is that you're being unclear about what is aggregating; that pattern begins to break down once you have e.g. multiple documents in the same schema. Or if versionId somehow wasn't linear (e.g. branches of changes).
¹e.g., there could be two rows with the same checkinTime, whose value happens to then be the max such.
From https://www.sqlite.org/lang_select.html#generation_of_the_se...
> Each non-aggregate expression in the result-set is evaluated once for an arbitrarily selected row of the dataset. The same arbitrarily selected row is used for each non-aggregate expression.
Does `max` somehow only affect the selected rows? Or is this relying on a side affect of the query planner sorting the table to optimize max?
However then I found https://www.sqlite.org/lang_select.html#bare_columns_in_an_a...
> If there is exactly one min() or max() aggregate in the query, then all bare columns in the result set take values from an input row which also contains the minimum or maximum.
In all of the nearby examples contain an explicit "GROUP BY" clause but I don't think that this section says that one is required for this behaviour. So I guess this is the behaviour that is being described.
However I found this rule as well:
> If the same minimum or maximum value occurs on two or more rows, then bare values might be selected from any of those rows. [...] The choice might be different for different bare columns within the same query.
Which is in conflict with the earlier rule which says that the row is consistent. Or is this consistent row rule only provided for the implicit grouping. AKA is a and b guaranteed to be from the same row for the first query but not the second? That would be very surprising, maybe the docs just promise too little?
SELECT a, b, MAX(c)
FROM t
SELECT 1 as group, a, b, MAX(c)
FROM t
GROUP BY 1
There are also more not-well sepcified results if multiple of MIN or MAX are used or if these functions or customized. So overall it is probably best to avoid this in "production" use. But can be convenient for some quick exportation if you are careful.- Its size is less than a megabyte: https://sqlite.org/footprint.html
- 750KB if all features are enabled: https://www.sqlite.org/about.html
- Looks like fair amount of functionality can be left out when compiling sqlite and with options to influence/strip down query planner: https://www.sqlite.org/compile.html
- And "SQLite does not compete with client/server databases. SQLite competes with fopen()": https://www.sqlite.org/whentouse.html
In the end, you don't need a database, but a library that gives you database API and behavior.
Why do you need a single library that gives you a database API and behaviour?
Wouldn't it be better to decouple those: provide an open, standard format that enables compact, fast, structured storage that is built to allow transaction/atomic updates.
If that exists then you can plug sqlite on top of that, or something else. Because you don't _need_ any of SQL, or really sqlite to improve the OpenDocument format. You need the storage format.
OpenDocument is very different from the pretty scientific/niche/highly-vendor-locked examples given in replies by others here. Locking this into a format developed by essentially a single person with a single implementation is absolutely mad.
But... it's less mad if the file format wasn't coupled to sqlite.
Part of the problem is that sqlite3_stmt combines aspects of both prepared statements and result sets. There is a tendency to keep them around to cache the compiled bytecode (prepared statement), but your might code might stop mid-iteration (result set), maybe holding a lock at this point. This can lead to surprising lock-upgrade failures. In the end, I wrote extensive error reporting using sqlite3_next_stmt, sqlite3_stmt_busy, sqlite3_sql, just to weed out those issues. The entire transaction retry code I wrote is full of optional logging and many comments, even though it was just for my own personal use. Before that, I wrote transaction retry logic for PostgreSQL, and that was so much easier (but it was before fully SERIALIZABLE transactions arrived).
The other surprise is that “ A transaction committed in WAL mode with synchronous=NORMAL might roll back following a power loss or system crash.” (https://sqlite.org/pragma.html#pragma_synchronous), but that wasn't relevant to my application.
Who would attempt to make concurrent writes to an application document format? And how wouldn't such an attempt be a user mistake? Failing to write is the solution, not the problem.
Show me a formalised ISO / IEC / ANSI / ETSI SQLite standard that the Richard Hipp and his company never deviates from, and the full legal search to ensure there are no patents that might affect it, and show me the multiple compatible implementations of SQLite that _all_ have these touted advantages, and _then_ we can talk about prosletizing it as a file format. If they don't, they're saying "take a hard dependency on a single-source implementation, and make all your users take it too".
XML is a formal standard. ASN.1 is a formal standard. JFIF is a formal standard. Even ZIP is a formal standard (adopted as part of standardising OpenDocument: ISO/IEC 21320-1:2015)
The most important thing about a document is that everyone _else_ can read it. Saving time on writing updates to disk is an irrelevant sideshow. Did we learn nothing from Microsoft perverting the standards bodies to try and keep its lock-in?
https://arstechnica.com/uncategorized/2008/10/norwegian-stan...
> A letter of resignation written by the departing members and made public by The Inquirer accuses the standards body of folding to pressure from Microsoft, violating its own procedural rules, and ignoring the analysis of the technical committee tasked with evaluating OOXML.
You are right to point out the folly of deeper implementations like having and needing to understand table structures for things like slides. However, the current status quo involves Microsoft implementing a fairly esoteric "update the XML file's bytes as they would be encoded in a ZIP file" in their proprietary tool (where they have enough money to invest the engineering time) and all other tools use the slower "whole file in memory" approach.
User visible features like incremental fast saves (and shared editing) keep people on closed systems and give Microsoft the leverage to do the things you warn against. SQLite as a container format could have prevented that by giving everyone a shot at a lower cost but still fast implementation.
From TFA:
Note that this is only a thought experiment. We are not suggesting that OpenDocument be changed.
Same with Docker. Can’t be that wrong?
Using it for a open specification which target is cross implementation compatibility makes the move way more hazardous. Meaning, every implementation has to run on environment targetable and compatible wit sqlite or has to re-implement a compatibility layer on something complex enough that you only reliable definitive source of truth is the very famous sqlite test suite.
It the same reason why Web SQL has being abandoned: if sqlite is the sole api implementor, it takes precedence on any others specs, and you have no control on your standard.
I would be 100% for a specification on how to map open docs files to an relational structure, though, with a well know sqlite-backed implementation.
You need key/value lookup, a way to list/paginate, and transaction semantics for updates.
AKA: a zip file with entries as keys, and XML documents or attachment blobs for values. What’s lacking and causes issues is the update semantics.
You can wack a SQL query language over those 3 operations if you’d like. Or don’t. Up to you, because the format is defined and can be reimplemented rather than the large, complex library api.
I think you can achieve the same thing with xml/zip and some rename shenanigans, but sqlite lets you get that in a single file on disk.
Also if you are using sqlite as the memory model, why not use it as the disk/transport format? It's basically free at that point.
The file size issue can be dealt with VACUUM (I believe, haven't personally dealt with sqlite-as-file-format).
The claim "it's basically free" isn't right, as for transport you need to VACUUM. And possibly COMPRESS too. And if you do that... might as well use the existing format. VACUUM completely rewrites the file from scratch. You can't do incremental updates in a VACUUMed file as it stops being VACUUMed, so you need to VACUUM it again to ensure minimal file size. Nothing is free.
ZIP also can be incrementally updated (file by file) by the way, I think MS Word uses this feature in some saves. But that's beside the point. You simply do not need incremental updates in a transport format.
I'm not sure what "hiccups mid-write" you're referring to. Any such hiccup that would damage an XML or ZIP file would also damage an SQLite file.
The distinction between a working disk file and a transport format are important. The working disk file is large, binary, messy, complex, optimized for quick look-ups and quick partial updates. If your word processor crashes, it can restore state from the working disk format in no time.
But the transport format needs to be small, readable, debuggable, simple, stable. And SQLite simply doesn't offer anything significantly superior in that department compared to the existing format. Especially nothing to justify the additional effort of changing an already working solution.
There's a reason "serialization" is called that, it's just serial data. No random access structures, no indices, single representation, often text-based. Throughout the decades, we've learned this is the best way to transport data of any kind. The messy/partial/polymorphic/cryptic/hyperoptimized/indexed formats are not for transport. They're intended to do work in, locally.
1) Makes sense only if the average XML parsers and zip libraries in use have fewer vulnerabilities and are actively maintained as well.
2) You can store sensitive data in a SQLite database or XML file, there's no real difference. You can clean up a SQLite database pretty easily if you want and that doesn't take away all the benefits.
3) What does implementer support even mean? I believe they are open to custom work... WebSQL died because it doesn't make sense to pretend SQLite is some kind of standard -- that brings us back to 4), which is the valid reason to avoid SQLite.
Actually, your 4) is worded too strongly. They say they're committed to forward compatibility as long as you don't use the new features. That makes forward compatibility the decision of the app: an app can have forward compatibility and not use newer features OR lose forward compatibility and use newer features.
Yes, parsing encoded files tends to introduce vulnerabilities. ZIP parsers have had plenty of vulnerabilities. This is not exclusive to SQLite.
> Lack of transparency: zip with xml's contains only xml's
Both zips and sqlite cannot be read with a text editor. Both are open formats with widely available tools to read them. The sqlite binary might, in fact, be more widely available than unzipping tools.
> meanwhile SQLite contains by design all kinds of traces with sensitive information or empty blocks.
Elaborate?
> Lack of implementer support. It was one of the reasons for WebSQL deprecation many years ago.
I don't understand how this is relevant?
> SQLite does not even promise forward compatibility, only backward one. Which means that new documents might not open in old software
Neither does OpenDocument. SQLite is actually more solid in this regard – forwards compatibility is still a thing unless new features are used.
Well, that's why your archive format of choice should be cpio, which is almost a text file except that modern implementations tend to 0-terminate the filename!
Jokes aside, there are widely-distributed tools that can take in an almost-arbitrary zip file and account for every byte in it. The format is straightforward enough that, were you so inclined, you can do most of it (other than, like, decompression and crc-checking) manually in a text editor. The SQLite format is not like this. There is one implementation, and relatively easy to "hide" data in a database file that its tooling will not reveal.
> Elaborate?
When you delete something from a SQLite database, it isn't necessarily actually removed from the file unless you VACUUM or have the secure_delete PRAMGA turned on. Either of these should solve the problem.
VACUUM INTO is a good way to export sqlite databases from an application for this reason.
If we are talking about binary formats, now there are systematic solutions like https://github.com/google/wuffs that protect against vulnerabilities. But SQLite is not just a format - it's an evolving ecosystem with constantly added features. And the most prominent issue was not even in core, it was in FTS3. What will SQLite add next? More json-related functions? Maybe BSON? It is useful, but does not help in this situation.
Regarding traces, there are many forensics tools and even books about forensic analysis of SQLite databases. In well-designed format such tools should not exist in the first place. This is hard requirement: if it requires rewriting the whole file - then so be it.
This way "main" document file (which might be checked in to git, or shared via dropbox or read by some document) only contains nice, clean, saved version.
And yet if your computer crashes for whatever reason, the data is still not lost and can be trivially recovered.
This is an app feature (it doesn't have to be "Export As", it can be a "clean history" toggle in the same "Save As" dialog and a separate command), so not a reason to excise efficient history preservation from the file format
There is also this, which seems to work: https://github.com/jacobsa/fuse
and this: https://github.com/jilio/sqlitefs
You're not going to have a sensible text version of a btree that is reasonably editable by a text editor.
My guess would be zlib is still number 1 though, even accounting for Daniel's considerations.
[0] https://daniel.haxx.se/blog/2021/10/21/the-most-used-softwar...
> Less code, fewer bugs.
The whole point of my remark is that the domain model and the export document format are two entirely separate things.
"those are completely different uses"
It's not a hard concept to grasp. There is no riddle to decipher.
Eh, if they're not running the same office system, down to patches, you can't really expect much.
The prior comment of "model user interactions in the database" seems spot on -- just keep track of what the user's doing as unsaved data in the database and commit it (in the appropriate way) to the DB as it happens; save is just another user action.
Presumably sqlite has figured out how to write to the filesystem without corrupting itself even in a variety of adverse scenarios?
If not, commit to a temporary copy of the DB that gets renamed to the "main" name periodically or when the app closes. There's an xzzzbit meme there somewhere, yo.
The trouble is that often in the wild, the content of the file on the filesystem is a user-facing interface. Users will copy it around and attach the whole document onto emails. When they do, they do not expect the file to contain data that they didn't want to save.
(Yes, they sometimes also expect the file to contain data that they wanted to but didn't explicitly save. This is not a contradiction.)
For example there's the story of the academic studies with falsified data [1] where forensics on the included excel documents showed they'd gone through and replaced data with "randomized" data (if I remember correctly); there are tons of examples of "redacted" data in pdf docs being visible under the blacked-out rectangles.
I'm not disagreeing with you, btw, such actions are certainly problematic, but hopefully kids will grow up knowing they need to run an export to sanitize their data if they don't want to show the whole world their transaction logs...
I think that's fine. Just remove "save" from the UI, and save after every keystroke. This may sound crazy in 1970, but it's how nearly everything works today. It's really only us weirdos that started using computers before "the cloud" that think "save" is an operation that does something, and we're dwindling in numbers!
nitpick: At least on some filesystems if you rename a.txt to overwrite b.txt and the machine crashes, you might end up both a.txt and b.txt hardlinked so they contain the same data.
Of course this is no big deal since b.txt is still updated atomically so it contains the new data (assuming a.txt was fsynced) or the old data. I assume nobody depends on a.txt being deleted simultaneously.
You don’t need sql for any of it.
You do need to parse DDL to get the column names, they're stored as a "CREATE TABLE" string. But you don't have to if you want to dump the file without names.
do you have any links?
Like, if someone changes some text in a paragraph, you can't just model that as "this paragraph now contains this new text". You have to model it as "this paragraph used to contain this text, but an uncommitted change changed it to this other text". User deletes an image? You have to still store the image and all the references to it, but with an uncommitted change to delete the image and remove the references to it.
And maybe that's a good thing, maybe a git-like system where the history of every change is tracked is what you want. But it certainly doesn't feel like it'd be appropriate for every application and file format.
Also binary documents are a lousy fit with git, smashing square peg into round hole makes little sense.
sqlite> create table x(c1, c2);
sqlite> insert into x values ("a", 1);
sqlite> insert into x values ("b", 2);
sqlite> insert into x values ("c", 3);
sqlite> select c1, max(c2) from x;
c|3
sqlite> select c1, max(c2), min(c2) from x;
a|3|1
sqlite> select c1, min(c2), max(c2) from x;
c|1|3
(note: since SQLite is dynamically typed, no need to specify column types for simple examples like this).Configuration of a repo indeed isn't defined as an artifact but as a SQLite table. One may wonder if this should be part of a repo, and I would say it should, so it actually is surprising that it's not also stored as artifacts
I have a love-hate relationship with this approach.
Using the C source code still leaves room for ambiguities / under-specification, no? After all, the semantics rely on the particular gcc release used for compiling the code.
By the way, concurrent read/writes on locally stored documents happen, even on single-user machines. If the reading process uses the SQLite structure (say a document indexer that knows about the format), it has to take some locks and may also need to flush data from the WAL log (depending on implementation details). At this point you have to deal with concurrency issues in the application, too. Unless you rewrite the entire document from scratch on every save and put it in place with an atomic rename (which I ended up doing for a different application, not the transaction-processing one). But that loses some of the advantages of SQLite.
Setting aside technological details, multiple clients operating on independent rows of the same table can only, at worst, waste time by retrying a transaction, while multiple concurrent users attempting to modify the same document are asking for trouble, and if they succeed they probably succeed at corrupting the document.
Even without lock contention the aggregate document state can be incoherent (for example, Alice and Bruno edit a text, but they accidentally modify the same section and the latest save prevails and nobody notices).
Well i get what you are saying, sqlite has been ported all over the place. It probably wouldn't be the limiting factor portability wise.
Doing periodic and automatic saves is good. Doing so on a document "in production" is majorly stupid. Not that I want to accidentally validate the busy work dev ops puts us through.
I would say the traditional model is a compromise from back when disks were unacceptably slow to be saving constantly.
It was unusable. The only thing that I was able to do is to tar it up and move to a machine with xfs, where I was able to sort it up into more balanced subdirs and then move it back (for processing using windows-only tool). Just tarring that single directory up took several days.
btrfs was unusable (not only that particular directory, but the whole filesystem became noticeably slower).
ext4 was ok. xfs didn't break a sweat. I don't recall any practical difference when compared against a nested tree like
├── aa
│ ├── aa
│ │ └── aaaaf3ee5e6b4b0d3255bfef95601890afd80709
│ ├── ab
│ └── ac
└── ab
├── aa
├── ab
└── acWe tried to use an image of traditional filesystems (ext4 and fat32), but with most files being under 1 KiB, it was super wasteful.
Our use case was a Maven mirror for disconnected environment that only contains metadata (i.e. lots of small XML files, without the actual jar). We already had a MongoDB service for some other JSON data, so here we are.
> The Web SQL specification cannot be implemented sustainably, which limits innovation and new functionality. The last version of the standard literally states "User agents must implement the SQL dialect supported by Sqlite 3.6.19". SQLite was not initially designed to run malicious SQL statements, yet implementing Web SQL means browsers have to do exactly this. The need to keep up with security and stability fixes dictates updating SQLite in Chromium. This comes in direct conflict with Web SQL's requirement of behaving exactly as SQLite 3.6.19.
There’s very good reason for that not to be a standard. (Now, assuming the SQLite documentation is licensed in a way which supports this, copying the documentation of SQLite’s supported SQL as of that version into the standard might have been viable, but no one interested in having WebSQL proposed that or any other resolution.
That relates to the cited issue of absence of independent implementations, which would have been a problem even with a spec that supported independent implementations and verification of their compliance independent of a particular reference implementation. though I personally think the spec problem is a bigger real problem (even if not the decisive policy problem) than the “everyone is using the same underlying software to implement the spec” problem is in this case, where the shared implementation is a permissively licensed open source implementation sponsored by several of the browser vendors, among others.
For the latter, because the stored data has such a simple format and the implementation has so few dependencies, I expect it will be very easy to get your data out for a long time to come. It's going to be tougher if you have business logic in views or other SQL expressions, of course, and if you rely on SQLite's particular approach to data types (as in “values have types“, but not much more).
I could agree about the single implementation, but if the alternative is making something new up I am not sure in what way that would be better.
I haven't studied the spec in detail but it seems comprehensive.
The fact that there also exists a high-quality, stable, public domain reference implementation can't really be counted against the format, can it?
The high-level software abstraction approach doesn't hold up when it comes to databases. This is such a wide and performance-critical interface that any abstractions are gonna leak badly. Even the SQL standard has all these impl-specific flavors. Many have tried to build layers on top that'll work with multiple DBMSes, and it's never worth. Anyone writing an app backend is just gonna marry a particular DBMS for the performance benefits (puns intended).
If for some reason an alternative implementation really needs to exist, SQLite is simple and open enough that someone can do it.
You need fast listing/pagination, key value get/set, and transactional updates. Basically DynamoDB, but for a single file. Build a query layer on top of that, sure. Use those primitive to build persistent indexes if you want.
Or just iterate through the keys in a for loop. It fits in memory anyway.
You don’t need a fully fledged DBMS for a word document. And if you’re shuffling around lots of data in a structured format with no updates needed, you probably want arrow/parquet rather than sqlite because the read performance is going to crush SQLite.
What's the issue with the update semantics, though?
You can’t squeeze it in without moving everything else about, which disrupts other readers. You could append it to the end maybe, but you need to handle concurrent writers. Compression also is an issue here - I expect zip compression is applied to multiple files at once, rather than per file? So now you might need to update multiple seemingly unrelated files.
You need to step down from the concept of a whole file as a unit and move towards pages of data that can be incrementally updated/reused/freed, where each page might contain one, many or even only a part of a “unit” (file/row/whatever)
This makes things more complex for sure
Many times have I encountered large docs, often spreadsheets, that push the limits here and become noticeably slow. If you want to get more sophisticated with the indexing and paging, SQLite is a very natural path. Anything else would be reinventing the same wheels SQLite has spent decades refining.
However, in their brilliance, AWS RDS defaults to only NO_ENGINE_SUBSTITUTION for SQL_MODE, thus merrily allowing partial aggregates with non-deterministic results. Wheee!
Version 5.7 introduced ONLY_FULL_GROUP_BY, but since that change broke lots of code that depended on this historical behavior, many people disabled it.
That'd be fairly non-optimal behaviour. ;)
When the application tries to add wrong data (eg duplicate key violating uniqueness constraint), SQLite will return an error.
The application should handle things better than by crashing. In theory anyway. :)
Makes me think of the "Voting software" xkcd: https://xkcd.com/2030
"I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die."
I'm pretty sure that what's actually going on is that everything we think of as a 'profession' is that way, and that people are in fact dying because of it.
The difference is that devs are honest about it.
If you make saving the default you have to manually not save. It's a trade off versus default no saves with manual saves
Which is exactly the problem. They (I.e one dude?), and they alone have spent decades refining a single implementation.
Before we go and lock the entirety of the worlds documents into what’s essentially a proprietary format specific to a single implementation of a single library written by a single dude… we should double check if that’s a good idea or not, and if we can, collectively, solve some of these issues without reimplementing the whole of SQLite.
Because that’s complex. Perhaps more complex than it needs to be for most applications, which would benefit from the storage part more than the query part. And then we are back at the start of our discussion?
But hopefully the choices are a bit better than just those two. :)
In the case of checksums in a database, that is why read-only modes should be used and I don't see what automatically saving would change. If anything, when the user zooms on a document in read-only mode, either it shouldn't be stored or storing it should trigger the same flow as modifying the document
95% of the time I want changes persisted immediately, but it's nice to be able to turn it off when I don't.
But generally the way autosave works is to save a copy that can be recovered on a crash, and only overwrite the original if directed by the user. That works for both use cases. (Haven't used Word in years, so I'm not sure if they have a different behavior now.)
1. on opening a file clone it to a temporary folder
2. edit the temporary file there on disk
3. on save mv/cp the temporary file over the destination
I am probably missing a lot of use cases, but it migth be a good idea for a game like Factorio where you are expected to have multiple on disk saves of the same run at different times.
I'm just pushing back against the idea that its a good or helpful idea to "help the users" by taking the deliberate "save" action away from them.
As an aside, one of the things that has been learned from this class of MRI scanners is that users need to feel "in control" of the machines they're using. The "look how smart this machine is by doing all these magical things you used to do yourself!" attitude works well in sales but really does not go over well in the field because users encounter the fuckups and are held responsible for them. So they quickly start to distrust the machines.
If the size, mtime, and inode number stay the same (i.e. it writes into the file directly instead of replacing it), then most backup software will skip it. AFAIK to do otherwise you either need to read the whole file every time, or be live monitoring audit events to see what files have been opened for write, or be the filesystem (e.g. ZFS snapshots, which can be maximally efficient since it knows exactly which blocks it's modified)
Of course this has its own downsides. While those writes may have been "unimportant", the fact is that your backups are now flawed. And if the application has had the foresight to distinguish unimportant writes, and preserve the mtime, I'd rather they just not make those writes in the first place
Linux also has a ctime ("status change time"), which is automatically updated when the content or metadata (inode) are changed. It is not possible to change from userspace (you have to use tricks such as changing the computer clock or directly modifying the disk). This gives you the security benefits, but it is not commonly used e.g. in backup tools, precisely because they want to be able to set "fake" timestamps.
And by some applications I pretty much just mean browsers, but still.
It just requires you to be the owner of the file (https://serverfault.com/a/337810), no special permissions. On the other hand, editing the ctime is not easy -- it requires tricks (e.g. change the computer clock, which typically requires root privileges).
However this has worked well for twenty years, so PEBKAC is a reasonable conclusion.
No. It’s the on disk format that matters. Because it would be just as slow and scary if it used a sqlite file that was embedded in a zip file or something equally as mad.
It’s not the SQL, it’s the file format.
If you decouple the file format from the SQL engine, it becomes simpler to reimplement, more agnostic and less vendor locked.
> If you decouple the file format from the SQL engine
That alone would be a difficult project. If you really want to break things down, easier to say your document standard relies on SQLite's rather simple query language (https://www.sqlite.org/lang.html), and there, it's independent of SQLite's query planner and file format. Wouldn't be hard to make it work with Postgres or MySQL, for instance.
Document standards are file formats…
Or are you saying a document format should just be some DDL statements? What? How is that interoperable? It’s coupled to the database that is storing the data as an implementation detail, which is exactly the problem with using SQLite.
> That alone would be a difficult project
I’m not suggesting using the SQLite file format, I’m suggesting the pretty basic idea that the storage for a general purpose widely used and interoperable document format should be logically decoupled from anything else, and definitely not be tied to the implantation details of a single library or even a single version of that library.
The file format is the most important part. It’s the only part. Nothing else matters because there is nothing else.
It’s not rocket science.
Yes. How is it interoperable, because it's quite easy to make DDL for SQLite that also works for many other DBMSes, given that SQLite is kinda the lowest common denominator of those.
Maybe not as interoperable as ODF since it's easier to implement an ODF parser/writer than a SQLite clone, but probably more interoperable than some kind of advanced ODF designed for efficient updates. Just because you define a standard doesn't mean there are good portable implementations out there.
Ladies and gentlemen: he’s so close, he’s nearly there, but he just can’t make the final connection!