What if OpenDocument used SQLite? (2014)

What if OpenDocument used SQLite? (2014)(sqlite.org)

445 points by weeber 2 years ago | 293 comments

p4bl0 2 years ago |

I'm currently working on an application where I use SQLite as the file format. I want to keep a usual workflow for users where you can make edit to your document and it only changes the file when you save it.

So to open a file I copy it into the :memory: database [1], then the user can do whatever manipulation they want and I can directly make the change in the database I don't need to have a model of the document other than its database format. And to save the document I VACUUM [2] it back to the database file. It works quite well, at least for reasonably sized file (which is always the case for my app) :).

[1] https://www.sqlite.org/inmemorydb.html

[2] https://www.sqlite.org/lang_vacuum.html

rakoo 2 years ago | |

Why do you use a secondary, volatile database ? Performance-wise you won't gain a lot more (we're talking about a user editing a file, so not even 1 write per second).

A proposal: write directly, and automatically in the database. No more Save button. There are multiple advantages:

- the system is crash-resistant. I like taking the approach of CouchDB where the only correct way to close the system is to crash it. That way a crash is an expected situation that you actually account for, not a special case that you might forget

- there is only one database. Less code, fewer bugs.

- it is safe. A write to SQLite works or doesn't work, there is no in-between. As said in the VACUUM doc you point to: "However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt"

- it is how SQLite was intended to work. And because of that, you won't have to think about it for the lifetime of SQLite

fluidcruft 2 years ago | | |

There is nothing I hate more than an app that modifies files secretly when I open them. Then I have to get all defensive to copy files before I open them to keep them intact. You may not see the problem with changing the checksum or hash of a file, but silently tampering with files is a nightmare in many domains. If you open a file and accidentally change something trivial (some apps like to store things like presentation state i. e. window positions, last page viewed, zoom level, ...)

For example in many regulated domains such as human subjects research files must be approved and only approved files may be used. "Is this version of the consent document the version that the IRB approved?" Well let's see... (1) file modification date is after the approval date and (2) checksums do not match.

Not to mention that writing a single byte of content to a filesystems marks the entire blob as needing backup.

The fact is the filesystem is the user's database, save is commit, and it should be under the users control because application developers do not have the faintest idea about user context.

p4bl0 2 years ago | | |

> Why do you use a secondary, volatile database ?

For the exact reason I gave in the comment you are replying to: I want to keep a usual workflow for users. Principle of least surprise.

Users are okay with change being autosaved when there is a single "thing" that can be edited to the point that you don't even have to open it, it's just there, it can be seen as a property (as in ownership) of the application more than of the user. For example, your music library in your jukebox application.

On the contrary, when the user have to open the "thing" with your application and can choose between many of their files that can be edited with your application, users do not expect their files to be automatically modified at all. For example users may start doing some heavy editing and then at the moment of saving their work, they might make a backup of the previous state file before saving, or choose to "save as…" in order to keep the old version just in case.

Crashes are not something that happen that often. It can become an actual problem when you have tens of thousands of users and rare-events do happen, but in the particular case of the application I working on, I do not actually have to worry about that (on the contrary, any solution would have downsides that are worst in the particular case of this application than having to do some work again because of a crash if it ever happens).

knome 2 years ago | | |

a save button is still good, as it allows you to keep specific checkpoints.

but the save button could simply tag specific save points in a larger table.

if the format can roll up changes to compress them, they also indicate where which variants need to be kept indefinitely.

asalahli 2 years ago | | |

> I like taking the approach of CouchDB where the only correct way to close the system is to crash it.

The term you're looking for is (aptly named) crash-only software.[0]

0. https://en.m.wikipedia.org/wiki/Crash-only_software

nextaccountic 2 years ago | |

This means that like a regular app, you lose data if the app crashes or there is a power loss.

It's much better to save after each operation in a temporary place (probably in ~/.local/share/application/yourapp, using XDG directories), and when the user clicks save, just copy the file into the desired location. That way, if there is a power loss and you reopen the app, it opens right back where it was doing (losing maybe he last few seconds of changes, but not all unsaved data)

scherlock 2 years ago | | |

If you have a db, why not just model it as unsaved data? I.e. all changes get stored to the db, but have a flag of unsaved. If you open up a file and there are unsaved changes, you can prompt the user to either make them saved or discard them.

Someone 2 years ago | | |

> and when the user clicks save, just copy the file into the desired location.

To be perfectly safe, you want to rename it, not copy it. If there’s a power loss during copying, you may endcup with corrupted data.

Renaming is, to coin a phrase, “more atomic” than copying (on Linux, the OS says it is atomic. ISO C says it, too, but POSIX doesn’t (https://pubs.opengroup.org/onlinepubs/000095399/functions/re...: “This rename() function is equivalent for regular files to that defined by the ISO C standard. Its inclusion here expands that definition to include actions on directories and specifies behavior when the new parameter names a file that already exists. That specification requires that the action of the function be atomic”)

Also, filesystems may have bugs, hardware may lie about syncing to disk, and network shares can be finicky.

Doing this properly isn’t as easy as one would think. You’ve to make sure to sync the file to be written and you’ll have to handle the case where the save location is on a different file system than your temporary file. If so, you’ll have to create a copy on that file system first.

I think many tools do not check whether they need to work cross filesystem and just write their scratch files to the save directory with a different name and then rename them.

Of course, that means you always need twice the disk space on the target disk to do a save. That used to be a problem almost everywhere, but nowadays mostly is restricted to embedded systems and USB sticks.

In this case, however, SQLite will do a lot for you, and probably better than you would do it. It claims (https://www.sqlite.org/atomiccommit.html#_multi_file_commit):

“SQLite allows a single database connection to talk to two or more database files simultaneously through the use of the ATTACH DATABASE command. When multiple database files are modified within a single transaction, all files are updated atomically. In other words, either all of the database files are updated or else none of them are. Achieving an atomic commit across multiple database files is more complex that doing so for a single file. This section describes how SQLite works that bit of magic.”

However, about VACUUM INTO, it says (https://www.sqlite.org/lang_vacuum.html):

“The VACUUM INTO command is transactional in the sense that the generated output database is a consistent snapshot of the original database. However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt. Also, SQLite does not invoke fsync() or FlushFileBuffers() on the generated database to ensure that it has reached non-volatile storage before completing.”

So, I don’t think doing “VACUUM INTO” is sufficient to guarantee that you get a good copy of your data on disk.

avereveard 2 years ago | | |

Good ole .filename.swp

p4bl0 2 years ago | | |

Yes, I am aware of that, and you are right about this in general. In my particular case however, it is preferable to loose some work in the rare cases were a crash occurs than to have a copy of the file in some place that the users are not aware of. Of course if crashes were frequent the trade-off would be different.

SanderNL 2 years ago | | |

You are right and like you explained this is trivially easily fixed by autosaving regularly.

What I have trouble imagining is people working with documents on computers for more than a few years yet somehow failing to develop the Always Save Instinct. I regularly catch myself saving unreasonably often.

liuliu 2 years ago | |

Maybe simpler? When open the DB, change it to WAL mode, turn off the automatic checkpoint https://www.sqlite.org/pragma.html#pragma_wal_autocheckpoint

When user saves, you just checkpointing the file, merging it back into the main database.

nyanpasu64 2 years ago | |

> The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction. This means that when VACUUMing a database, as much as twice the size of the original database file is required in free disk space.

> The VACUUM INTO command works the same way except that it uses the file named on the INTO clause in place of the temporary database and omits the step of copying the vacuumed database back over top of the original database.

Do you use VACUUM (uses a write-ahead log to survive power-off) or VACUUM INTO (as far as I can tell, it doesn't survive power-off during writing, and might corrupt the existing file contents if the filename already exists)?

justsomehnguy 2 years ago | | |

>> The file named by the INTO clause must not previously exist, or else it must be an empty file, or the VACUUM INTO command will fail with an error.

EDIT: there is no difference between VACUUM/VACUUM INTO - they both write to a new file (COW) it's just VACUUM [NOT INTO] does mv temp.sqlite originalfile.sqlite after that, while VACUUM INTO does not.

ilyt 2 years ago | |

We used something similar (DB doing caching run in memory but saved periodically on disk) but with backup API

https://www.sqlite.org/backup.html

remram 2 years ago | |

Why not use a transaction?

p4bl0 2 years ago | | |

A single transaction for the whole user session? That seems a bad idea. Also I'm not sure you can do transactions during another transaction, and I need them for other purpose, i.e., for what they were designed to do (doing changes in multiple tables that need to stay consistent).

rewmie 2 years ago | |

> I can directly make the change in the database I don't need to have a model of the document other than its database format.

I don't get your point. Are you saying that you don't need to have a model of the document other than the model of the document? What's the nuance I'm missing?

torstenvl 2 years ago | | |

An in-memory data model often differs from the serialized data as it exists on disk. For example, emacs uses a gap buffer for text files; but it outputs plain linear text to disk.

Programmers often have to make software design decisions around how to represent a file in memory in order to manipulate it. For example, if I'm writing an HTML editor, should I mostly treat it like a text file (maybe a gap buffer) with syntax highlighting and auto indentation as an afterthought? Or should I maybe load the whole thing into a tree? What are the robustness and performance characteristics of each?

The commenter above was saying that using SQLite made that decision easy. He could keep traditional (or "atavistic" per the commenter upthread, depending on your perspective) load/save semantics while also making the data model easy to work with.

socksy 2 years ago | | |

I suppose this is in the context where you will be syncing up the changes to a backend server which will also be storing the document in an SQL database. Normally, you might expect that data format on the client to be JSON/XML/something else, and you'd need to maintain logic that marshalls the document representation

    SQL <-> In-memory representation <-> Disk format.

With SQL on the client, in theory you only now need to maintain

    SQL <-> In-memory representation

Obviously I'm skirting over the format you would use to send either entire documents or partial updates of documents over the wire.

p4bl0 2 years ago | | |

When an application loads a document, for example if the document is formally a list of things (imagine a very simple TODO app), the usual approach is to have this data represented (modeled) as an actual list in your program, like a Python list of objects, because it's what is easy to manipulate programmatically.

Then, saving your document means serializing the data in some format (which could be JSON, XML, CSV, an SQLite database, …) and writing that to disk, and opening a document means reading the file from disk and unserializing it to your internal model.

What I'm saying is that my approach is to use an in-memory SQLite database as the internal model of the data in the applications. I presented an upside (opening and saving are easy), but is also has downsides: I have to do SQL queries to manipulate the data rather than manipulating objects directly (which could be mitigated using an ORM but that's outside my point). In Python-like pseudo-code you can imagining something like:

    self.todos[42].status = 'DONE'

    self._db.query("UPDATE todos SET status='DONE' WHERE id=42")

(Of course there is the possibility of using ORMs or other approach in between the two.)

hot_gril 2 years ago | |

Btw, Apple's CoreData, commonly used by iPhone and Mac apps, uses SQLite by default. That part works fine, so you can study it if you'd like and ignore all the bad parts built on top (ORM, MVC framework, etc).

miki123211 2 years ago |

The problem with SQLite is that it's not a standardized file format. It's well-documented and pretty well understood for sure, but there's no ISO standard defining how to interpret an SQLite file in excruciating detail. Same goes for competing implementations, Zip and XML have a much smaller API surface than SQLite, whose API, apart from a bunch of C functions, is the SQL language itself. Writing an XML parser is not a trivial task, but it's still simpler than writing an SQL parser, query optimizer, compiler, bytecode VM, full-text search engine, and whatever else Sqlite offers, without any data corruption in the process. If Open Office used SQLite, its programmers would inevitably start using its more esoteric features and writing queries that a less-capable engine wouldn't be able to optimize too well.

This isn't a concern for most software. If you're writing a domain-specific, closed-source application where interoperability with other apps or ISO standardization isn't a concern, SQLite is a perfectly fine file format, but as far as I understand the situation, those concerns did exist for Open Office.

nyanpasu64 2 years ago |

I was optimistic that Audacity adopting SQLite would be a substantial improvement in its file saving capabilities. In practice I encountered many gotchas:

- On Linux, saving into a new file onto a root-owned but world-writable NTFS mount created in /etc/fstab, fails due to permission errors or something. Saving into an existing file works as usual.

- Files are modified on disk when you edit the project in the program, creating spurious Git diffs if you check Audacity projects into Git as binary blobs. And when you save the file, old and deleted data is left in the SQLite file until you close the project's window (unlike saving a file in a text editor), and you can accidentally commit that into a Git repo if you don't close the window before committing. (I recall at one point that you had to manually vacuum the .aup3 file, but now closing the window is sufficient.) I'm getting Word 2003 Fast Save vibes.

Freak_NL 2 years ago |

Good article. Although one thing I do like about OpenDocument being just a bunch of XML files in a ZIP archive is that it is fairly easy to generate documents like spreadsheets without using a (potentially hefty) library which knows about the document format.

I have a use case where users of a web service want to use data exported as a bunch of rows in a table in a variety of tools. Now, CSV with UTF-8 encoding is of course, totally open, conventional, and workable, but anyone who has ever offered CSV files to end users will know the pain of these users getting stuck when they want to use these files in a spreadsheet application¹. So I saved a sample spreadsheet in OpenDocument's ODS and another in that Microsoft XML abomination called OOXML as XLSX, and just figured out the basics of those XML formats. I trimmed the ZIP archives down to the essentials, marked the places where content goes, and just build a new spreadsheet file whenever data is requested in that format. Now I can output CSV, ODS, and XLSX (and JSON thrown in for good measure) of the same data.

Doing this with SQLite would be possible of course, just a tad more complex and with a lower development speed. Being able to fire up the office suite, create a template document, and just dig into its XML files in the saved file is a nice feature (although admittedly of niche interest).

1: More specifically, users who use Excel in a locale like nl_NL, where CSV files are, hardcoded, assumed to have their columns separated by semicolons, because Microsoft once notoriously decided that the Dutch did not use comma's in a comma separated values file.

dfox 2 years ago | |

As for [1], it is not really hardcoded, but depends on what is the value of localeconv()->decimal_point, if it is “,”, excel uses semicolons both in CSV files and formula expression language.

This used to be configurable when opening CSV/TXT file in excel (and still is in LibreOffice) but as a part of the overall UI dumbification was moved somewhere under the “Data” menu/ribbon tab (so you have to open new workbook and find the right option, or well, use LibreOffice if you value your time).

Freak_NL 2 years ago | | |

> decimal_point

Are you sure that affects it? The decimal point parameter sounds like it decides how to write out 5½ (i.e., 5.5 (English style) or 5,5 (Dutch style)) surely? Although on the topic of this particular bête noire I would not be surprised.

stareatgoats 2 years ago |

As an aside, this blew me away. I can hardly believe it. No nested query required?

> SELECT manifest, versionId, max(checkinTime) FROM version;

> "Aside: Yes, that second query above that uses "max(checkinTime)" really does work and really does return a well-defined answer in SQLite. Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect: it returns the manifest and versionId of the entry that has the maximum checkinTime.)"

gwbas1c 2 years ago |

I shipped a product that used both SQLite and XML files.

One of the improvements that I made was moving a few tables that contained small amounts of data to xml files. Because these files were small and rarely written; it simplified the data access layer, and simplified diagnostics. (I made sure the files were multi-line tabbed xml.)

For "technical" people who needed to diagnose the product, asking them to crack open a SQLite database was a huge ask; but for the major part of the product that used SQLite, it was hands-down better than XML files. (An older version of the product used XML files. It had scalability problems because there's no good way to make an incremental update to an XML file.)

The advantages of XML, specifically, a human-readable format; really only work for small files when the design of the schema is optimized for readable XML. Unfortunately, the need to always rewrite the entire XML file, and the "complexities" that come with lots and lots of features will quickly erode XML's biggest advantages.

IMO: A "lay" person needing to muck around with the internals of an office document is fringe enough that learning to use a SQLite reader is an acceptable speed bump. The limitations of XML + Zip, when it comes to random writes in the middle of a file, just can't be overcome by Moore's law.

Tempest1981 2 years ago | |

I'm unclear on how SQLite (native format, no zip) is achieving sizes similar to XML + Zip. Are SQLite TEXT or BLOB fields compressed? Or are they assuming the caller is compressing BLOBs before writing?

gwbas1c 2 years ago | | |

SQLite does not compress, as far as I know.

Engineering is all about tradeoffs: SQLite is optimized for quick incremental updates where you don't need to rewrite the whole file. Zip & xml aren't. (IE, if you decide to add a letter to a word at the beginning of a document, with zip & XML you have to rewrite the whole document. SQLite can make a minor change without the whole rewrite.)

In our case, file size was not a factor in choosing between SQLite and XML.

But, remember that file size is deceptive: Disks are block devices; the 30 byte and 1k file take up the same space if you block size is 2k. (I've shipped a filesystem driver.) HTTP servers gzip on download. It's more important to know your needs than to get hung up on a single metric like file size.

> I'm unclear on how SQLite (native format, no zip) is achieving sizes similar to XML + Zip. Are SQLite TEXT or BLOB fields compressed? Or are they assuming the caller is compressing BLOBs before writing?

Remember, XML writes each tag name 1 time if there's no content and twice if there is. Each attribute has it's name written every time. I doubt SQLite writes all the metadata in each row.

ealexhudson 2 years ago |

ODT was designed to be standardised: while the predecessor format was very similar too, it relies very heavily on XHTML, SVG, and CSS, to name but three (there's a lot more).

Without being able to call out to existing standards, the ODT spec itself would suddenly become massive. The effort to update the standards appears to be significant and hasn't progressed much in recent years already :/

I think realistically, an Sqlite format could be offered as an option, but the office doc ship has really sailed.

Good argument to formalise the spec of Sqlite as a standard though...

dfox 2 years ago | |

The specification is massive (840 pages) even though it is written in very terse way that does not really specify the effects and behavior, only the syntax.

On the other hand if one ignores few warts (explosion of local styles and text spans due to ooo:rsid attribute, non-sparse spreedsheets and weird mechanism for styling tables as a few examples) it is really well designed markup for this kind of document data that strikes right balance between it being semantic markup and representing the kinds of stuff users want to do. Compare that with Office OpenXML with stateful formatting empty tags (yes, really, in DOCX <b/> _TOGGLES_ whether following text is bold).

orf 2 years ago |

Coupling a file format to SQLite smells wrong.

SQLite is good, but it is also fairly unique in this space. Why? Because it’s hard to replicate everything it does, because it does a lot.

But… for this case, do we need it do a lot? No, not really. We don’t need the full SQL standard, a query optimiser, etc etc for basic (+ safe) transaction semantics and the ability to store data in a basic table structure.

Perhaps there is a better file format we can use, but it would be better if it was decoupled from SQLite.

out_of_protocol 2 years ago |

Other example: raster map tiles (basically up to millions of tiny square pictures)

Zip vs tar vs filesystem vs sqlite. Tested all these scenarios, and sqlite was the fastest and the smallest, even beating plain archives with no overhead

vetinari 2 years ago | |

Many filesystems have an issue with tens of thousands or more files in a single directory, which is exactly what you can get with map tiles. No wonder sqlite is faster.

m4rtink 2 years ago | | |

Yeah, that's why sqlite was adopted for this back then - many devices still used FAT32 on the storage volumes where tiles we often stored/cached and that had horrendous small file performance - a plain white 130 Byte PNG tile could result in 64 kB being used.

liuliu 2 years ago | |

If SQLite is faster, the problem is the zip library you use.

SQLite has a major draw back (and yes, I love SQLite and built a lot of things around it over the years): the blob you get from the DB cannot be mmap and you have to copy it to somewhere else. For zip files, as long as the file is not compressed, you can mmap it (or it is compressed using some exotic encoding such as PVRTC) just fine.

3cats-in-a-coat 2 years ago |

OpenDocument is zipped images and XML. Implying you parse the entire format and put it in RAM. And frankly I don't see how SQLite can improve this. Well XML isn't ideal, but it's zipped, so there's no huge penalty in size here.

All benefits SQLite's article lists (and I love SQLite to death by the way) can be implemented by having SQLite be the runtime model of the document. On disk and in memory. But SQLite doesn't need to be the transport format. In fact SQLite can easily get bigger than the current format, SQLite is full of unused space when you mutate it around, it can get fragmented and sparse. And if you need to optimize it every time, then the "fast save" etc. benefit goes away.

There are formats which do need delta updates and quick indexed look-ups without fully loading the file in RAM, and this is why so many apps do use SQLite as a file format. I just feel OpenDocument was a bad pick to use SQLite for in this hypothetical scenario.

skybrian 2 years ago |

Implementing versioning in the file format conflicts with git, because each document is essentially its own little source control system. This can be surprising to users who copy the file and don’t realize that they’ve effectively copied the entire repo. Copying a file will sometimes include drafts they didn’t want to share. It can mean you lose control over when things are committed, and so you don’t end up with a useful history.

If you then check the file into git, you are storing one source control system into another one, and older versions appear in two different histories. To be git friendly, you don’t want to save anything other than the current version, and then let git do its thing.

Possibly the answer is “don’t use git, we have it covered,” but then the app developer should realize that they are implementing something like a source control system. How do people share drafts, review them, and merge changes? How do you publish a release that only includes the version you wanted to release?

And it does seem relevant that the developer of Sqlite actually did implement their own source control system [1]. Maybe they could have warned people about what they’re getting themselves into if they go down this route?

I wonder how terrible it would be to either use a git repo as your file format, or to build in git compatibility into your app somehow so you could push and pull?

[1] https://en.m.wikipedia.org/wiki/Fossil_(software)

mixmastamyk 2 years ago | |

It's pretty rare to put office docs into version control, as they are typically binary instead of text. So, doesn't work well. Perhaps there is a version of open-doc that doesn't use the zip file but a folder of XML instead? Also the XML might need to be optimized to prefer line-oriented operations.

EricRiese 2 years ago | | |

Yes, in LibreOffice you can save as FODT: flat ODT, which is a single unzipped XML. That's what I use to store my resume in git.

skybrian 2 years ago | | |

Yes, it's rare to use git, but it's also pretty well-known that people can share more than they intended in a Word document. Perhaps true of Open Office as well? See:

https://superuser.com/questions/1562130/can-people-see-the-c...

https://foiassist.ca/2019/04/04/i-thought-we-deleted-that-me...

robertlagrant 2 years ago |

> since OpenDocument predates SQLite

This shocked me. Impressive how far SQLite's come in such a short space of time.

capableweb 2 years ago | |

Hmm, me too, and Wikipedia says:

> OpenDocument - Initial release: 1 May 2005; 18 years ago

> SQLite - Initial release: 17 August 2000; 23 years ago

Wonder what gives.

paradox460 2 years ago | | |

OpenDocument traces it's ancestry to OpenOffice XML format, which traces it's ancestry to StarOffice, which was xmlized around the time Sun bought it in 1999

robertlagrant 2 years ago | | |

Hah - that tallies with my instinct on ODF at least. I'm confused too, then.

isoprophlex 2 years ago |

Man do I love SQLite.

Over the past 1.5 yrs I've build a computer vision tool from recording hardware/software, to derp learning pipelines, to front-end; we had some requirements on the recording side that were difficult to solve with existing solutions (storing exactly timestamped camera frames, gps data, car telemetry and other metadata).

Using a SQLite-backed data format for the video recordings made implementing things by ourselves super straightforward.

regularfry 2 years ago | |

> derp learning pipelines

This accurately describes the majority of my efforts, too.

isoprophlex 2 years ago | | |

Honest to god this was an unintentional typo, but I decided to leave it in as it was just too juicy

sgu999 2 years ago | |

I'm working on a similar problem and I've been struggling to convince all my colleagues that we should sqlite most things. By any chance do you have some public code, or blog posts to share?

isoprophlex 2 years ago | | |

Not in public repos, but sure. Drop me a line, hn at rombouts dot email.

im3w1l 2 years ago |

I don't want people to read my drafts. That could be highly embarassing, and they should not make it into the final saved document.

Past version and undo history should be stored separately from the document. They should be stored out of tree where they wont be commited into some git repository or be automatically synced or anything like that.

regularfry 2 years ago | |

I want to be able to read my drafts, until I decide to bake a publication version.

im3w1l 2 years ago | | |

Did you read the other part of my comment? Where I said to store the draft, but not in the document itself?

eviks 2 years ago | |

Then don't give people access to your drafts but exported versions without history? Why put the limits on the efficiency of a format by forcing it to store changes elsewhere?

pornel 2 years ago | | |

It's better if such gotchas don't exist. Otherwise you'll have every user get burned by it at least once, and blaming them for not knowing the subtle consequences of using "Save As" instead of "Export As" is not going to help anyone.

Lockal 2 years ago |

Sadly they did not include bad sides:

1) Vulnerabilities: not only in SQLite, but also in wrappers like https://nvd.nist.gov/vuln/detail/CVE-2023-32697

2) Lack of transparency: zip with xml's contains only xml's; meanwhile SQLite contains by design all kinds of traces with sensitive information or empty blocks. Attempts to fix these issues removes benefits that were mentioned.

3) Lack of implementer support. It was one of the reasons for WebSQL deprecation many years ago.

4) Lack of standardization for file format. SQLite does not even promise forward compatibility, only backward one. Which means that new documents might not open in old software, or vendor should fork SQLite and only backport security patches.

kunley 2 years ago |

Love the vibe of artivles, which present let's say reason-driven development vs habit-driven.

Why habit? Well, I can imagine back at the time OpenOffice was a fresh project, it went like this: "XML is going to stay forever and everybody uses XML, so ofc we use one... oh, it is so big! And there are many files, so we just zip'em"...

To be fair, the author of this excellent article doesn't even say about getting rid of XML in this format- but that could also be achieved by storing stuff in a SQLite file. Usage of XML was habitual thinking there- and not very visionary, as the format is dead now...

tpm 2 years ago | |

> Well, I can imagine back at the time OpenOffice was a fresh project

OpenOffice was born when Sun bought StarOffice, which was initially released in 1985 (on Z80 and certainly without any XML). So the project itself was far from fresh. OpenDocument was developed from OpenOffice.org XML format which was developed after Sun bought StarOffice in 1999. At the time XML was not used everywhere, but it was very much in vogue, certainly at Sun where the official line was that Java (created at Sun) and XML are going to conquer the world.

galangalalgol 2 years ago | |

Could you clarify the "XML is dead" comment? Don't all the major document formats still use zipped xml? I had to interface with an xml format recently, and that isn't something I ever did, and when I went looking for a crate that parses an xml schema I kept running across this whole xml is dead thing. But it still seems to be everywhere.

kortex 2 years ago | | |

Not GP, but I believe the "XML is dead" sentiment stems from the observation that very few greenfield applications are deliberately choosing xml. Sure you have legacy giants like (X)HTML, SVG, office formats, etc, but you'd be hard-pressed to convince developers (especially a younger crowd) to select it as a data format. It's seen as warty, cumbersome, unwieldy, verbose.

tannhaeuser 2 years ago |

Yeah what if? Then they haven't really understood the purpose of markup languages as plain text files for viewing/editing using generic text editors. There was no lack of proprietary formats such as MS Structured Format (used by MSO) and it was considered a big success when customers demanded open formats such as SGML/XML-based ones in late 90s/00's. The alternatives aren't even sequential (have fragments and cross pointers, etc). Yes they might be faster because they're closer to the in-memory representations as used by the original/historic app or even primitive memory dumps; marginal speed or size improvements were never a consideration though. And if anything, SQL (almost as old as SGML btw) is a joke as document query language compared to basically any alternative specifically designed for the job (ISO topic maps query language ie. Datalog, XPath and co, SPARQL, DSSSL/Scheme, ...) because of SQL's COBOLness, non-schemalessness, lock semantics/granularity a really bad fit, etc.).

dang 2 years ago |

What If OpenDocument Used SQLite? (2014) - https://news.ycombinator.com/item?id=25462814 - Dec 2020 (194 comments)

What If OpenDocument Used SQLite? - https://news.ycombinator.com/item?id=15607316 - Nov 2017 (190 comments)

kgeist 2 years ago |

Sqlite-based file formats are also very easy to debug, which saves a lot of dev time. After my app writes to a file and loading back doesn't work, I can just open it in Sqlite and inspect it in any way I wish because I have the full power of SQL at my fingertips.

nuc1e0n 2 years ago |

It's somewhat off topic I know, but is there something like sqlite but tailored for hierachical data? Like a xml document store rather than for relational data like sqlite is.

layer8 2 years ago | |

There’s ASN.1 for hierarchical data with a schema. It doesn’t provide a query language though.

dfox 2 years ago | | |

ASN.1 in itself is a schema syntax. That schema can be serialized into various related forms, but all of them are more or less a transport formats that cannot be reasonably used for random access.

There are some more or less general hierarchical formats with support for random access, but most of them are tightly related with particular technology stack (ie. MS's COM Compound Document) or with particular usage area (there is HDF5 for scientific data and many multimedia containers are in fact a hierarchical databases, with both the various IFF variants and EBML being explicitly designed as reusable formats for arbitrary data). And then there are formats that implicitly contain some kind of hierarchical container mechanism (PDF, TIFF, DICOM, FPS game map files…).

MrResearcher 2 years ago |

BLOBs in sqlite can be up to 2GB or less, depending on the compilation flags. If you store 2GB and the other application uses sqlite compiled with support for less than 2GB BLOB size, good luck on getting them to work... If you want to store content larger than 2GB in sqlite, you have to chunk them, manage the chunk sequences, etc. And you can't overwrite a fixed size 2KB portion at the specified offset, you'll have to rewrite the entire 2GB chunk.

michalc 2 years ago |

Shameless plug of a couple of Python libraries I’ve been involved with that work around memory issues of ODS files (for very specific use cases):

https://github.com/uktrade/stream-read-ods https://github.com/uktrade/stream-write-ods

CodeCompost 2 years ago |

There really should be a "NoSQLite" or something equivalent to store hierarchical data instead of normalized data.

remram 2 years ago | |

You can probably use SQLite for that, with a single key-value table.

ttyprintk 2 years ago | | |

The json* family of tree and table functions are nowadays built in.

OliverJones 2 years ago | |

It's trivial to implement hierarchical data with recursive common table expressions. https://www.sqlite.org/lang_with.html

iefbr14 2 years ago |

Why only documents? How about a SQLitefs?

pgeorgi 2 years ago | |

WinFS (https://en.wikipedia.org/wiki/WinFS) without the mssql Engine?

iefbr14 2 years ago | | |

Or this: https://github.com/narumatt/sqlitefs

throwaway894345 2 years ago |

Is SQLite’s disk format an open, versioned standard? Or is it just “however SQLite saves data to disk”?

SQLite 2 years ago | |

SQLite file format spec: https://www.sqlite.org/fileformat2.html

Complete version history: https://sqlite.org/docsrc/finfo/pages/fileformat2.in

Note that there have been no breaking changes since the file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the "reserved for future extensions" bits with descriptions of those extensions as they occurred.

throwaway894345 2 years ago | | |

Thanks for elaborating so thoroughly. I didn’t even realize you were on this platform!

swiftcoder 2 years ago |

> The use of a ZIP archive to encapsulate XML files plus resources is an elegant approach to an application file format. It is clearly superior to a custom binary file format.

I feel like I have considerable disagreement with the author of these sentences.

simonw 2 years ago | |

Why do you disagree?

vxNsr 2 years ago |

I’m curious to know what a gsheet/doc/slide file actually is under the hood. I as the user am only ever presented with a link, there’s no way to download a gsheet in its native format.

cm2187 2 years ago |

Sqlite format is smaller than the original format only because xml is super verbose, so any uncompressed binary format ends up being less than lightly zipped xml.

But sqlite files aren't small. One thing I don't understand is why they don't do string deduplication in sqlite (as in you only store a string once and every other occurence is just a pointer to that string). It seems such an obvious and easy way to reduce file size, memory consumption and therefore increase performance (less I/O). Is there a technical reason why this would not be desirable?

Etheryte 2 years ago | |

My first guess is that if you always store the full string you don't need to scan the database to see if you already have the same string. Essentially you choose to use more space but reduce load. Regardless of whether you do the string deduping on inserts or async later on, you have to do it at some point and the unpredictable performance overhead might be undesirable.

cm2187 2 years ago | | |

Well it should be a dictionary lookup, it should be pretty fast and predictable. And for freeing it up, it should be a good candidate for reference counting.

The_Colonel 2 years ago | |

If you have the same (long-ish) string repeating many times in a database, it points to a DB schema needing normalization.

cm2187 2 years ago | | |

I guess it depends on the use case. If you load a csv file into a sqlite database, normalisation isn't the first thing you do.

kortex 2 years ago | |

There is nonzero overhead for doing so: optimizing for duplicate strings invariably adds cost to handling unique strings.

This sounds like something you could do at the schema and application level.

ongytenes 2 years ago |

Would be interesting to see a fork implementing SQLite. Time would tell how well it would compete with the standard.

chadcmulligan 2 years ago |

AutoCAD uses a database as its file format, it is fairly slow.

roywashere 2 years ago |

(2014)

littlecranky67 2 years ago |

deleted.

littlestymaar 2 years ago | |

> Nobody really believes that OpenDocument should be changed to use SQLite as its container instead of ZIP. […] Rather, the point of this article is to use OpenDocument as a concrete example of how SQLite can be used to build better application file formats for future projects.

indymike 2 years ago |

> there's no ISO standard defining how to interpret an SQLite file in excruciating detail.

There comes a point where ISOing things doesn't help. The SQLite format belongs to SQLite, and an ISO standard would result in that standard being rendered irrelevant by the SQLite team, should they wish to make a change for any reason. Also, people would have to pay ISO for access to the specifications. SQLite should be treated as a defacto standard defined by the SQLite project.

didntcheck 2 years ago | |

Just a heads up that it looks like you meant to reply to miki123211, but you've posted a top-level comment instead :)

vmfunction 2 years ago |

At this point, why are we still using JSON/XML when there is SQLite for new projects? Stop the non sense of JSON/XML. SQLite is like json, but very queryable. Just send SQLite files around.

MongoDB also saves document db type of store space just FYI.

constantly 2 years ago | |

Any text editor in the world, even the ones that ship with the most barebones shells, can open json and xml and present their data to the user.

SQLite files require opening in a DB terminal or using special software to even get to the point where one can see what’s there at all. Further the entire internet basically natively supports XML and JSON.

vmfunction 2 years ago | | |

That is a good argument, however many people like some big game development company start to ship with GB of json file, at that point just use SQLite. It will be faster to query load. Also if you look at how DB such as Mongo (Not promoting them in any way), but when Maildir is used aginst Mongo for file storage, Mongo saves a lot of disk space. Again, it is about how we want to store files? NixOS is a quite a way to think about having a file system or db/store.

eviks 2 years ago | | |

Outside of simple cases xml is too verbose and ugly (and in these cases usually zipped), so it's not suitable for a poor human with a plain text editor, so that doesn't give you much of a leg.

(Json has a higher threshold of complexity before it succumbs)

quickthrower2 2 years ago | |

With JSON/XML the app owner decides the schema of the saved file, as they should. One day Sqlite will do some perfectly fine change that’ll break people who outsource their file format to it. Own your file format!

That said there is some nuance and it depends what the user expects. Is you app more of an MSWord where people expect a format that is decades backward compatible and only changes on explicit save, or is it more like a live app with a db back end. If the latter there should be no save concept around the DB file but perhaps a backup and restore function that exports to a controlled format.

eastern 2 years ago | | |

In sqlite the on-disk file format does not matter.

All that matters is that you should be able to issue sql to the sqlite embedded library and get back the results.

Freeing you from the overhead of owning (thus inventing and then maintaining) your own file format is almost the entire point of using sqlite in this manner.

sqlite> create table x(c1, c2); sqlite> insert into x values ("a", 1); sqlite> insert into x values ("b", 2); sqlite> insert into x values ("c", 3); sqlite> select c1, max(c2) from x; c|3 sqlite> select c1, max(c2), min(c2) from x; a|3|1 sqlite> select c1, min(c2), max(c2) from x; c|1|3