Linus Torvalds: “I'm happily hacking on a new save format using ‘libgit2’”

Linus Torvalds: “I'm happily hacking on a new save format using ‘libgit2’”(plus.google.com)

303 points by hebz0rl 12 years ago | 259 comments

mrcharles 12 years ago |

On the game I'm currently working on, it's built very heavily around Lua. So for the save system, we simply fill a large Lua table, and then write that to disk, as Lua code. The 'save' file then simply becomes a Lua file that can be read directly into Lua.

This is absolutely amazing for debugging purposes. Also you never have to worry about corrupt save files or anything of it's ilk. Development is easier, diagnosing problems is easier, and using a programmatic data structure on the backend means that you can pretty much keep things clean and forward compatible with ease.

(Oh, also being able to debug by altering the save file in any way you want is a godsend).

JackC 12 years ago | |

You probably know this, but remember that storing user data as code is a place where you (general "you") have to think very carefully about security.

Is there any way that arbitrary code in the file could compromise the user's system? If so, does the user know to treat these data files as executables? Is there any way someone untrusted could ever edit the file without the user's knowledge? Even in combination with other programs the user might be running? Are you sure about all of that?

Maybe Lua in particular is sandboxed so that's not a problem (beats me), but in general this is an area where safe high-level languages can all of a sudden turn dangerous. Personally I would rarely find it worth it.

TelmoMenezes 12 years ago | | |

This is a good point, but I feel that discouraging this type of approach is not the way to go.

I apologise in advance for ranting... I hope this is not too off-topic, but instead a "zoom out" on the issue.

This touches on something deep and wrong about how we use computers these days. Computers are really good at being computers, and the amplification of intellectual capabilities they afford is tremendous, but this is reserved for a limited few that were persistent enough and learned enough to rediscover the raw computer buried underneath, and what it can do.

For example, I dream of a world where everything communicates through s-expressions, all code is data and all data is code. Everything understandable all the way down. Imagine what people from all fields could create with this level of plug-ability and inter-operability. We had a whiff of that with the web so far, but it could be so much more powerful, so much simpler, so much more elegant. All the computer science is there, it's just a social problem.

I understand the security issues, but surely limiting the potential of computers is not the solution. There has to be a better way.

ufo 12 years ago | | |

Lua sandboxing is relatively straightfoward. You can choose what functinos from the standard library the script you are evaluating will see in its global scope. By passing an empty scope the only thing the evaluated script can do is build tables, concatenate strings, do arithmetic, etc. You only need to worry about DOS due to infinite loops but there are also workarounds for that).

In Loa 5.1 you can use setfenv http://www.lua.org/manual/5.1/manual.html#pdf-setfenv

And in Lua 5.2 the functions that eval strings receive the global scope as an optional parameter. http://www.lua.org/manual/5.2/manual.html#pdf-loadfile

chalst 12 years ago | | |

I trust Lua sandboxing. See, e.g.:

1. http://stackoverflow.com/questions/1224708/how-can-i-create-...

2. http://stackoverflow.com/questions/4134114/capabilities-for-...

I find it easier to trust Lua than similar facilities in other programming languages because the kernel of the language has a relatively simple semantics, so the TCB of a sandbox is lower, and the source is easier to understand than most other languages.

Note that sandboxing in Lua 5.2 has a still simpler semantics than for Lua 5.1 - few other languages evolve in a way that makes the language easier to trust.

unsigner 12 years ago | | |

Lua can be sandboxed so your data file can't call arbitrary functions (but can still call a controlled subset, e.g. a function called RGB that does r255255+g*255+b so your colors are somewhat human-readable in the file, yet 24-bit integers in memory).

But it's still code, so you can e.g. inject an infinite loop and the loader will hang. (You can protect against this, you can install a debug hook that gets called after every N instructions executed, and kill the loader.)

Guvante 12 years ago | | |

Typically sandboxing is stage one of any lua implementation. You don't need raw IO access and rarely need to print to the screen for instance.

mrcharles 12 years ago | | |

Aware of all the issues and already have a plan. But Lua generally only has access to the APIs you give it; from game code our Lua VM has no access to the OS at all, just game functions, and those game functions are never system related.

The biggest 'concern' would be save hacking, but at the end of the day that will happen no matter what so it doesn't bother me much.

mattgreenrocks 12 years ago | |

Preach it. My favorite persistence code: stuff that has nothing to do with SQL/NoSQL.

I leaned heavily on Python's pickle module for serializing a few thousand entities to disk a few years ago. By streaming them to the application at startup time, it remained plenty fast for all datasets it'd encounter. I intended to replace it with SQLite one day, but I never had to. I could just keep them all in memory.

I'd probably choose something a bit safer now, but it was hard to beat the simplicity.

3pt14159 12 years ago | | |

I used to do that, but pickle bit me once. I think it changes between versioning or something. I had to start the statistical model from scratch.

dbaupp 12 years ago | |

Why does using a Lua-basef format stop the files being corrupted?

politician 12 years ago | | |

Maybe he means that he doesn't have to deal with bugs in a custom binary serializer.

mrcharles 12 years ago | | |

It can only become corrupted by external factors; a lot of games I've worked on, in-game bugs could lead to corrupted saves being written out to disk. Since in this case we are just serializing lua data, unless the serializer itself has a bug, it will always write out correctly, and any issues become issues of game logic rather than anything else.

Igglyboo 12 years ago | | |

I don't think it does. I think he meant that if a save became corrupted it wouldn't do so silently, it would violently crash the game because of a syntax error.

Zecc 12 years ago | | |

It doesn't. But it makes them much easier to fix.

Edit: Igglyboo has a point too.

TheEzEzz 12 years ago | |

I did this with C# in my last game. All the map/object editors output C# code on save, which was then included in the compiled code on the next build. The beauty is that your "data" files get automatically updated when you refactor your regular code! On top of that loading is faster, because you don't need to worry about fetching a file and parsing it, the whole thing is just compiled code embedded in your executable.

Tanner 12 years ago | |

That Lua was originally designed as a configuration language becomes really clear when you start doing things like this. Having my code and configuration being separate but equal was really a paradigm shift for me.

Also, the Tiled Map Editor exports directly to Lua.

agumonkey 12 years ago | |

IIRC that's how Office and Photoshop file format started. I think it's a nightmare for compatibility in the end.

frik 12 years ago | |

So, it's similar to JSON (JavaScript), but valid Lua syntax.

  local t = {}
  t = {["foo"] = "bar", [123] = 456}
  t.foo2 = "bar2"

ilovecookies 12 years ago | |

While one could say that this is about savefiles for games, I would say it's implications could more be about savefiles for software projects. If you are building the game in LUA, of course LUA is going to be the preferred way to save your game in since you are already using LUA objects and interpreting files in that language will be easy to integrate.

If you ever used maven xml configs, java object marshalling or c# xml you would understand the pains of using xml as a file format for software projects and data representation. You have to find a solution that is language agnostic, neither LUA or JSON is.

balls187 12 years ago | |

I did something similar, but used JSON instead (pretty trivial to (de)/serialize LUA tables to JSON. This made it easy to send data to the server, and inspect with standard tools as well.

seivan 12 years ago | |

This sounds a lot like NSCoding for Objective-C (Cocoa). Though you'd still have to define the types/classes and name for each property you want to save. But you could technically save it in a big blob, and then read it into memory as you resume.

Could persist to disk as a binary, sql or a plist (xml).

I guess the only downside is, that if you got a lot of composite classes all with their own properties and associations (say a graph), there's a lot of manual work to be done.

hootener 12 years ago | |

I've had to write output save file formats for various projects on several occasions, and it never occurred to me to take this approach.

Thanks for sharing this, it's one of those ideas that (to me) seems so brilliant in its simplicity that I probably would've never thought of it.

Any hiccups in the day-to-day work using this approach? I'm just trying to get a better idea of the workflow since I'm very seriously considering applying it to my next project.

mrcharles 12 years ago | | |

The biggest hiccup is almost a literal one; serializing large lua structures and then writing them to disk can take a lot of time. But this can largely be mitigated by just saving compiled lua instead of text lua.

Touche 12 years ago | |

That's how people are going to cheat at your game.

roryokane 12 years ago | | |

I had a lot more fun recently playing the free game Boson X for PC (http://www.boson-x.com/) than I would have otherwise because I discovered that the game folder contains editable Lua scripts. The scripts control the game physics, scoring system, controls, level data, and more.

I’ve created mods of the game where you fun faster but gravity is stronger, and where all levels are randomly mixed into one level, and where the dangerous falling platforms also give you energy while you’re on them, and where the sound effects give the player clearer feedback on what they’re doing. And though I could cheat by multiplying my score by 1000 and submitting it online, I actually have been careful to always comment out the high-score saving and submission code in each of my mods.

I like the game much more than if the developers had obfuscated the Lua files so I couldn’t read and edit them.

outworlder 12 years ago | | |

The save format does not matter at all. It wouldn't matter even if it were an obscure, made-up format. All it would do is slow down 'cheaters' by half an hour.

The only argument against human-editable text files is parsing speed, not security.

phn 12 years ago | | |

Cheating is good, I remember having tons of fun with age of empires and sim city because I used cheat codes.

If the player has fun, it's a nice feature! :D

jethro_tell 12 years ago | | |

Plot twist, It's actually a 'teach yourself lua game'

jiggy2011 12 years ago | | |

Does it matter unless the game is multiplayer, in which case you should assume that client files are untrustworthy anyways.

mrcharles 12 years ago | | |

There are ways around it but if people want to cheat their own SP experience who am I to stop them? We'll obfuscate a bit to dissuade casual users but I don't know that I've ever encountered a game that didn't have some level of save hacking available.

Hell, I've used it myself more than a few times.

6d0debc071 12 years ago | | |

Hash the information and include the hash in the file. If the hash and the contents don't match when you try to load it, you can refuse it.

If not loading things is important to you, mind.

saucetenuto 12 years ago | | |

It's only cheating if the developer disapproves.

bhaak 12 years ago |

What's with all the XML hate? Of course, doing everything in XML is a stupid idea (e.g. XSLT and Ant) and thanks heaven that hype is over.

But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?

Doing hierarchical data in SQL is a bitch and if you want to transfer it, well good luck with a SQL dump. JSON and other lightweight markup languages fail the verification requirement.

bananas 12 years ago |

I think this title is wrong.

Firstly some clarification - this appears to just be about the persistence format for his dive log. It was XML, now it's git based with plain text.

As someone who had to manage a system which worked with plain text files structured in a filesystem for a number of years in the 1990s, this is done to death already.

You now end up with the following problems: locking, synchronising filesystem state with the program, inode usage, file handles to manage galore and concurrency. All sorts.

Basically this is a "look I've discovered maildir and stuffed it in a git repo".

Not saying there is a better solution but this isn't a magic bullet. It's just a different set of pain.

WalterBright 12 years ago |

Back in the bad old DOS days, instead of creating a file format for saving/loading the configuration of the text editor, I simply wrote out the image in memory of the executable to the executable file. (The configuration was written to static global variables.)

Running the new executable then loaded the new configuration. This worked like a champ, up until the Age of Antivirus Software, which always had much grief over writing to executable files.

It's a trick I learned from the original Fortran version of ADVENT.

jmnicolas 12 years ago |

From the comments (Tristan Colgate) :

"XML is what you do to a sysadmin if waterboarding him would get you fired."

Made my day :-)

Ygg2 12 years ago | |

That's just mean. Waterboarding isn't that bad...

jmnicolas 12 years ago | | |

But it gets you fired ... on the other end, nobody has ever been fired for using XML.

nzp 12 years ago | |

With my occasional sysadmin hat on, until a few weeks ago I had the luck to never have had to deal with XML configuration files. Then came Solr and now I know what horror is. (To be clear, Solr itself is great, but those god damn config files...)

lifeisstillgood 12 years ago |

What I like is the "I dont start prototyping till I have a good mental picture"

I am currently stuck on a project I want to start becasue I cannot get it to fit right in my (future) head. And I am glad I am not an idiot for not being able to knock out my next great project in between lattes.

(Ok, in direct comparison terms I am an idiot, but at least its not compounded)

specialist 12 years ago | |

  "A change in perspective is worth 80 IQ points."
  
  -- Alan Kay

My biggest hurdle solving new problems is divining a unifying, simplifying metaphor. Once you have the right notion, that Eureka! moment, everything falls into place, like magic.

Like how Kepler was able to fully explain Bache's astronomical data once he realized the planets orbits the sun.

Personal example: I used to write print production software. Placing pages onto much larger sheets of paper that get folded and bound into a book. A task called image positioning aka imposition. It took me years to figure out how to model the problem. Key insight was simulating the work backwards, from binding back to the press. Then when I showed the new solution to my coworkers, the response was "Well, duh."

tim333 12 years ago | |

Yeah, I noted that too, also that it took him months to to get his good mental picture. It makes me feel not so bad about spending months trying to get clear on some of my stuff.

tzury 12 years ago |

I just realized that Linus' posts are the only reason I ever go to Google Plus.

cbsmith 12 years ago | |

The question nobody is asking, but actually should is: I wonder what other good G+ content you are missing?

G+ is largely misunderstood. It is a lousy tool for interaction with people connected to you purely socially. It's a very good way to find and interact with people connected to you by interest.

icefox 12 years ago | | |

The really sad thing is that I have tried several times to search for content that I know exists on G+, but I can't find it, even when I knew the author. After the third time failing at this my usage of G+ dropped significantly. Of all of the things that you would think would work search would be at the top... :|

ChikkaChiChi 12 years ago | | |

This is exactly how I explain Google+ to folks. It's built for communities, not cliques.

jan_g 12 years ago | |

For me it's not just G+, but also Facebook and Twitter. Only reason I ever visit those sites is indirectly through HN posts and similar.

npsimons 12 years ago | |

I know this is completely off-topic, and I'll happily be downvoted for it, but why in the world does Google+ capture keyboard shortcuts that are already bound to other well known browser functions? (C-PgUp, C-PgDn, C-w, etc).

unsigner 12 years ago | |

Linus : G+ :: notch : Java

xentronium 12 years ago | | |

That's unfair. Lots of infrastructural projects are done in java. E.g. my personal favorite: lucene (+ solr, elasticsearch).

kurrent 12 years ago | |

I wonder if Linus ever reads Hacker News.....

oneeyedpigeon 12 years ago |

I don't quite get Linus' problem with XML for document markup (for anything else - config files, build scripts - sure, XML is horrible). Does anyone know any more details about what his specific gripe is? For me, asciidoc (which looks very similar, conceptually, to markdown) suffers from one huge problem: it's incomplete. Substituting symbols for words results in a more limited vocabulary, if that vocabulary is to remain at all memorable.

Sure, XML can be nasty, but thats very much a function of the care taken to a) format the file sensibly b) use appropriate structure (i.e. be as specific as necessary, and no more).

josephlord 12 years ago |

https://github.com/torvalds/subsurface

I didn't really know what he was talking about but I think this is it.

The title does need changing though as it is definitely file formats under discussion not file systems.

vfclists 12 years ago |

What is it with HN commenters and their demented ability to send topics completely of track? I would have thought someone might have examined the code or what Linus is trying to implement and comment about it.

But here we have threads about Lua, why people hate XML and love JSON and all kinds if irrelevant issues which have been well hashed elsewhere ad nauseam. Why not restrict to an analysis of whatever it is Linus developing?

HN is getting truly annoying and sucky, if it isn't so already.

fuzzix 12 years ago |

> "I actually want to have a good mental picture of what I'm doing before I start prototyping. And while I had a high-level notion of what I wanted, I didn't have enough of a idea of the details to really start coding."

This I like. The race away from the waterfall straw man has also stripped us of the advantages of BDUF.

While rigid phase-driven project management helps nobody, I think there's still room for speccing as much as we can upfront within iterative processes.

Or you could run to the IDE and start ramming design pattern boilerplate down its throat the second you're out of the first meeting ;)

hvidgaard 12 years ago | |

You should be speccing what you want to achieve: the goals, the why, the impact, the external limitations, measures of success and so forth. This also allows you to describe and plan testing up front. The "how" is best handled in an iterative manner.

A lot of people use AGILE to avoid planning at all, which is a particular destructive anti-pattern, and the exact opposite of what you need.

fuzzix 12 years ago | | |

> "A lot of people use AGILE to avoid planning at all"

Yup, I've seen this a lot.

In one instance "Agile" meant I could finish a major task using an unfamiliar language, framework and code base in short order.

Genuinely, the customer was told "Of course, fuzzix here is familiar with Agile processes so you should have this in 3 weeks".

edit of course this also meant there was no formal spec for the task, though I did have a photo of the whiteboard.

pessimizer 12 years ago | | |

>The "how" is best handled in an iterative manner.

I think that the first "how" should be planned as much as anything else. I understand how you refactor from v0.0.1 to v5.34.2 iteratively, but I think that getting from vNothing to v0.0.1 is qualitatively different.

If I don't have a complete idea of how my minimally functional thing will work that is small enough that I can completely hold it in my head, and instead just architect by agglutination and test writing, 1) my results are going to be hacky garbage, 2) my first 50 iterations are going to be devoted to replacing it all haphazardly to fix bugs, and 3) the code and interface will become increasingly more complex, harder to work with, and strewn with special cases.

When v0.0.1 is well planned, v2.5.2 may not look anything like the plan anymore, but in my experience it becomes shorter, cleaner, and more correct rather than a giant ball of band-aids propped up with tests.

splitbrain 12 years ago |

he talks about a save file format, not a file system. or do we have different concepts of "file system"?

sp332 12 years ago | |

I agree it's confusing, I think the submitter just meant "system for files" or something.

Pxtl 12 years ago | | |

That would be excusable if we were talking abuot somebody who writes higher-level programs that would be excusable, but not for a kernel developer.

k2enemy 12 years ago |

I don't really understand what he's talking about here (my ignorance, not his fault.) Is it something like https://camlistore.org/ that is a content addressable (the git part) datastore?

saljam 12 years ago | |

Yep, I thought it sounded like Camlistore, but as a library.

pcj 12 years ago |

>>So I've been thinking about this for basically months, but the way I work, I actually want to have a good mental picture of what I'm doing before I start prototyping. And while I had a high-level notion of what I wanted, I didn't have enough of a idea of the details to really start coding.

This might be a tangential discussion. Earlier, I used to have a similar approach. Can't code until I have the complete picture. But, it's tough to do in a commercial world and you have deliverables. So, nowadays, I start with what I know and scramble my way until I get a better picture. There are times when that approach works. But, there have been days where I was like - "wish I had spent some more time thinking about this".

I am curious how folks on HN handle this "coding block".

tonyarkles 12 years ago | |

I've got a few strategies that might help, depending on the circumstances.

A notebook: I'll write down some notes and just kind of free write whatever thoughts come to mind. If there's something that I think is important to come back to, I'll draw an empty box in the left margin (to be filled with a check mark later)

Readme: start writing the Readme for the project, even if you're not entirely sure of the details. Include code examples. If you don't like how the API is coming together, change it. It's way less work to modify the API now than it will be later.

Write a test: I don't always unit test, but when I do I test first :). This works well on projects that already have a decent test suite. It's kind of an executable version of the Readme.

Branch and Hack: branches are cheap. Make one and start playing. Don't like how it's turning out? Make a new branch and try again!

Ctrl-Z: maybe the answer won't come to you right away. Let it sit and run in the background for a while and come back to it. If I'm worried about forgetting details, I'll write it down in a notebook first.

aashishkoirala 12 years ago |

This is what Linus does. He has strong opinions and he throws them around. You can't let that get to you. Both XML and JSON are just fine if used properly.

icebraining 12 years ago | |

http://harmful.cat-v.org/software/xml/

theandrewbailey 12 years ago | |

This is the first profanity-free Linus rant that I've read in a long time.

vacri 12 years ago | | |

Almost all of Torvalds' "profanity rants" that get passed around are the result of frustration at an existing conversation, and you can find profanity-free comments by him simply by checking out a slightly earlier one.

aashishkoirala 12 years ago | | |

Haha, right. I play in the .NET space, so it's never going to happen, but God help us both if I ever have to end up working for this guy.

beagle3 12 years ago |

And the actual description is here: http://lists.hohndel.org/pipermail/subsurface/2014-March/010...

tedchs 12 years ago |

Why reinvent on-disk data formats when you can just make a file of protocol buffers? https://code.google.com/p/protobuf/

sparkie 12 years ago | |

Why reinvent binary serialization when you could use ASN.1, or any of the thousand binary serialization formats that pre-date protobufs?

lern_too_spel 12 years ago | | |

For that specific example, you can find a good discussion here: https://groups.google.com/forum/m/#!topic/protobuf/eNAZlnPKV...

McP 12 years ago | |

Ironically that has already been reinvented in the form of Cap'n Proto: http://kentonv.github.io/capnproto/

(other than that I agree it's a good solution)

Gonzih 12 years ago |

Current title that I see "Linus Torvalds on implementation of human-readable file system" is off. It's about file formats, not file systems.

senthilnayagam 12 years ago |

why do you need to view filesystem and make it readable for humans, you would interact it via commands "ls" or some gui

git as the basis of filesystem is interesting, hope we don't need to manually make branches and commits to use it

oneeyedpigeon 12 years ago | |

Did you read the article? It's not really about the filesystem. 1 part your fault for seemingly not reading the article you're commenting about, 1 part the submitter's fault for choosing such a misleading title.

joelhaasnoot 12 years ago |

Worked on a project a few years ago where we needed distributed sync capability. Using git (or bazaar or mercurial) was one of the options - store everything in it versus a database. Interesting to see the same thought "coming back".

fit2rule 12 years ago | |

I've also used libgit as a means to a similar end - providing versioned data across a local filesystem. Its an idea whose time has come ..

hardwaresofton 12 years ago |

Why not sqlite or sexpressions? Linus states that databases can't hold previous state but that's not really true...

I'm not sure why git is the best tool for the job in this case, even after reading the post & some of the contents.

tmzt 12 years ago | |

They can, if you recreate the primary feature of Git on top of them.

signa11 12 years ago |

erik-naggum's most excellent xml rant: http://www.schnada.de/grapt/eriknaggum-xmlrant.html

sam_bwut 12 years ago |

At work we have a git backed document store that just saves as json - versioning makes keeping track of audit points nice and easy.

twic 12 years ago |

Title is entirely misleading. Tech support! TECH SUPPORT!!

anon4 12 years ago | |

Have you tried turning it off and on again?

theandrewbailey 12 years ago | |

Is your title plugged in?

meapix 12 years ago |

xml haters!!! using other formats how can I define DTDs?

1ris 12 years ago | |

https://news.ycombinator.com/item?id=7333354

<customer> <custid>496F3AB</custid> <accounts> <account> <type>Personal</type> ... </account> <account> <type>Business</type> ... </account> </accounts> </customer>