Rethinking Files

110 points by UkiahSmith 7 years ago | 68 comments

erichanson 7 years ago |

Nice to see folks rethinking files, as they're a scourge on the planet and an antiquated anti-pattern that has been holding back the industry pretty much since its inception. I don't know how anyone could take a look at /etc for example, and consider it anything but archaic. The adduser command is some 1130 lines long, and all it does is do CRUD on files, to name just one example. Then there are countless config files that just have to be edited by hand and happily accept syntax errors and logical errors. No modern system would tolerate this.

The root of the problem with files is that they lack an information model, beyond just a sequence of bytes. They are unopinionated to a fault. All files have structure. Even if that structure is a "non-structure" like "all these files are just a random sequence of meaningless bytes", then that is their structure. But this information isn't present in the system, nor can it be enforced or constrained when that is desirable.

To me, the obvious alternative is the database, aka "everything is a row". We have used the database (relational or otherwise, but mostly relation) to successfully model many many domains, and bring coherence and clarity to them. The cool thing about the relational database is that it's based on an underlying relational algebra. The syntax of data in an RDBMS is really just one manifestation of a deeper layer of structure that is syntax-free, and these abstract structures can be (and are) manifested in multiple coexisting syntaxes.

I'm exploring this pattern ("datafication", headshake) with Aquameta (http://aquameta.org/) and written a lot more about why file-centric is holding us back (http://blog.aquameta.com/intro-chpater2-filesystem/). Boot to PostgreSQL! :)

agumonkey 7 years ago | |

It's something that felt right in COBOL (and COBOL rarely feels right in this day and age). The file IO is record based at the core, so opening a file is basically a crude (npi) sql statement. It makes a lot of tiny things simpler IMO.

gjs278 7 years ago | |

files can be modified by any text editor in /etc it sounds like you are advocating for a system similar to the windows registry. it can easily corrupt and can’t be fixed by live cds or other operating systems that have the filesystem driver. it would be a massive step backwards.

erichanson 7 years ago | | |

I agree the system you describe would be a massive step backwards. :)

No, I'm saying the OS needs an information model as a first-class citizen. But since you brought up data corruption: This hypothetical OS could also benefit from having transaction support up in application space -- to avoid data corruption -- something most "modern" programs don't even have, even though most file systems do.

To be fair, I do think the database needs a way to edit the contents of a field using your favorite text editor. But we've got a pgfs plugin that uses FUSE to make the database accessible as a filesystem as well.

zokier 7 years ago |

> People like Unix's “everything is a file” approach because what it really means is “everything is exposed to the same nexus”. It means you need only ssh to a system and you have all the power to reshape all aspects of that system with a single interface, the command line, using a common set of highly composable tools

But at least in Linux there are ton of files that are not exposed to the "same nexus", i.e. filesystem. The most common example would be network sockets. They are files, but do not exist anywhere in filesystem. In Linux file is more of an object handle.

https://yarchive.net/comp/linux/everything_is_file.html

http://events17.linuxfoundation.org/sites/events/files/slide...

Lowkeyloki 7 years ago |

I found the URL addressing scheme of Redox to be fascinating, if perhaps slightly less user-friendly compared to files and file paths.

https://doc.redox-os.org/book/design/url/urls.html

hlandau 7 years ago | |

Personally, Redox's use of URLs seemed like really bad design to me. It doesn't get simpler than the Unix path syntax.

Having a scheme:// makes sense for URLs because you don't otherwise have any contextual information indicating how to access a resource. But this isn't the case for something like a virtual filesystem, where the total set of filesystems mounted under it - and their types - are all known to the system. There's no need for disk://foo when you can just have /dev/disk/foo.

mpweiher 7 years ago | | |

That's true when the namespace covers objects that are very similar to access, ideally identical.

If that's not the case, I have found the scheme to be helpful to indicate what's going on.

EmilStenstrom 7 years ago |

I recommend opening up developer tools and adding this before reading this article:

  body {
      width: 40em;
      margin: 0 auto;
      font-size: 1.4em;
      line-height: 1.4em;
  }

saadat 7 years ago | |

Or use the Reader View/Reader mode in Firefox/Safari.

OJFord 7 years ago |

What if the interaction were more like OOP - the File class wouldn't necessarily make sense as the top parent.

Would be kind of interesting to call methods on objects rather than read/write files, but it's not immediately obvious to me that that really gains anything over the status quo.

And now that I've written that, I wonder is that what powershell's verb-object does anyway? I've never come close to proficient enough (nor wanted to!) to know.

mpweiher 7 years ago |

That's kind of the point of stores:

https://github.com/mpw/MPWFoundation/blob/master/Documentati...

and Polymorphic Identifiers:

http://objective.st/URIs/

Hierarchical paths were a good idea, let's use them. Objects were also a good idea, let's use those. A small set of verbs (GET, PUT, POST, DELETE) was also a good idea. Let's combine these!

Abstract from:

   Path    + File       + POSIX I/O
   URI     + Resource   + REST Verbs

Get:

1. Polymorphic Identifiers, which subsume paths, URIs, variables, dictionary keys etc.

2. Stores, wich resolve URIs, subsume filesystems, HTTP servers, dictionaries, etc.

3. A small protocol that essentially mirrors REST verbs in-process

syn0byte 7 years ago |

Your not solving anything, at best you are getting maybe one extra level of abstraction by shifting potential complexity in the application. It may or may not care about internal file "schema" and thus has no code for it. Shifts to concrete complexity in the system; Your application doesn't utilize file schema but some applications might so everyone gets a schema field and there is a bunch of extra code and complexity to support it.

From a security/reliability standpoint it sounds like a nightmare combining the worst of things like NTFS alternate data streams and share library loading into one.

leoc 7 years ago |

See my earlier comment, https://news.ycombinator.com/item?id=14542595 .

Lotus Agenda/Chandler https://en.wikipedia.org/wiki/Chandler_(software) is another part of this long Grail quest.

bayareanative 7 years ago |

Files are too finite, low-level and lose generate/parsing knowledge that is implemented N times in N places. OSes should read and write message-oriented streams of records (pb, capnp or similar.) that are invisible to the user, while tools and code see data and data structures. This solves many problems of unnecessary and repeated effort parsing log files, log file rotation, proprietary file formats, portability, compatibility and extensibility.

Also, programs should be able to dynamically-serve the contents of "files" as well with an "activation symlink", i.e.,

    /etc/resolv ->* resolvconf

The "the everything must be plain text" refrain is obsolete and unnecessary because it's trivial to serialize anything to any format since it would already be an universally-supported data structure both in tools and code.

It's not 1978 anymore.

RcouF1uZ4gsC 7 years ago | |

Sounds a lot like WinFS https://en.m.wikipedia.org/wiki/WinFS

O_H_E 7 years ago |

Two sic projects that can help managing files until we get another system

TMSU - tags your files and then access them through a virtual filesystem from any other application

https://tmsu.org -- https://github.com/oniony/TMSU

Tagsistant - Semantic filesystem for Linux, with relation reasoner, autotagging plugins and deduplication

https://www.tagsistant.net -- https://github.com/StrumentiResistenti/Tagsistant

solidsnack9000 7 years ago |

The examples given at the end, where verbs are commands at certain paths, looks a lot like a special file system. All the printers are under `/print` and all the print commands are under `/print`. One could imagine all the database tables being under `/db` and all the commands being under `/db/bin`.

ubrpwnzr 7 years ago |

Another site, can we please just add something like this:

</style>

tgbugs 7 years ago |

I've done some silly things [0] with python's pathlib recently that seem related to the issues discussed here. Given that smalltalk message passing finally clicked for me durnig the process, I am attracted to an object-like solution for everything (or a file-object-like solution for everything, since the practical performance advantages are undeniable). That said there are some considerations both for the low level implementation, and for high level things like affordances for 'file' operations.

In direct response to the suggestion about file paths for verbs. Allan Kay says in one (possibly many) of his talks something along the lines of 'every function should have a url.' The one of surely many challenges is how to ensure that the mechanism used to populate file system paths with nested functionality (e.g. /usr/bin/ls/all to `ls -a`) don't trigger malicious behavior during service/capability discovery. Being able to more deeply introspect file data and metadata as if the file were a folder could potentially be implemented as a plugin, and I worry about the complexity of requiring a file system to know about the contents of the files that it hosts, or that the files themselves be required to know about how to tell the file system about themselves. Existing file systems adhere to a fairly strict separation of concerns, since who knows what new file format or language will appear, and who knows what file system the file will need to exist on.

Said another way I think that the primary issue with the suggested approach is that it is hard to extend. The file system itself needs to know about the new type of object that it is going to represent, rather than simply acting as an index of paths to all objects. If there is a type of object that is opaque to the current version of the file system that object either has to implement a file-system-specific discovery protocol (which surely would have fun security considerations if it were anything other than a static manifest) or the user has to wait for a new version of the file system that knows what to do with that file type.

Some thoughts from my own work. (partially in the context of OJFord's comment below)

Treating files and urls as objects that have identifiers, metadata, and data portions and where the data portion is treated as a generator is very powerful, but the affordances around the expression local_file.data = remote_file.data make me hesitate. When assignment can trigger a network saturating read operation, or when setter doesn't know anything about how much space is on a disk, etc. then there are significant footguns waiting to be fired and I have already shot myself a couple of times.

The more homogeneous the object interface the better. However, this comes with a major risk. If the underlying systems you are wrapping have different operational semantics (think files system vs database transactions) and there is no way to distinguish between them based solely on the interface (because it is homogeneous) then disaster will strike at some point due to a mismatch. To avoid this everything built on top of the object representation has to be implemented under the assumption of the worst case possible behavior, making it difficult to leverage the features of more advanced systems. As with the affordances around local.data = remote.data, if I have a socket, a local file, a remote web page that I own, a handle to an led, a handle to a stop light, a database row in a table that has triggers set, the stdin to an open ssh session, and a network ring buffer all represented in the same object system, I have as many meanings for file_object.write('something') as I have types of objects, and the consequences and side effects of calling write are so diverse (from flipping bits on a harddrive to triggering arbitrary code execution) that it is all but guaranteed that something will go horribly wrong. At the very least there would need to be a distinction between operations where all side effects could be accounted for beforehand (e.g. writing a file of known length to disk has the side effect of reducing free disk space, but that is known before the operation starts), and operations where the consequences will depend on the contents of the message (e.g. DROP TABLES), with perhaps a middle ground for cases with static side effects (e.g. the database trigger) but that would not immediately visible to the caller and that might change from time to time.

The distinction between files and folders is quite annoying (non-homogeneous), especially if you want to require that certain pieces of metadata always 'follow' a file. This is from working with xattrs that are extremely easy to loose if you aren't careful. Xattrs are a great engineering optimization to make use of dead space in the file system data structure, but they aren't quite the full abstraction one would want. It is also not entirely clear what patterns to use when you have a file that is also a folder -- do you make the metadata the outer file and the data the inner file? Or the other way around? Having the metadata as the outer file means that you can change the metdata without changing the data, but that the metadata will always 'update' when its contents (the data) changes. However, when I first thought about using such a system, I had it the other way around, and a system with that much flexibility I suspect would have even more footguns than the current system.

Another issue is the long standing question around what constitutes an atomic operation. Everything is simple if only a single well behaved program is ever going to touch the files, but trying to build a full object-like system on top of existing systems is a recipe for leaky abstraction nightmares.

While I was working on this I came across debates from before I was born. For example hardlinks vs symlinks. There are real practical engineering tradeoffs that I can't even begin to comment on because I don't understand the use cases for hardlinks well enough to say why we didn't just get rid of them entirely.

0. https://github.com/SciCrunch/sparc-curation/blob/master/spar...