Rethinking Files(devever.net) |
Rethinking Files(devever.net) |
The root of the problem with files is that they lack an information model, beyond just a sequence of bytes. They are unopinionated to a fault. All files have structure. Even if that structure is a "non-structure" like "all these files are just a random sequence of meaningless bytes", then that is their structure. But this information isn't present in the system, nor can it be enforced or constrained when that is desirable.
To me, the obvious alternative is the database, aka "everything is a row". We have used the database (relational or otherwise, but mostly relation) to successfully model many many domains, and bring coherence and clarity to them. The cool thing about the relational database is that it's based on an underlying relational algebra. The syntax of data in an RDBMS is really just one manifestation of a deeper layer of structure that is syntax-free, and these abstract structures can be (and are) manifested in multiple coexisting syntaxes.
I'm exploring this pattern ("datafication", headshake) with Aquameta (http://aquameta.org/) and written a lot more about why file-centric is holding us back (http://blog.aquameta.com/intro-chpater2-filesystem/). Boot to PostgreSQL! :)
No, I'm saying the OS needs an information model as a first-class citizen. But since you brought up data corruption: This hypothetical OS could also benefit from having transaction support up in application space -- to avoid data corruption -- something most "modern" programs don't even have, even though most file systems do.
To be fair, I do think the database needs a way to edit the contents of a field using your favorite text editor. But we've got a pgfs plugin that uses FUSE to make the database accessible as a filesystem as well.
But at least in Linux there are ton of files that are not exposed to the "same nexus", i.e. filesystem. The most common example would be network sockets. They are files, but do not exist anywhere in filesystem. In Linux file is more of an object handle.
https://yarchive.net/comp/linux/everything_is_file.html
http://events17.linuxfoundation.org/sites/events/files/slide...
Having a scheme:// makes sense for URLs because you don't otherwise have any contextual information indicating how to access a resource. But this isn't the case for something like a virtual filesystem, where the total set of filesystems mounted under it - and their types - are all known to the system. There's no need for disk://foo when you can just have /dev/disk/foo.
If that's not the case, I have found the scheme to be helpful to indicate what's going on.
body {
width: 40em;
margin: 0 auto;
font-size: 1.4em;
line-height: 1.4em;
}Would be kind of interesting to call methods on objects rather than read/write files, but it's not immediately obvious to me that that really gains anything over the status quo.
And now that I've written that, I wonder is that what powershell's verb-object does anyway? I've never come close to proficient enough (nor wanted to!) to know.
https://github.com/mpw/MPWFoundation/blob/master/Documentati...
and Polymorphic Identifiers:
Hierarchical paths were a good idea, let's use them. Objects were also a good idea, let's use those. A small set of verbs (GET, PUT, POST, DELETE) was also a good idea. Let's combine these!
Abstract from:
Path + File + POSIX I/O
URI + Resource + REST Verbs
Get:1. Polymorphic Identifiers, which subsume paths, URIs, variables, dictionary keys etc.
2. Stores, wich resolve URIs, subsume filesystems, HTTP servers, dictionaries, etc.
3. A small protocol that essentially mirrors REST verbs in-process
See also: In-process REST, https://link.springer.com/chapter/10.1007/978-1-4614-9299-3_...
From a security/reliability standpoint it sounds like a nightmare combining the worst of things like NTFS alternate data streams and share library loading into one.
Lotus Agenda/Chandler https://en.wikipedia.org/wiki/Chandler_(software) is another part of this long Grail quest.
Also, programs should be able to dynamically-serve the contents of "files" as well with an "activation symlink", i.e.,
/etc/resolv ->* resolvconf
The "the everything must be plain text" refrain is obsolete and unnecessary because it's trivial to serialize anything to any format since it would already be an universally-supported data structure both in tools and code.It's not 1978 anymore.
TMSU - tags your files and then access them through a virtual filesystem from any other application
https://tmsu.org -- https://github.com/oniony/TMSU
Tagsistant - Semantic filesystem for Linux, with relation reasoner, autotagging plugins and deduplication
https://www.tagsistant.net -- https://github.com/StrumentiResistenti/Tagsistant
<style xmlns="http://www.w3.org/1999/xhtml"> body{ max-width: 600px; font-family: "Calibri"; margin-left: auto; margin-right: auto; }
</style>
In direct response to the suggestion about file paths for verbs. Allan Kay says in one (possibly many) of his talks something along the lines of 'every function should have a url.' The one of surely many challenges is how to ensure that the mechanism used to populate file system paths with nested functionality (e.g. /usr/bin/ls/all to `ls -a`) don't trigger malicious behavior during service/capability discovery. Being able to more deeply introspect file data and metadata as if the file were a folder could potentially be implemented as a plugin, and I worry about the complexity of requiring a file system to know about the contents of the files that it hosts, or that the files themselves be required to know about how to tell the file system about themselves. Existing file systems adhere to a fairly strict separation of concerns, since who knows what new file format or language will appear, and who knows what file system the file will need to exist on.
Said another way I think that the primary issue with the suggested approach is that it is hard to extend. The file system itself needs to know about the new type of object that it is going to represent, rather than simply acting as an index of paths to all objects. If there is a type of object that is opaque to the current version of the file system that object either has to implement a file-system-specific discovery protocol (which surely would have fun security considerations if it were anything other than a static manifest) or the user has to wait for a new version of the file system that knows what to do with that file type.
Some thoughts from my own work. (partially in the context of OJFord's comment below)
Treating files and urls as objects that have identifiers, metadata, and data portions and where the data portion is treated as a generator is very powerful, but the affordances around the expression local_file.data = remote_file.data make me hesitate. When assignment can trigger a network saturating read operation, or when setter doesn't know anything about how much space is on a disk, etc. then there are significant footguns waiting to be fired and I have already shot myself a couple of times.
The more homogeneous the object interface the better. However, this comes with a major risk. If the underlying systems you are wrapping have different operational semantics (think files system vs database transactions) and there is no way to distinguish between them based solely on the interface (because it is homogeneous) then disaster will strike at some point due to a mismatch. To avoid this everything built on top of the object representation has to be implemented under the assumption of the worst case possible behavior, making it difficult to leverage the features of more advanced systems. As with the affordances around local.data = remote.data, if I have a socket, a local file, a remote web page that I own, a handle to an led, a handle to a stop light, a database row in a table that has triggers set, the stdin to an open ssh session, and a network ring buffer all represented in the same object system, I have as many meanings for file_object.write('something') as I have types of objects, and the consequences and side effects of calling write are so diverse (from flipping bits on a harddrive to triggering arbitrary code execution) that it is all but guaranteed that something will go horribly wrong. At the very least there would need to be a distinction between operations where all side effects could be accounted for beforehand (e.g. writing a file of known length to disk has the side effect of reducing free disk space, but that is known before the operation starts), and operations where the consequences will depend on the contents of the message (e.g. DROP TABLES), with perhaps a middle ground for cases with static side effects (e.g. the database trigger) but that would not immediately visible to the caller and that might change from time to time.
The distinction between files and folders is quite annoying (non-homogeneous), especially if you want to require that certain pieces of metadata always 'follow' a file. This is from working with xattrs that are extremely easy to loose if you aren't careful. Xattrs are a great engineering optimization to make use of dead space in the file system data structure, but they aren't quite the full abstraction one would want. It is also not entirely clear what patterns to use when you have a file that is also a folder -- do you make the metadata the outer file and the data the inner file? Or the other way around? Having the metadata as the outer file means that you can change the metdata without changing the data, but that the metadata will always 'update' when its contents (the data) changes. However, when I first thought about using such a system, I had it the other way around, and a system with that much flexibility I suspect would have even more footguns than the current system.
Another issue is the long standing question around what constitutes an atomic operation. Everything is simple if only a single well behaved program is ever going to touch the files, but trying to build a full object-like system on top of existing systems is a recipe for leaky abstraction nightmares.
While I was working on this I came across debates from before I was born. For example hardlinks vs symlinks. There are real practical engineering tradeoffs that I can't even begin to comment on because I don't understand the use cases for hardlinks well enough to say why we didn't just get rid of them entirely.
0. https://github.com/SciCrunch/sparc-curation/blob/master/spar...
One generation throws some shit at the wall, some of it sticks. Time passes and a few elders talk up their achievements in grandiose terms. With time, people begin to forget the truth and view the artifacts from the past as products of pure enlightenment. Shit-throwers are retconned into being master architects. The just-world fallacy kicks in and people mistake 'passing the test of time' for proof of quality, then they spin legends to fill that narrative.
"PipesFS replaces socket-specific data access calls (like send and recvmsg) with basic reading from and writing to pipes, whereby the location of the pipe identifies the socket. The actual path for sockets is long, consisting of 8 filters for the reception end, but users are easily shielded from this com- plexity through symbolic links"
[1] https://ts.data61.csiro.au/publications/papers/deBruijn_Bos_...
[edited for clarity]
The file metaphor is soooo flexible, so it’s hard for me to think of examples where it breaks down. So, what are some good examples where the file metaphor breaks down? Maybe that’s helpful?
Coffee on the command line sounds interesting.
This could also make GUIs far easier to spin up. A operation on the computer could easily spin up a 'new' GUI that depends on system / operation state using GUI objects available to the entire operating system.
Theoretically you can just make up your own verbs for HTTP and use those. In practice people stick to the common ones because they're well supported. This leads to people massaging a problem domain into the straightjacket of GET/PUT/POST/PATCH/DELETE, regardless of how well it fundamentally fits that set of verbs. (I'm also convinced nobody actually knows what "REST" means, but that's another rant for another time.)
The other thing a common set of verbs gets you is generic endpoints and, even more interesting, generic intermediaries.
My approach is to let resource-y things be resource-y, and let verb-y things be verb-y. After all, language has nouns and adjectives and verbs, maybe there is a good reason for this diversity?
So
var:myhome/doorbell ring.
(Although I am highly skeptical of IoT, so somewhat wary of such an example).If you wanted to model this in a more resource-y way, you could doL
var:/myhome/doorbell/ringing := true.
// delay
var:/myhome/doorbell/ringing := false.
That would also get you the ability to read the status of the doorbell.> I'm also convinced nobody actually knows what "REST" means
Considering REST is the basis of the WWW, the largest and arguably most successful information system of all time, I would say (a) most people understand it "well enough" to work with it and (b) if we don't understand it, it behooves us to make an effort to do so.
Because it's not like there haven't been other attempts to build something like the WWW, they just failed miserably.
This is why I suggest we really need the ability to dynamically add new verbs. POSIX has one write() but in terms of semantics it's really a whole family of verbs as one overloaded method.
>The file system itself needs to know about the new type of object that it is going to represent, rather than simply acting as an index of paths to all objects.
What I had in mind was that a given filesystem driver (e.g. a userspace FUSE process) would provide object types it supports. So for example, a "printer FS" process, printerfsd, would provide printer objects under e.g. /printer/. But the vfs - the layer that does prefix matching on mount()ed filesystems wouldn't need to know about new object types, as it's just a dispatcher.
One shortcoming of this is that you can't mv /printer/foo to another filesystem. That's also a shortcoming of e.g. today's /proc or /sys, but there still seems to be enough that's worthwhile about this approach.
Personally I'd rather just stick to the existing analogue of verbs, which we call executables.
I think open/close() are probably the minimal interface.
Of course seek(n, SEEK_SET) could be implemented anyway, in a very un-performant (and tape-wearing manner): by rewinding, and then reading forward n bytes. There's a question of whether the utility of this is desirable when weighed against how surprising it may be to people who don't realise just how bad the performance will be, especially when a tape drive which only supports seek(0, SEEK_SET) can easily have this behaviour emulated on top in userspace by seek(0, SEEK_SET) followed by dummy reads, if you really want it.
read() and write() and seek() prove remarkably versatile, but the niggles come with the fact that different types of file/device on POSIX can have subtly different gotchas with these verbs which, on the face of it, appear to be the same verb. Essentially, I might argue they're not the same verb at all - they just seem similar.
For example, read() from a UDP socket and read() from a normal file have extremely different semantics. If you read() with a 64 byte buffer from a UDP socket, the message is truncated and the remainder of it is lost. This is a very different semantic to reading from a file, where you can read in whatever chunks you like.
I wrote the article upon reflection of precisely this attempt to force everything into the straightjacket of everything-is-a-file that we've had for decades with UNIX. How much code correctly deals with short write()s? "Everything can be expressed as an object on which you can perform read()/write()" can only be true if you ignore the details of a verb's precise semantics, but the precise semantics are important. I think it's fair to argue that write() isn't one verb at this point, but an overloaded verb referring to a set of verbs. Which verb in that set you're invoking is dependent on the type of "file".
For example, a GPU device doesn't have like a file. You cannot effectively control a GPU via read/write. read/write are excessively slow for anything you'd want to do, including a simple VGA buffer. Almost all operations on a GPU in Linux involve mmap'ing it and then applying ioctl() liberally.
You can do almost everything using the file metaphor, Plan 9 proves it. But it's at times going to be a very poor metaphor that is better at working at all than working well.
There's no RING HTTP method, and I could invent one, but heaven knows if various HTTP middleware would be happy with that. In practice, people do something like
POST http://example.com/doorbell/ring
The problem with this is that you now have a hierarchy of verbs; you have first class verbs (GET, PUT, POST, PATCH, DELETE), and second class verbs which have to be represented as distinct resources. This feels like a hack to me.But isn't this basically what RPC vs REST boils down to?
As far as I know people tried the RPC way for years then gave up on it and started doing REST. Seemingly because inventing a whole bunch of methods was inherently flawed.
What the use of schemes does is make things needlessly inflexible, and embeds a dependency on the name of a filesystem provider inside consumers of that filesystem. It's akin to a Unix where filesystems can only be mounted in top-level directories /mnt, but not /mnt/foo, etc.; I don't see the appeal.
Why?
Not so. See file: :-)
In any case, the article has 17 occurrences for "file(s)" versus 28 for "object(s)", so the author seems to agree with me :)
Perhaps a file never meant just something you have in a directory, rather "a stream of bytes", and the bigger thing was always unifying the difference whether those bytes are read from a magnetic tape or received wirelessly over the internet.
mount -t type device destination_dir
Unless I am missing something in your use case.
I wouldn't go that far :-). But you are right. They did good work with the cmdlets but they put a terrible language on top of them.
It would have been much better if they had put an interpreted version of C# (maybe with a few extensions) on top of it.
Edit: The side benefit of the verbosity is the discoverability of less used commands. Is it groupadd or addgroup? No question in Powershell, it would be Add-Group because of the Verb-Noun standard. Bash has all sorts of inconsistencies that require look-up if you don't use those commands often.
Oh hell.
F# will even compile
let add a b = a + b
add 1 2;;
as val add : a:int -> b:int -> int
val it : int = 3
or let add a b = a + b
add 1ul 2ul;;
as val add : a:uint32 -> b:uint32 -> uint32
val it : uint32 = 3u
So it will even infer type from the first usage of the function. get-command add*group*
Or for brevity gcm add*group*
For reference, the Bash version is this: compgen -c | grep group
Edit:Let's get crazier. You want custom tab completion to focus on the command's noun plus Bash style completion.
$Function:OriginalTabCompletion = $Function:TabExpansion
function TabExpansion($line, $lastWord) {
if ($line -match ('^!(.*)') {
$lastWord = $lastWord.trimstart('!')
Get-command -noun *$lastWord*
} elseif {
OriginalTabExpansion $line $lastWord
}
}
Set-psreadline -chord tab -function MenuComplete
!group<TAB>