Can file systems / databases benefit from purely functional datastructures? --------------------------------------------------------------------------- by Andrew Clausen [ Me and Joe Thornber discussed this idea a bit... we were both inspired by haskell (in particular, Chris Okasaki's book "Purely Functional Data Structures") and Daniel Phillip's idea for tux2. Andrew Bromage has also been quite helpful for bouncing ideas off :) ] Idea: applying purely functional data structures to make a safe, low-write latency file system. Hopefully, it would be much simpler (and much faster) than journalling. Basically, WAFL (of netapp fame) / tux2 work by never doing destructive update. This way, checkpoints (snapshots) and therefore rollback are trivial to implement. Until recently, it was believed that only a good amortized time was possible without destructive update. (Actually, the story is more complicated than that...) Both WAFL and tux2 therefore have high-commit latencies (which WAFL implementations typically alieviate via non-volatile RAM). There has been much research into "persistent" data structures (Chris Okasaki, "Purely Functional Data Structures", 1998, etc.). Random-access data structures with O(log n) worst-case update time, and O(1) worst-case operations on one end (i.e. append) are possible via a similar system to ext2's inode layout of storing direct, indirect, doubly indirect, etc. blocks in the inode. If lazy evaluation [delaying computation] and memoization [remembering past computation] are added, then queues and heaps can be implemented, and even combined with this ext2 inode-like data structure. I'm not sure if this is a useful extension. To apply the full version (i.e. with lazy evaluation and memoization) of this this to file systems, there's a few problems to solve: * garbage collection (both tux2 and wafl already offer efficient solutions - I think wafl's is applicable) * deciding which data structures to use. I think this is the hardest problem. Okasaki describes a data structure with O(log n) random update [where n is the number of items stored], and O(1) modification to the start and end. Mapping this onto efficient file system operations seems non-trivial. Probably want some kind of working set theorem, like that of splay trees. Actually: the file system interface tells us what the working set is (via open(2))... so we could have a separate tree/list/whatever for files open for writing. * lazy evaluation: need to define a small set of operators... we don't want "live code" in the file system. * memoization: trivial... just replace "operators" with the data they describe how to construct after computation. The potential benefits of all of this: * I personally think journalling is a really complicated beast. This might be simpler [always a good thing: might be easier to "optimize" various aspects, etc.] * journalling seems to have a performance hit, because you're often writing data twice (to journal, then for-real), seeking lots, etc. Extra disks or NVRAM basically solve these problems, but perhaps this would be cheaper. NOTE: I'm having second thoughts about all of this. Since the "persistence" (i.e. snapshots) feature of file systems is rarely used, it probably isn't worth optimizing for. That was most of the motivation for this whole lazy-evaluation / memoization data structure idea. In any case, I still think file systems could benefit from the work on purely functional data structures. NOTE2: that data structure Okasaki described (above) are "skew binary random access lists". They don't actually need lazy-eval/memoization! (Unless you want to build a lazy real-time variant of Hood-Melville queues on top, which allow you to remove blocks from the start) I think I can get away with using these, and getting a fairly efficient file system, with these operations: (where w is the number of open-for-write files and n is the size of a file) * O(log w) file creation time where m is the number of open-for-write files * O(log w + log n) random reads/writes to/from files * O(log w) appends to files. In practise: 5 writes worst case: (for a single block write) - the metaroot (1 write) - the inode, containing the list of trees (1 write) - joining two trees into another tree (2 writes worst case) - the data (1 write) It is probably possible to merge these writes, by clever block allocation, etc. Maybe down to 2 writes. This doesn't include block allocation accounting / garbage collection (TODO). This compares to journalling: - the journal (1 sequential write of several blocks) - the inode, to update size (1 write) - recording the new block (up to log n writes - b+ tree balancing games. 1 write usually) - the data (1 write) Perhaps skew binary random access lists are good for journaling file systems too? * O(1) (amortized?) sequential reads from files (TODO: remove O notation, and talk in absolute reads/writes) Skew binary random access lists are really nice data structures. They're actually quite similar to ext2's inode layout, with direct, indirect, double-indirect and triple-indirect blocks being stored in the inode. With skew-lists, you do the same, but store in reverse order, so you always have the end of the file in the inode itself. This makes appends really fast (constant time worst case :) They also should map onto blocks of 2^k very well. They use a set of trees of size 2^k - 1, which means you can chop them up into blocks of size 2^k. The extra 1 can be used to store the size of the tree, (i.e. k) and a few extra bits for other purposes, such as garbage collection. (Actually, probably not so useful, since you only get this extra space at the root of each tree) I'm currently skeptical this will ever be faster than journalling: you've always got to update the metaroot, which is equivalent (performance-wise) to seeking to the journal. OTOH, this supports snapshots at zero cost, and journalling doesn't :)