The Man Also Known As "I'm Batman!"

Hello Citizen! Would you like to be my sidekick? Would you like to ride with... Batman?

Thursday, July 14, 2005

We've moved! Check out all these great articles and others at!

Linux Desktop Distribution of the Future Follow-up Part 2

Category: Commentary

Part 1: Clearing Up Misconceptions
Part 2: Refining the Ideas

Some argue that it's easy to make suggestions when you aren't the one figuring out how these suggestions will work. The problem with this argument is that without communicating these raw ideas, the implementation can never be found. "Communication completes thought" is a wise saying that a former boss of mine drilled into my head. By communicating our ideas we often think through the implications and decide exactly how they might be implemented.

As such, this week's article will attempt to present some of the refinements and more detailed explanations that were produced through conversations over the original series. We'll start by going over the tools of the trade.


FUSE (Filesystem in USErspace) is probably one of the coolest little inventions ever. By implementing a simple interface, a developer can quickly and easily build a custom file system that runs entirely in user-land. Effectively, this makes Linux act like a Micro-Kernel with only the important I/O running at the Kernel/Ring 0 level.

The advantage to using FUSE (other than to speed the development of a new file system) is that all the libraries that are available to user programs are available to the file system. If you've never worked with Linux Kernel Modules before, allow me to assure you that this is a huge step. Software that runs inside the kernel can only access APIs running inside the kernel. This can limit a developer and make programming quite tricky.

The downside to using FUSE is the same downside that exists in nearly all micro-kernel designs: It's going to be a smidge slower than a kernel level driver. How much slower depends upon the application, but not so much that it would generally be noticeable with modern hardware.

FUSE's advantages have led to its use in SSHFS, GMailFS, and several other interesting file systems.

Berkeley DB (BDB)

Berkeley DB is a very powerful embedded database library. Instead of taking the design to the level of an SQL server, BDB provides only data storage and file indexing. Concepts like a high level query engine and robust network support are left to the developer. This makes it a perfect base for a custom database. An example of this is the Berkeley DB table type in MySQL, which is built on the BDB database software.


One of the key concerns about the AppDisk scheme I proposed was the issue of limited loopback devices. A standard kernel is limited to a mere 8 loopback devices, which could cause a few problems if the user wants to run more than a few applications. While there actually exists a patch to allow for more than 256 loopback devices (!), it's important to understand why a loopback device is needed in the first place.

You see, there's nothing in the mount system that prevents an existing file from being mounted into the file system. As far as it is concerned, a file in /dev is just the same as a file located in your home directory. The reason for the loopback device has to do with the file system drivers. Most file systems specify that they will only mount a block device. As a result, the kernel rejects attempts to directly mount a file as a new file system. What the loopback device does is that it creates a "fake" device that makes a file look like it's a block device.

The loopback device can be eliminated entirely if a file system is used that doesn't require a block device. For example, a "simple" FS could be created for most applications that don't require write support. This FS could be as simple as a TAR or ZIP file, although access times would be quite slow with both of those options. A file system driver could then be written for that "simple" FS, allowing for a virtually unlimited number of mounts.

Of course, actually performing unlimited mounts might be a good way to bump into other kernel limitations that aren't quite as obvious as the number of loopback devices. A better solution is as follows:

Create a file system using FUSE that supports the desired AppDisk file system types, and mount it to the proper point in the system. (I believe I suggested /apps.) When you want to mount a new AppDisk, you simply call the API for the FUSE file system, and the contents of the AppDisk will magically appear inside a sub-directory of /apps.

This scheme is really just a VFS inside a VFS, but it has the advantage of limiting the resources allocated by the kernel. Instead, all the resources would be allocated in userland where they can be easily tracked, protected, and paged.

Implementing the DBFS

Of all the concepts presented in the original article, the DBFS is probably the least developed in the OSS community. Because of this, I have spent considerable time attempting to derive quick methods of implementing this system. The three schemes I've come up with are as follows:
  • Direct Implementation
  • MySQL & FUSE
  • Berkeley DB & FUSE

Direct Implementation

Directly implementing a custom disk layout is the most desirable method of achieving a DBFS. It would have the advantage of being supported by a true kernel module, it would be easy to tailor to emerging needs, could use block devices directly, and would be much easier to tune for performance. It would also be possible to make such a DBFS a bootable drive, since the developer has direct control over the file system's structure.

The disadvantage is that it would take quite a bit of time to implement. Not only would the data layout structure have to be developed, but a record layout structure and indexing scheme would also have to be worked out. These concepts are well known, but would take time to complete.


I realize that I just debunked the concept of using an SQL Server for the DBFS, but I would be remiss in my duties if I didn't at least mention the possibility.

The absolute fastest method of getting the DBFS online is to use a loopback file system (i.e. a new view on an existing file structure) to attach meta-data to existing files. The concept is quite simple:

  1. Setup a MySQL database to store the meta-data.
  2. Create a FUSE file system that translates VFS calls to the appropriate file on an existing file system. (See the fusexmp example in the FUSE source code for an example of such a file system.)
  3. Present the user and programs with a DBFS view on the data.
  4. Profit!

It really is that simple. Unfortunately, there are quite a few drawbacks to this scheme. For one, the database server is vulnerable to crashing, corruption, rouge queries, security flaws, and a host of other issues. For another, the stability of the DBFS is vulnerable to the mapped file system being changed independent of the DBFS view. Finally, this scheme would be extremely wasteful of the system resources and could potentially interfere with other MySQL installations.

Berkeley DB & FUSE

Probably the best compromise between the two concepts presented above is the use of a high performance embedded database for storing both the meta-data and the file's binary data. Such a database could handle all of the disk layout and logging operations, as well as the indexing and querying.

This is where Berkeley DB comes in. Berkeley DB is a simple yet powerful database designed to store strings of arbitrary binary data. It does this by allowing one table of keys and values in each database. A key is a binary string and a value is a binary string. That's it. There's no concept of columns, relations, data types, referential integrity, or any other fancy database concepts. If a structure is desired, it is expected that the program will impose it upon the value portion of the record. As a result, the value portion of a BDB record is often a C Struct.

Other features of interest are BDB's ability to store multiple databases in a single file, support for indexing on parts of a record's values, automatic memory mapping, support for sequences, transactional safety, support for large records (up to 4 gigabytes in most systems) and the ability to retrieve/store only parts of a record value. The latter two points are especially useful, as they allow for a Berkeley database to be used to store all the data for a file. In effect, BDB can become the file system!

The first feature is also of interest, because it means that BDB can be used directly on a block device if so desired. While it may be tempting to jump directly to using a block device, it is important to consider that placing the file on the root file system allows for disk space to be allocated to either the root or DBFS as needed. If you make the DBFS a separate partition, then you become bound by the disk sizes you chose.

What Berkeley DB does not provide is a fancy network server, user management, a query language, or any other cruft that might get in the way of a DBFS implementation.

Implementing a DBFS under Berkeley DB

You didn't think I'd leave you with only a quick description of the concept, did you? Believe it or not, I've been hard at work attempting to work out the details of creating a DBFS in BDB. Precisely three databases would be required for basic file system support. (Keep in mind that a database in BDB is analogous to an SQL Table.)

The following are what the value structures might look like. Note that all keys are expected to be long integers.
struct {
long created; //Date
long modified; //Date
long accessed; //Date
char* name; //Variable length
} files; //The key is a sequential integer

struct {
long modified; //Date
char type; //0 for string, 1 for data, the rest reserved
char* name; //Variable length; indexed!
char* value; //file_data_id if type == 1, else variable string
} meta_data; //The key is a sequential integer

struct {
char* data; //Whatever the heck we want to put in here.
} file_data; //The key is a sequential integer
In looking at the structure above, you may begin to wonder where exactly the file data would be stored. The answer is, in the file_data database! Despite the high importance placed on the data inside a file, the data is really nothing more than an another attribute or piece of meta-data. Files would have a standardized meta_data record called ".data" or something similar.

Adding labels (which can then be presented as directories for backward compatibility) can be done with the following database:
struct {
char* name;
long[] file_ids; //A list of files attached to this label
} label; //The key is a sequential integer

Building a Query Engine

Obviously one of the biggest reasons for creating a DBFS is to allow for advanced queries. To some degree the structure presented improves upon searching. If you were looking for a file created within the last 30 days, the DBFS could do a table scan to find files with the requested date.

Assuming that the "files" database key was a "long" integer, and that the average file name size was 50 bytes, the DBFS would have to search through 7.8 megabytes of data to find all the files created within the last 30 days. Since the data is stored in a table instead of scattered across INodes on disk, a burst read from the hard drive could read through the database in a few seconds or less! And that is assuming that none of the data is already memory mapped and cached in RAM! (Which is a very likely scenario.)

Unfortunately, a database scan does nothing to help the issue of searching file contents. To fix this, we need an index. Thankfully, it is quite easy to create one. All we need is a database where the key is a string, and the value is an array of file ids. e.g.:

struct {
long[] file_ids;
} word_index; //Key is a String such as "hello\0"
Generally it makes the most sense to store only the lower case variants of each word to save space and eliminate issues with case sensitivity. This still presents a few issues with internationalization (the rules for lower casing a word differ from language to language), but that's not an issue I'm quite ready to tackle.

The end result of this is that a query engine built on top of this index could quickly find a list of file for each keyword. By creating a combined list (or), subtracting one list from another (not), or creating a list of only duplicate matches (and), boolean searches can be achieved.

The Query API

After playing around with FUSE for a little while, I realized how a simple and backward compatible query engine might be achieved. Consider a URL for a website. It usually consists of a protocol (e.g. http://), a host (e.g., a path (e.g. /myfile.cgi), and a query string (e.g. ?myfield=value). In the case of a file system, the protocol and host are both known, and the path is supplied by the user. However, no concept of a query string exists.

What if we were to virtually provide access to files beneath a given file path? For example, let's say we had the following file:
Let's say that the file had the attributes "subject", "to", and "from". We could access the contents of each of these attributes by opening the following paths for read-only access:
Note that the # symbol is used to denote access to an attribute. Similarly, we could write the attributes by opening the file for writing. Futhermore, we could get a list of attribute names by opening the following file for read only access:


These files would never actually appear in a file list because they are entirely virtual constructs. The file system driver would parse the path, find the file that terminates the path, then return a handle to a reader/writer for the attribute or list.

This scheme can be further extended for general queries. The following query finds all files with the word "batcave" in them:


Since special characters are not allowed in an indexed word, the standard GMail syntax of using [attribute name]:[keyword] can be used to search only specific attributes. Combine this with a feature for querying only specific labels, and you could craft a very advanced query such as one that looks for email from Bob:


To obtain the information, the query engine would look up the label "myfolder", and find all files associated with it, as well as sub-labels and files. It could then search through the meta-data for each file looking for the attribute "from". When the "from" attribute is found, the engine would then attempt to perform a substring match against the value of the attribute. This process can be accelerated by adding the attribute values to the index database we discussed above.

Advancing the State of Applications

One of the seeming disadvantages to the AppDisk/DBFS scheme proposed in the original article was the idea that program settings would be lost when the application was deleted. While this is a valid concern, I don't think it is as big of a concern as it might seem.

A DBFS not only advances the user's ability to interact with a file system, it should advance the ability of applications as well. The primary reason why users don't want to lose program settings after removing an application is because of the databases that application has stored. For example, my email program may have my entire email database in its settings. Or my web browser is holding all of my web bookmarks. Why should these programs hold our data hostage? Isn't the Unix philosophy "everything is a file"?

The reason why programs hold our data hostage is because they need to store it in special database files for performance and usability reasons. For example, it would be far too slow to parse out a directory full of emails just to show you a list of your inbox. Such an email program would easily be eclipsed by a email program utilizing an indexed database. Unless the former program had a DBFS backing it, that is!

The syntax still needs to be worked out, but imagine if you could open a query underneath a label that asked for a textual list of files with a given attribute and the value of that attribute! The email program could ask for the list of files with the MIME attribute of "text/email", and a list of the "to", "from", and "subject" attributes! The resulting file stream might look something like this:

myemail.msg, Bob, Me, How are things going\, Bob?

Note that the comma in the subject had to be escaped. Programs who read from these files will need to be aware of this issue, or perhaps would be allowed to switch the delimiter to a character of their choosing.

As a bonus, this information can be processed by traditional command line tools! Nothing stops you from SSHing into your machine, requesting a list of emails, then using grep, sed, cut and other Unix tools to process the data and generate a report!

Long term it would be valuable if all programs adopted this scheme. Your Web Browser bookmarks would be more accessible than ever, your address book could be nothing more than a bunch of VCF files on disk, and your calendar could consist of a number of event files. Not to mention that programs would have an easier time sharing this information! I for one can't wait for the day when Mozilla, Opera, KHTML, and other web browsers share all my bookmarks.


Berkeley DB
Micro Kernel

The author can be reached at Watch your step when entering the Batcave!

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.