The Man Also Known As "I'm Batman!"

Hello Citizen! Would you like to be my sidekick? Would you like to ride with... Batman?

Friday, July 29, 2005

We've moved! Check out all these great articles and others at http://akaimbatman.intelligentblogger.com!

Free Books on the Internet

Category: Cool Stuff

Over the years, one of the things I've noticed is a constant decline in the quality of books. Where shelves were once filled with books on OS Design, Compiler Theory, 3D Graphics Algorithms, and Basic Data Structures, now they are filled with books on Learning Java in 24 Hours and Idiots Guide to OS of your choice. While this decline is primarily due to the flood of poor programmers who entered the market during the boom, it has not abated during the bust. More books on 100 Things You Should Already Know about API XYZ continue to hit the shelves while the classic material gets more and more buried.

Thankfully, the Internet has taken the place of these once cherished hunks of dead tree. You may not be able to find a good book on Text Parsing in Barnes & Noble, but you can find all the same info plus someone to answer your questions on the internet. Still, sometimes it's nice to have a well written book to read.

Well have no fear! Many of those books we once cherished are now available on the Internet, and I've compiled a list for you of some of the more interesting ones. Nearly everything below is available in PDF format. I actually skipped over some of the HTML only books, because I very much like the idea of having a file that can travel with me. HTML is far too unwieldly. So download a book, curl up in front of the fire, and get a tan in the rays of your LCD screen!

Advanced Linux Programming - Pretty much everyone knows the standard C libraries, and the POSIX libraries are a requirement for any serious Unix developer. If you want to read a book on beyond such simple programming APIs, then this is the book for you. Processes, Threads, IPC, I/O, Assembly, it's all covered!

Bruce Perens' Programming Series - Bruce Perens is a very interesting (and probably quite busy) guy. So busy, in fact, that he's single-handedly written dozens of books on advanced programming topics! Ok, so maybe he didn't write them all. Don't let that deter you, though. This series of books cover everything from Java to Mozilla Platform Development, and everything in between!

Parsing Techniques - As a more junior developer, I noticed that nothing quite honed a developer's programming skills like trying to build a better text parser. This book attempts to break down many of the techniques that have developed over the years, and establish the theoretical foundations behind them. A must read for any serious developer!

Creating XPCOM Components - With WebApps developing into a more and more powerful solution for deploying applications, it's no wonder that interest in the Mozilla Platform is on the rise. Most people think of Mozilla as just a web browser, and take no note of the amazingly diverse set of APIs that are used under the hood. This book provides much more depth and info on the Mozilla Platform, and explains the XPCOM structure in depth.

Practical File System Design with the Be File System - I'm only going to say this once: Download this book NOW. This book is the definitive guide to file system design, and was written by the guy who designed the BeFS Database File System. If you wonder why his musings on a dead Filesystem for a dead OS are of interest, just consider this: He works for Apple. Can you guess which components he developed?

Thinking in Postscript - As programmers, we like to think in pixels. Pixels are a discrete, indivisible unit that are easy to calculate and plan for. Yet we work with them day in and day out without ever realizing the limitations of such units, Nearly every programmer has one of those "Ah ha!" moments when he's first introduced to vector graphics. Make your "Ah ha!" moment happen today. Download this book!

Introduction to Machine Learning - Artificial Intelligence. Talk about an area that has had tons of money poured into it with very little practical return. Even the Spam filters in common usage today tend to use statistical models instead of AI! Yet the field is very interesting, and the constant march toward faster computers makes home research possible. So pull up a chair, and prepare to dissect your own brain!

A Practical Theory of Programming - Impress all your friends and neighbors! This book is crammed full of mathematical formulas that describe how programming works. It may not teach you how to write LISP, but it will at least give you the math to understand it! (As if anyone understands LISP. I kid, I kid!)

Information Theory, Inference, and Learning Algorithms - Information Theory has become a hot topic in the physics world of late. You see, it turns out that many of the mysteries of the Universe are tied to the ability or inability of particles and energy to carry information. For example, it is perfectly possible to violate the speed of light. Quantum Tunneling is a perfect example of this. Yet it is still impossible to transfer information faster than the speed of light! Is it a universal conspiracy? Who knows? What I do know is that having a good grounding in Information Theory is good for Computer Scientists, Physicists, and your average Developer alike.

Compilers and Compiler Generators - Ever written a compiler? No? Well, one day you may just find yourself in the position of doing exactly that. And when that time comes, you want to be ready. So read this book and understand the theories behind translating source code to machine code.

Linux Device Drivers, 2nd Edition - So you've got OS Design down pat. Plus you've got compiler design down. You've even taken AI and Information Theory! But what about those pesky hardware devices? How in the world do they get controlled? Well, pick up this book and find out!

Linux From Scratch - While abstract knowledge of OS Design is all well and good, sometimes you just need to dive into a working codebase to get a feel for things. While Minix is a great place to start, this book (and LiveCD!) will help you setup a custom Linux machine in no time flat! Move over Gentoo, I'm compiling my own system from scratch!

Now for Some Entertainment

If you managed to plow your way through all of the technical books above, then you deserve a break! Here's a few books and links to entertain your brain!

The UNIX-HATERS Handbook - Don't you just love how fast Unix boots? After all, things crash a-plenty, so it's a good thing it boots fast! Left over from a time now gone is the UNIX-HATERS Handbook. From the days when LISP machines stomped the Earth, and Macs were changing the face of computing (Wait, isn't that happening again?) comes a humorous look at all those little things that constantly go wrong in the OS we call Unix. You'll laugh because it's funny, but you may also learn a thing or two about what the world of computing once looked like.

Baen Free Library - Science Fiction and Fantasy. All free for the taking! Think I'm kidding? Click on the link and find out! My personal recommendations are: On Basilisk Station, 1632, and Mutineer's Moon. Odyssey is a great read as well. Enjoy!

Star Dragon - Star Dragon is a hard Sci-Fi novel from author Mike Brotherton. I can't tell you if it's any good, but I can tell you that it's free. Never hurts to scope out new authors!

Baen Book CDs - So you thought the Baen Free Library was cool, did you? Well so did Mr. Baen. He thought it was so cool that he'd try attaching a few CDROMs filled with even more books to his hardcover novels. The only catch? You can give the CDs to all your friends and neighbors. Wait, that's not a catch, that's a feature! Enjoy these hundreds of books offered at no charge, and with no DRM!

Audio Books for Free - Sometimes the pressures of life get in the way of reading. For those times, there are audio books. But who has a tape player anymore? This site is chock full of free MP3s that you can download to your player and listen to on the go! Sure, it's mostly classics. You do like classics, right?

Project Gutenberg - Speaking of classics, how can any discussion of free books fail to mention Project Gutenberg? With over 16,000 books transferred to electronic format, there's no better place to catch up on your H.G. Wells or Jules Verne!

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.

Thursday, July 21, 2005

We've moved! Check out all these great articles and others at http://akaimbatman.intelligentblogger.com!

Linux Needs More Distros!

Category: Commentary

In a recent article, author Sal Cangeloso suggested that Linux needs to consolidate into fewer (perhaps even one!) distributions. His reasoning behind this recommendation is that the Open Source Community is unfocused in its goals and is wasting time on multiple distributions. He then cites the Mandrake and Connectiva merger as something he'd like to see more of.

While I agree with Mr. Cangeloso on the Madriva merger being good for the Linux community, I disagree with his logic on why it is good.

Multiple Distros Are a Strength

In the mid to late 90's, the GCC project began to stagnate as tight controls were placed on code contributions. Many GCC developers believed that it needed to be reorganized around a more flexible design, and tried to get support from the FSF. The FSF unfortunately ignored these pleas, so a handful of developers used their GPL-granted rights to fork the project into a technology called EGCS (Experimental GNU Compiler System).

Thanks to optimized code generation, better cross-platform support, and the unification of computer languages compilers, it wasn't long before EGCS began to displace the official GCC branch as the developer community's compiler of choice. The GCC developers eventually realized the power of EGCS and merged EGCS back into the main branch.

The point I'm getting at here is that a successful project is going to tend to become risk adverse. It's not that they don't want to move forward, it's more a matter that they don't want to fail and fall behind. This is a perfectly natural reaction. The loophole that exists to escape these sorts of politics is the ability to fork.

By forking, a developer can explore new paths without placing the primary project at risk. If the fork is successful, then it can be merged back into a more comprehensive project. If it fails, then far less was bet on its success.

As a result, the plethora of distributions is actually a strength of Linux (and the BSD community!), not a failing. If everything follows its natural course, distros will come and go, but a few main trunks will continue to benefit from the development. But why do we need multiple main trunks?

Multiple Product Lines are a Strength

If you've ever bought a car, you have probably realized that auto manufacturers sell cars under several different brands. GM, for example, sells inexpensive Chevy cars, high-performance Pontiacs, luxury Cadillacs, powerful GMC trucks, and experimental Oldsmobiles. While these lines each produce very different vehicles targeted at different markets, they are able to share chassises, parts, and research. This lowers the price while raising the quality of the vehicle.

A similar concept can be applied in software. For example, Apple produces a consumer version of Mac OS X and a server version of Mac OS X. The consumer version is tuned to multimedia, laptops, power management, and other features that end users expect, while the server is tuned to heavy networking, throughput, remote installation, and other features that admins have come to expect. The same is true of the Windows Operating System. (Although the differences between some versions were artificially added by Microsoft.)

In the Open Source world, even more flexibility is offered. Here's a quick rundown of distros, and where they have chosen to position themselves in the market. (No, this is not a comprehensive list. So don't complain about your favorite distro missing.)

RedHat: Long considered the leader in Linux development, RedHat has chosen to target the x86 server market with its product. Its spinoff (Fedora) is where the bleeding edge in Linux technology is tested before being integrated into RedHat's product line.

SuSE: SuSE has used its extensive experience in producing a high quality desktop distribution to position itself as the de-facto Corporate Desktop/Workstation distro.

Ubuntu, Mandriva, Xandros, Linspire: These distros are competing in the massive consumer desktop market. The differences between them can be extensive, the least of which is their choice in Desktop technology. Some choose to present users with GNOME, while other choose to present KDE. Each has its own method of easing the user's difficulties with installing software and interoperating with Windows machines and programs.

Gentoo: Gentoo has positioned itself as a powerful build system for constructing custom operating environments. While it's definitely not a choice for your average user, power-users such as technology workers, engineers, and scientists can customize the system to meet the precise needs of their work.

Yoper: Yoper has positioned itself as the "sports car" of Linux distributions. It's compiled to be faster than most distros, and is carefully tuned to meet the needs of most enthusiasts.

Knoppix: Designed to be the ultimate in demoing and portable desktop technology, Knoppix has evolved to allow users to take a familiar environment with them no matter where they go. Combined with a removable media such as a USB Pen Drive, Knoppix can give users the freedom to roam with their files from computer to computer. All without changing a single configuration option on those computers!

As you can see, the Linux community isn't just "wasting its time".

Aren't You Being Hypocritical?

One of the questions that I realize will come up in response to this article is, "Aren't you being hypocritical? You just told us that we need a standard Linux Desktop across all distributions!"

Let me put your mind at ease. I stated no such thing, nor am I planning to state such a thing. The previous article was called The Linux Desktop Distribution of the Future for a reason. A new distribution to test the ideas I presented could only be good for the Linux community as a whole. After all, why force Linux into a box? Why must packages be the perfect solution? Why must we deal with traditional file systems? Why must we stop moving forward?

I realize that many readers are used to wholesale Linux bashing from writers such as myself, and have instinctively reacted to what they perceived to be more Linux bashing and "make it like Windows" solutions. Put your mind at ease. I don't want to make Linux like Windows. I want to take Unix systems into the future, hopefully before the competition gets there.

So, let me ask you this question: What do you want out of a Linux distro? Or more precisely, how would you like to fork today?

You can email the Batcave at the address akaimbatman@gmail.com.

Links:
EGCS History
General Motors Automobiles

RedHat
SuSE
Ubuntu
Mandriva
Xandros
Linspire
Gentoo
Yoper
Knoppix

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.

Thursday, July 14, 2005

We've moved! Check out all these great articles and others at http://akaimbatman.intelligentblogger.com!

Linux Desktop Distribution of the Future Follow-up Part 2

Category: Commentary

Part 1: Clearing Up Misconceptions
Part 2: Refining the Ideas

Some argue that it's easy to make suggestions when you aren't the one figuring out how these suggestions will work. The problem with this argument is that without communicating these raw ideas, the implementation can never be found. "Communication completes thought" is a wise saying that a former boss of mine drilled into my head. By communicating our ideas we often think through the implications and decide exactly how they might be implemented.

As such, this week's article will attempt to present some of the refinements and more detailed explanations that were produced through conversations over the original series. We'll start by going over the tools of the trade.

FUSE

FUSE (Filesystem in USErspace) is probably one of the coolest little inventions ever. By implementing a simple interface, a developer can quickly and easily build a custom file system that runs entirely in user-land. Effectively, this makes Linux act like a Micro-Kernel with only the important I/O running at the Kernel/Ring 0 level.

The advantage to using FUSE (other than to speed the development of a new file system) is that all the libraries that are available to user programs are available to the file system. If you've never worked with Linux Kernel Modules before, allow me to assure you that this is a huge step. Software that runs inside the kernel can only access APIs running inside the kernel. This can limit a developer and make programming quite tricky.

The downside to using FUSE is the same downside that exists in nearly all micro-kernel designs: It's going to be a smidge slower than a kernel level driver. How much slower depends upon the application, but not so much that it would generally be noticeable with modern hardware.

FUSE's advantages have led to its use in SSHFS, GMailFS, and several other interesting file systems.

Berkeley DB (BDB)

Berkeley DB is a very powerful embedded database library. Instead of taking the design to the level of an SQL server, BDB provides only data storage and file indexing. Concepts like a high level query engine and robust network support are left to the developer. This makes it a perfect base for a custom database. An example of this is the Berkeley DB table type in MySQL, which is built on the BDB database software.

AppDisks

One of the key concerns about the AppDisk scheme I proposed was the issue of limited loopback devices. A standard kernel is limited to a mere 8 loopback devices, which could cause a few problems if the user wants to run more than a few applications. While there actually exists a patch to allow for more than 256 loopback devices (!), it's important to understand why a loopback device is needed in the first place.

You see, there's nothing in the mount system that prevents an existing file from being mounted into the file system. As far as it is concerned, a file in /dev is just the same as a file located in your home directory. The reason for the loopback device has to do with the file system drivers. Most file systems specify that they will only mount a block device. As a result, the kernel rejects attempts to directly mount a file as a new file system. What the loopback device does is that it creates a "fake" device that makes a file look like it's a block device.

The loopback device can be eliminated entirely if a file system is used that doesn't require a block device. For example, a "simple" FS could be created for most applications that don't require write support. This FS could be as simple as a TAR or ZIP file, although access times would be quite slow with both of those options. A file system driver could then be written for that "simple" FS, allowing for a virtually unlimited number of mounts.

Of course, actually performing unlimited mounts might be a good way to bump into other kernel limitations that aren't quite as obvious as the number of loopback devices. A better solution is as follows:

Create a file system using FUSE that supports the desired AppDisk file system types, and mount it to the proper point in the system. (I believe I suggested /apps.) When you want to mount a new AppDisk, you simply call the API for the FUSE file system, and the contents of the AppDisk will magically appear inside a sub-directory of /apps.

This scheme is really just a VFS inside a VFS, but it has the advantage of limiting the resources allocated by the kernel. Instead, all the resources would be allocated in userland where they can be easily tracked, protected, and paged.

Implementing the DBFS

Of all the concepts presented in the original article, the DBFS is probably the least developed in the OSS community. Because of this, I have spent considerable time attempting to derive quick methods of implementing this system. The three schemes I've come up with are as follows:

Direct Implementation
MySQL & FUSE
Berkeley DB & FUSE

Direct Implementation

Directly implementing a custom disk layout is the most desirable method of achieving a DBFS. It would have the advantage of being supported by a true kernel module, it would be easy to tailor to emerging needs, could use block devices directly, and would be much easier to tune for performance. It would also be possible to make such a DBFS a bootable drive, since the developer has direct control over the file system's structure.

The disadvantage is that it would take quite a bit of time to implement. Not only would the data layout structure have to be developed, but a record layout structure and indexing scheme would also have to be worked out. These concepts are well known, but would take time to complete.

MySQL & FUSE

I realize that I just debunked the concept of using an SQL Server for the DBFS, but I would be remiss in my duties if I didn't at least mention the possibility.

The absolute fastest method of getting the DBFS online is to use a loopback file system (i.e. a new view on an existing file structure) to attach meta-data to existing files. The concept is quite simple:

Setup a MySQL database to store the meta-data.
Create a FUSE file system that translates VFS calls to the appropriate file on an existing file system. (See the fusexmp example in the FUSE source code for an example of such a file system.)
Present the user and programs with a DBFS view on the data.
Profit!

It really is that simple. Unfortunately, there are quite a few drawbacks to this scheme. For one, the database server is vulnerable to crashing, corruption, rouge queries, security flaws, and a host of other issues. For another, the stability of the DBFS is vulnerable to the mapped file system being changed independent of the DBFS view. Finally, this scheme would be extremely wasteful of the system resources and could potentially interfere with other MySQL installations.

Berkeley DB & FUSE

Probably the best compromise between the two concepts presented above is the use of a high performance embedded database for storing both the meta-data and the file's binary data. Such a database could handle all of the disk layout and logging operations, as well as the indexing and querying.

This is where Berkeley DB comes in. Berkeley DB is a simple yet powerful database designed to store strings of arbitrary binary data. It does this by allowing one table of keys and values in each database. A key is a binary string and a value is a binary string. That's it. There's no concept of columns, relations, data types, referential integrity, or any other fancy database concepts. If a structure is desired, it is expected that the program will impose it upon the value portion of the record. As a result, the value portion of a BDB record is often a C Struct.

Other features of interest are BDB's ability to store multiple databases in a single file, support for indexing on parts of a record's values, automatic memory mapping, support for sequences, transactional safety, support for large records (up to 4 gigabytes in most systems) and the ability to retrieve/store only parts of a record value. The latter two points are especially useful, as they allow for a Berkeley database to be used to store all the data for a file. In effect, BDB can become the file system!

The first feature is also of interest, because it means that BDB can be used directly on a block device if so desired. While it may be tempting to jump directly to using a block device, it is important to consider that placing the file on the root file system allows for disk space to be allocated to either the root or DBFS as needed. If you make the DBFS a separate partition, then you become bound by the disk sizes you chose.

What Berkeley DB does not provide is a fancy network server, user management, a query language, or any other cruft that might get in the way of a DBFS implementation.

Implementing a DBFS under Berkeley DB

You didn't think I'd leave you with only a quick description of the concept, did you? Believe it or not, I've been hard at work attempting to work out the details of creating a DBFS in BDB. Precisely three databases would be required for basic file system support. (Keep in mind that a database in BDB is analogous to an SQL Table.)

The following are what the value structures might look like. Note that all keys are expected to be long integers.

struct {
long created;     //Date
long modified;    //Date
long accessed;    //Date
char* name;       //Variable length
} files;            //The key is a sequential integer

struct {
long modified;    //Date
char type;        //0 for string, 1 for data, the rest reserved
char* name;       //Variable length; indexed!
char* value;      //file_data_id if type == 1, else variable string
} meta_data;        //The key is a sequential integer

struct {
char* data;       //Whatever the heck we want to put in here.
} file_data;        //The key is a sequential integer

In looking at the structure above, you may begin to wonder where exactly the file data would be stored. The answer is, in the file_data database! Despite the high importance placed on the data inside a file, the data is really nothing more than an another attribute or piece of meta-data. Files would have a standardized meta_data record called ".data" or something similar.

Adding labels (which can then be presented as directories for backward compatibility) can be done with the following database:

struct {
char* name;
long[] file_ids;    //A list of files attached to this label
} label;              //The key is a sequential integer

Building a Query Engine

Obviously one of the biggest reasons for creating a DBFS is to allow for advanced queries. To some degree the structure presented improves upon searching. If you were looking for a file created within the last 30 days, the DBFS could do a table scan to find files with the requested date.

Assuming that the "files" database key was a "long" integer, and that the average file name size was 50 bytes, the DBFS would have to search through 7.8 megabytes of data to find all the files created within the last 30 days. Since the data is stored in a table instead of scattered across INodes on disk, a burst read from the hard drive could read through the database in a few seconds or less! And that is assuming that none of the data is already memory mapped and cached in RAM! (Which is a very likely scenario.)

Unfortunately, a database scan does nothing to help the issue of searching file contents. To fix this, we need an index. Thankfully, it is quite easy to create one. All we need is a database where the key is a string, and the value is an array of file ids. e.g.:


struct {
long[] file_ids;
} word_index;        //Key is a String such as "hello\0"

Generally it makes the most sense to store only the lower case variants of each word to save space and eliminate issues with case sensitivity. This still presents a few issues with internationalization (the rules for lower casing a word differ from language to language), but that's not an issue I'm quite ready to tackle.

The end result of this is that a query engine built on top of this index could quickly find a list of file for each keyword. By creating a combined list (or), subtracting one list from another (not), or creating a list of only duplicate matches (and), boolean searches can be achieved.

The Query API

After playing around with FUSE for a little while, I realized how a simple and backward compatible query engine might be achieved. Consider a URL for a website. It usually consists of a protocol (e.g. http://), a host (e.g. myplace.com), a path (e.g. /myfile.cgi), and a query string (e.g. ?myfield=value). In the case of a file system, the protocol and host are both known, and the path is supplied by the user. However, no concept of a query string exists.

What if we were to virtually provide access to files beneath a given file path? For example, let's say we had the following file:

/myfolder/myemail.msg

Let's say that the file had the attributes "subject", "to", and "from". We could access the contents of each of these attributes by opening the following paths for read-only access:

/myfolder/myemail.msg/#subject
/myfolder/myemail.msg/#to
/myfolder/myemail.msg/#from

Note that the # symbol is used to denote access to an attribute. Similarly, we could write the attributes by opening the file for writing. Futhermore, we could get a list of attribute names by opening the following file for read only access:

/myfolder/myemail.msg/*list

These files would never actually appear in a file list because they are entirely virtual constructs. The file system driver would parse the path, find the file that terminates the path, then return a handle to a reader/writer for the attribute or list.

This scheme can be further extended for general queries. The following query finds all files with the word "batcave" in them:

/?query=Batcave

Since special characters are not allowed in an indexed word, the standard GMail syntax of using [attribute name]:[keyword] can be used to search only specific attributes. Combine this with a feature for querying only specific labels, and you could craft a very advanced query such as one that looks for email from Bob:

/myfolder/?query=from:Bob

To obtain the information, the query engine would look up the label "myfolder", and find all files associated with it, as well as sub-labels and files. It could then search through the meta-data for each file looking for the attribute "from". When the "from" attribute is found, the engine would then attempt to perform a substring match against the value of the attribute. This process can be accelerated by adding the attribute values to the index database we discussed above.

Advancing the State of Applications

One of the seeming disadvantages to the AppDisk/DBFS scheme proposed in the original article was the idea that program settings would be lost when the application was deleted. While this is a valid concern, I don't think it is as big of a concern as it might seem.

A DBFS not only advances the user's ability to interact with a file system, it should advance the ability of applications as well. The primary reason why users don't want to lose program settings after removing an application is because of the databases that application has stored. For example, my email program may have my entire email database in its settings. Or my web browser is holding all of my web bookmarks. Why should these programs hold our data hostage? Isn't the Unix philosophy "everything is a file"?

The reason why programs hold our data hostage is because they need to store it in special database files for performance and usability reasons. For example, it would be far too slow to parse out a directory full of emails just to show you a list of your inbox. Such an email program would easily be eclipsed by a email program utilizing an indexed database. Unless the former program had a DBFS backing it, that is!

The syntax still needs to be worked out, but imagine if you could open a query underneath a label that asked for a textual list of files with a given attribute and the value of that attribute! The email program could ask for the list of files with the MIME attribute of "text/email", and a list of the "to", "from", and "subject" attributes! The resulting file stream might look something like this:

myemail.msg, Bob, Me, How are things going\, Bob?

Note that the comma in the subject had to be escaped. Programs who read from these files will need to be aware of this issue, or perhaps would be allowed to switch the delimiter to a character of their choosing.

As a bonus, this information can be processed by traditional command line tools! Nothing stops you from SSHing into your machine, requesting a list of emails, then using grep, sed, cut and other Unix tools to process the data and generate a report!

Long term it would be valuable if all programs adopted this scheme. Your Web Browser bookmarks would be more accessible than ever, your address book could be nothing more than a bunch of VCF files on disk, and your calendar could consist of a number of event files. Not to mention that programs would have an easier time sharing this information! I for one can't wait for the day when Mozilla, Opera, KHTML, and other web browsers share all my bookmarks.

Links:

FUSE
Berkeley DB
Micro Kernel
SSHFS
GMailFS

The author can be reached at akaimbatman@gmail.com. Watch your step when entering the Batcave!

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.

Thursday, July 07, 2005

We've moved! Check out all these great articles and others at http://akaimbatman.intelligentblogger.com!

Linux Desktop Distribution of the Future Follow-up Part 1

Category: Commentary

Part 1: Clearing Up Misconceptions
Part 2: Refining the Ideas

Originally this week was slated for an article unrelated to Linux, yet jam packed with useful and interesting information. Sadly, that article will have to wait. The response I received to my four part series "The Linux Desktop Distribution of the Future" was overwhelming, and demands a follow up. As a result, this article will focus on answering some of the common questions and criticisms leveled against the previous series.

Talkback

The following is a list of everyone I know who answered the series outside of the blog comments. Feel free to dig through these to get a feel for both the criticisms and the accolades that were put forth.

Slashdot.org Story
OSNews Story
Linux Today Story
Mark R. Hinkle's Blog (Editor of Linux World)
Brains Factory
Phil Crissman's Blog
Eric's "Extreme Boredom" Blog
Willisburg.org
House of Zeus

I do know of a few more, but my Russian is rather poor and I don't speak most Germanic and Arabic languages.

Direct Responses

There are two articles I am aware of that were published as direct responses to the series. The links and my responses are listed below:

EMerge Random

- or -

Does that Toaster Come With a Manual?

Dev/URandom's premise for responding is very odd at best.. His idea is that Linux is ready for the Desktop today, just as long as the user reads the manual first. He then attempts to drill home the concept by using expanded text like this: "What I answer to my John Doe's that want to install Linux is: read documentation first. Please understand to the last word: f i r s t."

The problem is that his argument does not hold water. Users don't read the manual before they drive a car, microwave their food, use a refrigerator, or even use a Desktop computer. In fact, Mac OS X comes with no manual to speak of! If Linux requires that the user read the manual before using the interface, it has already failed.

Thankfully most respondents set him straight. He has now changed his tune and moved on to a slightly more intelligent premise.

Jeff Williams

Mr. Williams' rebuttal was much more even handed than Dev/URandom's, and was a pleasure to read. Unfortunately, Mr. Williams seems to misunderstand my post in a few places.

In Part 2 of his response he refers to "installation" software for the disk images, and pretends that the disk images are just a reformatting of existing package systems. What Mr. Williams does not understand is that the AppDisks are never installed. This is a very important thing to understand. The AppDisks are mounted in place, then run directly out of the image. When the application exits, the image is unmounted and thus leaves the system. This is worlds different than the packaging systems of today.

In Part 3 of his response he refers to the DBFS as "more conjecture than something really implementable." Unfortuantely, he does not expand on this, so it's difficult to formulate a response. All I can say is: Yes, it is doable. Most everyone out there could do it with a standard database server. I'm suggesting doing it at the file system level. In this case, CompSci is on my side.

In the section entitled "Core system libraries," Mr. Williams attempts to agree and disagree at the same time. On one hand he says that I have identified the key issue with Linux (a lack of standardized APIs) then states that Fedora and Debian are standards. What I think Mr. Williams is missing is that the exact APIs can be modified at install time. The user can choose to install KDE, or he can choose not to. He can choose to install the SAMBA components, or he can choose not to. That choice is a powerful advantage for workstations and servers, but a very poor concept for a Desktop system where the user wishes to make his applications Just Work(TM).

In the section called "Documents", Mr. Williams argues that I have contradicted myself about special directories. However, he fails to grasp that the /Documents directory is not a special folder. It is, instead, a replacement for the user's Home directory. Inside it, the system is normalized to allow the user easy access to his documents, applications, and other functionality. This is different from GNOME and KDE which create various bits of files and folders (both hidden and visible) that have special meaning inside the system.

Finally, under "Part 4: Desktop", Mr. Williams fails to understand the concept of the saved query. His belief is that the results of the query would be saved. That was not my suggestion. My suggestion was intended to be to save the query itself, then rerun it on demand. Since the file browser would constantly be running queries, the saved query file would merely pass its contents to the file browser for execution. The results would then be displayed as if it were a "special" folder.

Now on to the more general points.

A DBFS is not an SQL Server!

One concept that seemed particularly difficult for people to accept was the idea that a file system could both serve files and be a database at the same time. From the responses, everyone seems to have gotten the idea that a heavyweight SQL Server would be required to support the file system, and that the linkage would be weak at best. If you're one of those people, place your mind at ease. I suggested no such idea.

To understand why a DBFS does not need a separate server, one must first understand that a file system is a database. As databases go, most are quite poor, but they are still databases. Despite the fact that most commercial databases maintain relationships and provide client/server abilities, all that is required to have a database is to store data in a structured fashion. By that definition, even a comma delimited file is a database!

Speaking of comma delimited file, the original computer databases were not far from these. Data was stored in records of fixed lengths, and retrievals were done by scanning the file. Columns existed only inside the program's definitions. i.e. A program might say "Characters 0-10 are the name, while characters 11-25 are the address". This scheme worked well enough, but wasn't particularly fast if you needed only a few records from the data.

Indexes were created as a solution to this issue. Indexes are really nothing more than a data structure that is faster to navigate than a full database scan. When these indexes were added to a flat database, the resulting database was referred to as an "ISAM" file. ISAM files are still used in many systems today due to their inherent simplicity.

Indexes can come in many shapes, sizes, and forms. The only purpose of an Index is to improve access to a particular piece of data. Now think for a moment. When you access a file on your hard drive, does your machine spin through every file on the system looking for the correct one? No! It provides you with a directory structure to navigate that allows you to pinpoint the correct file. Believe it or not, by navigating the directory structure, you've just followed through an index.

Fast forwarding to today, the underlying database technology has seen almost no change what so ever. Database servers build relational theory and client/server technology on top of indexes and data access, but offer little more at a lower level other than abandoning the fixed record sizes for indexed record/column locations.

In the context of my original article, my suggestions were nothing more than reorganizing the data structures that already exist on disk drives today. This is to provide a query method that does not require a jaunt through the directory structure. There is nothing incompatible about this concept, and it can already be seen in the myriad of file systems that are available today. NTFS is probably the closest file system to what I am suggesting, but BeFS and HFS+ both provide highly advanced data structures and indexing in their designs. Isn't it interesting that Linux supports all of those file systems just fine?

Yes, disk creation and repair tools would be required for this new file system. However, the same is true for every file system ever added to Linux. When you run FSCK, you're not running a generic program, you're running a wrapper around a file system specific program.

Have you tried Distro XYZ?

This is quickly becoming the most annoying question in the history of mankind. It doesn't matter *which* distro you've tried or haven't tried. They all run on similar principles. No one distro magically solves all the issues facing Linux today. Sure, some have strengths over other distros, but they also have weaknesses. As a result, you find yourself in a conversation like this:

Me: I tried RedHat, but XYZ was giving me headaches.
FanBoy1: You should try Gentoo, it fixes all your problems!
Me: I tried Gentoo, but problem ABC was a show-stopper for me.
FanBoy2: Gentoo is crud! Use SuSE!
Me: I tried SuSE, but it didn't support 123.
FanBoy3: That's because you should have been using Debian all along!
Me: I tried Debian, but it broke.
FanBoy4: That's because Debian is out of date. Ubuntu is the future!
Me: Ubuntu still has issue Z.
FanBoy5: I understand all your issues with Ubuntu. I ran into them myself, so I switched distros. You should really try RedHat!

And so this crazy circle completes.

For the record, I have tried nearly all mainstream Linux distros. You can find some very nice reviews of some of them in the history of this blog. (Originally posted in my Slashdot Journal.) No, I have not tried Ubuntu yet. I'm still waiting for my CDs to arrive due to this blog keeping me too busy to download and burn CDs.

Installation != Ease of Use

An interesting fallacy I've seen as of late is the idea that an easy installation somehow equates to ease of use. Let me put this to rest. The only reason why the user cares about how easy it is to install Linux is because Linux doesn't come bundled with hardware. Windows and Mac OS X don't have to worry, because they are already installed for the user.

Most Linux distros today do an excellent job of providing a user friendly installer. Which is great if the user has to install his OS. For long term usage however, the user doesn't really care about the installer. He cares about things like using whatever application he wants, having document compatibility, playing games, keeping his work organized, and above all else having his machine work for him instead of against him.

Far too often Linux works against its users, because it wants someone more experienced at the helm. There's nothing wrong with that as long as Linux is targeted at the workstation and server markets. But if you really want to see Linux adopted on the home or work desktop, then you need to consider how to keep Linux from fighting its users.

Shortcuts vs. Labeling

Another item that seems to have confused readers was the issue of keeping shortcuts off the desktop. I understand the confusion, so allow me to clarify: The desktop I'm proposing has no shortcuts in the traditional sense. Instead, the labeling system is used to provide a desktop label. Anything linked to the desktop label will show up on the user's desktop. This works because files can be linked to multiple labels. Thus if I have a file under the "Text Documents" label and add it to the "Desktop" label, the file will appear in both places.

Dumbed Down for the "Smart" People

An interesting criticism of my article was that the concept was "dumbed down" for the idiot savant user. If you'll forgive me, I find this to be an offensive idea. "Dumbing Down" implies that something is reduced in functionality and features until only the absolute minimum is left. An example of this is the arcade games designed for three year olds that only have one button and no joystick.

The concepts I introduced were far more sophisticated than those used today. The current file systems, for example, are quite dumb. (Except for ReiserFS. I considered the possibility of Reiser solving the DBFS need, but I found its meta-data support to be incomplete.) The DBFS I proposed would actually be quite smart. The interface I suggested contains all existing user interface component, adds some functionality, and has far more potential for the future.

The is nothing "dumb" about the suggested changes. All they do is Keep It Simple Stupid.

AppDisks vs. AppFolders

A very important distinction that seems to have slipped past many is the fact that I suggested the use of AppDisks, not AppFolders. These AppDisks are *similar* to AppFolders, save for the fact that they wrap a complete unix hierarchy into a mountable disk image.

This means that /bin, /lib, /share, and other Unix style directories still exist. For example, shared libs can be stashed inside the /lib directory, in direct opposition to the way that Mac OS X AppFolders work.

Linux Applications are so easy to install, they don't work

A common response to my article was to gloss over the issues with the packaging system and claim that there's nothing easier. Well, yes, it's quite easy to click and install from a manager like Synaptic just as long as the package is available. Many programs that users wish to run have no package, out of date packages, or the repository has moved on to a newer version, thus making their Linux system useless. In the latter case, the solution is to upgrade your system. Yet a common complaint leveled against Microsoft is that Microsoft forces users to upgrade!

In addition, there is no way to fully test a package repository. Since every package modifies the base system, the only way to prove that a package will work is to test it against every possible package configuration available! In case you're wondering, the math for that is P * P, where P is the number of packages available. A mere 100 packages could potentially result in 10,000 available configurations! That's a lot of potential for breakage! Now consider that most distros today have thousands of packages under their care, and the number is not declining.

Minor Correction: Reader Bradley Momberger has correctly pointed out that my math was a little screwy on this one. The correct forumla for the number of combinations is 2^P, which is actually quite a bit worse. 100 packages yields 1.26e30 possible combinations!

In the AppDisk system I've proposed, the application is merely a visitor to the system. It lives in its own directories and its own environment, separate from the rest of the system. Thus there is precisely P possible configurations to test. And since the one to one relationship exists, it is quite feasible to place the difficulty of testing the application on the developer.

This is all Sci-Fi Technology!

Perhaps the most amusing criticism is the suggestion that the ideas presented are "Sci-Fi" technology that couldn't possibly exist without multi-million dollar backing and years of development.

This argument, I'm afraid, ignores the fact that I'm building on top of a great deal of existing work. i.e. Standing on the shoulders of giants as it were. Take the example of applications as disk images. While I invented the concept independently, I can't take credit for being the first to deal in the concept. That honor goes to the Klik project and their CMG file technology. Granted, their CMG technology is still very immature, but it is here and working today.

Another example is the concept of creating a running system with the applications outside the /usr directory. GoboLinux has gone to great pains to experiment with alternate file system layouts. Yet another example is Rox Filer/Rox Desktop, the pioneering project behind the use of AppFolders in Linux. (GNUStep had AppFolders first, I believe, but the applications are restricted by the OpenSTEP Desktop.)

Look at the work being done by the DBFS and Beagle projects for examples of Database File System technology.

The only thing I have attempted to do is make suggestions about how disparate technologies may be pulled together to make a cohesive whole. Make no mistake. I am not suggesting a Linux desktop of the future, I am suggesting the Linux desktop of the future. Whether I play a part in it or not is irrelevant. It is already happening. The question is: Will we beat the competition or will we continue to play follow-the-leader?

In the next article, I will attempt to refine some of the concepts presented based on the feedback I've gotten. So tune in next time, same Bat Time, same Bat Channel!

Go to Part 2: Refining the Ideas >>

Links:

NTFS
BeFS
HFS+
ReiserFS
CMG Files
GoboLinux
Rox Desktop
DBFS
Beagle

Questions? Comments? Business Ideas? Hate Mail? You can send it all to akaimbatman@gmail.com. (Except for the hate mail. That belongs in /dev/null.)

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.

Friday, July 01, 2005

We've moved! Check out all these great articles and others at http://akaimbatman.intelligentblogger.com!

The Linux Desktop Distribution of the Future Part 4

Category: Conceptual Design

This article is part of a four part series intended to provide some thought into how a future Linux Desktop might work. It is not intended to be a comprehensive essay, although all the concepts presented here are considered "doable" by the author.

Part 1: Linux and the Desktop Today
Part 2: Applications
Part 3: File Management
Part 4: The Desktop Interface

Part 4: The Desktop Interface

In the final edition of this series, I'll tie together the technologies previously discussed, and explain how they might coexist in an easy to use desktop environment.

The Layout

When a user attempts to be productive with his computer, there are usually only a few types of files he's interested in: Application and Documents. The Applications are the programs he uses to work with his documents. Just about everything else that a user does is intended to support this functionality. For example, the user has no direct need for a trash can. Rather, the trash is an abstraction that allows him a second chance before permanently deleting a set of files.

As a result, the only things that should be on a user's desktop are:

1. An icon to access his applications.
2. An icon to access his documents.
3. An icon to access his trash.
4. Any special mounts such as CDROMs, Network Drives, Cameras, USB Storage Devices, etc.

Everything else should be kept off the desktop. In particular, it is rather important for the system to NOT have desktop shortcuts in order to prevent the common glut of special offers and installers.

For the purposes of easy to access files, it is in the user's interest to allow selected files to appear on the Desktop. In the proposed interface, the Desktop would be merely a label used by the system to identify which files should appear. As a result, the right click menu and/or toolbars can provide the user with the option to add or remove the file from the Desktop. The key difference between how this would work in a DBFS vs. a regular system, is that the file is never moved in a DBFS. If the file is already organized, it will not have to move to appear on the Desktop. Rather, the file simply has the Desktop label added. Removing that label would have no effect on the file other than to make it disappear from the desktop area.

The two other key items on the Desktop are the search box and the task bar.

Applications

Clicking on the Applications icon pops up a file browser displaying the contents of the Applications label. This label can theoretically contain anything, including sub-labels, but will hold all applications by default. In the filters section we'll discuss a method by which applications can automatically end up here.

Documents

The Documents icon shows a file browser of the root of the label tree. Uncategorized files show up in the root, while any file with at least one label can be found underneath the selected label(s). The Applications and Trash labels are automatically weeded out to keep the user from getting confused. Note that what is displayed is not the true root of the filesystem. Rather it is a virtual root consisting only of files and labels that have no parent. No true hierarchy exists in a database file system, so the label and file information is used to compute one.

A special "label" called "Users" should show up under here, with a set of sub-labels for all users of the system. Everything beneath the user's name is part of their DBFS files and meta-data.

Trash

As mentioned previously, files intended for deletion are simply labeled with the "Trash" label. Unlike other labels, however, any file with the Trash label should be automatically hidden from the file browser unless the Trash is being explicitly stored. Should the user decide to rescue the file from the trash, the system merely needs to remove the Trash label from the file in order to restore its full set of labels and attributes. Should the user decide to empty the trash, only then is the file actually deleted.

Note that at no time is the file actually moved or modified. The "move" to the trash can is merely an illusion designed to provide a mechanism similar to the trash can in today's OSes, but far more robust thanks to the ability to leave the file in place.

Search Box

The search box in the upper right hand corner is a search like Apple's Spotlight and a command line box rolled into one. If the box detects that you have typed an absolute path, it will bring up a file browser window regardless of the fact that it's outside of the DBFS. This provides the user with a way to browse the system files if he so desires without resorting to a terminal program. The search box can also have a URL or URI entered to directly access an Internet site or shared network drive.

In all other circumstances, the search box will automatically search all files on the DBFS, sifting through indexes and meta-data to find any and all matches. Since this query can effectively be pushed down to the DBFS, the concept is that it would be tremendously fast because of the DBFS indexes.

Future enhancements to the search box might allow for more complex functionality such as quick commands to run programs, web searching, program interfaces, dictionary lookups, and other features similar to the way that Google provides a combined search and command line today.

Another important feature to include in the Search, is a method by which the search can be saved as a pseudo-folder. In reality the search should be nothing more than a file on disk that opens to the search results window (thus allowing the file to accept regular labeling and meta-data additions), but the icon for the saved search would make it appear to the user that the search is a special type of folder.

Taskbar/Dock

It is generally agreed by all modern UI experts that there must be a sensible method for the user to view and access all open program windows. The most popular of these methods has long been the Windows Taskbar, which shows a button for each open Window. To date, nothing has been found wrong with this interface other than potential usability issues that result when the user has an overcrowded taskbar. Still, it's a familiar interface for most users.

Similarly, the Dock is very well known to users of Mac OS X. The Dock interface provides a solution to overcrowding by automatically scaling icons so that all the icons can be seen on the screen at once. The icons are never too small, because they regrow as the user moves the mouse over them. This allows the user to manage hundreds of programs with the flick of a mouse over the Dock.

So which is better? The answer is: Whichever one you prefer. Some people prefer the Windows method and some people prefer the Mac method. As a result, this is one of the few places where giving the user an option makes a big difference. i.e. Let the user choose which one he wants. They're similar enough to implement via the same methods, and this desktop has no extra baggage to differentiate either design at an API level.

Filters

Anyone who has ever used email can tell you that the ability to setup rules on incoming email is an invaluable method for organizing and managing your email. GMail users can tell you that when combined with Labels instead of folders, those rules (called Filters in the GMail system) become even more valuable than before. For example, in GMail you can automatically organize your email under the proper labels, but still see the email in your inbox prior to archiving it. This saves a great deal of time, as the user does not have to check each email folder individually.

When applied to downloaded files, this raises an interesting question. Why don't web browsers have rules to automatically organize downloaded files? Surprisingly, the answer is the same as with email rules for folders. By setting up a rule, an email bypasses any sort of holding area (the inbox in most cases) where the user can be easily informed of its existence. Users have enough trouble attempting to find their downloaded files today without running into problems with files automatically moving to God knows where.

Yet just like with email, the problem goes away when labels are introduced. A file can be automatically organized under a label without removing it from a "downloaded" label or the Desktop. Smart software could even auto-manage the user's downloads by automatically removing a file that's linked to another label after a given period of time.

The best implementation of this is to push the rules down to the filesystem level. When a new file is created, the DBFS will check the filters and if any apply, use them to automatically apply labels to the file. For example, if I create a new SXW document, I could have a rule that would automatically place the document under the "Word Processing" label. Nothing prevents me from later adding a project-specific label, but with the filter in place I know that I can always find all of my SXW documents in one place. Similarly, I can set a filter stating that all MPG files with the words "Star Trek" in the name should be placed under the "Videos" and "Star Trek" labels.

The power that such a system as this gives the user cannot be overstated. While search can reduce the amount of time a user spends looking for a file, filters all but make organization unnecessary. If a type of file isn't properly trapped by the system, the user can always add another filter.

File Save Window

Since this new Desktop and Filesystem paradigm does away with the previous concept of directories, it becomes important that the File Save dialogs be updated to reflect this shift. While the old boxes will work fine for a time (as they will see the labels as if they were directories, and the DBFS will automatically assign the selected label), a more robust solution would be a save dialog that shows you the labels in the system, and allows you to build a list of them in addition to giving the file a name and file type.

This could be implemented for many current applications by modifying the existing GTK and QT file choosers. Unfortunately, the remaining applications would need to be modified to fully support the new Labeling system. Not to worry, however! Even with applications that are unable to break out of the legacy file choosers, file system filters will still allow users to automatically organize their system despite a given program's failings.

Filter Manager/Control Panel/Network Browser

One of the greatest crimes against good UI design was the attempt to move "common" functionality directly into the Desktop metaphor. Such functionality only served to confuse users and complicate the interface. Yet integrating such functionality was almost required in order to keep the user from having to sort through dozens of menu choices before finding the actual system options.

Thanks to the Applications as Files scheme, there is no need for the user to have to put up with this any longer. Functionality such as the Control Panel, Network Browser, and other system features should show up as "just another application". If the user is savvy enough to know they need these apps, the user is savvy enough to look in the Applications label to find them.

Which isn't to say that other programs might not invoke them. The key is that the interface is only provided when it makes sense, and never at any other time. The upshot to this design is that these interfaces can actually be upgraded independent from system components. Which is a good thing when you consider that they are nothing more than GUIs that assist the user in managing system configuration. If a user really misses having such options available from the Desktop, he can simply add the Desktop label to the applications and watch happily as they integrate right into his desktop!

Notably Absent

There are a few things that are noticeably absent from the proposed Desktop concept. For one, there is no "Start Button" or equivalent. This is intentional. The Start Button was always a confusing metaphor that only existed to cover over the inability of the Desktop to function in a cohesive fashion.

Another thing that you'll note is missing is the concept of a shortcut. This is intentional. Shortcuts are dangerous metaphors because they don't keep in sync with their target. They were created as a method of covering over missing functionality in the Desktop (e.g. Listing application shortcuts under the Start Button) as well as covering over the missing ability of the file system to non-rigidly link a file. Consider for example, what happens if a file open operation is done on a shortcut? Should a handle to the target file be returned, or should a handle to the contents of the shortcut file be returned? If the answer is the former, then how does the system edit the shortcuts? Is a special FNCTL call needed?

As you can see, leaving these concepts out of the desktop only helps improve the situation. The only exception to the no-shortcuts rule is the Dock interface. If the user is using the Dock interface, then he will probably have shortcuts to commonly used programs located there. A possible solution to this issue is to create a "Dock" label. Any applications carrying that label would automatically appear on the Dock regardless of whether they are currently running or not.

Issues Remaining

As with all high level discussions, this article still leaves various issues unanswered. For example, how do users share applications? Are all applications available to everyone all of the time, or is there a method by which system wide applications can exist independent from user owned applications? Such issues such as this are easily solvable if given enough thought. The key is to be cognizant of the new paradigm that these changes bring to the table.

The author can be reached at akaimbatman@gmail.com.

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.