The Intelligent File Format: Part 1
Category: Conceptual Design
One of the most frustrating things to any computer professional is the wide variety of file formats. Day in and day out we deal with documents, archives, images, multimedia, and other files that only open in a specific program. We try to make our lives easier by using multi-format programs like PowerArchiver, OpenOffice, GIMP, and VLC. However, these programs often fail to render the file contents accurately. Or worse, we come across an old file in a format that's no longer supported and spend hours trying to find tools to open it.
Perhaps even more frustrating is that many standards have been proposed to solve these issues, but become quickly ineffective as new technologies make the standards obsolete. For example, who is going to encode their video in standard MPEG format when options such as DiVX, Sorenson, and MPEG4 exist?
The problem is even worse for Software Developers who spent inordinate amounts of time reverse engineering formats and code in order to gain compatibility with just one format! Having to reverse engineer a large number of formats becomes an impossible task that encourages developers to find less than ideal shortcuts. (For example, MPlayer redistributes the Windows Media Codecs with a custom linker.)
Some Operating Environments (most notably the Apple Newton) have tried to solve this problem with "standard" interfaces between user programs. This often fails, however, as the programs go out of date, and the interfaces change. What's needed is a way to magically obtain access to the data inside any file. A way to obtain the structure of the data without reverse engineering someone else's parser.
I've been long impressed by the number of file systems supported by Linux and FreeBSD. Even with incomplete support, these Operating Systems have made life easy for those of us with dual boot systems and multiple drives. Sadly, many operating systems don't reciprocate. Some fail because the developers don't want to support other OSes, but many fail because they can't reciprocate. The filesystem format may be too new, or perhaps the OS is no longer developed for. Either way, getting even a modicum of support seems difficult, if not impossible.
This lead me to think about ways of improving the situation. One Linux project actually used the NTFS driver from Windows to provide write support. Could the concept work the other way around? Could the Linux driver be compiled for Windows or specially linked against? Possibly. Wouldn't it be nice if a standard existed for file system drivers?
Then an idea occurred to me. What would happen if the beginning of file systems embedded a driver for accessing the disk? If the driver was in some sort of neutral format (similar to the X Windows drivers), then any OS could access the file system! In fact, the precise format of a given file system would become irrelevant. You could format your disk to the best file system for your need, and feel confident that it would work in any OS.
Or so the theory goes. I wasn't quite starry-eyed enough to believe that Microsoft, Linus, the FreeBSD Foundation, and Apple would suddenly all become agreeable just because I said so. Still, the concept had merit so I began to work on a prototype for a File-System-in-a-file. My thought was that such a library could prove the concept as well as provide an excellent choice for developers looking to keep all their program data inside a file (ala Access MDB and Outlook PST files) while still allowing the developer to plug in a more robust format at any time. If the concept could be taken one step further, perhaps it could become useful on external Flash drives; many of which are stuck with the sub-optimal FAT file system for compatibility reasons.
While this concept was exciting in of itself, it didn't even begin to scratch the surface of what was possible. It wasn't long before I considered the fact that a file system is nothing more than a hierarchical database. There's nothing inherently special about it, so why can't the file system payload be replaced with some sort of other data? As long as the embedded driver can read the format and produce some sort of usable data structure, there's no reason why the concept couldn't be extended for all types of data! Images, documents, multimedia, archives, and more could all be converted to self-describing formats.
Of course, like any technological innovation, the concept is not without it's pitfalls. Issues that need to be addressed are:
Let's go over each of these items and investigate the issue in detail.
There's no denying that a driver in the header would mean an instant increase in file size. For small files, this can easily double the size. It's even conceivable that the embedded driver could be larger than the original file itself!
However, there are mitigating factors to consider:
Since the driver is embedded with the file, files become tied to the specific version of software that they were written with. If the driver software has bugs, these bugs will continue to propagate as long as the file is in circulation. On the other hand, this also reduces the number of version incompatibilities by ensuring that the original software is always available to parse the file.
Many software packages rewrite files anyway, so this is generally not as big of an issue as it may seem.
One of the more challenging aspects of this scheme is how to link the driver. One would assume that an Image would have a very different interface from a File Archive. Runtime linking tends to be hard enough without adding completely unknown interfaces to the mix. And how do we document the available APIs to anyone who wishes to load the file?
Thus we need a way to store sufficient meta-data about the APIs to allow for proper runtime linking.
The greatest hazard posed by this format is that it allows arbitrary code to run every time a file is loaded. This potentially makes any file into a potential virus, even if it isn't executable!
What we need is a secure environment to run this code inside of. Such an environment would have to have a foolproof method against accessing the file system, network resources, GUI, and program memory. It can't allow for buffer overflows, and it must be capable of guaranteeing that the file handle passed to it can't be used against the parent program.
Above all else, the embedded driver must be portable. It does no good to invent a universal file format if it can't leave the confines of the x86 platform. Thus the best solution is to use either a portable scripting language or a language capable of executing on a Virtual Machine.
With files growing considerably in size, performance has become a major concern. Multimedia files in particular tend to be sensitive to I/O performance, meaning that the scripting language or VM must be capable of using the maximum system throughput without compromising security or portability.
With all these constraints and issues in mind, the choice becomes extremely narrow. Scripting languages like JavaScript and PERL are portable, but tend toward lower performance. Virtual Machines like Smalltalk and .NET have performance, but not high security. The only choice left to us is Java.
The reasons for using Java are:
As many readers may note. this concept is not without its precedent. Self Extracting Zip files and installers have commonly used a similar technique to distribute their payloads. While the previous concepts have not been quite as far-reaching as what is described here, they are certainly predecessors to the Intelligent File Format.
XML files have also lead the way by encouraging file formats that are common and easy to share. While the idea of a central repository of all XML DTDs and Schemas never came to pass, the overall concept is still running strong and is the basis for many cross platform protocols such as SOAP and XML-RPC.
Go to Part 2 ->
One of the most frustrating things to any computer professional is the wide variety of file formats. Day in and day out we deal with documents, archives, images, multimedia, and other files that only open in a specific program. We try to make our lives easier by using multi-format programs like PowerArchiver, OpenOffice, GIMP, and VLC. However, these programs often fail to render the file contents accurately. Or worse, we come across an old file in a format that's no longer supported and spend hours trying to find tools to open it.
Perhaps even more frustrating is that many standards have been proposed to solve these issues, but become quickly ineffective as new technologies make the standards obsolete. For example, who is going to encode their video in standard MPEG format when options such as DiVX, Sorenson, and MPEG4 exist?
The problem is even worse for Software Developers who spent inordinate amounts of time reverse engineering formats and code in order to gain compatibility with just one format! Having to reverse engineer a large number of formats becomes an impossible task that encourages developers to find less than ideal shortcuts. (For example, MPlayer redistributes the Windows Media Codecs with a custom linker.)
Some Operating Environments (most notably the Apple Newton) have tried to solve this problem with "standard" interfaces between user programs. This often fails, however, as the programs go out of date, and the interfaces change. What's needed is a way to magically obtain access to the data inside any file. A way to obtain the structure of the data without reverse engineering someone else's parser.
A Bit of History
I've been long impressed by the number of file systems supported by Linux and FreeBSD. Even with incomplete support, these Operating Systems have made life easy for those of us with dual boot systems and multiple drives. Sadly, many operating systems don't reciprocate. Some fail because the developers don't want to support other OSes, but many fail because they can't reciprocate. The filesystem format may be too new, or perhaps the OS is no longer developed for. Either way, getting even a modicum of support seems difficult, if not impossible.
This lead me to think about ways of improving the situation. One Linux project actually used the NTFS driver from Windows to provide write support. Could the concept work the other way around? Could the Linux driver be compiled for Windows or specially linked against? Possibly. Wouldn't it be nice if a standard existed for file system drivers?
Then an idea occurred to me. What would happen if the beginning of file systems embedded a driver for accessing the disk? If the driver was in some sort of neutral format (similar to the X Windows drivers), then any OS could access the file system! In fact, the precise format of a given file system would become irrelevant. You could format your disk to the best file system for your need, and feel confident that it would work in any OS.
Or so the theory goes. I wasn't quite starry-eyed enough to believe that Microsoft, Linus, the FreeBSD Foundation, and Apple would suddenly all become agreeable just because I said so. Still, the concept had merit so I began to work on a prototype for a File-System-in-a-file. My thought was that such a library could prove the concept as well as provide an excellent choice for developers looking to keep all their program data inside a file (ala Access MDB and Outlook PST files) while still allowing the developer to plug in a more robust format at any time. If the concept could be taken one step further, perhaps it could become useful on external Flash drives; many of which are stuck with the sub-optimal FAT file system for compatibility reasons.
The Logical Conclusion
While this concept was exciting in of itself, it didn't even begin to scratch the surface of what was possible. It wasn't long before I considered the fact that a file system is nothing more than a hierarchical database. There's nothing inherently special about it, so why can't the file system payload be replaced with some sort of other data? As long as the embedded driver can read the format and produce some sort of usable data structure, there's no reason why the concept couldn't be extended for all types of data! Images, documents, multimedia, archives, and more could all be converted to self-describing formats.
Of course, like any technological innovation, the concept is not without it's pitfalls. Issues that need to be addressed are:
- File Size - Embedding a driver will add overhead to the format.
- Upgradability - Files are tied to specific version of the format.
- Interface - How do we link the APIs at runtime?
- Security - What's to stop embedded code from launching a virus?
- Portability - How do we embed code that can work on all platforms?
- Performance - How do we provide maximum I/O throughput to files that are performance sensitive?
Let's go over each of these items and investigate the issue in detail.
File Size
There's no denying that a driver in the header would mean an instant increase in file size. For small files, this can easily double the size. It's even conceivable that the embedded driver could be larger than the original file itself!
However, there are mitigating factors to consider:
- Disk space is cheap. Adding a few kilobytes per file is unlikely to produce any appreciable increase in storage requirements.
- Bandwidth in modern systems is far greater than it used to be. Adding a few kilobytes will not increase transfer times to any noticeable degree.
- The driver can be compressed using a standard compression algorithm. This may reduce its size considerably.
Upgradability
Since the driver is embedded with the file, files become tied to the specific version of software that they were written with. If the driver software has bugs, these bugs will continue to propagate as long as the file is in circulation. On the other hand, this also reduces the number of version incompatibilities by ensuring that the original software is always available to parse the file.
Many software packages rewrite files anyway, so this is generally not as big of an issue as it may seem.
Interface
One of the more challenging aspects of this scheme is how to link the driver. One would assume that an Image would have a very different interface from a File Archive. Runtime linking tends to be hard enough without adding completely unknown interfaces to the mix. And how do we document the available APIs to anyone who wishes to load the file?
Thus we need a way to store sufficient meta-data about the APIs to allow for proper runtime linking.
Security
The greatest hazard posed by this format is that it allows arbitrary code to run every time a file is loaded. This potentially makes any file into a potential virus, even if it isn't executable!
What we need is a secure environment to run this code inside of. Such an environment would have to have a foolproof method against accessing the file system, network resources, GUI, and program memory. It can't allow for buffer overflows, and it must be capable of guaranteeing that the file handle passed to it can't be used against the parent program.
Portability
Above all else, the embedded driver must be portable. It does no good to invent a universal file format if it can't leave the confines of the x86 platform. Thus the best solution is to use either a portable scripting language or a language capable of executing on a Virtual Machine.
Performance
With files growing considerably in size, performance has become a major concern. Multimedia files in particular tend to be sensitive to I/O performance, meaning that the scripting language or VM must be capable of using the maximum system throughput without compromising security or portability.
The Obvious Choice
With all these constraints and issues in mind, the choice becomes extremely narrow. Scripting languages like JavaScript and PERL are portable, but tend toward lower performance. Virtual Machines like Smalltalk and .NET have performance, but not high security. The only choice left to us is Java.
The reasons for using Java are:
- Security is a core feature, not an add-on. Any chunk of code can be perfectly firewalled off from the rest.
- Java is portable to all major platforms, and can be ported to many more.
- Java Performance has increased considerably over the years, making it one of the fastest choices on the market. In simple algorithmic usage (e.g. decoders, cryptography, compression, etc.) Java has been shown by many benchmarks to be faster than native code.
- Java Reflection makes it easy to load a dynamic library, no matter what its source.
- Java can interface with nearly all languages. If you want to use portable file functionality in your C program, for example, there is nothing stopping you from using a JNI interface to load the data.
- Java bytecodes are small and compress well. They are regularly much smaller than a comparable native program.
Haven't I heard this before?
As many readers may note. this concept is not without its precedent. Self Extracting Zip files and installers have commonly used a similar technique to distribute their payloads. While the previous concepts have not been quite as far-reaching as what is described here, they are certainly predecessors to the Intelligent File Format.
XML files have also lead the way by encouraging file formats that are common and easy to share. While the idea of a central repository of all XML DTDs and Schemas never came to pass, the overall concept is still running strong and is the basis for many cross platform protocols such as SOAP and XML-RPC.
Go to Part 2 ->
<< Home