The Man Also Known As "I'm Batman!"

Hello Citizen! Would you like to be my sidekick? Would you like to ride with... Batman?

Monday, February 20, 2006

We've moved! Check out all these great articles and others at!

The Intelligent File Format: Part 1

Category: Conceptual Design

One of the most frustrating things to any computer professional is the wide variety of file formats. Day in and day out we deal with documents, archives, images, multimedia, and other files that only open in a specific program. We try to make our lives easier by using multi-format programs like PowerArchiver, OpenOffice, GIMP, and VLC. However, these programs often fail to render the file contents accurately. Or worse, we come across an old file in a format that's no longer supported and spend hours trying to find tools to open it.

Perhaps even more frustrating is that many standards have been proposed to solve these issues, but become quickly ineffective as new technologies make the standards obsolete. For example, who is going to encode their video in standard MPEG format when options such as DiVX, Sorenson, and MPEG4 exist?

The problem is even worse for Software Developers who spent inordinate amounts of time reverse engineering formats and code in order to gain compatibility with just one format! Having to reverse engineer a large number of formats becomes an impossible task that encourages developers to find less than ideal shortcuts. (For example, MPlayer redistributes the Windows Media Codecs with a custom linker.)

Some Operating Environments (most notably the Apple Newton) have tried to solve this problem with "standard" interfaces between user programs. This often fails, however, as the programs go out of date, and the interfaces change. What's needed is a way to magically obtain access to the data inside any file. A way to obtain the structure of the data without reverse engineering someone else's parser.

A Bit of History

I've been long impressed by the number of file systems supported by Linux and FreeBSD. Even with incomplete support, these Operating Systems have made life easy for those of us with dual boot systems and multiple drives. Sadly, many operating systems don't reciprocate. Some fail because the developers don't want to support other OSes, but many fail because they can't reciprocate. The filesystem format may be too new, or perhaps the OS is no longer developed for. Either way, getting even a modicum of support seems difficult, if not impossible.

This lead me to think about ways of improving the situation. One Linux project actually used the NTFS driver from Windows to provide write support. Could the concept work the other way around? Could the Linux driver be compiled for Windows or specially linked against? Possibly. Wouldn't it be nice if a standard existed for file system drivers?

Then an idea occurred to me. What would happen if the beginning of file systems embedded a driver for accessing the disk? If the driver was in some sort of neutral format (similar to the X Windows drivers), then any OS could access the file system! In fact, the precise format of a given file system would become irrelevant. You could format your disk to the best file system for your need, and feel confident that it would work in any OS.

Or so the theory goes. I wasn't quite starry-eyed enough to believe that Microsoft, Linus, the FreeBSD Foundation, and Apple would suddenly all become agreeable just because I said so. Still, the concept had merit so I began to work on a prototype for a File-System-in-a-file. My thought was that such a library could prove the concept as well as provide an excellent choice for developers looking to keep all their program data inside a file (ala Access MDB and Outlook PST files) while still allowing the developer to plug in a more robust format at any time. If the concept could be taken one step further, perhaps it could become useful on external Flash drives; many of which are stuck with the sub-optimal FAT file system for compatibility reasons.

The Logical Conclusion

While this concept was exciting in of itself, it didn't even begin to scratch the surface of what was possible. It wasn't long before I considered the fact that a file system is nothing more than a hierarchical database. There's nothing inherently special about it, so why can't the file system payload be replaced with some sort of other data? As long as the embedded driver can read the format and produce some sort of usable data structure, there's no reason why the concept couldn't be extended for all types of data! Images, documents, multimedia, archives, and more could all be converted to self-describing formats.

Of course, like any technological innovation, the concept is not without it's pitfalls. Issues that need to be addressed are:

  • File Size - Embedding a driver will add overhead to the format.

  • Upgradability - Files are tied to specific version of the format.

  • Interface - How do we link the APIs at runtime?

  • Security - What's to stop embedded code from launching a virus?

  • Portability - How do we embed code that can work on all platforms?

  • Performance - How do we provide maximum I/O throughput to files that are performance sensitive?

Let's go over each of these items and investigate the issue in detail.

File Size

There's no denying that a driver in the header would mean an instant increase in file size. For small files, this can easily double the size. It's even conceivable that the embedded driver could be larger than the original file itself!

However, there are mitigating factors to consider:

  1. Disk space is cheap. Adding a few kilobytes per file is unlikely to produce any appreciable increase in storage requirements.

  2. Bandwidth in modern systems is far greater than it used to be. Adding a few kilobytes will not increase transfer times to any noticeable degree.

  3. The driver can be compressed using a standard compression algorithm. This may reduce its size considerably.


Since the driver is embedded with the file, files become tied to the specific version of software that they were written with. If the driver software has bugs, these bugs will continue to propagate as long as the file is in circulation. On the other hand, this also reduces the number of version incompatibilities by ensuring that the original software is always available to parse the file.

Many software packages rewrite files anyway, so this is generally not as big of an issue as it may seem.


One of the more challenging aspects of this scheme is how to link the driver. One would assume that an Image would have a very different interface from a File Archive. Runtime linking tends to be hard enough without adding completely unknown interfaces to the mix. And how do we document the available APIs to anyone who wishes to load the file?

Thus we need a way to store sufficient meta-data about the APIs to allow for proper runtime linking.


The greatest hazard posed by this format is that it allows arbitrary code to run every time a file is loaded. This potentially makes any file into a potential virus, even if it isn't executable!

What we need is a secure environment to run this code inside of. Such an environment would have to have a foolproof method against accessing the file system, network resources, GUI, and program memory. It can't allow for buffer overflows, and it must be capable of guaranteeing that the file handle passed to it can't be used against the parent program.


Above all else, the embedded driver must be portable. It does no good to invent a universal file format if it can't leave the confines of the x86 platform. Thus the best solution is to use either a portable scripting language or a language capable of executing on a Virtual Machine.


With files growing considerably in size, performance has become a major concern. Multimedia files in particular tend to be sensitive to I/O performance, meaning that the scripting language or VM must be capable of using the maximum system throughput without compromising security or portability.

The Obvious Choice

With all these constraints and issues in mind, the choice becomes extremely narrow. Scripting languages like JavaScript and PERL are portable, but tend toward lower performance. Virtual Machines like Smalltalk and .NET have performance, but not high security. The only choice left to us is Java.

The reasons for using Java are:

  • Security is a core feature, not an add-on. Any chunk of code can be perfectly firewalled off from the rest.

  • Java is portable to all major platforms, and can be ported to many more.

  • Java Performance has increased considerably over the years, making it one of the fastest choices on the market. In simple algorithmic usage (e.g. decoders, cryptography, compression, etc.) Java has been shown by many benchmarks to be faster than native code.

  • Java Reflection makes it easy to load a dynamic library, no matter what its source.

  • Java can interface with nearly all languages. If you want to use portable file functionality in your C program, for example, there is nothing stopping you from using a JNI interface to load the data.

  • Java bytecodes are small and compress well. They are regularly much smaller than a comparable native program.

Haven't I heard this before?

As many readers may note. this concept is not without its precedent. Self Extracting Zip files and installers have commonly used a similar technique to distribute their payloads. While the previous concepts have not been quite as far-reaching as what is described here, they are certainly predecessors to the Intelligent File Format.

XML files have also lead the way by encouraging file formats that are common and easy to share. While the idea of a central repository of all XML DTDs and Schemas never came to pass, the overall concept is still running strong and is the basis for many cross platform protocols such as SOAP and XML-RPC.

Go to Part 2 ->

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.
We've moved! Check out all these great articles and others at!

The Intelligent File Format: Part 2

Category: Conceptual Design

In the second part of this article, I'll attempt to address how the Intelligent File Format might be implemented.

Like most file format documents, we'll begin by describing the physical format of the file, then we'll delve into the contents and how to link them.

The Format Table

As you can see, the format is fairly simple with just a header, the the JAR file containing the code, then the actual data used by the file. In theory, the code embedded in the JAR portion of the code would only ever see the Payload portion of the file.


The header would consist of a structure similar to the one below:

File ID4 bytes0xCAFEFEED
JAR Length4 bytes 
Flags1 byte 
Loader ClassUTF8 String 

File ID

The file ID is nothing more than a standard identifier used to auto-detect the file type and ensure that utilities know that this is a binary file.

JAR Length

Informs the loader exactly how much of the file is taken up by the embedded software. The loader can then use this information to find the beginning of the payload stream.


Bit 0 - Write
Bit 1 - Random Access

The write flag tells the loader if the file can be opened for writing, or if the data is considered read only. This is important because if no software to write the file is included, the program using the data may linger under the assumption that changed data has been saved when in reality no actual data has made it to disk.

The random access flag tells the loader software that the file cannot be streamed. If the file is being loaded over the network or other serial location, the program requesting the file will be forced to move the data to random access storage (such as a temp file) before it can be opened.

Loader Class

The loader class is a string that tells the loader software which class to call inside the embedded JAR file.

JAR File and Supporting Classes

The JAR file needs to contain sufficient software to load the file data into a code structure and then optionally write that structure back out.

While streaming of input/output can be handled by the standard and classes provided by Java, random access will need a new API. The only API that the Java Virtual Machine provides for random access is the API which is far too file specific to meet the needs of a modern, portable file format. The data could be randomly accessed on disk, in memory, or over the network. Thus a new interface must be developed to replicate the APIs of, but without tying the implementation to any particular storage system.

The following interface and abstract class accomplish that task:

public interface RandomAccess
public long getLength();
public void setLength(long length);
public void seek(long position);
public long getFilePointer();

public abstract class RandomAccessData
implements InputStream, OutputStream, Datainput,
DataOuput, RandomAccess
public abstract boolean isWritable();

The "isWritable()" method returns true if the file can be modified. The rest of the methods are equivalent to the methods contained in the class.

The loader class referenced by the header is expected to conform to an interface such as the following:

public interface IntelligentFileFormat
public String[] getInputOutputInterfaces();
public String[] getOutputInterfaces();
public String[] getFormatterInterfaces();

public Object getInput(InputStream in, String type)
throws UnsupportedOperationException, FileFormatException;

public void getOutput(OutputStream out, Object output)
throws UnsupportedOperationException, FileFormatException;

public Object getFormatter(RandomAccess io, String type)
throws UnsupportedOperationException, FileFormatException;

The first three methods inform the loading software what types objects this code might return. While most implementations would only have a single type of object to return, some implementations may allow for the data to be loaded in a variety of ways. For example, a vector image may be returnable as a DOM of the vector data, or as a java.awt.Image object. Alternatively, a software vendor may choose to fully document a "simple" interface while leaving the more complex, feature rich interface undocumented. (See part 3 for more information on this.)

The type of the object is expected to be a fully qualified class name. The types of objects that can be written are not required to be symmetrical with the types of objects that can be read. Using the vector image example, the vector DOM might be writable to disk while the java.awt.Image object may not.

The latter three methods are where the reading and writing occur. The getInput() method returns an object that may stream the data in, potentially allowing for partial renderings of the data to be shown while the file is still loading.

The getOutput() method uses the object passed in to rewrite the payload data in the file. As stated above, the type of objects that can be written out will not always match the types that can be read in.

The getFormatter() method returns an object that has random access to the payload data. Depending on the setting of the "write" flag and the mode under which the file is opened, the returned object may also be capable of changing the payload data.

If the payload data is not in the expected format or the loader passes an unknown type, the methods will throw a FileFormatException. If the class does not support a method (e.g. getInput() is called on a random access file, getFormatter() is called on a streaming file, or getOutput() is called with an object that cannot be written), a UnknownOperationException will be thrown.


An implementation of the above software would carry out the following steps:

  1. Open the file for reading and/or writing.

  2. Read in the header.

  3. Use the JAR size in the header to extract the JAR file.

  4. Load the JAR file into a secure ClassLoader.

  5. Ask the secure ClassLoader for an instance of the class listed in the header.

  6. Get a list of supported interfaces from the class.

  7. Open the payload for streaming or random access based on the flags in the header.

  8. Cast the object returned by the implementation to the expected object type -OR- use reflection to investigate the available APIs.

Go to Part 3 ->

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.
We've moved! Check out all these great articles and others at!

The Intelligent File Format: Part 3

Category: Conceptual Design


Can a producer "cheat" by not documenting their interfaces?

Yes and no. There's nothing in this design to prevent someone from failing to document their interfaces or providing documentation for a crippled interface while keeping the more robust one a secret. However, the matter of reverse engineering becomes orders of magnitude simpler. Rather than trying to guess at the boundaries of fields or decompile a software package, you can now use reflection to investigate and document the classes and methods available in the object representation of the data. In this way, the format can be quickly and easily reverse engineered.

Can the embedded software access a non-payload portion of the file?

If a RandomAccess implementation allowed it to, then yes. Considering the accidental damage this could do to the file, it is important that the RandomAccess implementation passed to the main class be constrained to the payload area of the file. As far as the embedded software is aware the file starts and ends with the payload area.

Won't this technology bloat the file sizes?

It will certainly increase them, yes. However, not all file types are good candidates for this scheme. A good rule of thumb is if the loader software is going to take up more than 20% of the file on average, then the file may be trivial enough to not warrant the use of the Intelligent File Format.

Will this replace all file formats including text files and XML?

No. Most textual formats are already accessible enough as-is. Replacing them with an Intelligent File Format would only server to invalidate the entire tool chain that has been built up over the years. IFF files are much better suited to replacing binary formats such as office documents, where the format is more difficult to reverse engineer.

Can Intelligent Files be embedded inside Intelligent Files?

Many types of documents allow for other types of documents to be embedded inside their data streams. The most famous example of this is Windows OLE (Object Linking and Embedding).

Thankfully, there is absolutely nothing preventing an embedded driver from calling up the loading software and passing in a subsection of its own stream. This would work even with maximum security restrictions in place. However, such implementations need to be careful with RandomAccess files. If the file grows beyond its existing bounds it may overwrite other data in the parent file. Thus more complex implementations would need to consider some sort of paging system for ensuring that growth beyond the end of a file would be relocated to an empty section.

What about temp file access?

As of this writing, I have not fully considered the implications for large binary data files. A common method of handling such files is to swap decoded data in and out of temporary files. However, the security restrictions placed on the embedded software would preclude any sort of temp files from being accessed. Until this is solved, formats would need to page data in and out of the payload area. With CPUs being as fast as they are, and Virtual Memory Paging systems being as sophisticated as they are today, this may not be much of an issue.

Is Java an absolute requirement?

No, it is not. The concept could potentially be done using any Virtual Machine platform. If the use of Java was a problem, a new VM platform could be created to meet the security and robustness needs of this project. However, Java has a decade of commercial development and acceptance behind it that a new VM platform would not. Getting the platform accepted would be a difficult task, especially if it was only useful for loading and saving Intelligent Files.

Does an implementation of this software exist?

Not at the time of this writing. I have built early versions of the portable file system (which do work exceptionally well), but I haven't implemented the Intelligent File Format itself. I did consider doing the work before posting this document, but decided against it for two reasons:

  1. doesn't provide any sort of web space for uploading the software, and I'm not quite ready to make a full OSS or Commercial project out of it.

  2. I wanted to solicit feedback on the concept, and see if anyone has good thoughts on improving upon the design.

Is this design complete?

Most certainly not! While the concept would work fine in theory, there are many tricky aspects to the implementation that need to be worked out. Thus I'd rather start discussion on the concept before taking steps to solidify the concept in code. If you have thoughts, suggestions, or criticisms you'd like to share, feel free to post a comment or email me at If your comment is interesting enough, I'll even add it to this FAQ.

How do you patch vulnerabilities in embedded JARs? -Cyrus007

The answer to this requires that we first consider the environment that the embedded code will run in. The JAR will be loaded by a secure classloader which will restrict all access to the environment except for a single I/O stream passed in. The permissions to that stream come from a higher level, so the embedded code cannot access other files. Ideally, the embedded code can't access anything other than the payload.

So what are the vectors for attack?

1. The embedded driver could have a vulnerability that could be exploited by data in the file.

Solution: The driver is restricted from accessing anything except the payload data. Thus there's no significant danger to the system. The offending data must have gotten there somehow, meaning that an attacker already had access to the file at an earlier date. Nothing needs to be done.

2. Through an unlikely set of circumstances, data has been downloaded to the file from an external source (such as the Internet) which could exploit a problem in the driver.

Solution: The exploit cannot get anything useful out of the data. It cannot be sent over the network, saved to another location on disk, or otherwise moved out of its sandbox. There's no execution path for non-protected code, either. The only possible application would be to destroy the data inside the file. A rescue utility or patch to the creating program could see to it that these files get updated to prevent data loss. Otherwise, there is no significant risk to the system.

3. There is a flaw in the JVM that allows permissions to be elevated. An attacker could create a virus file that would take over your system.

Solution: There is no logical flaw in the Java security model, so there must be a big oopsy somewhere in the JVM. The JVM needs to be patched against this to prevent both IFF files and Applets from becoming an attack vector. The driver software would remain unchanged.

Beyond those three instances, there's really no other vectors. The code loading the software already has full permissions, so it doesn't need exploits in the driver. Nor can the driver break out of its sandbox without a flaw in the JVM. Since the JVM is the software that must be patched, the driver doesn't need to change. The absolute worst that could happen is that an exploit could destroy the payload for a particular file. Nothing else.

Why not just embed a URL reference to the JAR in question? -Tao

Certainly an appealing design. In theory, the JAR data would only be replicated once rather than with every file. The question this raises is, "What if the JAR is unaccessable?" For example, the website could be down or the user could be on a plane. If he's never opened the file before (assuming a caching mechanism), he'll be unable to access the data. Plus, who is going to guarantee that the driver will remain available far into the future? Will this create a situation where the data must be reverse engineered without any remaining code as reference?

Making the file completely self-contained solves these issues. The driver will always be available, and the data can always be read.

If Java disappears, would the data be harder to extract than in a regular file? -KristofU

No. The embedded driver does not add anything to the payload stream that wouldn't exist without the driver. If the ability to run the code were lost to future generations, the header and embedded driver could be thrown away in favor of analyzing the payload directly. More likely, the executable code would be of help to such "data archaeologists." Even if they are incapable of running the code, it provides a roadmap to the file's contents. As long as they have a basic understanding of the Java bytecodes they can rebuild the driver in a new language. If the understanding of Java bytecodes and contemporary processor design was lost, then nothing is lost to the analysis of the payload. The ZIP markers should help clearly identify the driver's beginning and end points.

-> Digg This Story

WARNING: Comments have been temporarily disabled during maintenence. Comments will reappear soon.