The Man Also Known As "I'm Batman!": The Intelligent File Format: Part 3

Category: Conceptual Design

FAQ

Can a producer "cheat" by not documenting their interfaces?

Yes and no. There's nothing in this design to prevent someone from failing to document their interfaces or providing documentation for a crippled interface while keeping the more robust one a secret. However, the matter of reverse engineering becomes orders of magnitude simpler. Rather than trying to guess at the boundaries of fields or decompile a software package, you can now use reflection to investigate and document the classes and methods available in the object representation of the data. In this way, the format can be quickly and easily reverse engineered.

Can the embedded software access a non-payload portion of the file?

If a RandomAccess implementation allowed it to, then yes. Considering the accidental damage this could do to the file, it is important that the RandomAccess implementation passed to the main class be constrained to the payload area of the file. As far as the embedded software is aware the file starts and ends with the payload area.

Won't this technology bloat the file sizes?

It will certainly increase them, yes. However, not all file types are good candidates for this scheme. A good rule of thumb is if the loader software is going to take up more than 20% of the file on average, then the file may be trivial enough to not warrant the use of the Intelligent File Format.

Will this replace all file formats including text files and XML?

No. Most textual formats are already accessible enough as-is. Replacing them with an Intelligent File Format would only server to invalidate the entire tool chain that has been built up over the years. IFF files are much better suited to replacing binary formats such as office documents, where the format is more difficult to reverse engineer.

Can Intelligent Files be embedded inside Intelligent Files?

Many types of documents allow for other types of documents to be embedded inside their data streams. The most famous example of this is Windows OLE (Object Linking and Embedding).

Thankfully, there is absolutely nothing preventing an embedded driver from calling up the loading software and passing in a subsection of its own stream. This would work even with maximum security restrictions in place. However, such implementations need to be careful with RandomAccess files. If the file grows beyond its existing bounds it may overwrite other data in the parent file. Thus more complex implementations would need to consider some sort of paging system for ensuring that growth beyond the end of a file would be relocated to an empty section.

What about temp file access?

As of this writing, I have not fully considered the implications for large binary data files. A common method of handling such files is to swap decoded data in and out of temporary files. However, the security restrictions placed on the embedded software would preclude any sort of temp files from being accessed. Until this is solved, formats would need to page data in and out of the payload area. With CPUs being as fast as they are, and Virtual Memory Paging systems being as sophisticated as they are today, this may not be much of an issue.

Is Java an absolute requirement?

No, it is not. The concept could potentially be done using any Virtual Machine platform. If the use of Java was a problem, a new VM platform could be created to meet the security and robustness needs of this project. However, Java has a decade of commercial development and acceptance behind it that a new VM platform would not. Getting the platform accepted would be a difficult task, especially if it was only useful for loading and saving Intelligent Files.

Does an implementation of this software exist?

Not at the time of this writing. I have built early versions of the portable file system (which do work exceptionally well), but I haven't implemented the Intelligent File Format itself. I did consider doing the work before posting this document, but decided against it for two reasons:

Blogger.com doesn't provide any sort of web space for uploading the software, and I'm not quite ready to make a full OSS or Commercial project out of it.

I wanted to solicit feedback on the concept, and see if anyone has good thoughts on improving upon the design.

Is this design complete?

Most certainly not! While the concept would work fine in theory, there are many tricky aspects to the implementation that need to be worked out. Thus I'd rather start discussion on the concept before taking steps to solidify the concept in code. If you have thoughts, suggestions, or criticisms you'd like to share, feel free to post a comment or email me at akaimbatman@gmail.com. If your comment is interesting enough, I'll even add it to this FAQ.

How do you patch vulnerabilities in embedded JARs? -Cyrus007

The answer to this requires that we first consider the environment that the embedded code will run in. The JAR will be loaded by a secure classloader which will restrict all access to the environment except for a single I/O stream passed in. The permissions to that stream come from a higher level, so the embedded code cannot access other files. Ideally, the embedded code can't access anything other than the payload.

So what are the vectors for attack?

1. The embedded driver could have a vulnerability that could be exploited by data in the file.

Solution: The driver is restricted from accessing anything except the payload data. Thus there's no significant danger to the system. The offending data must have gotten there somehow, meaning that an attacker already had access to the file at an earlier date. Nothing needs to be done.

2. Through an unlikely set of circumstances, data has been downloaded to the file from an external source (such as the Internet) which could exploit a problem in the driver.

Solution: The exploit cannot get anything useful out of the data. It cannot be sent over the network, saved to another location on disk, or otherwise moved out of its sandbox. There's no execution path for non-protected code, either. The only possible application would be to destroy the data inside the file. A rescue utility or patch to the creating program could see to it that these files get updated to prevent data loss. Otherwise, there is no significant risk to the system.

3. There is a flaw in the JVM that allows permissions to be elevated. An attacker could create a virus file that would take over your system.

Solution: There is no logical flaw in the Java security model, so there must be a big oopsy somewhere in the JVM. The JVM needs to be patched against this to prevent both IFF files and Applets from becoming an attack vector. The driver software would remain unchanged.

Beyond those three instances, there's really no other vectors. The code loading the software already has full permissions, so it doesn't need exploits in the driver. Nor can the driver break out of its sandbox without a flaw in the JVM. Since the JVM is the software that must be patched, the driver doesn't need to change. The absolute worst that could happen is that an exploit could destroy the payload for a particular file. Nothing else.

Why not just embed a URL reference to the JAR in question? -Tao

Certainly an appealing design. In theory, the JAR data would only be replicated once rather than with every file. The question this raises is, "What if the JAR is unaccessable?" For example, the website could be down or the user could be on a plane. If he's never opened the file before (assuming a caching mechanism), he'll be unable to access the data. Plus, who is going to guarantee that the driver will remain available far into the future? Will this create a situation where the data must be reverse engineered without any remaining code as reference?

Making the file completely self-contained solves these issues. The driver will always be available, and the data can always be read.

If Java disappears, would the data be harder to extract than in a regular file? -KristofU

No. The embedded driver does not add anything to the payload stream that wouldn't exist without the driver. If the ability to run the code were lost to future generations, the header and embedded driver could be thrown away in favor of analyzing the payload directly. More likely, the executable code would be of help to such "data archaeologists." Even if they are incapable of running the code, it provides a roadmap to the file's contents. As long as they have a basic understanding of the Java bytecodes they can rebuild the driver in a new language. If the understanding of Java bytecodes and contemporary processor design was lost, then nothing is lost to the analysis of the payload. The ZIP markers should help clearly identify the driver's beginning and end points.

-> Digg This Story

The Man Also Known As "I'm Batman!"

Monday, February 20, 2006

The Intelligent File Format: Part 3

FAQ