The Problem with Standards

Introduction

The computer industry lives by standards. Whether it's the specification for a programming language, a data file format, or an internet protocol, engineers have to deal with standards and their specifications on a regular basis. Productivity is lost when bad standards or bad specifications cause engineers to waste time trying to get things working correctly in the face of incorrect or incomplete specs. That costs companies money, makes products late, and keeps users from getting new versions of applications as soon as they otherwise might.

But it's not just engineers who have trouble with bad specs. Users have plenty of problems, too. Applications that are supposed to be able to share data can't. A font which works fine on one system won't work on another. Computers can't talk to each other over a network. Or email messages come through unreadable. There are probably more ways for users to be hurt by bad specs than there are bad specs.

I'll look at some of the causes of these problems and suggest a few ideas for trying to make life better for engineers and users alike. After all, if it's easier for the engineer to get the code working correctly, it's easier for the user to use the application the engineer's developing.

Examples of problem standards

First I'll give some specific examples of problem standards I've run into over the years. These are all real-life examples, but they may not all be problems anymore. Vendors may have actually fixed the problems I describe. I wouldn't bet on it, but it's a possibility.

TIFF

The TIFF specification is one of the first specs I tangled with. We were working on writing a TIFF parser, and wrote it strictly to the specification. If a TIFF didn't match the specification exactly, our code would refuse to read the TIFF. Sadly, this needed to change almost immediately, as we discovered that TIFFs produced by Adobe Photoshop (at that time the owner of the standard, after they'd bought Aldus, who had originally created the standard), didn't meet the specifcation.

What happened was that for LZW compressed TIFF data, Photoshop would not write the end of block marker out correctly. This wasn't a problem for most applications, since they'd read bytes within a scan-line until they'd gotten enough for the line and then move onto the next scan-line. But we'd actually used the sample code provided within the specification which read until it encountered an end-of-data marker. But in Adobe Photoshop-produced TIFFs, there was no valid end-of-data marker, and we'd end up reading into the next block, at which point we were lost within the file.

It was a trivially easy problem to discover with our original parser, since it adhered to the standard and had a specific error code for missing end-of-data marker, and it was easy to fix once I'd discovered what the problem actually was, but it still took some extra time to find out what the correct solution to the problem was. And when I reported the problem to Adobe, the engineer in charge just shrugged and said that I was the first person to report the problem in the two or three years since they'd rewritten the TIFF generation code, so it must not be very important to fix. The fact that there was no application which would validate TIFFs and say "this is invalid" kept people from having a simple way to test that the TIFF data their application wrote was valid.

TrueType

TrueType is another standard I wrote a parser for. Completely from scratch again, and based only on the specification. Almost immediately I started to discover TrueType fonts that had bad data within them. Some were missing tables that were required by the standard, others would contain tables that weren't defined by the specification I was reading (before I realized there were two different specifications of the TrueType standard), and others would just have invalid data within the tables. In almost all of the cases, the font would work fine on Windows or a Mac if you didn't try to use all the characters within the font, or if you didn't move the font to the other platform.

One of the most frustrating things about TrueType was that while it was co-developed by Apple and Microsoft, there were two or three specifications available most of the time. Apple had their version of the standard, and Microsoft had their own version. And the two defined different tables which could exist within a font. In other words, both companies had started with a well-defined standard and added their own proprietary extensions to it. As a result, there was no standard TrueType font validator available for both Macintosh and Windows.

HTML 2.0

As with TrueType, the problem with HTML 2.0 was that there wasn't a single standard. Well, there was, but the reality was that both Netscape and Internet Explorer had taken the HTML 1.0 standard and added their own proprietary extensions to it. When the W3C tried to make the HTML 2.0 spec, they ended up taking some additions from each company, but the specification was lagging reality, and few people actually paid much attention to it, other than as a starting point.

With HTML 2.0, as with TrueType, two companies were both trying to make their own version of the standard, and the result just hurt users.

K&R C

In the case of K&R C, there was a good book describing the language. The language wasn't exactly simple, but it's small enough that a lot of geeks understood it well enough to be language lawyers. The problem arose when a number of microcomputer implementations of C came out. In many of the cases, the person (or people) developing the compiler produced an incomplete implementation, either because of limitations in the system they were targeting, or because of incomplete knowledge of the language. The only validation suites for C compilers cost a fairly large amount of money, so many of the implementations were non-standard.

Examples of good standards

On the other hand, there are a number of standards which do a good job. The language or file format they describe may not be perfect, but the specification does the job, and the data produced is generally good.

HTML 4.0 and XHTML

After the HTML 2.0 and 3.0 problems, the W3C worked on making a standard that people would actually adhere to. With the help of the developers, HTML 3.2 and then HTML 4.0 were developed. HTML 4.0 actually had multiple standards within it, a Transitional version, which retained support for many of the older vendor-specific kludges that had found their way into earlier specifications, and Strict, which specified a set of tags which would be supported into the future. Vendors could use the transitional specification for a while, but the future was clearly in the strict version of the standard.

XHTML has grown from the combination of HTML 4.0 and CSS2, and shares the benefits of both well-defined standards. It's still too early to say for sure whether or not we'll see widespread adoption of standard XHTML code, but given the W3C validator for XHTML, it seems likely.

Type 1

Type 1 was initially an Adobe-private format. Over time, third parties reverse engineered the format, but since Adobe controlled the PostScript market, they retained control of the standard. In addition, the wide adoption of Adobe Type Manager (ATM) helped keep the Type 1 font format from wandering. ATM was fairly picky about the fonts it would rasterize, so fonts that didn't meet the standards wouldn't look good on users' screens. Coupled with the fact that Adobe had a strangle-hold on the PostScript printer market for many years meant that Type 1 fonts stayed relatively standard.

Over time, Adobe released the Type 1 font format, but again, ATM and PostScript printers served as validation tools. And given the high-end market that Type 1 was serving, users had a strong incentive to only use valid Type 1 fonts.

ANSI C++

I originally wasn't sure whether to cite C++ as a good standard or not. But as I thought about it, I decided that there are enough standard C++ compilers to say that it's a good standard. There are some exceptions (most notably Microsoft's Visual C++ and the standard Windows headers), but many people realize that these exceptions are broken.

The important thing with C++ is that the language is well-defined by the specification, and there are large quantities of source-code available which can be used to validate compilers (the STL is a notable example). This isn't quite as good as having a dedicated validation suite (one exists, but I don't know of a thorough validation suite that's free), but it's close enough that companies who are trying to make an effort to produce either a standard C++ compiler or write standard C++ code can usually manage to do so.

What's the difference?

There are a couple differences between the good and bad examples I've cited. One good example is an open, committee-developed standard. Another was a proprietary standard that only became public after other companies had reverse-engineered a large portion of it. Of the bad standards I cite, all were made public early in their history. Some were developed by committee, while others were controlled by a single company. Most of the problem standards are extensible, offering developers the opportunity to add features to the standard. But extensibility alone isn't the cause of the problems. At least one of the good standards (XHTML) can be extended by developers.

So what causes the difference? In the case of the good standards, there is a readily available application to validate the data described by the specification. In the case of HTML 4.0 and XHTML, the W3C provides a validator which will tell you if you have valid HTML. In contrast, while there was a specification for HTML 2.0, developers didn't follow it because each browser implemented their own superset of HTML 1.0, and 2.0 was cobbled together from reality. Further, there wasn't a validator for HTML 2.0 so people writing web-pages tested their pages by looking at them in various browsers, rather than checking the code with a validator first. In the case of TrueType, not only wasn't there a validator available, but there were multiple versions of the spec to worry about. In the case of TIFF or PDF, there's no validator available and the standards are complex (and extensible) enough that it's easy for people to misinterpret parts of the specification.

Conclusion

From the standards I've cited, it would appear that having a utility available which can validate the data will make a large difference. Most internet protocols fall into this camp as well, since the data transferred by them will often pass through multiple different implementations, rather than staying with a single, perhaps broken implementation of the standard.

But is a validator enough?

I don't think a validator (by itself) is enough to make a standard good. Witness TrueType. Apple has released a font validator for Macintosh, but there are still Macintosh TrueType fonts which don't validate. The problem may just be that the validator isn't widely distributed, but I think there's more to it than that. Since the Mac OS will render all but the most completely broken TrueType fonts without complaining, users have little incentive to demand valid fonts. And without user demand, vendors have little incentive to make sure that their fonts are valid.

Rather than just having a validator available, I think the solution is to have a validator that's ubiquitous, and perhaps even mandatory. In the case of Type 1 fonts, ATM served this purpose. In the case of HTML4.0, iCab, Internet Explorer, and the W3C's HTML validator are all available. While IE will render non-standard HTML, if the HTML claims to be valid and isn't, IE will likely render it incorrectly on the screen. This gives us a chance of seeing more valid HTML over time.

Copyright 2009, Dave Polaschek. Last updated on Mon, 15 Feb 2010 14:08:58.