Fixing HTML

Douglas Crockford

2007-11-28

HTML needs fixing. The HTML 4 recommendation was published in 1999. Since then, the web has grown from a document retrieval system into an application delivery system. We have made significant progress since then, due to the cleverness of the web development community and the surprising expressive power of JavaScript, but we are at the limits. HTML is no longer a driver of innovation. It is now a serious impediment.

There are good ideas in HTML, but many of these were discarded in the XHTML effort. My thinking is that we should take a step back and refocus. The problems with HTML will not be solved by making it bigger and more complicated. I think instead we should generalize what it does well, while excising features that are problematic. HTML can be made into a general application delivery format without disrupting its original role as a document format.

The new language I am proposing is not totally compatible with HTML 4. But HTML 4 was not totally compatible with HTML 3, and XHTML would not have been totally compatible anyway, so that's ok.

This is my proposal for a kinder, gentler HTML 5.

HTML

The <html> tag gets an optional version attribute. If its value is 5, then the following HTML 5 rules apply. If it is 4 or if the attribute is missing, then the HTML 4 rules apply.

<html version=5>

No more doctypes.

Script

There is only one scripting language allowed on a page. This is to simplify the addition of new languages to the browser, eliminating the need to unify object models and memory models. It also paves the way for replacing JavaScript with a secure programming language. No security would be obtained if an insecure language can be mixed with a secure language. The language is selected by specifying the content-script-type. The default is application/ecmascript.

<meta http-equiv=content-script-type content=application/ecmascript>

<script> tags do not specify a type or language. They are direct children of <head> or <body>. <script> tags are not immediately executed. <script> tags do not block loading of other assets. When the </head> is reached, all of the scripts that were included in the head are then executed in order. When the </body> is reached, all of the scripts that were included in the body are then executed in order.

No more document.write. No more in-page event handlers. No more javascript: urls.

Frames

No more framesets, frames, or iframes. The security properties of these were problematic. Instead we'll have modules.

Modules

<module> creates a sub-tree which can contain a document with a communication channel. See http://json.org/module.html for a description. I would replace its communication mechanism with the common capability communication mechanism that I am advocating for Google Gears and Adobe AIR.

CSS

The default CSS content needs to be standardized. See for example http://developer.yahoo.com/yui/reset/. The browsers should have more commonality in their default styling.

A getElementsByCSSSelector method allows for collecting elements based on CSS selector notation.

Major improvements are needed in CSS which are beyond the scope of this document. CSS could also benefit from a simplifying refocusing. It would benefit from a constraint system that could be smarter about alignment, layout, and screen management.

Encoding

The only character encoding permitted in HTML 5 is UTF-8. The allowance of a multitude of encodings with default and discovered encodings exposes users to security exploits and reduces the integrity of documents. It is not unusual for the stated encoding of a document to not match the encoding of its content. A single encoding will make it easier to get it right. The expansion of asian text can be mitigated by gzipping.

Entities

HTML 5 is strict in the formulation of HTML entities. In the past, some browsers have been too forgiving of malformed entities, exposing users to security exploits. Browsers should not perform heroics to try to make bad content displayable. Such heroics result in security vulnerabilities.

Empty Tags

The <empty/> tag form is allowed, but not required for <br> or <hr>. The empty form can be used by <script src="url"/> tags.

Custom Tags

Custom HTML tags have always been allowed in HTML. In HTML 5 they become first class.

CSS can be used to style custom tags.

mymenubar {display: div; width: 100%;}

The display style attribute can take a tag name. This means to take the characteristics of that tag name when defining a new custom tag.

The getElementsByTagName method can be used to collect custom tags.

Custom Attributes

Custom HTML attributes have always been allowed in HTML. In HTML 5 they become first class.

The getElementsByAttribute method can be used to collect custom tags. It can take one or two parameters. The first parameter is the name of an attribute. The optional second parameter is a matching value.

It is not necessary to quote attribute values that contain only digits, letters, and these special characters: + - * % . : _. Quoting is still a good practice.

That's It

These changes significantly improve the reliability, security, and performance of HTML applications. The simplification of the language reduces the cost of training of web developers. It incorporates the best practices of Ajax development. It provides extensibility without complexity. The deltas from HTML 4 are generalizations and reductions, which should make browser implementation more straightforward. This is particularly important for mobile devices that cannot tolerate the power demands of complex platforms. The only new feature here is the module, which is critical for security. Modules makes safe mashups possible.