Jump to content

HTML

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Martinvie (talk | contribs) at 09:48, 5 June 2006 (→‎See also). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

An excerpt of HTML code with syntax highlighting

In computing, HyperText Markup Language (HTML) is a markup language designed for the creation of web pages with hypertext and other information to be displayed in a web browser. HTML is used to structure information — denoting certain text as headings, paragraphs, lists and so on — and can be used to describe, to some degree, the appearance and semantics of a document. HTML's grammar structure is the HTML DTD that was created using SGML syntax.

Originally defined by Tim Berners-Lee and further developed by the IETF, HTML is now an international standard (ISO/IEC 15445:2000). Later HTML specifications are maintained by the World Wide Web Consortium (W3C).

Early versions of HTML were defined with looser syntactic rules which helped its adoption by those unfamiliar with web publishing. Web browsers commonly made assumptions about intent and proceeded with rendering of the page. Over time, the trend in the official standards has been to create an increasingly strict language syntax; however, browsers still continue to render pages that are far from valid HTML.

XHTML, which applies the stricter rules of XML to HTML to make it easier to process and maintain, is the W3C's successor to HTML. As such, many consider XHTML to be the "current version" of HTML, but it is a separate, parallel standard; the W3C continues to recommend the use of either XHTML 1.1, XHTML 1.0, or HTML 4.01 for web publishing.

Version history of the standard

HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML. HTML's successor, XHTML, is a separate language that began as a reformulation of HTML 4.01 using XML 1.0. It continues to be developed:

  • XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and republished August 1, 2002. It offers the same three flavors as HTML 4.0 and 4.01, reformulated in XML, with minor restrictions.
  • XHTML 1.1, published May 31, 2001 as a W3C Recommendation. It is based on XHTML 1.0 Strict, but includes minor changes and is reformulated using modules from Modularization of XHTML, which was published April 10, 2001 as a W3C Recommendation.
  • XHTML 2.0 is still a W3C Working Draft

There is no official standard HTML 1.0 specification because there were multiple informal HTML standards at the time. However, some people consider the initial edition provided by Tim Berners-Lee to be the definitive HTML 1.0. That version did not include an IMG element type. Work on a successor for HTML, then called "HTML+", began in late 1993, designed originally to be "A superset of HTML…which will allow a gradual rollover from the previous format of HTML". The first formal specification was therefore given the version number 2.0 in order to distinguish it from these unofficial "standards". Work on HTML+ continued, but it never became a standard.

The HTML 3.0 standard was proposed by the newly formed W3C in March 1995, and provided many new capabilities such as support for tables, text flow around figures, and the display of complex math elements. Even though it was designed to be compatible with HTML 2.0, it was too complex at the time to be implemented, and when the draft expired in September 1995 work in this direction was discontinued due to lack of browser support. HTML 3.1 was never officially proposed, and the next standard proposal was HTML 3.2 (code-named "Wilbur"), which dropped the majority of the new features in HTML 3.0 and instead adopted many browser-specific element types and attributes which had been created for the Netscape and Mosaic web browsers. Math support as proposed by HTML 3.0 finally came about years later with a different standard, MathML.

HTML 4.0 likewise adopted many browser-specific element types and attributes, but at the same time began to try to "clean up" the standard by marking some of them as deprecated, and suggesting they not be used.

Minor editorial revisions to the HTML 4.0 specification were published as HTML 4.01.

The most common filename extension for files containing HTML is .html, however, older operating systems, such as DOS, limit file extensions to three letters, so a .htm extension is also used. Although perhaps less common now, the shorter form is still widely supported by current software.

Markup element types

Below are the kinds of markup element types in HTML.

  • Structural markup. Describes the purpose of text. For example,
<h2>Golf</h2>
directs the browser to render "Golf" as a second-level heading, similar to the "Markup element types" title at the start of this section. Structural markup does not denote any specific rendering, but most web browsers have standardised on how elements should be formatted. By default, for example, headings like these will appear in large, bold text. Further styling should be done with Cascading Style Sheets (CSS).
  • Presentational markup. Describes the appearance of the text, regardless of its function. For example,
<b>boldface</b>
will render "boldface" in bold text. In the majority of cases, using presentational markup is inappropriate, and presentation should be controlled by using CSS. In the case of both <b>bold</b> and <i>italic</i> there are elements which usually have an equivalent visual rendering but are more semantic in nature, namely <strong>strong emphasis</strong> and <em>emphasis</em> respectively. It is easier to see how an aural user agent should interpret the latter two elements. Note that most presentational markup elements have become deprecated under the HTML 4.0 specification, in favour of CSS based style design.
  • Hypertext markup. Links parts of the document to other documents. For example,
<a href="http://wikipedia.org/">Wikipedia</a>

will render the word Wikipedia as a hyperlink URL.

The Document Type Definition

In order to specify which version of the HTML standard they conform to, all HTML documents should start with a Document Type Declaration (informally, a "DOCTYPE"), which makes reference to a Document Type Definition (DTD). For example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
                      "http://www.w3.org/TR/html4/strict.dtd">

This declaration asserts that the document conforms to the Strict DTD of HTML 4.01, which is purely structural, leaving formatting to Cascading Style Sheets. In some cases, the presence or absence of an appropriate DTD may influence how a web browser will display the page.

In addition to the Strict DTD, HTML 4.01 provides Transitional and Frameset DTDs. The Transitional DTD was intended to gradually phase in the changes made in the Strict DTD, while the Frameset DTD was intended for those documents which contained frames.

Separation of style and content

With the advent and refinement of CSS and the increasing support for it in web browsers, subsequent editions of HTML increasingly stress only using markup that suggests the structure of the document, like headings, paragraphs, block quoted text, and tables, instead of using markup which is written for visual purposes only, like <font>, <b> (bold), and <i> (italics). Some of these elements are not permitted in certain varieties of HTML, like HTML 4.01 Strict. CSS provides a way to separate document structure from the content's presentation, by keeping all code dealing with presentation defined in a CSS file. See separation of style and content.

Publishing HTML with HTTP

The World Wide Web is primarily composed of HTML documents transmitted from a web server to a web browser using the HyperText Transfer Protocol (HTTP). However, HTTP can be used to serve images, sound and other content in addition to HTML. To allow the web browser to know how to handle the document it received, an indication of the file format of the document must be transmitted along with the document. This vital metadata includes the MIME type (text/html for HTML 4.01 and earlier, application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see Character encodings in HTML).

In modern browsers, the MIME type that is sent with the HTML document affects how the document is interpreted. A document sent with an XHTML MIME type, or served as application/xhtml+xml, is expected to be well-formed XML and a syntax error may cause the browser to fail to render the document. The same document sent with a HTML MIME type, or served as text/html, might get displayed since web browsers are more lenient with HTML.

If the MIME type is not recognized as HTML, the web browser should not attempt to render the document as HTML, even if the document is prefaced with a correct Document Type Declaration. Nevertheless, some web browsers do examine the contents or URL of the document and attempt to infer the file type. Such behaviour is discouraged due to security problems; even the most notorious offender, Internet Explorer, has mostly abandoned the practice in recent versions (as of 2005).

HTML e-mail

Some graphical e-mail clients allow the use of a subset of HTML (often ill-defined) as a pure display language. Many of these clients include a GUI HTML editor for composing emails and a rendering engine for displaying them once received. Use of HTML in email is quite controversial due to a variety of issues. The main benefit is the ability to decorate an email with presentational attributes (bold headings etc). However, there are a number of disadvantages, which include:

  • the recipient may not have an email client that can display HTML
  • the email has larger size because lots of formatting will be much larger than the plain text equivalent. This issue is made slightly worse by the fact that, for compatibility, most clients send a plaintext version as well.
  • overuse of formatting (there was at one stage a craze for making letterheads using HTML and sending them as part of every e-mail)
  • potential security issues of deluding the recipient to accept an email as being from an authoritative source (such as a bank) when this is not the case; this is related to phishing scams.
  • potential security issues of simply rendering a complex format like HTML, particularly if the object, embed, iframe or script tags are included as tags to be parsed.
  • potential privacy issues when embedding external content such as images, which can alert a third party that an email has been read (some e-mail clients do not load external images by default for this reason).

For these reasons many mailing lists deliberately block HTML email either stripping out the HTML part to just leave the plain text part or rejecting the entire message.

HTML as a hypertext format

HTML is the basis of a comparatively weak hypertext implementation. Earlier hypertext systems had features such as typed links, transclusion and source tracking. Another feature lacking today is fat links.

Even some hypertext features that were in early versions of HTML have been ignored by most popular webbrowsers until now, such as the link element and editable webpages.

Sometimes web services or browser manufacturers remedy these shortcomings. For instance, members of the modern social software landscape such as wikis and content management systems allow surfers to edit the web pages they visit.

See also: Jacob Nielsen on advanced hypertext for the World Wide Web.

See also

External links

W3C Specifications

Selected Tutorials/Guides

Validators