Canonical XML

from Wikipedia, the free encyclopedia

Canonical XML describes the canonical form of XML documents, which is intended to simplify the comparability of two such documents. Because of this, the Canonical XML Transformation removes insignificant differences between documents. Every XML document can be converted into canonical form (Canonical XML).

For example, XML allows spaces to appear in certain places within the start tag , and attributes can be specified in any order. Such differences are very rarely, if ever, given meaning. For this reason, the following two forms are generally considered equivalent:

    <p class="a" secure="1">

    <p     secure   = "1"
              class='a'    >

In the course of converting any XML document into canonical XML, the attributes are sorted in their nominative order (alphabetically by name) and the spaces and quotation marks are standardized. Thus, the second shape would be converted to the first shape.

Canonical XML specifies a number of other details, some of which are listed here:

  • the UTF-8 character encoding is used
  • Line ends are represented by the character 0x0A (New Line = line feed ),
  • Spaces within the attribute values ​​are standardized,
  • Entity references are resolved
  • Sections marked as CDATA are resolved,
  • empty elements are <leer></leer>coded with start and end tags , not by using them as empty tags <leer/>,
  • Standard attributes are specified explicitly,
  • Unnecessary namespace declarations are deleted.

Converting a document to Canonical XML is idempotent . This means that during the first conversion the characters displayed change compared to the original, but no further changes are made during further conversions.

According to the W3C , two documents can be considered logically equivalent within the given context of use if they have the same canonical form (except for some infrequent cases).

However, in special environments, users might want special semantics that are outside of the general logical sameness with which Canonical XML is associated. For example, a steganography system in an XML document could be changed by changing spaces, quotation marks for attributes and the arrangement of these, the use of hexadecimal vs. decimal character references, etc. hide information. Obviously, converting such a file to Canonical XML will lose these special semantics. However, XML files that deal with the use of uppercase vs. Differentiate lower case, or those that use old vs. use new spelling, etc., may be considered equivalent for certain purposes. Such contexts are outside the scope of Canonical XML.

software

An implementation of Canonical XML can be found in the program xmllint, which is part of gnome libxml2 and is also available for Microsoft Windows.

Example application:

xmllint --c14n SomeXml.xml > CanonicalVersionOf_SomeXml.xml

See also

Web links