URL encoding

from Wikipedia, the free encyclopedia

URL encoding ( URL encoding , also called percent encoding ) is a mechanism that is used to encode information in a URL under certain conditions . Only certain characters of the ASCII character set are used for coding .

Without this coding, some information could not be represented in a URL. For example, a space is usually interpreted by the browser as the end of the URL, subsequent characters would be ignored or lead to an error. With the URL encoding, a space can be passed through the character string %20. RFC 3986 defines a standard of how a URI (and thus also a URL) should be syntactically structured and under which conditions the URL encoding is used.

The URL coding with the percent sign is also used for characters not contained in the ASCII character set . So far, however, there is only one recommendation in RFC 3986; a binding standard is still missing.

Reserved and unreserved characters

URLs can consist of the following parts:

https://maxmuster:geheim@www.example.com:8080/index.html?p1=A&p2=B#ressource
\___/   \_______/ \____/ \_____________/ \__/\_________/ \_______/ \_______/
  |         |       |           |         |       |          |         |
Schema      |    Kennwort      Host      Port    Pfad      Query    Fragment
         Benutzer

Certain characters within this expression identify and separate the individual segments of the URL and enable the expression to be broken down and processed. For example, with HTTP access:

Other characters have specific meanings in the document path . The following characters are reserved:

  • : / ? # [ ] @ ! $ & ' ( ) * + , ; =

The following characters are not reserved and therefore have no predefined meaning in a URL:

  • Letters: A–Z, a–z
  • Digits: 0–9
  • - . _ ~

Percentage representation

A URL consists of the named reserved and non-reserved characters. It cannot contain any other characters. In principle, however, there is a need to be able to represent any byte sequences in URLs - i.e. all values ​​between 0 and 255. In addition, there must be a way of being able to write reserved characters in a URL in such a way that they lose their special meaning (see also escape sequence ).

The percentage representation of characters takes both requirements into account. It is based on a coding process that assigns a three-digit combination of characters to each character code, starting with the percent sign, followed by the two-digit hexadecimal representation of the character code.

A reserved character must be written in percent-coded form in a URL if it has a special meaning at the point where it is located, but should not have this in the present context. Unreserved characters can, but should not, be percent-encoded. For other characters (including binary data) there is usually no other option than to display them in a URL in percent-coded form (exception: reserved character +'' instead of a space in the query string).

Example:

According to ASCII, #the hexadecimal character code 23 is assigned to the character ''. Thus the expression %23'' represents the percent-coded form of the character #''.

The interpretation of:

http://www.example.net/index.html?session=A54C6FE2#info

is clear. A URL parameter named was sessiondefined to which the value is A54C6FE2assigned and the document anchor named was infospecified. In #the present context, the symbol '' has the special meaning that it is followed by the name of a document anchor. Should it lose this meaning, i.e. H. If sessionthe value is A54C6FE2#infoassigned to the URL parameter , the character #'' must be in percent-coded form in the URL:

http://www.example.net/index.html?session=A54C6FE2%23info

In practice, this mechanism is not always applied consistently. However, there are cases when it is necessary to use it, for example when calling an anchor via a dereferrer service .

Relevant ASCII characters in percentages

! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] { | }
%20 %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C %2D %2E %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %7B %7C %7D

Non-ASCII characters

The bytes are also %coded with a preceding '' for characters that are not contained in the ASCII character set . Which bit sequence a character represents depends on the character coding to be used. It is recommended by RFC 3986 to use UTF-8 for encoding, as this Unicode format can be used for all international characters, which makes UTF-8 the quasi-standard encoding for URIs, but there is no explicit standard yet . In order to be able to encode the URL, one has to know or guess which character encoding was used for the file to be called up or which encoding the target computer is using. For this reason, it still makes sense to only use characters from the ASCII pool.

In the recommended UTF-8 encoding, for example, the letter “ö” (with the decimal Unicode character value 246) would be %C3%B6displayed as. All character values ​​above 127 are represented by UTF-8 as combinations of two or more bytes and are accordingly included in the percent coding. The characters of the Latin alphabet (extended by diacritics ) are all represented with two bytes. CJK characters , for example, require more bytes .

Sometimes ISO 8859-1 (Latin-1) is still used for the representation and its identical decimal character value 246 is inserted directly into the URL with the help of percent coding. The umlaut "ö" is then %F6displayed as a value .

Both types of representation transmit different bit sequences to the server. Although both are correctly coded according to their type, only one of them delivers the desired file and the other usually only an error message. With some servers - such as those of Wikipedia - an attempt is made to determine the encoding, so that the correct file can then be forwarded. If one encoding doesn't work, try one of the other likely variants.

Uniqueness of the character decoding

Individually encoded ASCII characters (e.g. %23for #) are encoded identically in ASCII , UTF-8 and most other common encodings such as ISO 8859-15.

The coding is uncertain for numbers from 128 to 255: Either it is a UTF-8 code sequence (or its beginning) or a coding for a limited character set of 256 characters such as ISO 8859-15. Because only certain consecutive codes are allowed in UTF-8, limited encodings and UTF-8 can be differentiated with a certain probability: %C3%B6the character "ö" according to UTF-8 will be quite certain (and not the character string according to ISO 8859-15 ö) .

Form encoding

The MIME type application/x-www-form-urlencodedcan be used to identify URL-encoded data. When submitting web form details using the POST method, this MIME type is specified as the content type . For historical reasons, the coding does not exactly match the coding in URLs; in particular, a space is not encoded with %20'', but instead with a single +''.

Web links

  • RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax
  • RFC 3987 - Internationalized Resource Identifiers (IRIs) offer a clearly distinguishable alternative to the representation of URIs with Unicode characters and use an extended variant of the URL coding
  • URL encoding tool

Individual evidence

  1. HTML 4.01 Specification: 17.13.4 Form content types December 24, 1999.