String

from Wikipedia, the free encyclopedia

In computer science, a character string , character sequence , character string or a string (from English) is a finite sequence of characters (e.g. letters, digits, special characters and control characters ) from a defined character set . Characters can be repeated in a character string, the order of the characters is defined. A character string can also be empty , i.e. contain no character and have the length 0. Character strings are thus sequences of symbols with a finite length.

In programming , a character string is a data type that contains a string of characters of fixed or variable length. This mainly stores words, sentences and entire texts. Almost every programming language has such a data type and some programming languages ​​only work with this data type. Examples are sed , awk, and bash . In the source text of a computer program , character strings represent text that is not interpreted as a programming command but contains information. For example, error messages or other outputs to the user can be recorded as a character string in the source text, or user inputs can be saved as character strings in variables.

The basics of programming languages ​​are examined in theoretical computer science . There the given character set is called the alphabet and the character strings are called " words ". The theory of such words is a subject of formal languages . In connection with programming languages, on the other hand, there are questions of representation, storage and handling of character strings.

representation

Strings can be represented on different levels. One of them is the source text of a program that is read and interpreted by the translator. Another is how a string of characters is stored in memory at runtime of a program.

Syntax for literals

In general, a literal string of characters is represented in programming languages ​​by simply adding characters together. It is enclosed in single or double quotes:

  • "Wikipedia"
  • 'Dieser Satz ist eine Zeichenkette.'
  • "123"
  • ""(an empty string , empty string )
  • "Erste Lösung, um das Anführungszeichen \" als Teil der Zeichenkette aufzunehmen."
    (e.g. in C ; C strings are always delimited by quotation marks, a quotation mark as an element of the character string can still be incorporated in the form of an escape sequence )
  • 'Zweite Lösung, um ein '' aufzunehmen'
    (Doubling of the delimiter, e.g. in Pascal or Rexx ; Pascal strings are always limited by quotation marks)
  • "Dritte Lösung, um ein ' aufzunehmen":
    (Using one or the other delimiter, e.g. in Rexx or Python )
  • L"Wide String"see below for multibyte string in C

Such strings usually need to be written on a single line. However, in some programming languages ​​such as Python, strings that are delimited by triple quotes can span multiple lines.

In Algol 60 , the beginning and end of the string are marked by different characters; Strings can span any line break. One string can also be part of another without syntactic ambiguity.

Intern

There are several methods of efficiently storing character strings. For example, a character from the character set used can be defined as a terminator. A string then ends before the first occurrence of this character. Another possibility is to save the length of the character string separately.

Representation with terminator

In programming languages ​​such as C, the character strings are stored consecutively in the memory and terminated with the zero character (NUL in ASCII ). The null character is the character whose binary representation is all zeros. The following example shows how a 5-character string could be stored in a 10-byte buffer:

F. R. A. N K NUL k e f w
46 52 41 4E 4B 00 6B 65 66 77

The length of the string above is 5; however, it requires 6 bytes in memory. Letters after the NUL character no longer count in the character string; they can belong to a new string or they can simply be unused. A character string in C is an array of the type char , whereby the character string contains a null character as the end identifier. This is why such character strings are also called zero- terminated , an older term is ASCIIZ string . Since the null character itself also requires storage space that the character string occupies, the memory requirement of a character string is always at least 1 character greater than the usable length of the character string. The number of characters before the end identifier is referred to as the "length of the character string". It is determined by the C function strlen().

The advantage of this method is that the length of a string is only limited by the available memory and not additionally by the capacity of the length field ; a disadvantage is that it cannot contain null characters and that it is comparatively difficult and inefficient to use; for example, the length of such a string can only be determined by counting the characters.

Representation with a separate length specification

Another way of storing character strings is used in the programming languages ​​Pascal, BASIC, PL / I and others:

length F. R. A. N K k e f w
05 46 52 41 4E 4B 6B 65 66 77

Character strings that are stored in this way cannot exceed a certain length. In Turbo Pascal , for example, the length is stored in the "zeroth" character. Since a character is 8 bits, the length is limited to 255 characters. The successor language Object Pascal has expanded the length field to 31 bits and supports character strings of up to 2 gigabytes in length. In REXX , too , the length is stored in four bytes, so that the limitation by the length field is not stronger than by the memory.

Storage in the pool

Storing strings takes up a lot of storage space and is a very common task. That is why many high-level programming languages ​​use special management in order to be able to design this as efficiently as possible. However, this is not accessible to the programmer of an application; there is usually no way to access this administration directly or even to determine whether it is active.

All character strings are stored in a central "pool". The aim is that each required character string is only saved once. The variable in the application program is only given an identification number in order to be able to access the character string if necessary.

The administration uses faster and more efficient methods (usually a hash table ) for the organization . Every time a character string is to be saved, a check is made to see whether a string with the same content is already known. If this is the case, the identification number of the existing character string is returned; otherwise it has to be created again.

Every time a string is stored, its reference counter is incremented by one. If a character string is no longer required at a point in the program (because a subroutine has ended and the literals it contains become meaningless, or because a variable receives a different value), this is reported to the administration and the reference counter is reduced by one. This makes it possible to determine which of the stored character strings are currently being used - if the reference counter has the value zero, it is currently not used. This would make it possible to reorganize the administration and delete unneeded character strings ( garbage collection ) if there is a shortage of storage space . However, this is avoided as far as possible because it can happen that each time a subroutine is called, the same character strings are reassigned again and again; advanced management also registers the frequency of saving and only deletes seldom used and long strings.

If it is a programming language in which a source text is compiled and the result is stored in an object file , then the resulting static character strings are usually given a similar tabular management in their data section after all preprocessor operations have been resolved. However, there is neither a deletion nor a reference counter. These literals are also not available to the central character string management, since with dynamic integration it is not ensured that this data section is always loaded.

Multibyte characters

Traditionally, 8 bits corresponding to one byte were used to represent a single character, which allows up to 256 different characters. This is not enough to be able to process characters from many foreign languages ​​and especially non-Latin scripts such as Greek at the same time .

In the meantime, the programming languages ​​provide for 2 bytes or 4 bytes for storing a single character; consequently one avoids the word byte in this context and speaks generally of char .

Under Microsoft Windows , all system functions that use character strings are available in a version with a suffix A(for ANSI , means 1 byte according to ISO 8859 ) and with a suffix W(for wide , multibyte). It is easier, however, not to specify this explicitly: If you compile a program with the corresponding option, all neutral function calls are automatically switched to 1 byte / character or to multibyte. There are also preprocessor macros for the programming languages ​​C ++ and C , with the help of which all standard functions and literals can be noted in an indefinite version in the source text; The appropriate function is then used during compilation. By definition, the historical standard functions in C always process exactly 1 byte / character.

Internally, it is now common in practically all current programming languages ​​to use several bytes for one character and to store the larger numbers in them according to UCS (“ Unicode ”).

In contrast, a mixed form is used for external communication and for saving in files. In order to save space and transmission time in files and in remote data transmission , texts consisting predominantly of Latin (English) letters are noted with 1 byte / character. If the coding is less than 128 (an " ASCII character"), the character is used as it is. If, on the other hand, the byte has a value of 128 or more, this is interpreted as the beginning of a sequence consisting of several bytes that represents a single character. The standardized format for this is UTF-8 (also UTF-16 ). If such characters are encountered and several bytes are available for the internal representation, the decoding should take place as early as possible (during the read-in process), as it is no longer possible to distinguish later how this sequence was meant. The same technique is used to encode URLs ; the Wikilink DF%C3%9Cleads to " DFÜ".

A proprietary intermediate form was in use on Microsoft systems in the 1990s under the name “ Multibyte Character Set ”. Different formats and encodings / decodings were used to remedy the problem of having to cover Asian scripts with 1 byte / character. In the meantime this is still supported externally; internal representations and developments no longer use it, however, but use Unicode .

Basic operations with strings

The basic operations with character strings, which occur in almost all programming languages, are length, copying, comparing, concatenating , forming substrings, pattern recognition , searching for substrings or individual characters.

The assignment operator (mostly =or :=) is used in many high-level programming languages ​​to copy character strings . In C, copying is carried out with the standard function strcpy()or memcpy(). How time-consuming copying is depends heavily on the representation of the character strings. In a method with reference counters , copying only consists of increasing the reference counter. In other procedures, the character string may have to be allocated and completely copied.

Comparing character strings for equal and unequal is supported by many high-level programming languages ​​with the operators =or <>or !=. In some languages ​​such as Pascal, a lexicographical comparison can also be made with <and >. If these operators are not available, functions are used. The standard function strcmp()in C has three results: equal, greater, or less. The first character has the highest value. However, there are also more complicated comparison functions that take into account upper / lower case letters, the classification of umlauts, etc. This plays a role when searching in dictionaries and telephone books.

For concatenation there are operators in many programming languages ​​such as +( BASIC , Pascal , Python , Java , C ++ ), &( Ada , BASIC), .( Perl , PHP ), ..( Lua ) or ||( REXX ). In C there for the function strcat().

In order to add another to an already existing character string, some languages ​​provide their own operator ( +=in Java and Python, ..in Perl and PHP). Usually, the operand is not simply added at the end, but the expression old + new is evaluated and assigned to the variable old , since strings are generally regarded as immutable; so it is only an abbreviated form. However, in many modern programming languages ​​such as Java , C-Sharp or Visual Basic .NET there are so-called string builder classes that represent changeable strings. However, as a rule, the string and the string builder cannot be exchanged, but have to be converted into one another.

Strings that are notated directly one after the other (with or without whitespace ) are implicitly concatenated in some languages ​​(C, C ++, Python, REXX).

There are several ways to get a partial chain. A partial chain can be clearly defined by specifying ( character string , start index , end index ) or ( character string , start index , length ). This operation is called frequent substr(). Some programming languages, for example Python, offer syntactic sugar for this operation (see examples).

PL / SQL

In Oracle, the following basic operations are possible in stored procedures, functions, and PL / SQL blocks:

DECLARE
 Text1 varchar2(30);
 Text2 varchar2(30);
 Text3 varchar2(61);
BEGIN
 Text1 := 'Frank';
 Text2 := 'Meier';
 Text3 := Text1 || ' ' || Text2
END;
/

BASIC

 text$ = "FRANK"
 text2$ = text$

The trailing dollar sign indicates that it is a string variable. Since a string is delimited by quotation marks, they can only be integrated into the string using the Chr(34)or CHR$(34)function, 34 is the ASCII code of the quotation mark.

Several strings can (depending on the BASIC dialect) be combined with the plus sign or with the ampersand "&" to form one ("concatenated"):

 text2$ = "***" + text$ + "***"
 text2$ = "***" & text$ & "***"

C.

This C program defines two character string variables, each of which can hold 5 characters "payload". Since strings are terminated with a null character, the array must have 6 characters. Then the text "FRANK" is copied into both variables.

#include <string.h>

int main(void)
{
  char text1[6];
  char text2[6];

  strcpy(text1, "FRANK");
  strcpy(text2, text1);

  return 0;
}

The standard function is used to add two strings to one another strcat(). However, this does not allocate the storage space required for the target string. This must be done separately beforehand.

#include <string.h>

int main(void)
{
  char puffer[128]; // Zielpuffer, der groß genug ist.

  strcpy(puffer, "FRANK");
  strcat(puffer, "ENSTEIN");

  return 0;
}

Java

String text1 = "FRANK";
String text2 = text1;

Strings in Java are objects of the String class. They cannot be changed after they have been created. In the example above, text1 and text2 represent the same object.

The concatenation of character strings is carried out by the plus operator (overloaded in this case):

String text1 = "FRANK";
String text2 = "ENSTEIN";
String ganzerName = text1 + text2;

Pascal

(Strictly speaking, the following has only worked since Turbo Pascal , since the Pascal language created by Niklaus Wirth only knew packed arrays of char , which were a bit more cumbersome to use)

var vorname, nachname, name: string;
{… …}
vorname := 'FRANK';
nachname := 'MEIER';
name := vorname + ' ' +nachname;

PHP

In PHP The situation is similar to Perl .

$text = "FRANK";

$text2 = $text; // $text2 ergibt "FRANK"

$text3 = <<<HEREDOC
Ich bin ein längerer Text mit Anführungszeichen wie " oder '
HEREDOC;

Texts are concatenated with a point.

$text = "FRANK";
$text = "FRANK" . "ENSTEIN"; // $text ergibt "FRANKENSTEIN"

$text = "FRANK";
$text .= "ENSTEIN"; // $text ergibt "FRANKENSTEIN"

Rexx

In Rexx everything - including numbers - is represented as a string. This is how a string value is assigned to a variable: a = "Ottos Mops" The following expressions each result in the value "Ottos Mops":

  • "Ottos" "Mops"
    (implicitly concatenated; exactly one space is automatically inserted)
  • "Ottos" || ' Mops'
    (explicitly concatenated, no space inserted)
  • "Ottos"' Mops'
    (implicitly concatenated by immediately adding another string that is delimited by the other delimiter)

Further operations

Determine substrings

Assume the variable scontains the string Ottos Mops hopst fort. Then the first character ( O), the first five characters ( Ottos), the seventh to tenth ( Mops) and the last four ( fort) can be determined as follows:

python

  • s[0]O
  • s[:5]or s[0:5]or s.split()[0]Otto
  • s[6:10]or s.split()[1]pug
  • s[-4:]or s.split()[-1]continue

This process is called slicing (from “to slice” meaning “cut into slices” or “split up”). The first character has the index 0.

Rexx

  • SubStr(s, 1, 1)or Left(s, 1)O
  • Left(s, 4)or Word(s, 1)Otto
  • SubStr(s, 7, 4)or Word(s, 2)pug
  • Right(s, 4)or Word(s, 4)continue

Rexx can also process strings word by word, where words are separated by (any number of) spaces. As with Pascal strings, the first character has the index 1.

  • PARSE VAR s A 2 1 O M F   ⇒ Variables A, O, M, F contain 'O', 'Otto', 'Mops', 'fort'

This process is called tokenizing (from English "token" with the meaning "abbreviation" or "game piece" and means here for example "piece" or "chunk") and is also a standard function in other languages.

PHP

  • substr($s, 0, 5)Otto
  • substr($s, 6, 4)pug
  • substr($s, -4)continue
  • further examples, see

BlitzBasic

  • Left(s, 5)Otto
  • Mid(s, 7, 4)pug
  • Right(s, 4)continue

Object Pascal

  • s[1]O
  • Copy(s, 1, 5)Otto
  • Copy(s, 7, 4)pug

Using the StrUtils unit:

  • LeftStr(s, 5)Otto
  • MidStr(s, 7, 4)pug
  • RightStr(s, 4)continue

Algorithms

Different algorithms work mainly with character strings:

Today, a programmer usually no longer writes this type of algorithm himself, but uses constructs from a language or library functions.

Buffer Overflow - Strings and Computer Security

Whenever strings from the outside world are taken over into the internal representation, special precautions should be taken. In addition to unwanted control characters and the formatting, the maximum length of the character string must also be checked.

Example: An international telephone number is to be read from a file. It should only contain digits and be separated from the address by a tab (ASCII 9). A fixed-length string of 16 characters is provided for recording; this is sufficient for all valid telephone numbers. - The input data could contain spaces or hyphens and make the phone number longer. Even if a space that looks exactly the same is accidentally followed instead of TAB, the result is more than 16 characters.

If this is not checked by means of suitable checks and if there is no appropriate response, a buffer overflow and possibly a crash of the program or mysterious subsequent errors occur.

The most common attack vectors on web servers include buffer overflows. The attempt is made to assign a content to a character string variable, the length of which exceeds the length of the variable reserved by the compiler in the memory. This will overwrite other neighboring variables in memory. If this effect is skilfully exploited, a program running on a server can be manipulated and used to attack the server. But it is enough to crash the server software; Since it is supposed to guard the network connection (" gateway "), its failure tears a gap that leaves a weakly secured server defenseless against any manipulation.

Unless the validity has already been monitored in a manageable environment, character string operations should only be carried out with functions in which the maximum length of the character string is checked. In C these would be functions such as B. strncpy(), snprintf()... (instead of strcpy(), sprintf()...).

See also

Individual evidence

  1. Every single character in every computer system has a numerical value (mostly binary value), according to which it can be compared with another character to be equal, larger or smaller. For the determination of duplicates in a set of strings, all that matters is equality; but if the comparison function also offers the results larger and smaller according to the rules of a total order , duplicate detection can be done much more efficiently with binary searches .
  2. Due to the decreasing value of the signs of increasing address, compare the comparison operators of different programming languages, as well as the C functions strcmp(), strncmp(), memcmp()on each machine in the style of big endian .
  3. php.net - strings according to the official PHP manual