sed (Unix)

from Wikipedia, the free encyclopedia

sed stands for Stream EDitor and is a Unix tool that can be used to edit text data streams . The data stream can also be read from a file. In contrast to a text editor , the original file is not changed.

In contrast to an interactive text editor such as vi , it is controlled sedby means of a script .

The sedcommand set is based on that of the line-oriented text editor ed . According to the POSIX specification, a certain type of regular expressions , so-called (POSIX) basic regular expressions (BRE), are used for the text screening . The GNU implementation uses GNU-BRE s, which differ slightly from POSIX-BRE s.

Even if the range of languages sedseems to be quite limited and specialized, it is a Turing-complete language. The Turing-completeness can be proven by sedprogramming a Turing machine with or by writing an interpreter for another Turing-complete language with sed .

As a result, games like Sokoban or Arkanoid and other sophisticated programs like Debugger could and were written with sed.

Working method

sedcan work both within a pipeline and on files . Outputs are always on <stdout>, error messages on <stderr>. The typical call therefore looks like this:

sed 'Anweisung1
     Anweisung2
     …
     AnweisungN' Eingabedatei > Ausgabedatei
<stream> | sed 'Anweisung1
                Anweisung2
                …
                AnweisungN' | <stream>

sedreads an input file (or one input streamto <stdin>) line by line. These input data first end up in the so-called pattern space . Each instruction of the specified program is executed one after the other on this pattern space. Each of these instructions can change the pattern space, the following instructions are then executed on the respective result of the last instruction. If one of these changes leads to a null text, processing is aborted at this point and the instruction list starts again from the beginning with the next input line. Otherwise the result of the last instruction is <stdout>output and the instruction list is also started again with the next input line.

programming

sed-Instructions can be roughly divided into three groups: text manipulations, branches and others. (Most of the sedmanuals, as well as the POSIX specification, divide instructions into 2-address, 1-address and addressless instructions - see below - but this grouping is not suitable for introductory purposes.)

Text manipulation

This is by far the most frequently used function and the instruction set is particularly extensive here. An instruction generally has the following structure ( 2-address command ):

<Adresse1>,<Adresse2> Kommando [Optionen]

Adresse1and Adresse2can also be omitted. If both addresses are specified, the execution will be carried out Kommandofor each line, starting with the one that Adresse1matches with up to the one that Adresse2matches with . If Adresse1and are Adresse2not specified, then it is carried out Kommandofor each line, if it is only Adresse2omitted, then it is Kommandoonly carried out for lines that Adresse1match. An address is either a line number or a regular expression . Regular expressions are /enclosed in two . Two examples:

sed '/Beginn/,/Ende/ s/alt/NEU/' inputfile
Input output
x alt
Beginn
y alt
Ende
z alt
x alt
Beginn
y NEU
Ende
z alt

"Old" is replaced by "NEW", but only from the line that contains "begin" to the line that contains "end" (2-address variant). In contrast, in the second example, the same replacement is carried out in all lines that begin with "y" or "z" (1-address variant):

sed '/^[yz]/ s/alt/NEU/' inputfile
Input output
x alt
Beginn
y alt
Ende
z alt
x alt
Beginn
y NEU
Ende
z NEU

Compound commands

Instead of a single command, it can Kommandoalso contain a list of instructions { … }enclosed by. The rules described above apply to these instructions; they can themselves also consist of further composite commands. An example:

sed '/^[yz]/ {
               s/^\([yz]\)/(\1)/
               s/alt/NEU/
             }' inputfile
Input output
x alt
Beginn
y alt
Ende
z alt
x alt
Beginn
(y) NEU
Ende
(z) NEU

Branches

sedknows two types of branches: unconditional branches (jump instructions) and conditional ones, which are executed depending on a replacement operation that was performed or not. A typical application example is the following: a source text was indented with the help of leading tab characters , these leading tabs should be replaced by 8 blanks each . Tabs other than those at the beginning of the line can appear in the text, but should not be changed. The problem is that multiplicative links ( replace N tabs with N * 8 blanks ) cannot be expressed as RegExp. On the other hand, a global replacement would also affect the tab characters within the text. Therefore, a loop is formed with jump instructions (in the following, blanks and tabs are symbolized by "<t>" and "<b>" for better understanding):

sed ':start
     /^<b>*<t>/ {
                  s/^\(<b>*\)<t>/\1<b><b><b><b><b><b><b><b>/
                  b start
                }' inputfile

In each line, the first tab character, provided that it is preceded by zero or more spaces, is replaced by 8 spaces, after which the jump instruction b <Sprungzielname>ensures that the program execution returns to the first line. If the last leading tabulator character is replaced, the expression /^<b>*<t>/no longer matches and the block is not executed, so that the end of the program is reached and the next line is read.

The similarity to assembly languages ​​becomes clear here, as a control structure comparable to that used in high-level languages ​​is built up with a condition and a labelrepeat-until .

Other instructions

Hold space manipulation

A powerful (albeit relatively unknown) function of sedis the so-called hold space . This is a freely available memory area that works in a similar way to the accumulator known in some assembler languages . Direct manipulation of the data in the hold space is not possible, but data in the pattern space can be relocated to the hold space, copied or exchanged with its content. The pattern space can also be attached to the hold space or vice versa.

The following example illustrates the function of the Hold Space: the text of a "chapter heading" is saved and added to each line of the respective "chapter", but the line with the chapter heading itself is suppressed:

sed '/^=/ {
             s/^=//
             s/^/ (/
             s/$/)/
             h
             d
          }
     G; s/\n// ' inputfile
Input output
=Kapitel1
Zeile 1
Zeile 2
Zeile 3
=Kapitel2
Zeile A
Zeile B
Zeile C
Zeile 1 (Kapitel1)
Zeile 2 (Kapitel1)
Zeile 3 (Kapitel1)
Zeile A (Kapitel2)
Zeile B (Kapitel2)
Zeile C (Kapitel2)

Whenever a line begins with "=", the statement block is executed, which removes this character and provides the remaining line with a leading space and brackets. Then this text is copied into the hold space (“h”) and deleted from the pattern space (“d”), whereby the program for this line is ended and the next line is read. Since the condition of the input block does not apply to “normal lines”, only the last instruction (“G”) is carried out, which appends the content of the hold space to the pattern space.

Multi-line statements

Not all text manipulations can be carried out within individual lines. Sometimes information from other lines needs to be included in the decision-making process, and sometimes cross-line replacements need to be made. For this purpose, the sedprogramming language provides the instructions N, Pand D, with which several lines of the input text can be loaded into the pattern space (“N”) at the same time and parts of it can be output (“P”) or deleted (“D”). A typical application example is the following one-liner (actually two one-liners), which provides a text with line numbers:

sed '=' inputfile | sed 'N; s/\n/<t>/'

The first sedcall prints the line number for each line in the input text and then the line itself. The second sedcall combines these two lines into a single one by first reading the following line ("N") and then the automatically inserted line separator ( "\ N") is replaced by a tab character.

Applications, options, notes

Capacity limits

sedis not subject to any (real) restrictions on file sizes. Aside from the available disk space, which is a practical limit, most implementations implement the line counter as intor long int. With the 64-bit processors commonly used today, the risk of overflow can therefore be neglected.

Like most of the UNIX text manipulation tools, sedthe line length is limited (more precisely: the number of bytes up to the following newlinecharacter). The minimum size is defined by the POSIX standard; the actual size can vary from system to system and can be looked up in the kernel header file /usr/include/limits.has the value of the constant in each case LINE_MAX. The length is specified in bytes , not in characters (which is why a conversion is necessary when processing UTF-coded files that represent individual characters with several bytes).

Greedyness

A RegExpdistinction is made between greedy and non-greedy in the scope of s . sed- RegExps are always greedy , which means that they RegExpalways have the longest possible scope:

/a.*B/; "'a', gefolgt von null oder mehr beliebigen Zeichen, gefolgt von 'B'"
axyBBBskdjfhaaBBpweruBjdfh ; längstmöglicher Geltungsbereich (greedy)
axyBBBskdjfhaaBBpweruBjdfh ; kürzestmöglicher Geltungsbereich (non-greedy)

The reason is that it is sedoptimized for speed and non-greedy RegExps would require extensive backtracking . If you want to enforce a non-greedy behavior, this is usually achieved by negating character classes. In the example above, for example:

/a[^B]*B/ ; "'a', gefolgt von null oder mehr nicht-'B', gefolgt von 'B'"

Practical Limits in Shell Programming

It should not go unmentioned that the most common use of sed(but also of awk, trand similar filter programs) in practice - the manipulation of the output of other commands ad hoc , something like this:

ls -l /path/to/myfile | sed 's/^\([^ ][^ ]*\) .*/\1/' # gibt Filetype und Filemode aus

strictly speaking represents abuse. Since every call of an external program requires the complex system calls fork()and exec(), shell-internal methods, such as the so-called variable expansion , are usually superior to calling external programs, even if they have to be written much longer. The rule of thumb for this is: if the output of the filter process is a file or a data stream, the filter program must be used, otherwise variable expansion is preferable.

In-place editing

Due to the way sedtext manipulation is carried out, this cannot be done directly on the input file. A separate file is required as output, which is then copied via the input file if necessary.

sed '…<Anweisungen>…' /path/to/inputfile > /path/to/output
mv /path/to/output /path/to/input

This is also provided for in the POSIX standard. The GNU version of sed offers the command line option in addition to the POSIX standard -i. This allows a file to be changed in place, apparently without detour , but in fact a temporary file is also created in the background. This is not deleted in the event of an error and the metadata (owner, group, inode number, ...) of the original file is definitely changed.

RegExp-Notation

It has become common practice to limit regular expressions - as in the examples above - with slashes. seddoes not require this, however. Every character that follows a replacement command is accepted as a delimiter and then expected in the sequence. These two statements are therefore equivalent:

s/^\([^ ][^ ]*\) \([^ ][^ ]*\)/\2 \1/ ; vertauscht erstes und zweites Wort einer Zeile
s_^\([^ ][^ ]*\) \([^ ][^ ]*\)_\2 \1_ ; "_" statt "/"

This is useful if the slash is RegExprequired as part of the , because then you can save yourself the tedious escape of escaping (identifying the use as a literal ). You then simply switch to another, unused character.

Some typical practices

Deletion of text parts

Is done by replacing with nothing . Explicit deletion for parts of a line is only provided from the beginning of the line to the first line separator ( D). The expression

/Ausdruck/d

however, does NOT delete the expression part , but every line that contains expression ! Expression functions here as the address (see above, 1-address variant of the command d).

Addressing at least one character

In the scope of the POSIX-BREs - in contrast to the GNU-BREs - the quantifier \+ for one or more of the preceding expression is not provided. In order sedto write portable scripts that do not only run with GNU sed, the expression should therefore be doubled and the *quantifier ( zero or more ) should be used.

/xa\+y/ ; GNU-Variante für "'x' gefolgt von einem oder mehr (aber nicht null) 'a', gefolgt von 'y'"
/xaa*y/ ; dasselbe in POSIX: "'x' gefolgt von 'a' gefolgt von null oder mehr 'a's, gefolgt von 'y'"

Replacement of several or all occurrences within a line

If no further options are specified, only the first occurrence of a search text is subject to the replacement rule:

sed 's/alt/NEU/' inputfile
Input output
alt
alt alt
alt alt alt
alt alt alt alt
alt alt alt alt alt
NEU
NEU alt
NEU alt alt
NEU alt alt alt
NEU alt alt alt alt

This behavior, however, can be changed by specifying comma DOptions: If a number N is specified, only that is N changed th occurrence, one g(for global ) changes all occurrences:

sed 's/alt/NEU/g' inputfile
Input output
alt
alt alt
alt alt alt
alt alt alt alt
alt alt alt alt alt
NEU
NEU NEU
NEU NEU NEU
NEU NEU NEU NEU
NEU NEU NEU NEU NEU

Filter specific lines

Basically, sedalways outputs the content of the pattern space after the last instruction. If you want to suppress this behavior for individual lines, you can either delete certain lines using a rule (explicit filtering), but it is also -npossible to switch off this behavior altogether (implicit filtering) with the command line option . Only what is specified with the express Printcommand ( p) is then output . pcan either serve as a separate instruction or as an option for other instructions. From the text already used above, the example only outputs the "chapter headings":

sed -n 's/^=\(.*\)$/Kapitelüberschrift: \1/p' inputfile
Input output
=Kapitel1
Zeile 1
Zeile 2
Zeile 3
=Kapitel2
Zeile A
Zeile B
Zeile C
Kapitelüberschrift: Kapitel1
Kapitelüberschrift: Kapitel2

Debugging

For troubleshooting, it can be useful to have interim results displayed in order to better understand the development in the pattern space. The option mentioned above pcan be used for this. Lines can be output several times in a row. In the example program above, for example:

sed '/^=/ {
             s/^=//p
             s/^/ (/p
             s/$/)/p
             h
             d
          }
     p
     G' inputfile

Web links

Individual evidence

  1. sed specification of the Open Group. Retrieved March 27, 2013 .
  2. Implementation of a Turing Machine as Sed Script. Retrieved March 23, 2013 .
  3. Turing machine with sed . Retrieved March 17, 2013 .
  4. cam.ac.uk ( Memento from April 18, 2010 in the Internet Archive )
  5. List of various sed scripts. Retrieved November 19, 2011 .
  6. Comparing the Run-Time Efficiency of a ROT13 Algorithm in tr vs. ksh93. Retrieved March 25, 2013 .