Wide format and long format

from Wikipedia, the free encyclopedia

The Wide Format and Long Format are (sometimes called unstacked and stacked) terms that are used to two different representations to describe table data.

The wide format is suitable for displaying cross-sectional data or time series data. In the wide format, several columns contain the measured values ​​of the same variable at different times or repeated tests, while the individual or observation unit occupies the entire row of the table. This type of display is intuitive for comparing different values ​​of the same variable, but unsuitable for displaying panel data with more than one variable.

In the case of panel data, the data are available for several individuals and for several measurements per individual. If the panel contains more than one variable, such as both weight and height, these would have to be placed side by side in wide format. The columns for height and weight can then be sorted by type of variable or time of measurement, which makes it difficult to keep an overview. The point in time or the repetition of the measurement can only be recognized implicitly from the column name. In this case the long format is more suitable.

In long format, all values ​​of the repeated measurement variable are shown in the same column and the associated time is shown in a separate variable. Therefore, the data in long format is also referred to as “stacked”. Further variables are noted in their own column, but also use the values ​​of the time variable. The time variable (which can also indicate the repetition of the experiment or the context of the measurement) is thus explicitly stated.

In general, the wide format has more columns than the long format, which has more rows. The term "format" has many different meanings in computer science; The terms "layout" or "structure" are therefore suggested for a more precise definition.

For some operations, especially when analyzing panel data, statistics programs require that a long format is used in which the point in time is explicitly included. This is why the transformation of data in wide format into long format is of great importance and implemented in many statistical programs. An invented example of such a transformation is carried out in the following section.

example

A nutritionist wants to test a new diet method. For this purpose, 10 overweight people, 5 of them women and 5 men, take part in a study. The test persons are weighed immediately before the start (Weight.1) and after the end (Weight.2) of the diet in order to be able to determine any weight reduction due to the diet. In addition, the weight is measured again a year later (Weight.3) to check the long-term success of the diet. Weight appears as a repeated variable in several columns, in contrast to gender, which is only measured once for each subject.

# R-Programm-Code zur Erstellung des Datensatzes im Wide-Format und Export zu Latex:
library (xtable)
set.seed(42)
datensatz.wide = data.frame(Probandennummer = 1:10 , Geschlecht = c(rep("w",5),rep("m",5)) ,
Gewicht.1 = rnorm(10,150,10) , Gewicht.2 = rnorm(10,140,10) , Gewicht.3 = rnorm(10,135,10))
View(datensatz.wide)
xtable(datensatz.wide , caption = "Wide-Format" , digits=1 , align = c("c|","c","c","c","c","c"))
Wide format
Subject number gender Weight. 1 Weight. 2 Weight. 3
1 1 w 163.7 153.0 131.9
2 2 w 144.4 162.9 117.2
3 3 w 153.6 126.1 133.3
4th 4th w 156.3 137.2 147.1
5 5 w 154.0 138.7 154.0
6th 6th m 148.9 146.4 130.7
7th 7th m 165.1 137.2 132.4
8th 8th m 149.1 113.4 117.4
9 9 m 170.2 115.6 139.6
10 10 m 149.4 153.2 128.6

As clear as the wide format is, some statistical methods such as B. the analysis of variance with repeated measurements using the function ezANOVAfrom the R package ez require a representation in long format. In the wide format, a separate column is created in the data record for each point in time at which the weight of the test subjects is measured. In contrast, in the long format, all measured values ​​of the weight for the three points in time are placed in a single column. A new variable is created accordingly so that the information about the point in time is not lost.

Long format
Subject number gender time Weight
1.1 1 w 1 163.7
2.1 2 w 1 144.4
3.1 3 w 1 153.6
4.1 4th w 1 156.3
5.1 5 w 1 154.0
6.1 6th m 1 148.9
7.1 7th m 1 165.1
8.1 8th m 1 149.1
9.1 9 m 1 170.2
10.1 10 m 1 149.4
1.2 1 w 2 153.0
2.2 2 w 2 162.9
3.2 3 w 2 126.1
4.2 4th w 2 137.2
5.2 5 w 2 138.7
6.2 6th m 2 146.4
7.2 7th m 2 137.2
8.2 8th m 2 113.4
9.2 9 m 2 115.6
10.2 10 m 2 153.2
1.3 1 w 3 131.9
2.3 2 w 3 117.2
3.3 3 w 3 133.3
4.3 4th w 3 147.1
5.3 5 w 3 154.0
6.3 6th m 3 130.7
7.3 7th m 3 132.4
8.3 8th m 3 117.4
9.3 9 m 3 139.6
10.3 10 m 3 128.6
# R-Programm-Code zur Transformation vom Wide -ins Long-Format und Export zu Latex:
datensatz.long = reshape(datensatz.wide , idvar = "Probandennummer" , varying = c("Gewicht.1","Gewicht.2","Gewicht.3") ,
timevar = "Zeitpunkt" , v.names = "Gewicht" , sep = "." , direction = "long")
View(datensatz.long)
xtable(datensatz.long , caption = "Long-Format" , digits = 1 , align = c("c|","c","c","c","c"))

The transformation from wide to long format can be done in R with the reshape command, among other things. The first argument to the function is the data set to be restructured, in this case datensatz.wide. idvaris the variable that uniquely identifies the subjects, in this case with the numbers 1 to 10. varyingIndicates the repeated measurement variable Gewichtincluding the designation for the point in time. The measurement time is separated from the variable name by a point, so that the three individual variables in wide format result as a vector c("Gewicht.1","Gewicht.2","Gewicht.3"). Because in this case a period was used for separation, this is sep = "."noted in the argument . If, as with, Gewicht1 no character is used for the separation, one would sep = ""write instead . The repeated measurement variable designation without measurement time is entered at v.names. A new variable is created in Long format so that it is clear at what point in time the repeated-measurement variable was measured. A variable name for this can be assigned under timevar. Finally, directionthe direction of the transformation is specified with, in this case in a long format.

Individual evidence

  1. Stata | FAQ: Problems with reshape. Retrieved June 11, 2020 .
  2. Michael A. Lawrence: ez: Easy Analysis and Visualization of Factorial Experiments. November 2, 2016, accessed December 16, 2016 .

Web links