Google File System

from Wikipedia, the free encyclopedia

The Google File System ( GFS or GoogleFS ) is a proprietary Linux- based distributed file system that Google LLC uses for its applications. It 's optimized for Google's web search. The data is sometimes stored in files several gigabytes in size, which are rarely deleted, overwritten or reduced in size. It is also optimized for high data throughput .

construction

The Google File System is adapted to the necessary requirements of the web search , which generates an enormous amount of data to be stored. GFS emerged from an earlier attempt by Google, which bears the name “BigFiles” and was developed by Larry Page and Sergey Brin during their research at Stanford University .

The data is consistently stored in very large files , sometimes even several gigabytes in size, which are only deleted, overwritten or compressed in extremely rare cases ; Data are usually appended or read out. The file system has also been designed and optimized to run on Google's computing clusters , the network nodes of which consist of commercially available PCs. However, this also means that the high failure rate and the associated loss of data from individual network nodes must be viewed as normal. This is also reflected in the fact that no distinction is made between normal (shutdown) and abnormal termination (crash): Server processes are terminated by default with a kill command . Other design decisions rely on high data throughput rates, even if this comes at the expense of latency .

A GFS cluster consists of a master and hundreds or thousands of chunk servers . The chunk servers store the files, with each file divided into 64 MB pieces ("chunks") , similar to clusters or sectors in common file systems.

To prevent data loss, the GFS defaults to saving each file at least three times per cluster. If a chunk server fails, there are only negligible delays until the file has its standard number of replicas again. Depending on requirements, the number can be higher, for example in the case of executable files . Each chunk is assigned a unique, 64-bit long identifier, and logical mappings of the files to the individual chunks are retained.

The master does not save chunks, but rather their metadata , such as file names , file sizes, their storage location and their copies, which processes are currently accessing which chunk, etc. The masters receive all requests for a file and, in response, provide the associated chunk server and issue it appropriate locks on the process. However, a client may cache the address of the chunk server for a certain period of time . If the number of available replicas falls below the standard number, it is also the masters that trigger the creation of a new chunk copy. The metadata is kept up-to-date by the masters regularly sending update requests to the chunk servers ( heart-beat messages” in German for example: “heartbeat messages” ).

The GFS is designed and implemented with only one master per cluster. This appears to be a flaw in the system that limits its scalability and reliability, since the maximum size and uptime depend on the performance and uptime of the master, since it catalogs the metadata and almost all requests go through it; However, Google's technicians have shown through measurements that this (at least so far) is not the case and that GFS is very scalable. The master is normally the most powerful network node in the network. In order to ensure reliability, there are several "shadow masters" that mirror the main computer and, if necessary, immediately step in if the master should fail. In addition, the shadow masters are also available for pure read requests, which make up the main traffic, so that the scalability is increased further. There are seldom bottlenecks, as clients only ask for metadata that is completely stored in the main memory as a B-tree - it is very compact, only a few bytes are generated per megabyte of data. By using only one main node, the software complexity is drastically reduced, since write operations do not have to be coordinated.

literature

  • Matthew Helmke: Ubuntu Unleashed 2015 Edition . Pearson Education Inc, 2015, ISBN 978-0-672-33837-3 .
  • Kuan-Ching Li, Qing Li, Timothy K. Shih (Eds.): Cloud Computing and Digital Media . Taylor & Francis Group, Boca Raton 2014, ISBN 978-1-4665-6917-1 .
  • Kenli Li, Zheng Xiao, Yan Wang, Jiayi Du, Keqin Li (Eds.): Parallel Computational Fluid Dynamics . Springer Verlag, Berlin / Heidelberg 2014, ISBN 978-3-642-53961-9 .
  • Yunquan Zhang, Kenli Li, Zheng Xiao (Eds.): High Performance Computing . Springer Verlag, Berlin / Heidelberg 2012, ISBN 978-3-642-41590-6 .

See also

Web links