Browser cache

from Wikipedia, the free encyclopedia

Browser cache [ ˈbɹaʊ̯zə (ɹ) kæʃ ] is a buffer memory of the web browser in which resources (e.g. texts or images) that have already been accessed are stored as copies on the user's computer (local). If a resource is needed again later, it can be retrieved from the cache more quickly than if it had to be downloaded again from the World Wide Web .

Every time the content of a URL is required to display a page, the cache is first checked to see whether it already exists.

The advantage is that network traffic and the time it takes to download all components of a website are greatly reduced. The disadvantage is that the data stored in the cache can be out of date if the website has been updated in the meantime.

Areas of application

“Resource” is anything that can be obtained from a specific URL. Different content can be present under the same URL at different times. The last known content is assigned to each URL in the cache.

The same concept for a browser cache on a user's PC is also used by so-called proxy servers for entire computer networks , for example for a company location or a university at the connection point to the Internet or for all customers of a (physical) telecommunications provider in a supply area . They deliver frequently requested files (such as the WP globe ) directly to all connected participants in this network , without first having to go through the actual Internet.

Caching is basically an option for any protocol that makes resources available. In practice, however, it is only used for HTTP / HTTPS . Anyone who requests a file download with the browser via FTP will receive a fresh version at that moment; however, a proxy server could keep copies of frequently requested files. sends data and does not retrieve it. Other protocols for resources do not offer any special support for version management and are rarely used today. mailto:

User control

With a browser, users generally have the following options for influencing the behavior of the cache via configuration settings or interactively:

  • Set the maximum size of the hard disk space; if it is zero, there is no caching.
  • Retrieve the current page from the server again (this would affect the URL visible in the address line; single page, or a graphic).
  • Call up the current page and all the resources it contains (images, scripts, styles, etc.) (this would mostly only affect HTML pages).
  • Empty the entire cache, possibly also selectively only for all resources of a certain domain (the current page).
  • Empty the entire cache at the end (or alternatively at the beginning) of each session.
  • Set maximum age for resources.

Cache management

The following principles apply to the browser or proxy server:

  • No system is required to follow any indications of the origin of the resource.
  • In the interest of an attractive offer, browser developers are interested in evaluating information on cache management in order to display the pages both quickly and up-to-date.

Extensive suggestions were made in 1997 in RFC 2068 and in 1999 in RFC 2616 , but do not have to be fully implemented.

Web server

A web server should provide cache information ( metadata ) for each individual resource in order to guarantee the user an up-to-date display and to achieve the lowest possible communication effort for both user and server.

The operator of the web server benefits from the fact that he does not have to constantly answer queries about unchanged resources and use computer and network capacity for this.

In order to obtain the frequency of visited pages and information about the readers, on the other hand, very small embedded snippets are used , which are appropriately prevented from being included in the cache and forced to be called up each time the page is displayed.

Version identification

For cache management, one of two identifiers is optionally used for each individual resource, if provided by the server:

Timestamp
Banal timestamp in UTC
Field :Last-Modified
ETag
Differentiation of substantially different versions of a resource in terms of content.
Can basically also be a timestamp; but also a continuously counted version number or a hash code .
Field: ETag

If such information is missing, the cache management only knows the time of the last successful call.

Methods

The following methods are available:

Heuristic Estimates
The cache management uses its own algorithms to determine which resources should remain and which should be removed. This is particularly used if the resource has not been given appropriate information.
  • Leave resources indefinitely if there is enough disk space available; only delete appropriately if there are space problems.
  • Resources that were recently used may soon be needed again. Resources that have not been addressed for a long time must be deleted.
  • Resources that are frequently addressed should be retained. Resources that are seldom and not in demand for a long time must be deleted.
  • File size - delete very large resources first, not used for a long time and in total seldom needed leave a lot of small resources instead.
  • Stability - content that has been changed several times is candidate for deletion. Resources that have not been changed over years and several validity checks should be retained if they are used frequently. Short remaining terms are an indication of volatility, even if the current expiry of the minimum shelf life does not necessarily mean that it is unusable.
  • POST data in addition to the URL usually means that such pages cannot be called up repeatedly and are usually not stored in the cache. This would also be quite dangerous because such information often changes the content of the target page and the necessary confirmation would not be sent to the server.
  • If ?a query is recognized in the URL on (e.g. during a database query), algorithms originally refrained from storing it, because the combination of the query parameters resulted in many different URLs without being reused. Increasingly, however, all pages are statically presented by CMS in this URL format, so that this assumption is no longer reliable.
"Expiry date"
The resource is provided with an expiration date (point in time) or a shelf life period after it has been requested (for example: "three days"), from which the point in time of the expiry can be calculated.
  • Fields:
  • Expires: with a specific time and
  • Cache-Control: max-age= in seconds as a relative specification
Example: weather report; always valid for the following 15 minutes.
  • However, the invalidity of the resource does not necessarily cause it to be deleted from the cache, but only a check of the validity, which can lead to an extension of the validity period with unchanged content.
  • If the Expirestime specified as at the moment of the query is already in the past, this version cannot be included in the cache; Information about this URL would have to be deleted.
  • If the server does not provide information on the validity period, it can be concluded from the time of the last change, if necessary from the behavior recorded by the caching, whether the resource changes frequently or is constant: If the last change was made three years ago, the resource is probably quite stable; if it is a quarter of an hour old or if it has changed twice in the last day, it should be checked for topicality at short notice. How exactly the cache management deals adequately with missing meta-information is left to the intelligence of the programmer. It would be clumsy and time consuming as well as network capacity to retrieve a large file from the web server every time if the information is missing.
  • A maximum age could have been specified in the user configuration, around two weeks.
No cache
A resource announces that it should not be kept in the cache and that it must be called up fresh from the server every time a page is opened.
Fields:
  • Cache-Control: no-cache
  • Cache-Control: max-age=0
Traditionally, these two fields are transmitted simultaneously, albeit redundantly, in the hope that the browser will understand at least one of them. This is often combined with the “expiry date” January 1, 1970; this would also have the same effect.
Example: stock exchange prices ; change every second.
More precisely: With Cache-Control: no-cache, a browser would be allowed to keep the resource in the cache; Before each access, however, he would have to check with the server that it is still up to date. It is at the discretion of the browser implementer to handle this or not to cache such information, which is foreseeable quickly out of date.
There is no difference in the effect on the reader and in the presentation desired by the provider.
Version comparison
Based on an expiry time, the browser assumes that the resource currently in the cache could be out of date (English: stale , stale, stale '). There are then two options:
  • Request the short HEAD information of the resource from the server (initially without the complete content), evaluate the result yourself and then request the content if necessary (GET).
  • Send the known version information (Last-Modified / ETag) to the server. The server either replies with the HTTP status code 304 Not Modified (the version is still valid) or sends a new version ( 200 OK) - in the worst case, now 404 Not Found.
Invalidation
If one of the relatively seldom occurring POST, PUT or DELETE HTTP request methods is found for a URL later , as is used in forms or WebDAV , the entry for this URL would have to be deleted from the cache, because this would result in this resource may have been changed on the server.
No store
By default, each successfully transferred resource is saved as a single file on the hard drive. If the transmission breaks down or the computer even crashes, the page loading can be continued with the intermediate results. In the early years of the browser, the real core memory was also limited and the networks slow, so that this approach could hardly be avoided. This applies to all resources of the currently displayed pages, regardless of the use of a cache.
  • After the page display has been completed , those files that can be reused at a later date are transferred to the data structure of the cache; all other files are marked as temporary and deleted in due course.
  • In the case of particularly sensitive resources (such as financial transactions, account data), traces could be left on the hard drive; for example because "deleting" a file only means removing it from the visible file system , not immediately physically overwriting it. If the browser or the computer crashes, files could be left behind; Even after an apparent deletion, the physical hard drive could still contain sensitive information that would still be readable when logged on to the user account in question.
  • With Cache-Control: no-storemarked resources to a browser not cache to disk but keep only volatile in core memory.
  • However, the concept has a gap: Only rarely the operating system provides an application the option of requesting core memory, which promised not to store paging file is paged to the hard disk.

Security aspects

A user's cache allows conclusions to be drawn about which topics are called.

  • A user should use an individualized cache on his computer; for each user account a separate area that is protected from reading by other users.
    • Around 2000/2005, the previously common central cache directories, which were used jointly by all users of a PC and could also be read by every user, were replaced by individualized caches whose access rights are limited to the user logged on to the operating system are also limited to one user profile of the browser.
    • Most browsers have a "private mode". Among other things, an additional cache data structure is set up here, which takes up all resources that have now been called up. The "normal" cache can be used for reading in parallel, but no information may be written there. When private mode ends, the additional cache is deleted.
  • Proxy servers, which are used for open communication, store the pages they have used and the access statistics for all network users.
  • URLs accessed via HTTPS cannot be saved on proxy servers; the associated content and even URL paths are encrypted.
    • HTTPS, on the other hand, has no influence on the caching of the requesting individual user, who knows the URL and the decrypted content. However, some browsers have an individual configuration setting, according to which information retrieved via HTTPS should not be stored in the cache. Due to the widespread use of HTTPS for wireless connections, the use of this protocol hardly allows any conclusions to be drawn about a special need for confidentiality of the pages transmitted with it.
  • The server response Cache-Control: privateshould have the effect that this resource may only be stored in the individualized cache of a user, but not on proxy servers or shared browser cache.

Proxy server

Some fields are specifically aimed at proxy servers, i.e. any intermediate stations between the browser and the server with the actual origin of the data. The intermediate stations can hold resources that are frequently queried in a “shared cache” for all participants in the network (or users of the computer).

Fields in the resource
  • Cache-Control: s-maxage=nSeconds
    As max-age, but only on shared (“s” = “shared”) cache.
  • Cache-Control: private
    Do not save on shared cache.
  • Cache-Control: public
    Explicitly released for storage on shared cache.
A browser can submit its request
Pragma: no-cache
A proxy server encountered en route should pass the request through to the origin and not answer it from its cache. It is also recommended to mark the resource for this URL as obsolete (best before expired), as there are obvious doubts as to whether it is up to date.

HTTP caching

In summary, the following fields of HTTP mainly influence caching - if provided by the web server or if the basis was created:

Web server
Browser request

The same information that a web server transmits in addition to the content can also be integrated into an HTML document and overwrite the server's standard information if necessary:

   <meta http-equiv="Last-Modified" content="..." />

Web links

References and comments

  1. RFC 2616 14.19
  2. RFC 2616 13.2.4
  3. RFC 2616 14.21
  4. RFC 2616 14.9.3, 14.9.4
  5. &max-age= - Some web servers send the appropriate field in the response to a URL parameter like this Cache-Control.
  6. RFC 2616 14.9.1
  7. RFC 2616 October 13
  8. RFC 2616 14.9.2
  9. RFC 2616 14.9.3
  10. HTML. 4