Spider trap

A spider trap (literally "spider trap") is a web structure that is intended to detect unwanted web crawlers and optionally prevent them from capturing the content of a website .

The goal is to exclude unwanted web crawlers, which are supposed to spread spam or to find security holes, from the capture of Internet content, while desired crawlers, such as search engine bots , are not impaired in their work and human visitors are not impaired in their experience.

The Spider Trap uses the fact that desired bots adhere to the rules defined by it (e.g. in a robots.txt file) and thus ignore certain content of a website. Unwanted crawlers generally do not adhere to such regulations. It is therefore possible for the developer to place a link that is invisible to the user and blocked for a desired crawler, which leads to the blocking of the IP address used by the unwanted crawler .

In the event that a visitor gets lost on this blocked page, the option can be offered to unblock the website using a CAPTCHA .

The detection of web crawlers or the saving in a log file can be done very easily with the following PHP script:

<?php
$props = array (
     'REMOTE_ADDR' ,
     'REMOTE_HOST' ,
     'HTTP_USER_AGENT' ,
     'SERVER_PORT' ,
     'QUERY_STRING' ,
     'HTTP_REFERER'
);
$log = array ( 'evil' , date ( DATE_ATOM ));
foreach ( $props as $prop ){
     $entry = array_key_exists ( $prop , $_SERVER )? $_SERVER [ $prop ]: <nowiki>''</nowiki> ;
     array_push ( $log , $entry );
}
file_put_contents ( 'bot.log' , join ( " \t " , $log ). " \n " , FILE_APPEND );
?>

Of course, this can be expanded significantly, for example with a database .

Web links

Spider Trap , open source implementation of a spider trap under Mozilla Public License (in PHP )
Heise article on the capture of web crawlers