Talk:Saint-Aubin-des-Coudrais and Help:Creating a bot: Difference between pages

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
(Difference between pages)
Content deleted Content added
OOODDD (talk | contribs)
talk page tag, Replaced: {{WikiProject France|class=stub|importance=low}} → {{WikiProject French communes|class=stub|importance=low}} using AWB
 
m Protected Wikipedia:Creating a bot: Highly visible page ([move=sysop] (indefinite))
 
Line 1: Line 1:
{{botnav}}
{{WikiProject French communes|class=stub|importance=low}}

{{shortcut|WP:CREATEBOT|WP:MAKEBOT|WP:MKBOT}}
'''Robots''' or '''bots''' are automatic processes which interact with Wikipedia as though they were human editors. This page attempts to explain how to carry out the development of a bot for use on Wikipedia. The explanation is geared mainly towards those who have some prior programming experience, but are unsure of how to apply this knowledge to creating a Wikipedia bot.

==Why would I need to create a bot?==

Bots can automate tasks and perform them much faster than humans. If you have a simple task which you need to perform lots of times (an example might be to add a [[wikipedia:template|template]] to all pages in a category with 1000 pages) then this is a task better suited to a bot than a human.

==Considerations before creating a bot==

There are already a [[Wikipedia:Bots/Status|number of bots running on Wikipedia]]. Many of these bots publish their [[source code]], which can sometimes be reused with little additional development time. In addition, there are a number of [[Wikipedia:Semi-bots|semi-bots]] available to anyone. Most of these take the form of enhanced web browsers with Wikipedia-specific functionality. The most popular of these is [[WP:AWB|AWB]]; see [[Wikipedia:Tools/Editing tools]] for a complete list.

If you have no previous programming experience, it may be simpler to ask an existing bot to do the job, or ask others develop a bot for you. These requests can be made at [[Wikipedia:Bot requests]]. If you wish to write a new bot anyway, be aware that learning a programming language is a non-trivial task. However, it is not black magic – anyone can learn how to program with sufficient time and effort. Good luck!

If you decide to create a bot, planning is crucial to obtain an error-free, efficient, and effective program. The following initial considerations are important:
* Will the bot be manually assisted or fully automated?
* Will you create the bot alone, or with the help of other programmers?
* What language will be used to implement the bot?
* Will the bot's requests, edits, or other actions be logged? If so, will the logs be stored on local media, or on wiki pages?
* Will the bot run inside a web browser (for example, written in Javascript), or will it be a standalone program?
* If the bot is a standalone program, will it run on your local computer, or on a remote server such as the [https://wiki.toolserver.org/view/Main_Page Wikimedia Toolsever]?
* If the bot runs on a remote server, will other editors be able to operate the bot or start it running?

==How does a Wikipedia bot work?==
===Overview of operation===
[[Image:Wikieditcycle.png|left]]
Just like a human editor, a wikipedia bot reads wikipedia pages, and makes changes where it thinks changes need to be made. The difference is that although bots are faster and less prone to fatigue than humans, they are nowhere near as bright as we are. Bots are good at repetitive tasks that have easily defined patterns, where few decisions have to be made.

In the most typical case, a bot would log in to its own account and request pages from Wikipedia just the way a browser does - though of course it never displays the page, but works on it in memory - and then programmatically examine the page code to see if any changes need to be made. It would then make whatever edits it was designed to do and submit the edits, again using the same codes a browser would use. This method, often called [[screen scraping]], uses the standard [[Hypertext Transfer Protocol|HTTP]] GET protocol: whenever you see '''/w/index.php?...=...&...=...''' in the browser address bar, everything after the question mark is variables and data sent by the GET method. There are also a handful of other [[Application Programming Interface]]s, described below, for getting pages and sending edits to and from Wikipedia.

Because bots access pages the same way people do, bots can experience in the same kind of difficulties that human users do. They can get caught in edit conflicts, have page timeouts, or run across other unexpected complications while requesting pages or making edits. Because the volume of work done by a bot is larger than that done by a live person, the bot is more likely to encounter these issues. Thus, it is important to consider these situations when writing a bot.

===APIs for bots===
In order to make changes to Wikipedia pages, a bot necessarily has to retrieve pages from Wikipedia and send edits back. There are several [[Application Programming Interface|APIs]] available for that purpose.

* [[mw:API|MediaWiki API]] (api.php). This library was specifically written to permit automated processes such as bots make queries and post changes. Data is available in many different machine-readable formats ([[JSON]], [[XML]], [[YAML]],...). Features have been fully ported from the older Query API interface.
*: '''Status:''' Available on all Wikimedia projects, with a very complete set of queries. The ability to edit pages via API.php has also been enabled on all Wikimedia projects, enabling bots to operate entirely without screen scraping.

* [[Screen scraping]] (index.php). Screen scraping, as mentioned above, involves requesting a Wikipedia page, looking at the raw HTML code (what you would see if you clicked View->Source in most browsers), and then analyzing the HTML for patterns. There are certain problems with this approach: the wikipedia interface can change without notice, which may break the bot code, and calling for HTML creates a larger server load than processing the wikitext itself. You can include an <code>action=render</code> GET request -w/index.php?title=Wikipedia:...&action=render - when you call the page to produce a stripped-down version of the page (without the Wikipedia sidebars and tabs) which reduces the amount of transferred data and eases the effects of changes in the user interface. Other parameters of index.php may be useful: see the partial list at [[MW:Manual:Parameters to index.php|Manual:Parameters to index.php]]. There are very few reasons to use this technique anymore and it is mainly used by older bot frameworks written before the API had as many features.
*: '''Status:''' Deprecated.

* [[Special:Export]] can be used to obtain bulk export of page content in XML form. See [[MW:Manual:Parameters to Special:Export|Manual:Parameters to Special:Export]] for arguments;
*: '''Status:''' Built-in feature of MediaWiki, available on all Wikimedia servers.

* Raw (Wikitext) page processing: sending a <code>action=raw</code> or a <code>action=raw&templates=expand</code> GET request to index.php will give the unprocessed wikitext source code of a page.

Some Wikipedia web servers are configured to grant requests for compressed ([[gzip]]) content. This can be done by including a line "Accept-Encoding: gzip" in the HTTP request header; if the HTTP reply header contains "Content-Encoding: gzip", the document is in gzip form, otherwise, it is in the regular uncompressed form. Note that this is specific to the web server and not to the MediaWiki software. Other sites employing MediaWiki may not have this feature.

===Logging in ===
Approved bots need to be logged in to make edits. Although a bot can make read requests without logging in, bots that have completed testing should log in for all activities. Bots logged in from an account with the bot flag can obtain more results per query from the Mediawiki API (api.php).

For security, login data must be passed using the [[HTTP POST]] method. because parameters of [[HTTP GET]] requests are easily visible in URL, logins via GET are disabled.

To log a bot in using [[mw:API|MediaWiki API]], use this URL and POST data:
* URL: <code>http://en.wikipedia.org/w/api.php</code>
* POST parameters:
**<code>action=login</code>
** <code>lgname=BOTUSERNAME</code>
** <code>lgpassword=BOTPASSWORD</code>
** <code>format=xml</code>
This will return a result (success or error) in XML form, as documented at [[mw:API:Login]]. Other output formats are available.

A successful login attempt will result in the Wikimedia server setting several [[HTTP cookie]]s. The bot will must save these cookies and send them back every time it makes a request (this is particularly crucial for edting). On the English Wikipedia, the following cookies should be used: '''enwikiUserID''', '''enwikiToken''', and '''enwikiUserName'''. The '''enwiki_session''' cookie is required to actually send an edit or commit some change, otherwise the [[MediaWiki:Session fail preview]] error message will be returned.

===Editing; edit tokens===
Wikipedia uses a system of [[mw:Manual:Edit token|edit tokens]] for making edits to wikipedia pages, as well as some other operations such as rollback. The token looks like a long hexadecimal number followed by '+\':
:b66655fjr7fd5drr3411ss23456s65eg+\
The role of edit tokens is to prevent ''edit hijacking'', where users are tricked into making an edit by clicking a single link.

The editing process involves two HTTP requests. First, a request for an edit token must be made. Then, a second HTTP request must be made that sends the new content of the page along with the edit token just obtained. It is not possible to make an edit in a single HTTP request.

To obtain an edit token, follow these steps:
<ul>
<li>'''MediaWiki API (api.php)'''. Make a request with the following parameters (see [[mw:API:Edit - Create&Edit pages]]).
<ul>
<li><code>action=query</code></li>
<li><code>prop=info</code></li>
<li><code>titles=PAGENAME</code></li>
<li><code>intoken=edit</code></li>
</ul>
The token will be returned in the <code>edittoken</code> attribute of the response.
</li>

If the edit token the bot receives does not have the hexidecimal string - i.e. the edit token is just '+\' - then the bot most likely is not logged in. This might be due to a number of factors: failure in authentication with the server, a dropped connection, a timeout of some sort, an error in storing or returning the correct cookies. It might be a programming error, or it may simply require logging in again to refresh the login cookies.

===Edit conflicts===


Edit conflicts occur when multiple, overlapping edit attempts are made on the same page. Almost every bot ''will'' eventually get caught in an edit conflict of one sort or another, and should include some mechanism to test for and accommodate these issues.

Bots that use the Mediawiki API (api.php) should set the <code>basetimestamp</code> attribute, and check the server responses for indications of errors. For more details, see [[mw:API:Edit - Create&Edit pages]].

Generally speaking, if an edit fails to complete the bot should check the page again before trying to make a new edit, to make sure the edit is still appropriate. Further, if a bot rechecks a page to resubmit a change, it should be careful to avoid any behavior that could lead to an infinite loop and any behavior that could even resemble [[WP:EW|edit warring]].

==Overview of the process of developing a bot==
Actually coding or writing a bot is only one part of developing a bot. You should generally follow the development cycle below. Failure to comply with this development cycle, particularly the sections on [[Wikipedia:Bots|Wikipedia bot policy]], may lead to your bot failing to be approved or being blocked from editing Wikipedia.
[[Image:Bot_development_cycle.svg|thumb|none|600px|Overview of Wikipedia bot development cycle]]
====Software elements analysis:====
*The first task in creating a Wikipedia bot is extracting the requirements or coming up with an idea. If you don't have an idea of what to write a bot for, you could pick up ideas at [[Wikipedia:Bot requests|requests for work to be done by a bot]].
*Make sure an existing bot isn't already doing what you think your bot should do. To see what tasks are already being performed by a bot, see [[Wikipedia:Bots/Status|the list of currently operating bots]].

====Specification:====
*[[Program specification|Specification]] is the task of precisely describing the software to be written, possibly in a rigorous way. You should come up with a detailed proposal of what you want it to do. Try to discuss this proposal with some editors and refine it based on feedback. Even a great idea can be made better by incorporating ideas from other editors.
*In the most basic form, your specified bot must meet the following criteria:
:*The bot is harmless (it must not make edits that could be considered vandalism)
:*The bot is useful (it provides a useful service more effectively than a human editor could), and
:*The bot does not waste server resources.
*Make sure your proposal meets the criteria of [[Wikipedia:Bots|Wikipedia bot policy]]

====[[Software architecture]]:====
*Think about '''how''' you might create it and which programming language and tools you would use. Architecture is concerned with making sure the software system will meet the requirements of the product as well as ensuring that future requirements can be addressed. There are [[Wikipedia:Types_of_bots|different types of bots]] and the main body of the article below will cover this technical side.

====[[computer programming|Implementation]]:====
Implementation (or coding) involves reducing design to code. It may be the most obvious part of the software engineering job but it is not necessarily the largest portion. In the implementation stage you should:
*Create a user page for your bot. Your bot's edits must not be made under your own account. Your bot will need its own account with its own username and password.
*Add these details to your proposal and post it to [[Wikipedia:Bots/Requests_for_approval|requests for bot approval]]
*Add the same information to the user page of the bot. You should also add a link to the approval page (whether approved or not) for each function. People will comment on your proposal and it will be either accepted or rejected.
*Code your bot in your chosen programming language.

====[[software testing|Testing]]:====
If accepted, it would probably be put on a trial period during which it may be run to fine-tune it and iron out any bugs. You should test your bot widely and ensure that it works correctly. At the end of the trial period it would hopefully be accepted.

====[[software documentation|Documentation]]:====
An important (and often overlooked) task is documenting the internal design of your bot for the purpose of future maintenance and enhancement. This is especially important if you are going to allow clones of your bot. Ideally, you should post up the source code of your bot on its userpage if you want others to be able to run clones of it. This code should be well documented for ease of use.

====Software training and support:====
You should be ready to field queries or objections to your bot on your user talk page.

====[[software maintenance|Maintenance]]:====
Maintaining and enhancing your bot to cope with newly discovered problems or new requirements can take far more time than the initial development of the software. Not only may it be necessary to add code that does not fit the original design but just determining how software works at some point after it is completed may require significant effort.
*If you want to make a major functionality change to your bot in the future, you should request this as above using the [[Wikipedia:Bots/Requests_for_approval|requests for bot approval]].

==General guidelines for running a bot==
In addition to the official bot policy, which covers the main points to consider when developing your bot, there are a number of more general advisory points to consider when developing your bot.

===Bot best practices===
* Use the [[mw:Manual:Maxlag parameter|maxlag parameter]] with a maximum lag of 5 seconds. This will enable the bot to run quickly when server load is low, and throttle the bot when server load is high.
**If writing a bot in a framework that does not support maxlag, limit the total requests (read and write requests together) to no more than 10/minute.
* Use the [[mw:API|API]] whenever possible, and set the query limits to the largest values that the server permits, to minimize the total number of requests that must be made.
* Edit (write) requests are more expensive in server time than read requests. Be edit-light and design you code to keep edits to a miminum.
** Try to consolidate edits. One single large edit is better than 10 smaller ones.
* Do not make multi-threaded requests. Wait for one server request to complete before beginning another
* Back off upon receiving errors from the server. Errors such as squid timeouts are often an indication heavy server load. Use a sequence of increasingly longer delays between repeated requests.
* Make use of [[mw:Extension:Assert Edit|the Assert Edit extension]], an extension explicitly designed for bots to check certain conditions, which is enabled on Wikipedia.
* Test your code thoroughly before making large automated runs. Individually examine all edits on trial runs to verify they are perfect.

===Common bot features you should consider implementing===

====Manual assistance====
If your bot is doing anything that requires judgement or evaluation of context (e.g., correcting spelling) then you should consider making your bot manually-assisted. That is, not making edits without human confirmation.

====Disabling the bot====
It is good bot policy to have a feature to disable the bot's operation if it is requested. You should probably have the bot refuse to run if a message has been left on its talk page, on the assumption that the message may be a complaint against its activities. This can be checked by looking for the "You have new messages..." banner in the HTML for the edit form. Remember that if your bot goes bad, it is your responsibility to clean up after it! You can also have a page that will turn the bot off if <code>True</code> on the page is changed. This can be done by grabbing and checking the page before each edit.

====Signature====
Just like a human, if your bot makes edits to a talk page on wikipedia, it should sign its post with four tildes <nowiki>(~~~~)</nowiki>. It should not sign any edits to text in the main namespace.

== Programming languages and libraries ==

Bots can be written in almost any programming language. The choice of a language often depends on the experience of the bot writer (which languages are familiar) or on the availability of pre-developed libraries to perform the desired task. The following list includes some languages that have libraries to assist with bot tasks.

====[[Perl]]====
Perl has a [[run-time]] [[Compiler|compiler]]. This means that it is not necessary to compile [[Software build|builds]] of your code yourself as it is with other programming languages. Instead, you simply create your program using a text editor such as [[gvim]]. You then run the code by passing it to an interpreter. This can be located either on your own computer or on a remote computer ([[webserver]]). If located on a webserver, you can start your program running and interface with your program while it is running via the [[Common Gateway Interface]] from your browser. Perl is available for most operating systems, including [[Microsoft Windows]] (which most human editors use) and [[UNIX]] (which many webservers use). If your internet service provider provides you with webspace, the chances are good that you have access to a perl build on the webserver from which you can run your Perl programs.

Guides to getting started with Perl programming:
*[http://www.perl.com/pub/a/2000/10/begperl1.html A Beginner's Introduction to Perl]
*[http://www.cgi101.com/book/ CGI Programming 101: Learn CGI Today!]
*[http://www.cs.tut.fi/~jkorpela/perl/course.html Perl lessons]
*[http://perl.about.com/b/a/227771.htm Get started learning Perl]

Libraries:
* [http://search.cpan.org/~exobuzz/MediaWiki-API/ MediaWiki::API] - A Perl module for interfacing with the MediaWiki API, to allow information retrieval, editing, and file upload/download.
* [http://openfacts.berlios.de/index-en.phtml?title=Anura Anura] -- Perl interface to MediaWiki using libwww-perl. Not recommended, as the current version does not check for edit conflicts.
* [http://search.cpan.org/~markj/WWW-Mediawiki-Client/ WWW::Mediawiki::Client] -- perl module and command line client
* [http://search.cpan.org/~esummers/WWW-Wikipedia-1.9/ WWW::Wikipedia] -- perl module for interfacing wikipedia
* [http://wiki.kn.vutbr.cz/mj/index.cgi?Perl%20Wikipedia%20toolkit Perl Wikipedia ToolKit] -- perl modules, parsing wikitext and extracting data
* [[User:Shadow1/Perlwikipedia|perlwikipedia]] - A fairly-complete Wikipedia bot framework written in Perl.
* [http://search.cpan.org/perldoc?MediaWiki MediaWiki CPAN Package] by Edward Chernenko - has a rich API, but also several critical bugs.
* [https://svn.toolserver.org/svnroot/cbm/mediawiki-api/ Mediawiki::API] - a library by [[User:CBM|CBM]] with robust automatic error handling and wrappers for may common API.php uses. This is not the same as the library on CPAN.

====[[PHP]]====
[[PHP]] can also be used for programming bots. PHP is an especially good choice if you wish to provide a webform-based interface to your bot. For example, suppose you wanted to create a bot for renaming categories. You could create an HTML form into which you will type the current and desired names of a category. When the form is submitted, your bot could read these inputs, then edit all the articles in the current category and move them to the desired category. (Obviously, any bot with a form interface would need to be secured somehow from random web surfers.)

To log in your bot, you will need to know how to use PHP to send and receive cookies; to edit with your bot, you will need to know how to send form variables. Libraries like [http://snoopy.sourceforge.net/ Snoopy] simplify such actions.

Libraries:
* [http://wikisum.com/w/User:Adam/Creating_MediaWiki_bots_in_PHP BasicBot]: A basic framework with sample bot scripts. Based on [http://snoopy.sourceforge.net/ Snoopy].
* [[User:SQL/SxWiki|SxWiki]]: A very simple bot framework.
* [[User:ClueBot/Source|wikibot.classes]]: The bot framework designed for and used by [[User:ClueBot|ClueBot]], along with a few other bots.
* [[User:Kaspo/Phpwikibot|Phpwikibot]]: The bot framework designed by [[User:Kaspo|Kaspo]].

Other pages:
*[[User:GeorgeMoney/Bot_Framework|Bot framework]]
<!-- looks empty *[http://sourceforge.net/projects/phpwikipedia/ PHP Wikipedia Bot Framework] -->

====[[Python (programming language)|Python]]====
Python is a popular interpreted language with object-oriented features.

Getting started with Python:
* [http://docs.python.org/tut/ Official Python tutorial]
* [http://wiki.python.org/moin/BeginnersGuide Beginner's Guide to coding in python]

Libraries:
* [[m:Using the python wikipediabot|PyWikipediaBot]] -- Python Wikipedia Robot Framework ([http://pywikipediabot.sourceforge.net/ Home Page], [http://sourceforge.net/projects/pywikipediabot/ SF Project Page])

==== [[Microsoft .NET]] ====
Languages include [[Microsoft Visual C Sharp|C#]], [[C++|Managed C++]], [[Visual Basic .NET]], [[J Sharp|J#]], [[JScript .NET]], [[IronPython]], and [[Windows PowerShell]].<br />Free [[Microsoft Visual Studio|Microsoft Visual Studio .NET]] [[development environment]] is often used.

Getting started:
*''Add links here!''

Libraries:
* [http://dotnetwikibot.sourceforge.net/ DotNetWikiBot Framework] - a clean full-featured C# [[API]], compiled as [[DLL]] library, that allows to build programs and web robots easily to manage information on MediaWiki-powered sites. Detailed documentation is available.
** [http://sourceforge.net/project/showfiles.php?group_id=158332 WikiFunctions .NET library] - Bundled with [[WP:AWB|AWB]], is a library of stuff useful for bots, such as generating lists, loading/editing articles, connecting to the recent changes IRC channel and more.
* [http://sourceforge.net/projects/wikiaccess WikiAccess library]
* [http://code.google.com/p/mediawikiengine/ MediaWikiEngine], used by [http://code.google.com/p/wikimediacommonplace/ Commonplace upload tool]
* [http://code.google.com/p/tyngmediawiki Tyng.MediaWiki class library], a MediaWiki API written in C# used by [[User:NrhpBot|NrhpBot]]

====[[Java (programming language)|Java]]====
Generally developed with an IDE, such as [[Eclipse (software)|Eclipse]]; development using a command line console (with the [[javac]] and java programs) is also an option.

Getting started:
* [http://jwbf.sourceforge.net/getting-started/ Getting started]

Libraries:
* [http://jwbf.sourceforge.net/ Java Wiki Bot Framework]
* [[User:MER-C/Wiki.java]]

==== [[Ruby (programming language)|Ruby]] ====
[http://www.rwikibot.net RWikiBot] is a Ruby framework for writing bots. Currently, it is under development and looking for contributors. It uses MediaWiki's official [http://www.mediawiki.org/wiki/API API], and as such is limited in certain capabilities.

Libraries:
* [http://www.rwikibot.net RWikiBot]

==== [[Chicken (Scheme implementation)|Chicken Scheme]] ====
Iron Chicken is an extension or "egg" for Chicken Scheme that makes the Mediawiki [http://www.mediawiki.org/wiki/API API] programmable using [[s-expressions]], and presents API and HTML output as [[SXML]] which can be queried easily.

A simple example that gets members of a category and writes them to a page in the client user's userspace is:
* [[User:Tony Sidaway/scripts/categorymembers]]

Libraries:
* [http://www.call-with-current-continuation.org/eggs/irnc-base.html irnc-base]
[[fr:Wikipédia:Créer un bot]]

[[Category:Wikipedia bots]]

Revision as of 05:03, 11 October 2008

Robots or bots are automatic processes which interact with Wikipedia as though they were human editors. This page attempts to explain how to carry out the development of a bot for use on Wikipedia. The explanation is geared mainly towards those who have some prior programming experience, but are unsure of how to apply this knowledge to creating a Wikipedia bot.

Why would I need to create a bot?

Bots can automate tasks and perform them much faster than humans. If you have a simple task which you need to perform lots of times (an example might be to add a template to all pages in a category with 1000 pages) then this is a task better suited to a bot than a human.

Considerations before creating a bot

There are already a number of bots running on Wikipedia. Many of these bots publish their source code, which can sometimes be reused with little additional development time. In addition, there are a number of semi-bots available to anyone. Most of these take the form of enhanced web browsers with Wikipedia-specific functionality. The most popular of these is AWB; see Wikipedia:Tools/Editing tools for a complete list.

If you have no previous programming experience, it may be simpler to ask an existing bot to do the job, or ask others develop a bot for you. These requests can be made at Wikipedia:Bot requests. If you wish to write a new bot anyway, be aware that learning a programming language is a non-trivial task. However, it is not black magic – anyone can learn how to program with sufficient time and effort. Good luck!

If you decide to create a bot, planning is crucial to obtain an error-free, efficient, and effective program. The following initial considerations are important:

  • Will the bot be manually assisted or fully automated?
  • Will you create the bot alone, or with the help of other programmers?
  • What language will be used to implement the bot?
  • Will the bot's requests, edits, or other actions be logged? If so, will the logs be stored on local media, or on wiki pages?
  • Will the bot run inside a web browser (for example, written in Javascript), or will it be a standalone program?
  • If the bot is a standalone program, will it run on your local computer, or on a remote server such as the Wikimedia Toolsever?
  • If the bot runs on a remote server, will other editors be able to operate the bot or start it running?

How does a Wikipedia bot work?

Overview of operation

Just like a human editor, a wikipedia bot reads wikipedia pages, and makes changes where it thinks changes need to be made. The difference is that although bots are faster and less prone to fatigue than humans, they are nowhere near as bright as we are. Bots are good at repetitive tasks that have easily defined patterns, where few decisions have to be made.

In the most typical case, a bot would log in to its own account and request pages from Wikipedia just the way a browser does - though of course it never displays the page, but works on it in memory - and then programmatically examine the page code to see if any changes need to be made. It would then make whatever edits it was designed to do and submit the edits, again using the same codes a browser would use. This method, often called screen scraping, uses the standard HTTP GET protocol: whenever you see /w/index.php?...=...&...=... in the browser address bar, everything after the question mark is variables and data sent by the GET method. There are also a handful of other Application Programming Interfaces, described below, for getting pages and sending edits to and from Wikipedia.

Because bots access pages the same way people do, bots can experience in the same kind of difficulties that human users do. They can get caught in edit conflicts, have page timeouts, or run across other unexpected complications while requesting pages or making edits. Because the volume of work done by a bot is larger than that done by a live person, the bot is more likely to encounter these issues. Thus, it is important to consider these situations when writing a bot.

APIs for bots

In order to make changes to Wikipedia pages, a bot necessarily has to retrieve pages from Wikipedia and send edits back. There are several APIs available for that purpose.

  • MediaWiki API (api.php). This library was specifically written to permit automated processes such as bots make queries and post changes. Data is available in many different machine-readable formats (JSON, XML, YAML,...). Features have been fully ported from the older Query API interface.
    Status: Available on all Wikimedia projects, with a very complete set of queries. The ability to edit pages via API.php has also been enabled on all Wikimedia projects, enabling bots to operate entirely without screen scraping.
  • Screen scraping (index.php). Screen scraping, as mentioned above, involves requesting a Wikipedia page, looking at the raw HTML code (what you would see if you clicked View->Source in most browsers), and then analyzing the HTML for patterns. There are certain problems with this approach: the wikipedia interface can change without notice, which may break the bot code, and calling for HTML creates a larger server load than processing the wikitext itself. You can include an action=render GET request -w/index.php?title=Wikipedia:...&action=render - when you call the page to produce a stripped-down version of the page (without the Wikipedia sidebars and tabs) which reduces the amount of transferred data and eases the effects of changes in the user interface. Other parameters of index.php may be useful: see the partial list at Manual:Parameters to index.php. There are very few reasons to use this technique anymore and it is mainly used by older bot frameworks written before the API had as many features.
    Status: Deprecated.
  • Raw (Wikitext) page processing: sending a action=raw or a action=raw&templates=expand GET request to index.php will give the unprocessed wikitext source code of a page.

Some Wikipedia web servers are configured to grant requests for compressed (gzip) content. This can be done by including a line "Accept-Encoding: gzip" in the HTTP request header; if the HTTP reply header contains "Content-Encoding: gzip", the document is in gzip form, otherwise, it is in the regular uncompressed form. Note that this is specific to the web server and not to the MediaWiki software. Other sites employing MediaWiki may not have this feature.

Logging in

Approved bots need to be logged in to make edits. Although a bot can make read requests without logging in, bots that have completed testing should log in for all activities. Bots logged in from an account with the bot flag can obtain more results per query from the Mediawiki API (api.php).

For security, login data must be passed using the HTTP POST method. because parameters of HTTP GET requests are easily visible in URL, logins via GET are disabled.

To log a bot in using MediaWiki API, use this URL and POST data:

This will return a result (success or error) in XML form, as documented at mw:API:Login. Other output formats are available.

A successful login attempt will result in the Wikimedia server setting several HTTP cookies. The bot will must save these cookies and send them back every time it makes a request (this is particularly crucial for edting). On the English Wikipedia, the following cookies should be used: enwikiUserID, enwikiToken, and enwikiUserName. The enwiki_session cookie is required to actually send an edit or commit some change, otherwise the MediaWiki:Session fail preview error message will be returned.

Editing; edit tokens

Wikipedia uses a system of edit tokens for making edits to wikipedia pages, as well as some other operations such as rollback. The token looks like a long hexadecimal number followed by '+\':

b66655fjr7fd5drr3411ss23456s65eg+\

The role of edit tokens is to prevent edit hijacking, where users are tricked into making an edit by clicking a single link.

The editing process involves two HTTP requests. First, a request for an edit token must be made. Then, a second HTTP request must be made that sends the new content of the page along with the edit token just obtained. It is not possible to make an edit in a single HTTP request.

To obtain an edit token, follow these steps:

  • MediaWiki API (api.php). Make a request with the following parameters (see mw:API:Edit - Create&Edit pages).
    • action=query
    • prop=info
    • titles=PAGENAME
    • intoken=edit

    The token will be returned in the edittoken attribute of the response.

  • If the edit token the bot receives does not have the hexidecimal string - i.e. the edit token is just '+\' - then the bot most likely is not logged in. This might be due to a number of factors: failure in authentication with the server, a dropped connection, a timeout of some sort, an error in storing or returning the correct cookies. It might be a programming error, or it may simply require logging in again to refresh the login cookies.

    Edit conflicts

    Edit conflicts occur when multiple, overlapping edit attempts are made on the same page. Almost every bot will eventually get caught in an edit conflict of one sort or another, and should include some mechanism to test for and accommodate these issues.

    Bots that use the Mediawiki API (api.php) should set the basetimestamp attribute, and check the server responses for indications of errors. For more details, see mw:API:Edit - Create&Edit pages.

    Generally speaking, if an edit fails to complete the bot should check the page again before trying to make a new edit, to make sure the edit is still appropriate. Further, if a bot rechecks a page to resubmit a change, it should be careful to avoid any behavior that could lead to an infinite loop and any behavior that could even resemble edit warring.

    Overview of the process of developing a bot

    Actually coding or writing a bot is only one part of developing a bot. You should generally follow the development cycle below. Failure to comply with this development cycle, particularly the sections on Wikipedia bot policy, may lead to your bot failing to be approved or being blocked from editing Wikipedia.

    Overview of Wikipedia bot development cycle

    Software elements analysis:

    • The first task in creating a Wikipedia bot is extracting the requirements or coming up with an idea. If you don't have an idea of what to write a bot for, you could pick up ideas at requests for work to be done by a bot.
    • Make sure an existing bot isn't already doing what you think your bot should do. To see what tasks are already being performed by a bot, see the list of currently operating bots.

    Specification:

    • Specification is the task of precisely describing the software to be written, possibly in a rigorous way. You should come up with a detailed proposal of what you want it to do. Try to discuss this proposal with some editors and refine it based on feedback. Even a great idea can be made better by incorporating ideas from other editors.
    • In the most basic form, your specified bot must meet the following criteria:
    • The bot is harmless (it must not make edits that could be considered vandalism)
    • The bot is useful (it provides a useful service more effectively than a human editor could), and
    • The bot does not waste server resources.

    Software architecture:

    • Think about how you might create it and which programming language and tools you would use. Architecture is concerned with making sure the software system will meet the requirements of the product as well as ensuring that future requirements can be addressed. There are different types of bots and the main body of the article below will cover this technical side.

    Implementation:

    Implementation (or coding) involves reducing design to code. It may be the most obvious part of the software engineering job but it is not necessarily the largest portion. In the implementation stage you should:

    • Create a user page for your bot. Your bot's edits must not be made under your own account. Your bot will need its own account with its own username and password.
    • Add these details to your proposal and post it to requests for bot approval
    • Add the same information to the user page of the bot. You should also add a link to the approval page (whether approved or not) for each function. People will comment on your proposal and it will be either accepted or rejected.
    • Code your bot in your chosen programming language.

    Testing:

    If accepted, it would probably be put on a trial period during which it may be run to fine-tune it and iron out any bugs. You should test your bot widely and ensure that it works correctly. At the end of the trial period it would hopefully be accepted.

    Documentation:

    An important (and often overlooked) task is documenting the internal design of your bot for the purpose of future maintenance and enhancement. This is especially important if you are going to allow clones of your bot. Ideally, you should post up the source code of your bot on its userpage if you want others to be able to run clones of it. This code should be well documented for ease of use.

    Software training and support:

    You should be ready to field queries or objections to your bot on your user talk page.

    Maintenance:

    Maintaining and enhancing your bot to cope with newly discovered problems or new requirements can take far more time than the initial development of the software. Not only may it be necessary to add code that does not fit the original design but just determining how software works at some point after it is completed may require significant effort.

    • If you want to make a major functionality change to your bot in the future, you should request this as above using the requests for bot approval.

    General guidelines for running a bot

    In addition to the official bot policy, which covers the main points to consider when developing your bot, there are a number of more general advisory points to consider when developing your bot.

    Bot best practices

    • Use the maxlag parameter with a maximum lag of 5 seconds. This will enable the bot to run quickly when server load is low, and throttle the bot when server load is high.
      • If writing a bot in a framework that does not support maxlag, limit the total requests (read and write requests together) to no more than 10/minute.
    • Use the API whenever possible, and set the query limits to the largest values that the server permits, to minimize the total number of requests that must be made.
    • Edit (write) requests are more expensive in server time than read requests. Be edit-light and design you code to keep edits to a miminum.
      • Try to consolidate edits. One single large edit is better than 10 smaller ones.
    • Do not make multi-threaded requests. Wait for one server request to complete before beginning another
    • Back off upon receiving errors from the server. Errors such as squid timeouts are often an indication heavy server load. Use a sequence of increasingly longer delays between repeated requests.
    • Make use of the Assert Edit extension, an extension explicitly designed for bots to check certain conditions, which is enabled on Wikipedia.
    • Test your code thoroughly before making large automated runs. Individually examine all edits on trial runs to verify they are perfect.

    Common bot features you should consider implementing

    Manual assistance

    If your bot is doing anything that requires judgement or evaluation of context (e.g., correcting spelling) then you should consider making your bot manually-assisted. That is, not making edits without human confirmation.

    Disabling the bot

    It is good bot policy to have a feature to disable the bot's operation if it is requested. You should probably have the bot refuse to run if a message has been left on its talk page, on the assumption that the message may be a complaint against its activities. This can be checked by looking for the "You have new messages..." banner in the HTML for the edit form. Remember that if your bot goes bad, it is your responsibility to clean up after it! You can also have a page that will turn the bot off if True on the page is changed. This can be done by grabbing and checking the page before each edit.

    Signature

    Just like a human, if your bot makes edits to a talk page on wikipedia, it should sign its post with four tildes (~~~~). It should not sign any edits to text in the main namespace.

    Programming languages and libraries

    Bots can be written in almost any programming language. The choice of a language often depends on the experience of the bot writer (which languages are familiar) or on the availability of pre-developed libraries to perform the desired task. The following list includes some languages that have libraries to assist with bot tasks.

    Perl

    Perl has a run-time compiler. This means that it is not necessary to compile builds of your code yourself as it is with other programming languages. Instead, you simply create your program using a text editor such as gvim. You then run the code by passing it to an interpreter. This can be located either on your own computer or on a remote computer (webserver). If located on a webserver, you can start your program running and interface with your program while it is running via the Common Gateway Interface from your browser. Perl is available for most operating systems, including Microsoft Windows (which most human editors use) and UNIX (which many webservers use). If your internet service provider provides you with webspace, the chances are good that you have access to a perl build on the webserver from which you can run your Perl programs.

    Guides to getting started with Perl programming:

    Libraries:

    • MediaWiki::API - A Perl module for interfacing with the MediaWiki API, to allow information retrieval, editing, and file upload/download.
    • Anura -- Perl interface to MediaWiki using libwww-perl. Not recommended, as the current version does not check for edit conflicts.
    • WWW::Mediawiki::Client -- perl module and command line client
    • WWW::Wikipedia -- perl module for interfacing wikipedia
    • Perl Wikipedia ToolKit -- perl modules, parsing wikitext and extracting data
    • perlwikipedia - A fairly-complete Wikipedia bot framework written in Perl.
    • MediaWiki CPAN Package by Edward Chernenko - has a rich API, but also several critical bugs.
    • Mediawiki::API - a library by CBM with robust automatic error handling and wrappers for may common API.php uses. This is not the same as the library on CPAN.

    PHP

    PHP can also be used for programming bots. PHP is an especially good choice if you wish to provide a webform-based interface to your bot. For example, suppose you wanted to create a bot for renaming categories. You could create an HTML form into which you will type the current and desired names of a category. When the form is submitted, your bot could read these inputs, then edit all the articles in the current category and move them to the desired category. (Obviously, any bot with a form interface would need to be secured somehow from random web surfers.)

    To log in your bot, you will need to know how to use PHP to send and receive cookies; to edit with your bot, you will need to know how to send form variables. Libraries like Snoopy simplify such actions.

    Libraries:

    Other pages:

    Python

    Python is a popular interpreted language with object-oriented features.

    Getting started with Python:

    Libraries:

    Microsoft .NET

    Languages include C#, Managed C++, Visual Basic .NET, J#, JScript .NET, IronPython, and Windows PowerShell.
    Free Microsoft Visual Studio .NET development environment is often used.

    Getting started:

    • Add links here!

    Libraries:

    Java

    Generally developed with an IDE, such as Eclipse; development using a command line console (with the javac and java programs) is also an option.

    Getting started:

    Libraries:

    Ruby

    RWikiBot is a Ruby framework for writing bots. Currently, it is under development and looking for contributors. It uses MediaWiki's official API, and as such is limited in certain capabilities.

    Libraries:

    Chicken Scheme

    Iron Chicken is an extension or "egg" for Chicken Scheme that makes the Mediawiki API programmable using s-expressions, and presents API and HTML output as SXML which can be queried easily.

    A simple example that gets members of a category and writes them to a page in the client user's userspace is:

    Libraries: