Pages

Monday, August 5, 2013

Web Robots

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html . Before it does so, it firsts checks for http://www.example.com/robots.txt , and finds:

The simplest robots.txt file uses two rules:

•   User-agent: the robot the following rule applies to
•   Disallow: the URL you want to block

Disallow indexing of everything

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

Allow indexing of everything

User-agent: *
Disallow:

Disawllow indexing of a specific folder

User-agent: *
Disallow: /folder/

To exclude a single robot

User-agent: BadBot
Disallow: /

To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

Crawl-delay directive

Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:

User-agent: *
Crawl-delay: 10

Allow directive

Some major crawlers support an Allow directive which can counteract a following Disallow directive. This is useful when one tells robots to avoid an entire directory but still wants some HTML documents in that directory crawled and indexed.

Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Sitemap

Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form:

Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml

Host

Some crawlers (Yandex,Google) support a Host directive, allowing websites with multiple mirrors to specify their preferred domain.

Host: example.com
(Or)
Host: www.example.com

This is not supported by all crawlers and if used, it should be inserted at the bottom of the host file after Crawl-delay directive


No comments: