Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html . Before it does so, it firsts checks for http://www.example.com/robots.txt , and finds:
The simplest robots.txt file uses two rules:
• User-agent: the robot the following rule applies to
• Disallow: the URL you want to block
Disallow indexing of everything
User-agent: *
Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Allow indexing of everything
User-agent: *
Disallow:
Disawllow indexing of a specific folder
User-agent: *
Disallow: /folder/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Crawl-delay directive
Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:
User-agent: *
Crawl-delay: 10
Allow directive
Some major crawlers support an Allow directive which can counteract a following Disallow directive. This is useful when one tells robots to avoid an entire directory but still wants some HTML documents in that directory crawled and indexed.
Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html
Sitemap
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form:
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml
Host
Some crawlers (Yandex,Google) support a Host directive, allowing websites with multiple mirrors to specify their preferred domain.
Host: example.com
(Or)
Host: www.example.com
This is not supported by all crawlers and if used, it should be inserted at the bottom of the host file after Crawl-delay directive
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html . Before it does so, it firsts checks for http://www.example.com/robots.txt , and finds:
The simplest robots.txt file uses two rules:
• User-agent: the robot the following rule applies to
• Disallow: the URL you want to block
Disallow indexing of everything
User-agent: *
Disallow: /
The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Allow indexing of everything
User-agent: *
Disallow:
Disawllow indexing of a specific folder
User-agent: *
Disallow: /folder/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
User-agent: *
Disallow: /
Crawl-delay directive
Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:
User-agent: *
Crawl-delay: 10
Allow directive
Some major crawlers support an Allow directive which can counteract a following Disallow directive. This is useful when one tells robots to avoid an entire directory but still wants some HTML documents in that directory crawled and indexed.
Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html
Sitemap
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form:
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml
Host
Some crawlers (Yandex,Google) support a Host directive, allowing websites with multiple mirrors to specify their preferred domain.
Host: example.com
(Or)
Host: www.example.com
This is not supported by all crawlers and if used, it should be inserted at the bottom of the host file after Crawl-delay directive
No comments:
Post a Comment