How to list all search engine spiders

Some of your homepage visitors are automatic bots (spiders). They download pages of your website and use it to index text and display it in search engine results. Each of their robots would accept robots.txt file stored in root of homepage. If you want to find more about robots.txt see Official Robots.txt pages. So most of them download /robots.txt file before they start to download other files. And it is one way how to get search engine list.

Using robots.txt file

After executing this short command, you'll see all UserAgent strings getting you robots.txt file.

grep "\"GET /robots.txt " log.txt | cut -d\" -f6 | sort | uniq

If you want sort out all User agents by their activity, you can add some few commands.

grep "\"GET /robots.txt " log.txt | cut -d\" -f6 | sort \
| uniq -c | sort -rn

From User agent string you can detect some most popular bots or search engines.

For example, look for these strings:

  • Google – Googlebot/
  • Yahoo – Yahoo! Slurp
  • MSN – msnbot/

Using User agent string

If you are looking for one specific search engine spider you can use User agent filter. Choose one of string you found out (for example Googlebot) and use it so:

grep " Googlebot/" log.txt

You will get all requests from Google search engine. After some other work, you can list all pages downloaded by Google with number of each download for specific page.

grep " Googlebot/" log.txt | cut -d\" -f2 | cut -d" " -f2 \
| sort | uniq -c | sort -rn