Powerful tips to control your computer and website
How to list all search engine spiders
Some of your homepage visitors are automatic bots (spiders). They download pages of your website and use it to index text and display it in search engine results. Each of their robots would accept robots.txt file stored in root of homepage. If you want to find more about robots.txt see Official Robots.txt pages. So most of them download /robots.txt file before they start to download other files. And it is one way how to get search engine list.
Using robots.txt file
After executing this short command, you'll see all UserAgent strings getting you robots.txt file.
grep "\"GET /robots.txt " log.txt | cut -d\" -f6 | sort | uniq
If you want sort out all User agents by their activity, you can add some few commands.
grep "\"GET /robots.txt " log.txt | cut -d\" -f6 | sort \ | uniq -c | sort -rn
From User agent string you can detect some most popular bots or search engines.
For example, look for these strings:
- Google – Googlebot/
- Yahoo – Yahoo! Slurp
- MSN – msnbot/
Using User agent string
If you are looking for one specific search engine spider you can use User agent filter. Choose one of string you found out (for example Googlebot) and use it so:
grep " Googlebot/" log.txt
You will get all requests from Google search engine. After some other work, you can list all pages downloaded by Google with number of each download for specific page.
grep " Googlebot/" log.txt | cut -d\" -f2 | cut -d" " -f2 \ | sort | uniq -c | sort -rn