Jasper22.NET: Create Your Own Googlebot

Posted by jasper22 at 10:47 | Sunday, August 21, 2011

Introduction

Googlebot finds and indexes the web through hyperlinks. It still goes to the new pages through hyperlinks in the old sites. My searchbot (Xearchbot) can find sites and stores its URL, title, keywords metatag and description metatag in database. In the future, it will store the body of the document converted to plain text. I don't calculate PageRank, because it is very time-consuming.

Before Googlebot downloads a page, it downloads file robots.txt. In this file, there is information where bot can go and where it musn't go. This is an example content of this file:

# All bots can't go to folder "nobots":
User-Agent: *
Disallow: /nobots

# But, ExampleBot can go to all folders:
User-Agent: ExampleBot
Disallow:

# BadBot can't go to "nobadbot1" and "nobadbot2" folders:
Disallow: /nobadbot1
Disallow: /nobadbot2

There is the second way to block Googlebot: Robots metatag. It's name attribute is "robots" and content attribute has values separated by comma. There is "index" or "noindex" (document can be indexed or not) and "follow" or "nofollow" (follow hyperlinks or not). For indexing document and following hyperlinks, metatag looks like this:

Blocking of following of single link is supported too - in order to do that, it's rel="nofollow". Malware robots and antiviruses' robots ignore robots.txt, metatags and rel="nofollow". Our bot will be normal searchbot and must allow all of the following blockers.

Read more: Codeproject
QR: CreateYourOwnGooglebot.aspx

Posted via email from Jasper-net

Jasper22.NET

Create Your Own Googlebot

Archive

Random sites

Followers

Search This Blog