|
Home / Internet Marketing / Search Engine Optimization
Search Engine Robots or Web Crawlers
By:Susmita
Most of the common users or visitors use different available search engines to search out the piece of information they required. But how this information is provided by search engines? Where from they have collected these information? Basically most of these search engines maintain their own database of information. These database includes the sites available in the webworld which ultimately maintain the detail web pages information for each available sites. Basically search engine do some background work by using robots to collect information and maintain the database. They make catalog of gathered information and then present it publicly or at-times for private use.
In this article we will discuss about those entities which loiter in the global internet environment or we will about web crawlers which move around in netspace. We will learn
· What it’s all about and what purpose they serve ?
· Pros and cons of using these entities.
· How we can keep our pages away from crawlers ?
· Differences between the common crawlers and robots.
In the following portion we will divide the whole research work under the following two sections :
I. Search Engine Spider : Robots.txt.
II. Search Engine Robots : Meta-tags Explained.
I. Search Engine Spider : Robots.txt
What is robots.txt file ?
A web robot is a program or search engine software that visits sites regularly and automatically and crawl through the web’s hypertext structure by fetching a document, and recursively retrieving all the documents which are referenced. Sometimes site owners do not want all their site pages to be crawled by the web robots. For this reason they can exclude few of their pages being crawled by the robots by using some standard agents. So most of the robots abide by the ‘Robots Exclusion Standard’, a set of constraints to restricts robots behavior.
‘Robot Exclusion Standard’ is a protocol used by the site administrator to control the movement of the robots. When search engine robots come to a site it will search for a file named robots.txt in the root domain of the site (http://www.anydomain.com/robots.txt). This is a plain text file which implements ‘Robots Exclusion Protocols’ by allowing or disallowing specific files within the directories of files. Site administrator can disallow access to cgi, temporary or private directories by specifying robot user agent names.
The format of the robot.txt file is very simple. It consists of two field : user-agent and one or more disallow field.
What is User-agent ?
This is the technical name for an programming concepts in the world wide networking environment and used to mention the specific search engine robot within the robots.txt file.
For example :
User-agent: googlebot
We can also use the wildcard character “*” to specify all robots :
User-agent: *
Means all the robots are allowed to come to visit.
What is Disallow ?
In the robot.txt file second field is known as the disallow: These lines guide the robots, to which file should be crawled or which should not be. For example to prevent downloading email.htm the syntax will be:
Disallow: email.htm
Prevent crawling through directories the syntax will be:
Disallow: /cgi-bin/
White Space and Comments :
Using # at the beginning of any line in the robots.txt file will be considered as comments only and using # at the beginning of the robots.txt like the following example entail us which url to be crawled.
# robots.txt for www.anydomain.com
Entry Details for robots.txt :
1) User-agent: *
Disallow:
The asterisk (*) in the User-agent field is denoting “all robots” are invited. As nothing is disallowed so all robots are free to crawl through.
2) User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/
All robots are allowed to crawl through the all files except the cgi-bin, temp and private file.
3) User-agent: dangerbot
Disallow: /
Dangerbot is not allowed to crawl through any of the directories. “/” stands for all directories.
4) User-agent: dangerbot Disallow: /
User-agent: *
Disallow: /temp/
The blank line indicates starting of new User-agent records. Except dangerbot all the other bots are allowed to crawl through all the directories except “temp” directories.
5) User-agent: dangerbot
Disallow: /links/listing.html
User-agent: *
Disallow: /email.html/
Dangerbot is not allowed for the listing page of links directory otherwise all the robots are allowed for all directories except downloading email.html page.
6) User-agent: abcbot
Disallow: /*.gif$
To remove all files from a specific file type (e.g. .gif ) we will use the above robots.txt entry.
7) User-agent: abcbot
Disallow: /*?
To restrict web crawler from crawling dynamic pages we will use the above robots.txt entry.
Note : Disallow field may contain “*” to follow any series of characters and may end with “$” to indicate the end of the name.
Eg : Within the image files to exclude all gif files but allowing others from google crawling
User-agent: Googlebot-Image
Disallow: /*.gif$
Disadvantages of robots.txt :
Problem with Disallow field:
Disallow: /css/ /cgi-bin/ /images/
Different spider will read the above field in different way. Some will ignore the spaces and will read /css//cgi-bin//images/ and may only consider either /images/ or /css/ ignoring the others.
The correct syntax should be :
Disallow: /css/
Disallow: /cgi-bin/
Disallow: /images/
All Files listing:
Specifying each and every file name within a directory is most commonly used mistake
Disallow: /ab/cdef.html
Disallow: /ab/ghij.html
Disallow: /ab/klmn.html
Disallow: /op/qrst.html
Disallow: /op/uvwx.html
Above portion can be written as:
Disallow: /ab/
Disallow: /op/
A trailing slash means a lot that is a directory is offlimits.
Capitalization:
USER-AGENT: REDBOT
DISALLOW:
Though fields are not case sensitive but the datas like directories, filenames are case sensitive.
Conflicting syntax:
User-agent: *
Disallow: /
#
User-agent: Redbot
Disallow:
What will happen ? Redbot is allowed to crawl everything but will this permission override the disallow field or disallow will override the allow permission.
II. Search Engine Robots: Meta-tag Explained:
What is robot meta tag ?
Besides robots.txt search engine is also having another tools to crawl through web pages. This is the META tag which tells web spider to index a page and follow links on it, which may be more helpful in some cases, as it can be used on page-by-page basis. It is also helpful incase you don’t have the requisite permission to access the servers root directory to control robots.txt file.
We used to place this tag within the header portion of html.
Format of the Robots Meta tag :
In the HTML document it is placed in the HEAD section.
html
head
META NAME=”robots” CONTENT=”index,follow”
META NAME=”description” CONTENT=”Welcome to…….”
title……………title
head
body
Robots Meta Tag options :
There are four options that can be used in the CONTENT portion of the Meta Robots. These are index, noindex, follow, nofollow.
This tag allowing search engine robots to index a specific page and can follow all the link residing on it. If site admin doesn’t want any pages to be indexed or any link to be followed then they can replace “ index,follow” with “ noindex,nofollow”.
According to the requirements, site admin can use the robots in the following different options :
META NAME=”robots” CONTENT=”index,follow”> Index this page, follow links from this page.
META NAME=”robots” CONTENT =”noindex,follow”> Don’t index this page but follow link from this page.
META NAME=”robots” CONTENT =”index,nofollow”> Index this page but don’t follow links from this page
META NAME=”robots” CONTENT =”noindex,nofollow”> Don’t index this page, don’t follow links from this page.
Digg
del.icio.us
Blink
Stumble
Spurl
Reddit
Netscape
Furl
Article keywords: SEO, SEM, Search Engine Optimization, Dynamic Page, robots, robots.txt
Article Source: http://www.articles2k.com
Susmita love researching on web marketing and on SEO related issues. She prepared her blog for distributing knowledge and gathering knowledge as well as for sharing her views on different aspects of life.
|
|
| Top Search Engine Optimization Articles |
- 1). High Paying Lateral Keywords By : Matthew C. Keegan
If you run AdSense on your site you know that some words pay more than others, much more in fact. More than likely you have also learned that terms like "structured settlements" and "mesothelioma" can produce incredibly high PPC revenue, if they show up on your site at all. Unfortunately hundreds of thousands of other webmasters are "on" to this practice judging by the number of sites created regularly to capitalize on the phenomenom.
|
- 2). Creating Sitemaps For Google, MSN AND Yahoo! - The Easy Way By : Philip Nicosia
If you own or maintain a website or intend to own one, wouldn’t it be great if you get frequent visitors who find satisfaction in getting exactly the information they need from your page?
While that satisfaction largely depends on the contents of your website, how you get to be accessed by website users is the most critical factor of website development.
|
- 3). How To Stand A Chance In The Search Engine Optimization Maze By : bluecharm
Despite all the other methods available for obtaining a decent web ranking for your website, for instance, link exchanges, search engine optimization or SEO is still regarded as the way to go to ensure that your website is listed high in the search engine rankings.
Search engine optimization is a method of analyzing your site and modifying it to allow search engines to read and index it.
|
- 4). Link Exchange and Search Engine Optimization By : Bill Boyd
Here’s an overview of how exchanging links can improve our search engine ranking.
I believe Exchanging Links is one of, if not “the” best way to market your site online! Read below to see why.
Some people believe that in order to be successful online they have to direct thousands of targeted visitors to their site. TRUE
They also believe that they have to spend all their time adjusting the scripts and contents of their pages, to make the search engines spider them and rank them well.
|
- 5). Black Hat SEO – What Never To Do Or Get Banned By :
While there are many legitimate skills in SEO there are also those that can work but if you get caught using them the results can be disastrous. When search engine optimization became an issue many techniques were employed because at that point the search engines used a much simpler algorithm. As these tricks were used to exploit the simpler algorithms they were also served to make them more advanced.
|
- 6). An SEO Glossary - Common SEO Terms Defined By : Glenn Murray
Search Engine Optimization (SEO) has become an essential weapon in the arsenal of every online business. Unfortunately, for most business owners and marketing managers (and even many webmasters), it's also somewhat of an enigma. This is partly due to the fact that it's such a new and rapidly changing field, and partly due to the fact that SEO practitioners tend to speak in a language all of their own which, without translation, is virtually impenetrable to the layperson.
|
- 7). The Best Web Marketing Secrets By : Allen Brown
Are you stuck up with lower down the sales? Not improving or upgrading your sale to higher scale? Do not worry! You are at the right place to get the Right Solution.
We have magic wand to boost your sell. Your profit and sales will surely increases to your wish as you apply it. Do not worry you may be marketing in any form, i.e. may be through web, or by printed form or in personal way.
|
- 8). Organic SEO And Link Building By : John Tourloukis
Organic SEO or search engine optimization is a slow and steady process. Achieving top rankings takes time and link building is a big part of that effort. If you are to have a successful web site and obtain high search engine rankings you will need quality relevant web sites linking to your web site. The amount and quality of links you need will depend on the competitiveness of the keyword phrases for which you are optimizing.
|
- 9). Keyword Specific Ads Equals Higher Search Engine Rankings By : Joe Money
Keywords are an essential component of producing a web site. These are the words which the search engines use to help categorize and rank your pages. For instance, if your website is about 'writing articles' then your keywords could include 'writing', 'articles', 'article writing' and so on.
The advice from many web designers and search engine optimizers has been to find as many keywords as you can.
|
|
|
| New Search Engine Optimization Articles |
|
|
|
|
|
|
|
|
- 5). The Benefits of SEO By : Robby
SEO stand for Search Engine Optimization. It's a fancy term that means a website is user friendly to search engines bots. When you use SEO on a site you fix the site's meta tags, title, discription, and even the body of the index page, among other things to give the bots the best possible idea what your page is about and how relevant the content is to everything else on the page.
|
- 6). The Future of Search Engine Rankings By : Christian Taylor
How do you plan for future changes in search engine rankings?
It would be great to have a crystal ball, prying into the plans of the search engines and how they may change weights of algorithms and the like.
For those of little SE understanding, algorithms are formulas or rules set up by SE's to determine ranking. Each rule then has a different weight or percentage of importance assigned to it.
|
- 7). The Three Basic Keys of Search Engine Optimisation By : David Touri
With search engine optimisation on the rise, there are many mind blowing theories that are circulating about what will get you high up in the search engines. From having your keywords in bold font, to having your text as close to the top of a page, along with the famous 3% - 7% keyword density rule, we can scrap them all since it comes down to three basic key points.
|
- 8). Striking out on your own: Do you really need a sitemap generator? By : Philip Nicosia
There are so many responsibilities that are associated with starting a website. Each step towards the realization of a website requires a lot of hard work and utmost dedication to achieve results and to pay attention to detail. Only by doing this can one ensure that a website is set up successfully and able to meet the goals set for its creation.
Aside.
|
- 9). Good And Bad SEO Practices By : Andreas Obermueller
Search Engine Optimizing is a set of practices developed to help you get a high traffic and PageRank, while simultaneously building a useful and functional website that will respond to the needs of your potential users. While some website owners are getting their site indexed by following and enhancing these guidelines, others are looking for a way to sneak in and crawl up search engines by developing various strategies to cheat spiders.
|
- 10). The Elusive Practice of Search Engine Optimization By : Jay Stockman
Search engine technology has completely revolutionized the way we retrieve relevant information. Mastering the algorithms that drive these marvels of technology is the challenge, and assures the webmaster complete marketing success. In 1990, Alan Emtage at McGill University developed the first search engine, named Archie. Its purpose was to create a database of web filenames that could be queried, and retrieved by any user.
|
|
|