<<Previous: Getting Software Updates | ↑Up: Procedures and Examples | Next>>: Indexing Other Sites |
On the first access to a site the file /robots.txt
will be
retrieved, if it exists. Settings there will be respected. Any
encountered URL that is disallowed by robots.txt
will be
discarded. Meta robots is also respected for each page retrieved.
See http://www.robotstxt.org/wc/exclusion.html
for the robots.txt
and meta robots standards.
If there are any HTML trees that you don't want indexed you may
want to setup a robots.txt
file, meta robots within the
HTML pages, or use the various exclusion options to the Parametric Search Appliance.
For example: if you had a "text only" version of your web server that
duplicated the content of your normal server you would not want to
index it. (On the other hand if most of your meaningful text
is contained in graphics, Java, or JavaScript you may want to walk
the text tree instead of the normal one, since graphics and Java are
not searchable.)
Suppose your "text only" pages were all under a directory
called /text
. The simplest way to prevent traversal of that
tree would be to use the exclusion or exclusion prefix.
The exclusion would look something like this:
/text/
The exclusion prefix would look something like this:
http://www.example.com/text/
That will prevent retrieval of any pages under the /text
tree.
This does not prevent other Web robots
from retrieving the /text
tree. To setup a permanent global
exclusion list you need to create a file called robots.txt
in
your document root directory. The format of that file is as follows:
User-agent: *
Disallow: /text
Where *
is the name of the robot to block. *
means any
robot not specifically named (all robots in this case since no others
are named). Or you could specify the name of the robot. For
the Parametric Search Appliance it would be ThunderstoneSA
.
You may specify several "Disallow"s for any given robot (see below).
The "Disallow"s are simple path prefixes. They may not contain wildcards.
You may also specify different "Disallow" sets for different robots. Simply insert a blank line and add another "User-agent" line followed by its "Disallow" lines.
Here's a larger example:
User-agent: *
Disallow: /text
Disallow: /junk
User-agent: ThunderstoneSA
Disallow: /text
Disallow: /thunderstonesa
User-agent: Scooter
Disallow: /text
Disallow: /junk
Disallow: /big
The Scooter
robot will be blocked from accessing any pages under
the /text
, /junk
, and /big
trees. The Parametric Search Appliance
will be blocked from accessing any pages under /text
and
/
thunderstonesa.
All other robots will be blocked from accessing
pages under /text
and /junk
.
Use of robots.txt
is not enforced in any way. Robots may or may
not use it. The Parametric Search Appliance will, by default, always look for it and
use it if present. This may be disabled by turning off robots.txt
under the Robots setting. When using robots.txt you may still
use "Exclusions" for manual exclusion.
Meta robots provides another method of controlling robots such as the Parametric Search Appliance. Any HTML may contain a meta tag in the source of the form.
<meta name="robots" content="WHAT-TO-DO">
WHAT-TO-DO
may contain any of the following keywords. Multiple
keywords may be used by placing a comma(,
) between them.
Keyword | Meaning |
INDEX | Index the text of this page |
NOINDEX | Don't index the text of this page |
FOLLOW | Follow hyperlinks on this page |
NOFOLLOW | Don't follow hyperlinks on this page |
ALL | Synonym for INDEX,FOLLOW |
NONE | Synonym for NOINDEX,NOFOLLOW |
Like robots.txt
this is not enforced in any way. Robots may or
may not use it. The Parametric Search Appliance always indexes and follows hyperlinks
by default so it only looks for NOINDEX
and/or NOFOLLOW
and/or NONE
.
<<Previous: Getting Software Updates | ↑Up: Procedures and Examples | Next>>: Indexing Other Sites |