<<Previous: Base URL(s) | ↑Up: Basic Walk Settings | Next>>: Allow Extensions |
Syntax: select Yes or No buttons
robots.txt
With this set to Yes, the Parametric Search Appliance will initially get
/robots.txt
from any site being indexed and respect its
directives for what prefixes to ignore. Turning this setting off is
not generally recommended. Supported directives in robots.txt
include User-agent:
, Disallow:
, Allow:
,
Sitemap:
,
and Crawl-delay:
.
Any Sitemap:
links in robots.txt
will be crawled as
well, subject to normal exclusion settings. Sitemaps not in
robots.txt
may be added via Base URL(s)
(here) or URL URL (p.here).
Meta
Respect the meta tag called robots
. With this set to Yes
the Parametric Search Appliance will process and respect the robot control information
within each retrieved HTML page.
Placeholder
Whether to still put an (empty) entry - a placeholder - in the
html
search table for URLs that are excluded via
<meta name="robots">
tags. Leaving a placeholder improves
refresh crawls, as the URL can then have its own individual refresh
time like any other stored URL. Without a placeholder, the URL would
be fetched every time a link to it is found, because no knowledge that
it has been recently fetched would be stored.
The downside to placeholders is that if the URL is also being searched
in queries - i.e. Url
is part of Index Fields - then
the excluded URL might be found in results. Placeholders have empty
text fields (e.g. no body, meta, etc.) to avoid matches on text, but
the URL field must remain.
See also Robots.txt
6.6.
<<Previous: Base URL(s) | ↑Up: Basic Walk Settings | Next>>: Allow Extensions |