<<Previous: Ignore Characters | ↑Up: Advanced Walk Settings | Next>>: Language Analysis |
A group of settings that control whether and how to split
anytotx
plugin output into multiple sub-URLs in the table.
Non-text files, such as PDFs, that anytotx
processes are often
very large or composed of sub-files. The Plugin Split setting allows
these files to be split up for finer-grain searching. Split files will
cause more than one URL to be entered in the html
table (and
thus also in potential search results) for the original URL. Such subsequent
URLs will have an anchor appended to distinguish them from each other;
usually this is the sub-file name, but it may be generic e.g.
"#part5
" if there are no sub-files. Note: adjusting
any of these settings can affect the ability of Refresh
-type rewalks to
complete successfully (New
walks operate as usual). Note:
Data from Field and other crawl processing is not currently performed
on Plugin Split URLs.
Depth
The Depth setting controls at what depth to split anytotx
output. Each time a multi-file archive is unpacked by anytotx
,
the depth increases. (Note that the depth does not increase with any
subdir(s) that may be created by each unpacking.) Depth 0 (the
default) means split at the top level (i.e. do not split).
Depth 1 would therefore insert each file of a ZIP file as a
separate URL in the table. Files deeper than the Depth setting
are left merged; e.g. another ZIP file contained within a ZIP file
would have its files' text remain merged at Depth 1.
Bytes
The Bytes setting controls how many bytes each part will be after the file has been split. The default of 0 indicates do not split. This is useful for large monolithic files that have no detectable sub-file or page structure. If both Pages and Bytes are set, the first limit reached is used for each part.
AtPage
The AtPage setting controls whether to force the Bytes-controlled splitting to occur at a page boundary (a Ctrl-L). Checking this may make each part arbitrarily larger than the Bytes setting, because a part may extend to the next page break. With this setting unchecked, a part may be up to 50% larger than the Bytes setting, because the page-break check will only go that far over the limit.
Pages
The Pages setting controls how many pages to group in a part. The default of 0 does not split at all. If both Pages and Bytes are set, the first limit reached is used for each part. For example, setting Pages to 10 and Bytes to 100000 would break at 10 pages or 100KB, whichever comes first. This is useful to catch page-bounded documents like PDFs, and simultaneously avoid generating huge text for non-paged documents.
<<Previous: Ignore Characters | ↑Up: Advanced Walk Settings | Next>>: Language Analysis |