afs_web_crawl - AFS - Reference Guides

AFS Filters Description

Product
AFS
Platform
7.12
Category
Reference Guides
Language
English

This filter crawls documents via http, https and ftp protocols.

The filter is declared with the afs_web_crawl type. It is in the antidot-paf-crawl package. It is a processor filter.

This filter can be instantiated only once at any given moment. It will not read the "instances" parameter in the configuration.

The Web Crawler filter specifications are described in the following table:

Parameter name

Mandatory

Type

Default

Description

user-agent

Yes

string

N/A

User-Agent to use in HTTP headers. It is recommended to chose a different User-Agent for each crawl, and to ensure visited web sites can contact you if they need to

num_threads

No

integer

0

Number of threads to use for crawling. If set to 0 (which is the default) the number of threads used will be 10 times the number of CPU core

This filter crawls (fetches) documents using web protocols and according to the netiquette. The Document Manager is used as a local cache in order to minimize bandwith. robots.txt, sitemaps, RSS and Atom Feeds are automatically dealt with. Full HTTP reply and response headers are stored in HTTP layer.