Crawl Specificities - AIF

AIF Crawl

Product
AIF
Category
Technical Notes
language
English
audience
public

Syndication feeds:

RSS or Atom syndication feeds are natively parsed by the Antidot crawler to extract the urls stored in a feed.

Sitemaps:

As syndication feeds, sitemaps are automatically taken into account, even if they are stored as an archive on the site being crawled (in this case, the archive is unzipped on the fly).

When a sitemap is detected, the urls contained in the file are sent to the afs_web_crawl filter as new urls set for crawling.

However, it's possible to disable this behaviour for a given site, thanks to the afs:disableSitemaps option (<afs:disableSitemaps value="true">)

Https website:

The https websites configuration must have been stored in the dedicated directories structure within the conf/ directory (i.e. $AFS7/conf/perimeter/https/www.mysite.com/conf.xml)

This kind of websites requires authentication informations that can be set in crawl settings. In this case, the crawler will check these authentication settings before performing any actions. Once authenticated, the crawler then works as for http sites.