Crawl Principles - AIF

AIF Crawl

Product
AIF
AFS_Version
7.7
Category
Technical Notes
language
English
audience
public

The basic principle of crawling consists in considering a web page and then extract links of it to find new links that will be followed, aso.

Without restrictions, crawling can obviously be boundless and thus endless. To avoid this kind of inconvenience, many restrictions must be set

  • Force crawl process to stay within the site(s) it is currently crawling
  • Enforce a maximum depth (number of links to reach a given page)
  • Forbid some kind or urls (for instance do not follow the .../en/... pages, which are in english)
  • Check the robots.txt files to know where the crawler must crawl or not
  • ...