afs_html_scrape - AFS - Reference Guides

AFS Filters Description

Product
AFS
Platform
7.11
Category
Reference Guides
Language
English

Extract usefull text from HTML pages using machine learning algorithms. It works well in the context of news articles. It is very fast, less than 40 ms for news document.

The filter is declared with the afs_html_scrape type. It is in the afs-html-scrape package. It is a processor filter.

The HTML scrape filter filter specifications are described in the following table:

Parameter name

Mandatory

Type

Default

Description

input_layer

No

layer

CONTENTS

Input Layer

output_layer

No

layer

CONTENTS

Output_layer

output_format

No

string

html

Type of the output. Can be 'html', 'text' or 'highlight': 'html': html content where boilerplate content and scripts have been removed 'text': text content with only usefull content 'highlight': html input content highlighted in green for usefull content and in red for boilerplate content

algorithm

No

string

boilerpipe

2 different algorithms are implemented in this filter: 'bte' and 'boilerpipe'. Default is 'boilerpipe'.

Input is HTML text. Output contains only usefull/readable text. Output can be HTML or plain text. Three modes are implemented:
  • text: In this mode, output is a plain text file containing only usefull textual content. Templates and navigational contents (also called boilerplate) are removed.
  • html: In this mode, output is the same html file where template and navigational content are removed.
  • highlight : In this mode, output is again an html file where content is highlited in green and template and navigational content are highlited in red.
Note: This filter is intended to be used as the first filter in a pipe. Received files cannot be processed. For more information, see PaF Design Patterns.