afs_word_freq_build - AFS - Reference Guides

AFS Filters Description

Product
AFS
Platform
7.11
Category
Reference Guides
Language
English

The Word Frequency Builder This filter is used, in association with the indexing filter, when the language of the documents requires usage of the compound splitter analyzer to split compound words (supported languages are: DE, CS, HU, SV, DA, ES, FI, NL, TR).

The filter is declared with the afs_word_freq_build type. It is in the antidot-paf-tbd package. It is a visitor filter.

This filter can be instantiated only once at any given moment. It will not read the "instances" parameter in the configuration.

The Word Frequence Build filter specifications are described in the following table:

Parameter name

Mandatory

Type

Default

Description

input_layer

No

layer

CONTENTS

Layer read by the filter.

min_word_length

No

integer

3

minimum length in characters of words to include in the corpus

affix_blacklist

No

list

Empty

List of prefixes or suffixes that will be blacklisted by the compound splitter analyzer, ie words containing these affixes will not be splitted. Automatically initialized if parameter lang is provided. If both paramaters are set, provided blacklist and preloaded one are added.

lang

No

string

Empty

Language of the corpus, if provided this will automatically initialize the affix blacklist

The filter reads the content of the input layer and produces a corpus data file that contains all words with their statistics in the corpus. This file will be used by the compound splitter analyzer of the indexing filter to recognize and split compound words that can be found in the corpus.
Tip: Usage of this filter is not required to use the compound splitter analyzer: by default a reference corpus containing most of the usual words in each installed language is provided with the PaF.
Attention: This filter must be included in a pipe which precedes the pipe containing the indexing filter.