afs_classifier_train - AFS - Reference Guides

AFS Filters Description

Product
AFS
AFS_Version
7.12
Category
Reference Guides
language
English

This filter allows to generate a classify database. This database can then be used by afs_doc_classify filter.

The filter is declared with the afs_classifier_train type. It is in the antidot-paf-misc package. It is a processor filter.

This filter can be instantiated only once at any given moment. It will not read the "instances" parameter in the configuration.

The Classifier Train filter specifications are described in the following table:

Parameter name

Mandatory

Type

Default

Description

input_layer

No

layer

CONTENTS

It is the layer containing input data.

nsmap

No

map

Empty map

It is the namespaces to be used to interpret the xpaths.

db_dir

No

directory

$AFS7/classifier

Output directory of the generated classify database. Can then be the value of the db_dir parameter of afs_doc_classify filter.

content_xpaths

Yes

list

N/A

Xpaths containing the text to be extracted from the documents.

labels_xpath

Yes

string

N/A

Xpath to one or several label(s) of the document.

It takes XML documents as input, containing text and one or several labels. It outputs a classify database allowing to tag new documents with existing labels, which will be "label free". The behavior is close to "spam filter" mechanism in a mail client:
  • afs_classifier_train uses documents known as spam and documents known as not-spam (documents must cover a representative sample).
  • afs_doc_classify tags new documents as spam or not-spam using the classify database generated by afs_classifier_train.
  • In fact the behavior is much more advanced: new documents can be tagged with several labels, and a probability is given for each label.

<pre><?xml version="1.0" encoding="UTF-8" standalone="no"?> <afs:PaF xmlns:afs="http://ref.antidot.net/v7/afs#" name="Test" service="0"> <afs:pipe name="index" run="once"> <afs:filter uri="#classifier_train" type="afs_classifier_train"> <afs:args> <!-- Where to find the classify database --> <afs:arg name="db_dir" value="iptc-fr"/> <!-- Input layer --> <afs:arg name="input_layer" value="CONTENTS"/> <!-- Text extracted from the document --> <afs:arg name="content_xpaths"> <afs:list> <afs:param value="//p"/> <afs:param value="//Property[@FormalName='Keyword']/@Value"/> <afs:param value="//HeadLine"/> </afs:list> </afs:arg> <!-- Label(s) of the document --> <afs:arg name="labels_xpath" value="//SubjectCode/*[not(@Vocabulary)]/@FormalName"/> </afs:args> </afs:filter> </afs:pipe> </afs:PaF></pre>