afs_entity_extract - AFS - Reference Guides

AFS Filters Description

Product
AFS
Platform
7.12
Category
Reference Guides
Language
English

This filter extracts named entities from input layer content. More information available after the parameters table.

The filter is declared with the afs_entity_extract type. It is in the antidot-paf-entity-extract package. It is a processor filter.

The Named entities extractor filter specifications are described in the following table:

Parameter name

Mandatory

Type

Default

Description

extractor

No

map

N/A

Choose which named entities extractor to use. Keys are ISO-639-1 languages, values are an extractor type in {dbpedia, syllabs, antidot, none}. A default rule can be set using the "defaut" key. The default extractor is antidot. The "none" extractor will not extract any entities. A common pattern is to use the none extractor as a default rule for all unknown languages.

url

No

map

N/A

Override the extractor endpoint. URL for Webservice-based extractors such as Dbpedia Spotlight or Syllabs. Filesystem path to its database for the antidot extractor.

default_lang

No

string

fr

Default lang to use when handling a non-localized document. ISO 639-1 format.

api_key

No

string

N/A

API Key for extractor, if applicable. Currently required for the Syllabs extractor.

input_layer

No

layer

CONTENTS

Layer containing the plain text to extract the entities from.

output_layer

No

layer

USER_1

Layer to fill with the extracted entities.

output_format

No

string

XML

The output format in the output layer. Possible values are XML, JSON, serialized protobuf.

enable_post_process

No

boolean

true

Output of entity extraction is cleaned and standardized by document.

The Named entities extractor filter deprecated specifications are described in the following table:

Parameter name

Deprecated since

Replaced by

Description

syllabs_api_key

7.7

api_key

API Key required by the Syllabs extractor. Mandatory for Syllabs, ignored by the other extractors.

Named Entities are textual categories of words or group of words.

This filter proposes 3 extractors: Antidot Extraction Technology, Syllabs and Dbpedia Spotlight.

Antidot Extraction Technology (AET) also known as Caribou runs localy on the PaF server. AET needs databases which exist in french and in english at that moment. Those databases are provided in the package filter.

Syllabs is a proprietary solution. It is available by a Web Service: http://api.syllabs.com/v0/entities. The filter queries the Web Service. An API Key is needed and has to be filled in syllabs_api_key parameter. Supported languages are: English, Spanish, French and Italian. Documentation is available at http://docs.api.syllabs.com/ref/resources/entities.html.

Dbpedia Spotlight is a free solution. It is available by a Web Service. A demo is avaible at: http://dbpedia-spotlight.github.io/demo/. The filter queries the Web Service. Supported languages are: Danish, German, English, Spanish, French, Hungarian, Italian, Portuguese, Russian, Swedishand and Turkish. Documentation can be found here: https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki

Attention: This filter does not use cache, so every treated document produces an HTTP request on chosen API. Then, the Web Service of this API must reply at the necessary frequency.

Default behavior is to use AET. As AET only comes with english and french databases, documents from other languages can not be processed and their paf_status will be set to KO.

By default, this filter uses Antidot Extraction Technology to extract named entities. Since it is a learning technology, the filter package imports training databases which are mandatory to extract named entities. As of now, the package comes with databases for two languages: french and english.

Antidot algorithm is trained to extract named entities in news documents. It is trained to extract three entity classes: PERSONS, LOCATIONS and ORGANIZATIONS. In this context, Antidot achieves 94.1% F1 (recall: 92.3, precision: 95.6).

Attention: Outside of the news context, expect the results quality to be lower.

Note: in english four categories are extracted: PERSONS, LOCATIONS, ORGANIZATIONS and PRODUCTS.

Antidot entity extractor is based on a machine learning algorithm. A training database has be built for each language from news document (only available in French and English at the moment). Here are some insights for a better understanding of what can be expected from Antidot Entity Extraction.

Training: For each document in the training dataset, a human labeled each entity. For instance, in the sentence: Barack Obama was re-elected president in November 2012. Barack Obama is labeled as PERSON.

In english, possible labels are: PERSON, LOCATION, ORGANIZATION and PRODUCT. In french, we have only: PERSON, LOCATION and ORGANIZATION.

Then a set of features has to be computed by our algorithm. Let's give some examples of features:
  • syntactic features:
    • word is capitalized (True/False) ; previous word in capitalized; next word is capitalized
    • current/previous/next word contains a digit
    • ...
  • gazetteer features:
    • word/group of word is in column N of gazetteer G
    • ...
  • other features...

At training time, a model is built. This model contains for each feature (there are thousands) and for each label the weight on the rule "feature generates label". The weight can be understood as the importance.

Inference: entity computation

The filter afs_entity_extract loads this model. For each document paragraph, for instance: "Paris Hilton lives in London" 1. Every features of each word in the paragraph are computed 2. According to the values of the features, most likely labels for the paragraph are computed. A score is given to each entity. 3. At last, entities in a document are merged to avoid duplicates. For instance if we have two entities: ("Barack Obama", PERSON, 0.9) and ("Obama", PERSON, 0.8), the two entities are grouped in a single one with two occurrences. Information about entities position (the offset) is kept. This last step can be skipped (with option enable_post_process="False").

We would like to emphasize that afs_entity_extract is not afs_annotate. afs_entity_extract can find entities that are not in any gazetteers and that have not been seen at training time. afs_entity_extract uses information from the whole paragraph.

Our training was done on news documents. If one tries to extract entities from documents of an other context, results will not be as good as for news documents.

Language configuration:

Named entity extraction is a localized process. This filter can be configured to choose an extractor depending on the document language. Document language can be set by filter afs_lang_set or detected by filter afs_lang_detect.

It is possible to specify which extractor will process each language. For instance, this configuration map:
<afs:filter uri="ner_uri" type="ner_id">
    <afs:args>
        <afs:arg name="extractor">
            <afs:map>
            <afs:param key="FR" value="Antidot"/>
            <afs:param key="EN" value="Antidot"/>
            <afs:param key="DE" value="Spotlight"/>
            <afs:param key="default" value="None"/>
            </afs:map>
        </afs:arg>
    </afs:args>
</afs:filter>
English and French documents will be processed by AET, German documents will be processed by Dbpedia Spotlight and for any other languages, entities will not be extracted. Paf document status of those last documents will be set to OK.

Filter output is a serialized protobuf (it can be changed thanks to output_format parameter). It contains the list of found named entities, and for each entity the following information:
  • type: entity type (can be: PERSON, LOCATION, ORGANIZATION, or UNDEFINED if not provided by the API or not recognized).
  • text: (string) character string corresponding to the entity, as recognized in the source text.
  • count: (integer) number of times the entity appears in the source text.
  • confidence: (float) trust value assigned to the entity reconnaissance (if provided by the API).
  • uri (string): entity associated URI (if provided by the API).
  • types (string): type(s) as provided by the API (raw value).
  • matches : entity occurrences list, and for each occurrence:
    • offset: entity position (utf-8),
    • form: text form at this position.
  • firstname / lastname: if the entity is of the PERSON, this extension contains entity details.

Filter output example (with output_format parameter set to XML):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<afs:Entities xmlns:afs="http://ref.antidot.net/v7/afs#">
    <afs:entities type="LOCATION" text="Saint-Julien-en-Genevois" count="1" confidence="0.866">
        <afs:matches offset="2445" form="Saint-Julien-en-Genevois"/>
    </afs:entities>
    <afs:entities type="LOCATION" text="France" count="4" confidence="0.990">
        <afs:matches offset="47" form="France"/>
        <afs:matches offset="213" form="France"/>
        <afs:matches offset="446" form="France"/>
        <afs:matches offset="2533" form="France"/>
    </afs:entities>
    <afs:entities type="PERSON" text="Daniel Tasoeur" count="1" confidence="0.997">
        <afs:matches offset="2854" form="Daniel Tasoeur"/>
    </afs:entities>
    <afs:entities type="ORGANIZATION" text="Pouvoirs Publics" count="1" confidence="0.682">
        <afs:matches offset="2778" form="Pouvoirs Publics"/>
    </afs:entities>
    <afs:entities type="LOCATION" text="Tours" count="1" confidence="0.954">
        <afs:matches offset="2324" form="Tours"/>
    </afs:entities>
    <afs:entities type="PERSON" text="Manuel Valls" count="2" confidence="0.838">
        <afs:matches offset="1914" form="Manuel Valls"/>
        <afs:matches offset="3022" form="Manuel Valls"/>
    </afs:entities>
</afs:Entities>