afs_dates_normalize - AFS - Reference Guides

AFS Filters Description

Product
AFS
AFS_Version
7.11
Category
Reference Guides
language
English

The date normalization filter allows to standardize a date into a format.

The filter is declared with the afs_dates_normalize type. It is in the antidot-paf-misc package. It is a processor filter.

This filter can be instantiated only once at any given moment. It will not read the "instances" parameter in the configuration.

The Dates normalization filter specifications are described in the following table:

Parameter name

Mandatory

Type

Default

Description

xpaths

Yes

list

N/A

It it the list of xpaths to read.

date_format

No

string

ISO8601

It is the parsing format. This parameter will be deprecated in next release, please use date_formats parameter instead.

date_formats

No

list

N/A

ICU patterns used sequentially to normalize the date (first match wins). When date_format is defined to valid ICU pattern, it is set to the top list of date formats

normalized_date_format

No

string

yyyy-MM-dd

Output format of the normalized date.

strict

No

boolean

true

Dates found for provided XPaths must be valid (default), otherwise (strict=false) at least one date should be valid.

input_layer

No

layer

CONTENTS

It is the input layer.

nsmap

No

map

Empty map

It is the namespace map to interpret every XPaths.

output_format

No

string

XML

The serialized format in the OUTPUT layer. Values can be XML, JSON or SERIALIZED_PROTOBUF.

output_layer

No

layer

CONTENTS

It is the output layer. Note that the content of the layer will be overwritten by the output of the filter.

default_locale

No

string

FR

The default locale used for parsing date with strings (as July, Mars, etc.) if doc has not language set.

force_locale

No

string

No force, use doc language or default_locale

If set, force the locale. Do not care about doc language of default_locale parameter.

Typically, this filter is used to prepare the XML data for the categorization process of the facet engine. Hereafter an example of the output of the filter:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<afs:date_normalisations xmlns:afs="http://ref.antidot.net/v7/afs#">
    <afs:xpath xpath="/page/date">
        <afs:NormalizedDate origin="2007-11-12" state="true" normalized="2007-11-12" startDate="2007-11-12" endDate="2007-11-12">
            <afs:centuries_ext>
                <afs:century>2000</afs:century>
            </afs:centuries_ext>
            <afs:decades_ext>
                <afs:decade>2000</afs:decade>
            </afs:decades_ext>
            <afs:years_ext>
                <afs:year>2007</afs:year>
            </afs:years_ext>
        </afs:NormalizedDate>
    </afs:xpath>
</afs:date_normalisations>
afs:date_normalisations is the root. It contains one afs:xpath for each configured XPath. afs:xpath contains an afs:NormalizedDate element. afs:NormalizedDate contains the following atributes:
  • origin: the original date, as in the input document.
  • state: false if the filter was unable to normalize the date (date format not recognized).
  • normalized: normalized value.
  • startDate: start date of the interval associated to the date (an interval is associated to any date. For example: the date is 2012-01 with the yyyy-MM format, the day is not given, the interval is the whole month. Then the associated interval is "1st to 31 January 2012", and startDate value will be 2012-01-01).
  • endDate: end date of the interval associated to the date.
afs:NormalizedDate contains the following sub-elements (extensions):
  • afs:centuries_ext: contains the century.
  • afs:decades_ext: contains the decade.
  • afs:years_ext: contains the year.
The date_format parameter supports two kinds of syntax:
  • A string from the set of strings allows complex algorithm:
    • ISO8601 parses dates with the ISO8601 rev 2003 algorithm. Recognized format (partial date are allowed, eg. 2016-07): yyyy[-]MM[-]dd'T'HH[:]mm[:]ss[.,]<digit>*[Z<timezone>]
    • RFC822 parses dates with the RFC822 algorithm. Format: EEE, dd MMM yy HH:mm:ss V (eg: Tue, 19 Jul 2016 06:50:17 GMT)
    • YYYY_MM_DD parses dates with the construction Year <separator> month <separator> day where several separators are tried (month and day are optional). This means that a date like 2010-08 is parsed successfully.
      • YYYY_MM is the same as YYYY_MM_DD, but day is prohibited.
      • YYYY stands for year only.
  • An ICU pattern is directly sent to the ICU processor. The exact vocabulary which can be used to describe the patterns can be found in the ICU documentation.

The language to be used to parse the date is defined hereafter:
  • If the force_locale parameter is used, the locale language is used.
  • If the document is set with a specific language, this language is used.
  • If the document language is UNKNOWN, the default language set by the default_locale parameter is used.

Attention: date_format parameter using complex algorithms (ISO8601, RFC822...) cannot be combined with date_formats parameters. In such combination, date_format parameter is ignored.