afs_data_diff - AFS - Reference Guides

AFS Filters Description

Product
AFS
Platform
7.12
Category
Reference Guides
Language
English

Optimize processing of a full export by converting it into a incremental update. Must be located in a pipe before actual processing, and usually after an unzip or split filter.

The filter is declared with the afs_data_diff type. It is in the antidot-paf-misc package. It is a visitor filter.

This filter can be instantiated only once at any given moment. It will not read the "instances" parameter in the configuration.

The Convert full export to incremental update filter specifications are described in the following table:

Parameter name

Mandatory

Type

Default

Description

input_layer

No

layer

CONTENTS

The layer holding the data to check for update

This filter is used when input data can not easily of efficiently be exported in an incremental way, meaning only full exports can be made. Working with only full exports is however not optimal since:
  • If only some of the documents have been updated, then it is much more efficient to detect the updated documents, and process only them (and not the whole export). For example, it is not uncommon in eCommerce exports that 98% of the documents are unchanged.
  • ACS Alert mechanisms are based on a processing tag (PaFId) set on every document when they are processed. If all documents are reprocessed each time then Alert mechanism will be useless since it will detect them as updated after each PaF execution. When using afs_data_diff, only modified data is processed, and alerts are performing as expected.

afs_data_diff relies on AIF’s Rich Versioned Layered Documents. By default, the revision mechanism is disabled: it must be activated by setting PaF/Filters/Layers/overwrite to false. At least two revisions of the layers must be stored, therefore it is safe to leave the default value of PaF/Filters/Layers/maxRevisions

Operating mode: When operating inside a data driven PaF, this filter will automatically detect processing mode (full or diff). In diff mode, the filter does not perform any action. In full mode - or outside a data driven PaF, it acts as described below.

afs_data_diff is a visitor filter. Let E1 and E2 be two consecutive full exports on which afs_data_diff will work. Let p1 be the PaFId associated with E1, and p2 the PaFId associated with E2. afs_data_diff performs the following operations:
  • it detects deleted documents (were in E1 but are not in E2) and deletes them (set their status to DELETED),
  • it detects non modified documents (identical in E1 and E2), sets their PaFId to p1 and their status to END_OF_LIFE,
  • it detects modified documents (different between E1 and E2) and does not alter them (PaFId p2, ...),
  • it detects new documents (are in E2 but were not in E1) and does not alter them (PaFId p2, ...).
This filter provides logs statistics at the end of its process, such as percentage of new, deleted and modified documents.

Advanced information: afs_data_diff is simple and efficient:
  • in order to detect if a document has been modified between two exports, the revision mechanism is used. If the CONTENTS layer has a new revision for p2 then the document has been updated. Otherwise p1 and p2 share the same revision – and the layer metadata indicate that (at t least) two copies exist for this revision.
  • New and deleted documents are found by comparing the list of documents in p1 and p2. Documents in p2 and not in p1 are new ; documents in p1 and not in p2 are deleted.

Guidelines:
  • afs_data_diff is a visitor filter : it should therefore be used in a first pipe, used for extraction, before a pipe used for processing.
  • Processing of large XML files or of archive files usually implies using afs_xml_split or afs_zip_extract. In this pattern afs_data_diff should be used after these filters in order to detect the updates in the resulting documents (and not in the “source” atomic document).

Tip: Note that the generator(s) filter(s) (such as unzip and/or split) preceding afs_data_diff must produce stable document URIs from one run to another, so that afs_data_diff filter can identifiy them in consecutive runs, and therefore work properly.