De-Duplication

A good ID is bi-directionally unique; an ID represents a single object, and a single object is represented by only one ID. In order to guarantee universal uniqueness, EIDR content registrations go through a central system that uses a de-duplication module to guarantee that an object is unique. Once a unique ID is assigned to an object, the ID becomes a persistent and permanent part of the registry, available for use by the media and entertainment ecosystem.

Overview

The de-duplication module responds to a registry request with one of four outcomes.

  • No Duplicate: The record submitted is unique.
  • Duplicate: The record submitted is a duplicate of an existing object in the registry.
  • Potential Duplicate: It is likely, but not certain, that the record submitted is a duplicate of one or more records in the registry. Synchronous transactions (such as Match), return the match candidate list with confidence scores for the user to review; asynchronous transactions refer the match candidate list to EIDR Operations, so that the user receives a New ID (no duplicate) or Matched ID (duplicate).
  • Rejected: The submitted record was erroneous or ambiguous and could not be processed further by the registry operator. After correcting the errors or omissions, the record may be re-submitted.

When an attempt is made to register or modify an EIDR content record, the system first decides if the action would result in there being duplicate records in the database. This may occur, for example, if two back-catalogs contain records for the same movie or TV show. Allowing the registration or modification would violate the principal of one record/one ID.

There are sets of rules for determining candidates that match the newly created or modified record, from which a matching system generates scores. Different referent types and relationships have different scoring rules. Scores are computed against two thresholds:

  • Low threshold: Anything below this is presumably not a potential duplicate
  • High threshold: Anything greater than or equal to this is almost certainly an exact duplicate of the requested registration or modification

The normal operational mode for registration and modification is the asynchronous path. If there is only one candidate and it is above the High Threshold, the system returns the identified duplicate ID in lieu of processing the requested registration or modification. The item returned as the duplicate is probably good enough for the registry to use as the EIDR ID for the requested registration. However, if you are certain that the submitted record is unique, despite its apparent similarity to the identified duplicate, you may provide additional metadata sufficient to disambiguate the record and re-submit the registry request or request a manual review by setting the Operation element’s dedupMode flag to “manual”.

If there are no candidates above the Low Threshold, then the registration or modification is allowed to complete. Upon successful registration, you receive a new EIDR ID; otherwise, for modification requests, the underlying metadata is modified accordingly.

In all other cases having candidates above the Low Threshold for the asynchronous workflow, the attempted registration or modification is manually reviewed by EIDR to determine if it is an exact duplicate or new item registration.

In the synchronous workflow, which includes basic UI interaction, there are three possible outcomes:

  • Success with no matches (new ID issued or existing record modified)
  • One high threshold match (existing ID returned, registration or modification not processed)
  • One low threshold match or multiple matches of any time (return match list with confidence scores)

When reviewing the results after a failure, the user takes one of the following actions:

  • Identifies an exact match and uses that ID. In this case the user has completed the task, though updating the existing record with expanded metadata – including alternate IDs – is recommended practice.
  • The metadata must be modified (e.g., fields added) to avoid a match: the user resubmits.
  • The metadata is correct: the user submits asynchronously to trigger manual review and returns later for status from the token.
  • There is a metadata or validation error: the user fixes the metadata and resubmits.

Matching API

It is possible to obtain de-duplication results without submitting a create or modify request to the registry by using the Match API. This API call is similar to the asynchronous Register API except that no changes are applied to the registry. Instead, the de-duplication results are returned to the user for review. As before, the likely results for each match transaction include No Match (the record should be re-submitted to create a new ID), Match (an existing ID already exists), or Candidate List (possible matches are returned with their confidence scores) to indicate what would happen at registration time.

Tokens and Batches

The EIDR HTTP API has several calls, referred to as batchable operations, that can modify the contents of the Registry. These are Create, Modify, AddRelationship, RemoveRelationship, ReplaceRelationship, Delete, Alias, and Promote.

The EIDR API uses two approaches: single/batch requests and the immediate/async response flag.

  • In reality, all EIDR requests are batch requests. What is often called a single or non-batch request is just a batch containing one request.
  • The immediate/async response is a general mechanism, but can only be used for a batch size of one (a single request).

Additionally, the EIDR API uses tokens to track the status of batchable operations.

This section contains a short overview describing how batch/single requests, immediate/async responses, and tokens interact with each other, with examples of the Registry responses for the various combinations.

NOTE: It is important to distinguish the Response returned by the Registry from the value returned by a call through the SDK. In particular, the SDK provides some help with various errors and the registry’s occasionally inconsistent error replies, but the objects returned by the SDK have a direct mapping to the Registry Response elements.

Single/Batch

All requests are submitted through the HTTP API as a batch. Batches of one (single requests) are treated somewhat differently from batches with multiple requests.

All the operations in a batch must be the same (for example, all Create or all Modify). The registry returns an Invalid Request Error for a batch that violates this constraint.

Immediate/Asynchronous

In order to guarantee uniqueness, EIDR sends requests for modifying an object’s metadata to the de-duplication system. In most cases this automatically returns a result. If there is ambiguity that cannot be resolved by the software, one of two things will happen:

  • If the request is marked as immediate-response (synchronous), the registry immediately returns a response to the application, giving details of the potential problems. In some cases, immediate-response requests return more detailed status information than asynchronous requests.
  • If the request is not marked as immediate-response, it is sent for manual de-duplication. Registry operators make a decision, which is returned to the application. This process is not real-time, and these requests are usually referred to as asynchronous.

Immediate response applies only to single requests, and all multiple-request batches are non-immediate (asynchronous). If an application requests immediate response for a batch of more than one item, the registry returns an Invalid Request error. For example:

<Response xmlns="https://www.eidr.org/schema" version="2.6.0">
 <Status>
  <Code>3</Code>
  <Type>invalid request</Type>
 </Status>
</Response>

Tokens

Every batchable request generates a token for the request; a multi-item request additionally generates a token for each operation in the batch. This is done with two kinds of tokens

  • Operation tokens, which refer to individual Create, Modify, etc. requests and are returned in the /Response/RequestStatusResults/OperationStatus/Token XML element.
  • Batch tokens, which refer to the status of a batch request. These are returned in the /Response/RequestStatus/Token element.

In addition, the user can assign a unique User Token to any batchable operation, generally the user’s internal ID for the associated transaction. This may simplify certain integrated system workflows, since the user’s system will not have to store the EIDR token.

Information is extracted from tokens with the StatusLookup request. Operation Tokens have detailed information about the status of an individual request (for example, a single Create or Modify). Batch tokens have information about the status of the batch and any available information about the individual items within the batch. This information includes the Operation Token and current state for each item in the batch.

Batches with a single item generate only a single token. This is treated as an Operation Token whenever information relating to it is returned from the Registry (for example, when it is initially generated, and when it is requested via StatusLookup).

Operation Tokens

The /RequestStatusResults/OperationStatus/Status element will not change once it has reached a terminal state. Anything other than Pending is a terminal state. The /RequestStatusResults/OperationStatus/Status/Code is a numeric value from 0-5 with corresponding OperationStatus/Status/Type strings. These elements, as well as the codes and types for other fields, are defined in api-common.xsd.

OperationStatus CodeOperationStatus Type
0success
1duplicate
2pending
3authorization error
4validation error
5other error
6rejected

Records can be re-submitted to obtain a different result after making suitable changes:

  • Duplicate Error (change the metadata in the request, the metadata of the object(s) that were found as duplicates, or the de-dupe mode).
  • Authorization Error (change the credentials in the request, ACL of any objects involved, or the roles allowed to the requester).
  • Validation Error (change the metadata in the request or the metadata on related object(s) that caused the problem).
  • Other Error is returned for various transient problems (such as bad communication with the de-duplication system) and can be retried without necessarily changing the submitted data. Since it may reflect some other error, and transient does not necessarily mean short-lived, some caution should be used – if this error is returned a second time, it may not be productive to try it a third time without first contacting EIDR Operations.

Pending us the only non-terminal state for an operation token.

Batch Tokens

For batches containing more than one item, once a batch has passed top-level authentication, syntax checking, etc., there are three possible states:

  • 1 (batch received) means that the batch has passed the preliminary validation and is being turned into individual requests. No further information is available at this point.
  • 2 (batch queued) means that the individual requests have all been submitted. In this state the individual tokens and the current state for each are returned when you call StatusLookup with the batch token.
  • 3 (invalid batch) can result from a bad user token. This can also result from abnormal operation of the Registry, which should be reported to EIDR support.

Batch queued and invalid batch are the terminal states for a batch token.

Scores

Scores can be returned in the response to immediate-mode requests to indicate how close any duplicate items are to the requested registration. Scores are valid only for immediate-mode (synchronous) requests. If they are present in the response to an async request, they should be ignored.

Polling

You must periodically poll that status of a Token using StatusLookup until it reaches one of the possible end states.

For a Batch Token you can extract all the Operation Tokens once the batch has reached the “batch queued” state and manage them all individually, or you can poll on the Batch Token, dealing with each Operation Token as it reaches an end state or after all of them have reached an end state. The former is usually preferable, since you do not have to continually poll on the Batch Token; the latter is less efficient but may be preferable when you do not want the complexity of managing multiple tokens. Using the Web UI, you can only submit one record at a time but can poll any token. If you search for a batch token you will see the operation’s token results, and you can look up each operation token separately.

See Also

Updated on April 9, 2021

Was this article helpful?

Related Articles