1. Home
  2. Programming
  3. Working with the Registry
  4. Reading the Registry
  5. Text Processing and Queries
  1. Home
  2. EIDR Data Fields
  3. Text Processing and Queries
  1. Home
  2. Programming
  3. Text Processing and Queries

Text Processing and Queries

Field Rules

There are three kinds of matches:

  • Token Match – the check is made for the existence of the tokens in any order in the candidate string. This uses the <field> <string> <string> syntax.
  • Exact Match – the check is done for the existence of an ordered sequence of tokens in the candidate string. This uses the <field> “<string>” syntax.
  • Complete Match – checks for the exact existence of the exact query string or not. This is generally used for fields containing controlled vocabulary or IDs, but can also be used on some text fields. For a non-tokenized field, the query string and the field must be identical. A tokenized field must be identical to the tokenized query string. This uses the IS and ISNOT operators.

Controlled vocabulary fields have punctuation replaced by a space, and are then tokenized.

FieldMatch typeNormalize
Any field called DisplayNameTokenYes
VirtualField (Full and self-defined) See description of Virtual Fields.TokenYes
FindPartiesByNameSee function definitionYes
FindPartiesFromCatalogSee function definitionYes

Simple Queries

The metadata fields to be tested are represented with a subset of XPath notation that supports only complete paths to elements and attributes. (For more information on XPath, see https://www.w3.org/TR/xpath20/.) The XPath used in query requests can be based on:

  • /FullMetadata, which is of type fullObjectInfoType: This queries across the objects’ inherited metadata.
  • /ProvenanceMetadata, which is of type provenanceInfoType: This queries across the objects’ provenance metadata.

In these examples the parentheses are not strictly necessary, but improve legibility. Note that XML, at the protocol level, requires escaping of special characters (<, >, etc.), although procedural implementations of the API may hide that from the application.

 This query finds all objects longer than 20 minutes and shorter than 40 minutes:

(/FullMetadata/BaseObjectData/ApproximateLength > PT20M00S) AND (/FullMetadata/BaseObjectData/ApproximateLength < PT40M00S)

This finds all records modified in 2010 and not modified since. (See below for the precision of date comparisons.)

(/ProvenanceMetadata/LastModificationDate = 2010)

There are several kinds of simple queries, not all of which are applicable to all fields.

Exact Value: These queries use IS and ISNOT. They are applicable to

  • Fields containing DOIs
  • Fields containing controlled vocabulary
  • Certain text fields.

NOTE: For a non-existent field, ISNOT returns TRUE.

Exact Value-language: This special case of Exact Value fields for language fields uses IS and ISNOT and behaves as follows:

  • If only a pre-dash component is supplied in the query, it matches anything with that prefix.
  • If the language code in the query has a – in it, it only matches another field that is exactly the same.
  • Examples:
    • es matches objects that have es, es-ES, and es-419
    • de-CH matches de-CH, but not de or de-DE.

Order: Queries can be done using comparisons (<, <=, >=, >) as well as equality and inequality (=, <>, !=) for fields that contain:

  • Integers
  • Dates
  • Durations.

Existence: The existence of a field can be queried. This is useful for optional elements that represent large optional sub-blocks (e.g. the subtitle tracks of a Manifestation).

For example, this finds all objects that have information about separately encoded subtitles:

(/FullMetadata/ExtraObjectMetadata/ManifestationInfo/Digital/Track/ Subtitle EXISTS)

Text Matching:

  • Text queries are case-insensitive.
  • Both the text in the query string and the text stored in the registry are generally processed into tokens before matching. Tokenization consists of one or more of the following steps.
  • Normalization: Sequences of whitespace are collapsed into a single space; some punctuation is converted to spaces; and some punctuation is removed (causing concatenation of the string before it with the string after it). This gives a series of tokens.
  • For Description fields only, two filters can be applied to the tokens that result from normalization:
    • Stop words (small, common words, such as “the” or “in” in English or “la” and “en” in Spanish) are filtered.
    • Words are stemmed; stemming removes plurals, turns inflected words into the appropriate root, and so on.
  • Strings represented using the Latin alphabet can be searched with or without diacritic significance. ASCII-based searches ignore diacritic marks (“ü” is equivalent “u”), while non-ASCII searches treat characters with different diacritics as distinct.

Search Expressions

There are two kinds of text queries:

  • The form <field> <string1>…<stringN> is true for any field that has one or more of the strings. It is equivalent to <field> <string1> OR <field> <string2> OR … OR <field> <stringN>
  • The form <field> “<string>” is true for any field that has exactly <string> in it. <string> is tokenized before the comparison. Stated another way, the token sequence generated by <string> must appear exactly in <field>. The tokenization rules applied to applied to <string> are those applied to <field>.

The grammar for query expressions is:

<expression> ::= <term>
               | <expression> OR <term>
               | <expression> AND <term>
               | NOT <term>
               | ASCII (<expression>)

<term>       ::= <field> EXISTS
               | <field> <string> <string>*
               | <field> "<string>"
               | <field> IS "<string>"
               | <field> ISNOT "<string>"
               | <field> <logop> <value>
               | ( <expression> )

NOTE: * is the equivalent of EBNF {} and “<term> || NOT <term>” could be EBNF “[NOT]

<field>      ::= legal xpath attribute
  | legal xpath element

<value>      ::= number | date | time | duration
<logop>      ::= = | <> | < | <= | > | >= | !=

ASCII Searches

Using the ASCII operator in a query string changes the way Latin alphabet-based text strings are compared so that characters with and without diacritic marks are evaluated identically by mapping them all to their ASCII equivalents. The mapping is based on Unicode NFKD decomposition plus the Latin supplement (Latin-ASCII.xml) from the Unicode Common Locale Data Repository. When searching in this mode, “ü” is equivalent “u” and “ł” is equivalent to “l”. This means that in most cases, ASCII versions of Latin content titles no longer need to be created manually.

To search in this mode include an ASCII modifier to one or more query Expression clauses:

ASCII((/FullMetadata/BaseObjectData/Credits/Actor/DisplayName Martín) OR (/FullMetadata/BaseObjectData/Credits/Actor/DisplayName Jose))

This would find actors named Martin or José.

NOTE: ASCII searches do not automatically account for locale-dependent forms, such the “ü” in German which may be represented as “ue” in English or Latin transliterations (Romanization) of non-Latin scripts such as Cyrillic, Chinese, or Arabic, which must still be produced manually.

Notes and Examples

  • The types on each side of a <logop> must be compatible.
  • Wildcards are not currently supported; normalization and stemming cover the problems for which wildcards are generally used.
  • Comparison operations for dates and times truncate to the lowest precision in the expression.
  • Although ranges are not directly supported, they can be implemented using two simple queries combined by AND.
  • Fields that contain controlled vocabulary are tokenized, with punctuation characters removed.
  • The empty string matches nothing, rather than everything.
  • For non-existent fields, all ISNOT comparisons evaluate as True. For example, if there is no CountryOfOrigin field, (/FullMetadata/BaseObjectData/CountryOfOrigin ISNOT fr) is True.
  • IS and ISNOT apply to Value, Value-language, and Text fields.
    • For Value fields, they are useful for testing for controlled vocabulary words, equality of DOIs, and equality of non-tokenized fields such as HouseSequence, AlternateID, and the various private data fields.
    • For Value-language, they are used as described above.
  • Comparisons are done to the precision of the least precise argument. For example, a date field containing 2010 is >=, <=, and = to 2010-10-10. Using >=2012-01-01 would return the records in 2012 and later.
  • Some applications may want to do queries across only metadata on the object itself, as opposed to the full metadata. This can be useful for applications whose main purpose is dealing with the metadata, rather than dealing with the objects defined by the metadata. This can be done by doing a regular query, calling Resolve() to return only self-defined metadata, and then examining those results.
  • As an example, imagine a Registry that has objects with the following titles in the ResourceName fields:
    • Batman: The Dark Knight
    • Knight of Dark Stories
    • Dancing In The Dark
    • Darkness At Noon
    • Darkness Waits
    • The Ghost and The Darkness
    • Shanghai Knights
    • Shanghai Noon
    • Sinbad: The Battle of the Dark Knights
    • First Knight

Querying on /FullMetadata/BaseObjectData/ResourceName (abbreviated field in the table) gives these results:

field DarkBatman: The Dark Knight Knight of Dark Stories Dancing In The Dark Sinbad: The Battle of the Dark KnightsAnything with “Dark”.
field “Dark Knight”Batman: The Dark Knight  Anything with exactly the sequence “Dark Knight”. Sinbad: The Battle of the Dark Knights is not included because titles are not stemmed.
field Dark KnightBatman: The Dark Knight Knight of Dark Stories Dancing In The Dark Sinbad: The Battle of the Dark Knights First KnightAny title that has “Dark” or “Knight”.
field KnightsShanghai Knights Sinbad: The Battle of the Dark KnightsAny title with “Knights”.
(field Dark) AND (field Knight)Batman: The Dark Knight Knight of Dark StoriesAny title with both “Dark” and “Knight”, in any order and any position.
(field Dark) AND NOT (field The)Knight of Dark Stories  Sinbad: The Battle of the Dark Knights is not included because comparison is case-insensitive.

NOTE: The ISNOT, NOT and <> operators can be inefficient when applied globally.

Example Queries

Finding Types of Objects
Find all Series, 1 Also works with “Season”(/FullMetadata/BaseObjectData/ReferentType Series)
Find all Series, 2 Also works with SeasonInfo, EpisodeInfo, ClipInfo, CompilationInfo, CompositeInfo, ManifestationInfo, PackagingInfo, PromotionInfo, AlternateContentInfo, SupplementalContentInfo(/FullMetadata/BaseObjectData/SeriesInfo EXISTS)
Find all records(/FullMetadata EXISTS)
Find all root objects. This is done by checking for the absence of any relationship that requires a Parent.(NOT ((/FullMetadata/ExtraObjectMetadata/SeasonInfo EXISTS) OR (/FullMetadata/ExtraObjectMetadata/EpisodeInfo EXISTS) OR (/FullMetadata/ExtraObjectMetadata/EditInfo EXISTS) OR (/FullMetadata/ExtraObjectMetadata/ClipInfo EXISTS) OR (/FullMetadata/ExtraObjectMetadata/ManifestationInfo EXISTS)))
Registrant and AssociatedOrg
Find all ”in development” records for a Registrant.(/FullMetadata/BaseObjectData/Administrators/Registrant 10.52337/ABCD-EF01) AND (/FullMetadata/BaseObjectData/Status Dev)  
Find all valid records with a particular AssociatedOrg.(/FullMetadata/BaseObjectData/[email protected] 10.52337/ABCD-EF01) AND (/FullMetadata/BaseObjectData/Status valid)  
Find all valid records with one or the other of two AssociatedOrg IDs.   Generalization is left to the reader.((/FullMetadata/BaseObjectData/[email protected] 10.52337/ABCD-EF01) OR (FullMetadata/BaseObjectData/[email protected] 10.52337/2345-6789) ) AND (/FullMetadata/BaseObjectData/Status Valid)  
Find things registered by the EIDR Operations.(/ProvenanceMetadata/Administrators/Registrant IS 10.5237/superparty)
Find things not registered by the EIDR Operations.(/ProvenanceMetadata/Administrators/Registrant ISNOT 10.5237/superparty)
Looking for Possible Data Quality Problems
All Items with English title and non-English primary language.   The records may be correct (The Hangover was released as Very Bad Trip in France) but is quite often not, so it is worth investigating, especially for bulk registration.  Method 1:   (/FullMetadata/BaseObjectData/[email protected] en) AND (/FullMetadata/BaseObjectData/OriginalLanguage ISNOT en)   Method 2: (/FullMetadata/BaseObjectData/[email protected] en) AND NOT (/FullMetadata/BaseObjectData/OriginalLanguage en)  
Find everything from before 1936 that is not a Movie or a Series.(/FullMetadata/BaseObjectData/ReleaseDate <= 1936)  AND (/FullMetadata/BaseObjectData/ReferentType ISNOT Movie) AND (/FullMetadata/BaseObjectData/ReferentType ISNOT Series)
Bad Season End date These can creep in if an export program uses a silly default when there is no date in the database. Change SeasonInfo to SeriesInfo for bad Series end dates. You can also generate queries like this for checking consistency for series, using either tools and scripts or the SDK. — Do a Full resolution — Extract the end date — Construct the query, using the series as the root of the query  (/FullMetadata/ExtraObjectMetadata/SeasonInfo/EndDate > YEAR)   *Where YEAR is some year in the far future.  
Find things with EIDR Operations as AssociatedOrg.(/FullMetadata/BaseObjectData/[email protected] 10.52337/Superparty)
Statistics (The –n flag in QueryTool is useful here)
Find all records modified since 31 December 2010. Use this to do incremental backups, setting the date to be the day before you started the last one (to avoid race conditions and time zone issues).(/ProvenanceMetadata/LastModificationDate >= 2010-12-31)  
Find any record that was submitted in February, 2013.(/ProvenanceMetadata/CreationDate >= 2013-02-01) AND (/ProvenanceMetadata/CreationDate < 2013-03-01)
Find all records registered by Registrant 10.5237/ABCD-EF01 in August, 2012.(/ProvenanceMetadata/Administrators/Registrant IS 10.5237/ABCD-EF01) AND ((/ProvenanceMetadata/CreationDate >= 2012-08-01) AND (/ProvenanceMetadata/CreationDate < 2012-09-01))

Language-Specific Filtering

NOTE: Language-specific filtering applies only to registry queries operating on the Base Object Data Description field. It does not apply to queries on other data elements and does not apply to de-duplication.

There are language-specific lists for English, French, Spanish, Italian, and German for:

  • Punctuation that turns into spaces
  • Punctuation that collapses two words together
  • Stop words that get filtered out

Language-specific rules are given in the table below. If a field has a language attribute, then the language-specific rules are applied; otherwise, English rules are applied. The query string is processed based on the language field in the queried field; if there is no language attribute, English rules are applied.

NOTE: The dash in language fields (e.g. de-CH) is not removed.

LanguageSpacing PunctuationCollapsing PunctuationStop Words
English. , ; : ^ & ! + – = ( ) [ ] { } < > ~ # $ / * @ € £ ? ” (double quote) – (hyphen)‘ (single quote) ’ (apostrophe)  a, the, this, that, these, some, is, are, and, or, but, so, as, at, by, of, on, for, in, into, to, with, I, you, he, she, it, we, they, them, its, theirs
French. , ; : ^ & ! + – = ( ) [ ] { } < > ~ # $ / * @ € £ ? ” (double quote) – (hyphen) « »‘ (single quote) ’ (apostrophe)  un, une, le, la, les, l, ce, ces, c, de, du, des, d, est, sont, a, ont, ne, pas, n, et, ou, mais, que, qui, qu, à, aux, sur, dans, en, par, avec, y, il, elle, ils, elles, lui, leurs, son, sa, ses, leur
Italian. , ; : ^ & ! + – = ( ) [ ] { } < > ~ # $ / * @ € £ ? ” (double quote) – (hyphen) « »‘ (single quote) ’ (apostrophe)  ad, al, allo, ai, agli, all, agl, alla, alle, con, col, coi, da, dal, dallo, dai, dagli, dall, dagl, dalla, dalle, di, del, dello, dei, degli, dell, degl, della, delle, in, nel, nello, nei, negli, nell, negl, nella, nelle, su, sul, sullo, sui, sugli, sull, sugl, sulla, sulle, per, tra, contro, lui, lei, noi, loro, suo, sua, suoi, sue, lo, la, li, le, gli, ne, il, un, uno, una, ma, ed, se, perché, anche, come, dov, dove, che, chi, cui, non, più, quale, quanto, quanti, quanta, quante, quello, quelli, quella, quelle, questo, questi, questa, queste, si, a, c, e, i, l, o, sono, è
Spanish. , ; : ^ & ! + – = ( ) [ ] { } < > ~ # $ / * @ € £ ? ” (double quote) – (hyphen) « » ¿ ¡‘ (single quote) ’ (apostrophe, but it is not used in Spanish)  un, unos, una, unas, el, los, la, las, este, esta, esto, estos, estas, ese, esa, eso, esos, esas, es, son, está, están, hay, y, o, pero, de, en, para, como, con, por, sobre, el, ella, ellos, ellas, se, su, sus, suyo, suya, suyos, suyas
German  . , ; : ^ & ! + – = ( ) [ ] { } < > ~ # $ / * @ € £ ? ” (double quote) – (hyphen) „ “ « »  ‘ (single quote) ’ (apostrophe) [NOTE: This means that an apostrophe at the end of a word is dropped, and one in the middle of a word collapses the two parts together.]ein, einer, eine, eines, einem, einen, der, die, das, den ist, sein und, oder durch, als, von, mit, für, am, in, aus er, sie, es, sie ihn ihm, ihr, ihnen sein, siene, ihre  

See Also

Updated on April 9, 2021

Was this article helpful?

Related Articles