Data Quality Policy
There are several aspects to consider when defining data quality, and how to measure it.
- Data completeness
- Data accuracy
- Matching accuracy
Sparse records without significant distinguishing elements cannot be admitted to the ISNI database as without distinguishing metadata, there is no way to relate to records in external databases. ISNI defines non sparse as follows:
- all records must include a name (personal or organisation)
- all records must include a local identifier
- names in the common surnames list must score 2 from the following list. Other names must score 1:
- Birth date (type is “lived”: score 1)
- Death date (type is “lived”: score 1)
- Birth date (type is not “lived” AND date is before 1800: score 1)
- Death date (type is not “lived” AND date is before 1800: score 1)
- Publisher (publisher or publishers: score 1)
- ISBN (ISBN or ISBNs: score 1)
- Title (title or titles: score 2)
- Co-author or Organisation affiliation (one: score 1; two or more: score 2)
Rich records may be assigned an ISNI, even if the data is only supplied by one source. Records that are admitted to the database but are not rich will be assigned if they match with records from other sources. ISNI defines rich records as follows:
- A personal name is unique and includes a surname and forename AND one or more of the following cases:
- The request includes an “isNot” statement for an identity with the same name string
- Full date of birth and death supplied
- Or Year of birth + 1 title or instrument+ 1 related name (Co author or affiliated institution)
- 1 title or instrument + 1 external URL link of type encyclopaedia, home page (not social network page) + 1 related name (Co author or affiliated institution)
- An organisation name is unique and does not consist only of initials and the record AND all of the following cases:
- Includes a full UN/LOCODE (http://www.unece.org/cefact/locode/service/location.)
- And Organisation type
- And Organisation URL
There are several cases to consider in relation to accuracy.
- inaccuracy of data provided (spelling and typing accuracy)
- two or more persons in the same record (same name but different identity)
- two or more identities of the same person in the same record
- Duplicate assignment
Inaccurate data is difficult to detect by program. The ISNI system includes a date anomaly check program that looks for creators who created before their 10th birthday. This program is run regularly and the results are reviewed manually. This could point to a typographical error or to two or more persons in the same record. Other checks are made by searching for unusual source combinations. The Quality Team at the British Library, National Library of the Netherlands, and Bibliothèque nationale de France regularly makes samples of the database and incoming test data. For data to be loaded by batch, data is returned if the error rate for duplicates and mixed identities is above 5%. Results of database samples may indicate corrections that may be made programatically or areas in the database that require manual review.
Two or more persons or organisations in the same record
Two or more persons or organisations in the same record may occur because of errors in the incoming data or because of matching errors. Where records needing splitting with two or more persons or organisations in the same record (same name but different identity) are found, as much as possible, the matching program is refined. To date, program modifications have been successful for all reported errors. It is a delicate balance to ensure maximum matching without erroneous matching. As a compromise, the class “possible match” was devised and data contributors are invited to help resolve these by checking them manually. Where an ISNI is determined to include 2 or more separate identities, the ISNI Quality Team is notified and performs the splitting procedure that automatically splits off good data to new ISNI records and notifies all concerned sources. The old ISNI record will be deprecated.
Two or more identities of the same person in the same record
Some sources regard pseudonyms as name variants not separate identities. The ISNI system includes a suite of programs to detect these cases and convert them appropriately. ISNI’s policy is to assign different identifiers to separate public identities of the same person. The relationship between the identities may be public or private. Where it is public, a link is made between the two records on the ISNI database IF the data is also coming from a separate source.
The ISNI system errs on the side of not making a false association. Considering this, duplicate assignment may occur. When a duplicate is detected, the 2 ISNI records are merged and the former ISNI is recorded in the record, such that a search or link using the deprecated ISNI will always result in the record being retrieved or resolved. All concerned sources are notified of merges in a dated notification message that remains on the database record. The public are invited to report duplicates via the ISNI web search facility and all members of ISNI are able to resolve duplicates. All cases of seemingly inexplicable failure to merge are examined with a view to finding a remedy. For example once we found a record that was not matching because of a zero in the publication date of one of the titles. Where records are merged, the ISNI that is regarded as the “retained ISNI” will be determined according to the following sequence:
- A record with assigned status trumps a record with provisional or other status (the ISNI has already been diffused)
- The record containing a preferred source: either JISC Names (JNAM) or Ringgold (RING) (Preferred sources are recommended by the ISNI Technical Advisory Committee and approved by the Board as being base sources in their area with high quality curated data that has been disambiguated)
- ISNI with the most sources (therefore where the ISNI has been more widely diffused)
- If still no ISNI chosen, the one with the lowest value (first minted)
Requests for Data Removal or Deletion
The ISNI data sources are at the liberty to request that their data be removed at any time. If the source is the last remaining source, the core metadata, including at a miniumum name and ISNI, will be retained and the source of the data will be designated "ISNI". For example, a source may no longer need a record for an organisation that has ceased. In this case, the source is changed to "ISNI" and the ceased date may also be recorded. Via the web interface, an individual or organisation is able to request deletion of data elements or complete removal of the ISNI data. Such requests are reviewed, verified and enacted by the ISNI Quality Team.
The Quality Infrastructure
- System for reporting errors
- in place for data contributors
- for the general public via the web search interface
- Ability to upgrade data for data contributors
- System for the Quality Team for corrections
- Other corrections
- System for reporting corrections to data contributors
- Regular anomaly checking by program
- Date anomalies
- Pseudonym and real name inconsistencies
- Regular data sampling and analysis of problems
- Adjustments to correct or accommodate incoming data
- E.g. Minor deviations from the format, character set, too many occurrences of people born 1st January, date of publication zero, expanding abbreviations, surname only and more....
- Data enrichment to increase matching
- E.g. application of Dewey
- Identification of new data sources that can serve to increasing matching by providing bridging information, e.g. encyclopaedias
Responsibility for Data Quality
There are 4 parties involved:
- The data contributors
- The Quality Team
- The ISNI-IA Assignment Agency (OCLC)
- The ISNI-IA
- The general public
The Data Contributors
Each data contributor takes reasonable steps to establish unique Public Identity Reference Metadata. Data contributors are invited to check the accuracy of ISNI assignment and report any errors. They are also invited to review the possible matches and report via the web interface. They may enrich their records thus causing further matching. Data contributors periodically receive notification reports of errors and new ISNI assignments that concern their records.
The Quality Team
The Quality Team at the British Library, National Library of the Netherlands, and Bibliothèque nationale de France responds to errors reports from Data Contributors and from public contribution. Hidden notes fields that can be seen by RAGs and members are used to record errors and details of fixes made as appropriate. The ISNI-IA Assignment Agency keeps a history of all email correspondence generated by the ISNI Quality Team. It systematically reviews reports, provided by the Assignment Agency, of potential problems. In addition, it requests statistical samples from the Assignment Agency and manually reviews those records and reports on the findings. The Quality Team will merge records. This results in a notification to all sources on the resulting record. Where there is a data error or a record with mixed identities and only one source on the record, the error is reported to the contributing source. The record is marked and an ISNI is not assigned. Where there are several sources on the record, the Quality Team will fix the record and notify all the sources where appropriate. If the Quality Team splits a record into 2 or more records, the end result may be a record for a single source that still contains mixed identities and then the source will be notified and no ISNI assigned for that record. If the Quality Team finds that the matching system caused an error, the ISNI Assignment Agency is notified.
The Assignment Agency
The Assignment Agency is responsible for the system referred to above as the Quality Infrastructure. It responds to queries from the Quality Team about possible matching errors. It regularly runs the anomaly and statistical sampling programs and feeds the reports to the Quality Team.
The ISNI-IA is the owner of the database and is ultimately responsible for the quality of the data within it. The ISNI-IA may set thresholds for the percentage of data error that is acceptable. As stated in the opening statement of definition, there are different types and causes of errors and thus different thresholds may be applied. Where it is found that the error rate is unsatisfactory, the ISNI-IA, as the owner of the data, is responsible for the solution. The ISNI-IA works with the ISNI Technical Advisory Committee, including the Assignment Agency and the Quality Team to devise and resource a solution.
The General Public
The general public is able to make contributions via the web search interface. A Captcha is used to ensure that the input is from a genuine human being, but identification is not mandatory. The data is collected into hidden indexed fields that generate a nightly alert. This Quality Team receives this alert and sets a goal to respond within three working days. An email repose is sent if an address was given with the information. The Quality Team has set priorities for responding in the case of high volume:
- all input involving data correction, merging and splitting
- all suggestions for enrichment of records coming from persons associated with a single ISNI (i.e. persons themselves, heirs etc.)
- all suggestions for enrichment from other parties that will facilitate ISNI linking (future matching etc.)
- Other enrichment