SCSB Matching Algorithm

Collection Group Designation

CGD

Shareable

Retention Commitment

Notes

Committed

Shareable

Yes

  • CGD not changed by Matching Algorithm.
  • These are items purchased via cooperative collection agreements, identified by the vendor.
  • Do not expect that two items will match and both have been submitted with a CGD of Committed, but if that does happen, both will retain the Committed designation.
  • If a Committed item matches a Shared item, both will retain their designations as Committed and Shared.

Shared

Shareable

Yes


Open

Shareable

No

  • When an item is submitted with a CGD of Open, the CGD is not changed by Matching Algorithm.
  • It is not expected that items will be submitted as Open going forward, but it has been done in the past.

Uncommittable

Shareable

No

  • CGD not changed by Matching Algorithm.

Private

Not shareable

No

  • CGD not changed by Matching Algorithm.

Ongoing Matching Algorithm

The ongoing matching algorithm is run on items with the same Material Type when:

  • new items are accessioned to SCSB with a CGD of “Shared”
  • existing items are updated using the Submit Collection API to a CGD of “Shared”

The new/updated item is compared to all the existing items in the database with a CGD of Shared or Committed.

When two Bibs match, 

  • The first check is to identify if either of the matches has undergone Initial Matching, if so the one which went under Initial Matching will remain Shared and the rest are set to Open.
  • If neither Bib underwent Initial Matching Algorithm then the item that was accessioned earlier will retain the Shared status, while the newer item is set to Open.

Bibs are considered a “Multi Match” when at least two of the following control numbers match:

  • ISBN
  • ISSN
  • LCCN
  • OCLC

In the case of a “Multi Match” the title field is not compared.

Bibs are considered a “Single Match” when only one control number matches and the first 4 words of the title matches.

Normalization

  • ISBN, ISSN, LCCN, OCLC - non-numeric characters are removed
  • Title - diacritics are removed and the case is ignored


Title Match Comparison

  • Field <245> <a> is considered for Title Match Comparison
  • Diacritics and the blank spaces are removed from the title. All titles are converted to lower case before comparison
  • Words  "A" , "AN"  and "THE"  not considered for title match irrespective of their place/location in the sentence

Exception

  • Matching copies from a single institution are not compared. It is up to the submitting institutions to only submit one copy of matching items to SCSB as “Shared,” and the rest as “Open
  • When two items have a single number match but having different bib levels; one is a monograph and the other a serial. That is when the material type does not match, the matching algorithm is not applied
  • When the CGD of an item is changed manually to Shared using the SCSB UI, the matching algorithm is not run on the item.  This means that there could be two items with the Shared CGD.

Reports

The following reports are generated after each run of the matching algorithm:

  • Matching Summary Report
  • Matching Serial MVM Report
  • Title Exception Report
  • CGD Round Trip Report

Matching Summary Report

  • Example file name: MatchingSummaryReport-27Jul2021080127.csv
  • Columns: Institution, Total Bibs, Total Items, Shared Items Before Matching, Shared Items After Matching, Difference of Shared Items, Open Items Before Matching, Open Items After Matching, Difference of Open Items
  • Rows: PUL, CUL, NYPL, HL
  • Example data:

Matching Serial MVM Report

  • Includes items that were changed from Shared to Open for Serial and MVM material types
  • All items that match are changed from Shared to Open
  • Example file name: MatchingSerialMvmReport-27Jul2021080114.csv
  • Columns: OwningInstitutionId, Title, Summary Holdings, Volume Part Year, Use Restriction, BibId, OwningInstitutionBibId, Barcode

Example data:

Title Exception Report

  • Items that match only one number and not the first four words of the normalized title will be included in this report.  Previously, these items were considered a match, and the CGD was affected, but as of v4.3 (July 2021), they are not considered matched.
  • Columns: OwningInstitution, BibId, OwningInstitutionBibId, MaterialType, OCLCNumber, ISBN, ISSN, LCCN, Title1, Title2, Title3, Title4, Title5, Title6, Title7, Title8, Title9, Title10, Title11, Title12, Title13, Title14, Title15, Title16, Title17, Title18, Title19

Example data:

CGD Round Trip Report

  • A report will be created if any item’s CGD is changed by the matching algorithm.
  • All the items with a change to the CGD will be included in the report.
  • The report will be written to the SCSB AWS S3 bucket.
  • The report is institution-specific and will be put into the corresponding directory for that report and institution.
  • The directory in the S3 bucket will be:
  • reports/cgd-round-trip/<institution>/
  • The name of the report will be CGD_RoundTripReport_<timestamp>.csv
  • ex: CGD_RoundTripReport_20210322_185905.csv
  • Columns: Item Barcode, Old CGD, CGD, Date of Action

Example data:

Prior to v4.3

  1. The matching algorithm will no longer consider Use Restrictions as of v4.3 (July 2021) and beyond.
  2. The matching algorithm will no longer consider two items to be a match when only one control number matches.

Note / Submit Collection API

When items are updated with the Submit Collection API, the CGD in SCSB is not updated when the data files are in the “cgd_protection” folder but are updated if they are in the “cgd_no_protection” folder.


The technical Documentation for matching algorithm - Technical Documentation for Matching Algorithm