Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The matching algorithm was developed to meet the specific needs of all partners to identify a single item in a duplicate with CGD (Collection Group Designation) set as “Shared” . Additional copies in a matched set are designated “Open.”

All items attached to matched bibs for serials and for multi-volume monographs with more than one holding will be designated “Open.”

Business Rules:

Some revisions and modifications have taken place throughout the development process and are summarized here:

...

Matching Algorithm

Collection Group Designation

CGD

Shareable

Retention Commitment

Notes

Committed

Shareable

Yes

  • CGD not changed by Matching Algorithm.
  • These are items purchased via cooperative collection agreements, identified by the vendor.
  • Do not expect that two items will match and both have the CGD of Committed.
  • If two items match and are submitted as Committed, both will retain the Committed designation.
  • If a Committed item matches a
    Shared item, both will retain their designations as Committed and Shared.

Shared

Shareable

Yes


Open

Shareable

No

  • When submitted as Open, not changed by Matching Algorithm.
  • It is not expected that items will be submitted as Open going forward, but it has been done in the past.

Uncommittable

Shareable

No

  • CGD not changed by Matching Algorithm.

Private

Not shareable

No

  • CGD not changed by Matching Algorithm.

Ongoing Matching Algorithm

The ongoing matching algorithm is run on items with the same Material Type when:

  • new items are accessioned to SCSB with a CGD of “Shared”
  • existing items are updated to a CGD of “Shared” and in the “cgd_no_protection” folder


The new/updated item is compared to all the existing items in the database with a CGD of Shared or Committed.


When two items match, the item that was accessioned earlier will retain the Shared status, while the newer item will be bumped to Open.


Items are considered a “match” when two of the following points match:

  • ISBN
  • ISSN
  • LCCN
  • OCLC
  • Title (first 4 words) 

Normalization

  • ISBN, ISSN, LCCN, OCLC - non numeric characters are removed
  • Title - diacritics are removed

Exceptions

  1. Matching copies from a single institution are not compared. It is up to the submitting institutions to only submit one copy of duplicates matching items to SCSB as “Shared,” and the rest as “Open.”
  2. If one or more numbers match, the items will be processed so only one item is “Shared”. The initial designations are assigned in a way designed to create an even distribution of “Shared” designations among the partners, while considering use restrictions.
  3. However, if only one number matches, an additional title verification will take place. The data will be normalised and When two items have a single number match but having different bib levels; one is a monograph and the other a serial. That is when the material type does not match, the matching algorithm is not applied.
  4. When the CGD of an item is changed manually to Shared using the SCSB UI, the matching algorithm is not run on the item.  This means that there could be two items with the Shared CGD.

Reports

The following reports are generated after each run of the matching algorithm:

  • Matching Summary Report
  • Matching Serial MVM Report
  • Title Exception Report
  • CGD Round Trip Report

Matching Summary Report

  • Example file name: MatchingSummaryReport-27Jul2021080127.csv
  • Columns: Institution, Total Bibs, Total Items, Shared Items Before Matching, Shared Items After Matching, Difference of Shared Items, Open Items Before Matching, Open Items After Matching, Difference of Open Items
  • Rows:PUL, CUL, NYPL, HL
  • Example data:

Matching Serial MVM Report

  • Includes items that were changed from Shared to Open for Serial and MVM material types
  • All items that match are changed from Shared to Open
  • Example file name: MatchingSerialMvmReport-27Jul2021080114.csv
  • Columns: OwningInstitutionId, Title, Summary Holdings, Volume Part Year, Use Restriction, BibId, OwningInstitutionBibId, Barcode
  • Example data:

Title Exception Report

  • Items that match on only one number and not the first four words of the normalized title will be

...

  1. These title exception records will not be eligible for grouping of bibs as well as CGD update process (Applies for both Initial Matching Algorithm and Ongoing Matching Algorithm).

...

  1. A summary report will be created and available in the AWS S3
  2. Monographs and multi-volume monographs with only one item - designated per the ⅓ algorithm
  3. Serials (with one to N items) and MVMs with more than one item - all items attached to matching bibs will be designated ‘Open.”  
  4. A title report of the serials that match will be available in the AWS S3. at [AWS S3 /scsb-{environment}/reports/matching-reports]  

...

  • included in this report.  Previously, these items were considered a match and the CGD was affected, but as of v4.3 (July 2021), they are not considered matched.
  • Columns: OwningInstitution, BibId, OwningInstitutionBibId, MaterialType, OCLCNumber, ISBN, ISSN, LCCN, Title1, Title2, Title3, Title4, Title5, Title6, Title7, Title8, Title9, Title10, Title11, Title12, Title13, Title14, Title15, Title16, Title17, Title18, Title19
  • Example data:

CGD Round Trip Report

  • A report will be created if any item’s CGD is changed by the matching algorithm.
  • All the items with a change to the CGD will be included in the report.
  • The report will be written to the SCSB AWS S3 bucket.
  • The report is institution specific and will be put into the corresponding directory for that report and institution.
  • The directory in the S3 bucket will be:
  • reports/cgd-round-trip/<institution>/
  • The name of the report will be CGD_RoundTripReport_<timestamp>.csv
  • ex: CGD_RoundTripReport_20210322_185905.csv
  • Columns: Item Barcode, Old CGD, CGD, Date of Action
  • Example data:

Prior to v4.3

  1. The matching algorithm will no longer consider Use Restrictions as of v4.3 (July 2021) and beyond.


The technical Documentation for matching algorithm - Technical Documentation for Matching Algorithm