SCSB Technical Documentation For Matching Algorithm

Initial Matching Algorithm

The Tables involved in Initial Matching Algorithm :

  1. MATCHING_MATCHPOINTS_T
  2. MATCHING_BIB_T
  3. REPORT_T
  4. REPORT_DATA_T


The following steps are involved in the Initial Matching Algorithm,

  1. The first step involves identifying the matchpoints (ISBN, ISSN, LCCN and OCLC) that have one or more duplicate counterparts among all the records in SCSB. The values are recorded in the MATCHING_MATCHPOINTS_T table along with the count of occurrences. The algorithm uses Apache Solr query with facets to identify these matchpoints.
  2. This is followed by identifying the bibliographic data such as the owning institution and SCSB bibliographic id, title, the matchpoint value, material type using the matchpoint values. These data are populated in the MATCHING_BIB_T table for further analysis. The table also has a flag to mark if a particular row had been processed.
  3. Through an iterative process, the values in the MATCHING_BIB_T table are processed. This is done using SQL queries by using combinations for multi match (OCLC-ISBN, OCLC-ISSN, OCLC-LCCN, ISBN-ISSN, ISBN-LCCN, ISSN-LCCN) and single match (ISBN, ISSN, LCCN and OCLC). As and when a row is processed the flag in the MATCHING_BIB_T is marked ‘Complete’. After processing, the REPORT_T and REPORT_DATA_T table are populated with the matching bib data.
  4. The next step is updating CGD for Monographs, Serials and MVMs.
    1. For Monographs, all the records with material type ‘Monograph’ are fetched. This data is processed to determine if they are Monograph or MVM. Monographs are separately processed based on the use restriction and the count of shared record of each institution.
    2. For MVMs and Serials, all the items are marked as “Open”.

The following SQL queries files are used for report generation in case of Initial Matching

Title_Exception_Report_Query.sql


Ongoing Matching Algorithm

The Tables involved in ongoing Matching Algorithm :

  1. REPORT_T
  2. REPORT_DATA_T

The following steps are involved in the ongoing Matching Algorithm,

  1. The first step is to fetch all the bibs that have been updated based on the last updated date in the bib from Solr.
  2. The matchpoints of these bibs namely, ISBN, ISSN, LCCN, OCLC, are extracted and used in Solr query to find duplicates among the existing SCSB records.
  3. On successful matching, the records (original as well as duplicates) are recorded in the REPORT_T and REPORT_DATA _T tables.
  4. The items of the bibs are fetched from the DB. The use restriction of the items and a flag, INITIAL_MATCHING_DATE in the ITEM_T table are verified.
  5. If the use restrictions are different, then the item having the least use restriction will remain ‘Shared’ and the rest would be marked as ‘Open’ for Monographs. In case of serials or MVM, all items’ CGD will be marked as ‘Open’.
  6. If the use restrictions are same, then the item with date in the INITIAL_MATCHING_DATE column will be retained as ‘Shared’ and the other will be marked ‘Open’. If the INITIAL_MATCHING_DATE is empty, then the item which entered first (based on the CREATED_DATE column of ITEM_T table) is retained as ‘Shared’ and the rest are marked ‘Open’ for Monograph.
  7. In case of Serials and MVMs, all the items’ CGD are marked as ‘Open’.
  8. Finally, the reports are generated in the configured FTP location (/share/recap/matching-reports/prod).


If a new Institution comes and if we want to run the initial matching algorithm again, then the following steps need to be followed :

  1. Clear the values in the above mentioned tables by executing the following scripts in DB :
  2. truncate recap.matching_matchpoints_t;
  3. truncate recap.matching_bib_t;
  4. delete from report_data_t where record_num in (select record_num from report_t where type in ('MatchingCount','MatchingMonographCGDSummary','MatchingMVMCGDSummary','MatchingSerialCGDSummary','MaterialTypeException','MultiMatch','SingleMatch','TitleException') and file_name in ('ISBN','ISBN,ISSN','ISBN,LCCN','ISSN','ISSN,LCCN','LCCN','MatchingCGDSummaryReport','MatchingSummaryCount','OCLC','OCLC,ISBN','OCLC,ISSN','OCLC,LCCN','PendingBibMatches'));
  5. delete from report_t where type in ('MatchingCount','MatchingMonographCGDSummary','MatchingMVMCGDSummary','MatchingSerialCGDSummary','MaterialTypeException','MultiMatch','SingleMatch','TitleException') and file_name in ('ISBN','ISBN,ISSN','ISBN,LCCN','ISSN','ISSN,LCCN','LCCN','MatchingCGDSummaryReport','MatchingSummaryCount','OCLC','OCLC,ISBN','OCLC,ISSN','OCLC,LCCN','PendingBibMatches');
  1. Then run the Initial Matching Algorithm from the solrAdmin Initial Matching Algorithm UI.


UI steps to Run Initial Matching Algorithm :

  1. The whole initial matching algorithm can be run by providing the first option “Run Matching Full Process“ in the drop down of the matching UI by providing current date in the date field.
  2. If you want to run the matching as step by step process then you can run it based the option steps give below :
    1. Find Matching and Save Reports.
    2. Update Monographs CGD in DB.
    3. Update Serials CGD in DB.
    4. Update MVMs CGD in DB.
    5. Update CGD in solr (given the date when you started the matching algorithm process in the date field).

UI step to Run Ongoing Matching Algorithm :

The ongoing matching algorithm can be run by using the SCSB Solr Admin User Interface. Under the Ongoing Matching Job tab , the job type can be selected as ongoing matching algorithm job which will run the all the process for ongoing matching.

Ongoing matching algorithm can also be run via scheduled jobs with the job name as 'OngoingMatchingAlgorithm'.