Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. The first step involves identifying the matchpoints match points (ISBN, ISSN, LCCN and OCLC) that have one or more duplicate counterparts among all the records in SCSB. The values are recorded in the MATCHING_MATCHPOINTS_T table along with the count of occurrences. The algorithm uses Apache Solr query with facets to identify these matchpoints.these match points.(The initial requirement of these match points happens for only 'Shared' items. With the recent modifications especially to group bibs match points are found across all the CGD items i.e, 'Shared','Private','Open','Committed' and 'Uncommitted')
  2. This is followed by identifying the bibliographic data such as the owning institution and SCSB bibliographic id, title, the matchpoint match point value, material type using the matchpoint match point values. These data are populated in the MATCHING_BIB_T table for further analysis. The table also has a flag to mark if a particular row had been processed.
  3. Through an iterative process, the values in the MATCHING_BIB_T table are processed. This is done using SQL queries by using combinations for multi match (OCLC-ISBN, OCLC-ISSN, OCLC-LCCN, ISBN-ISSN, ISBN-LCCN, ISSN-LCCN) and single match (ISBN, ISSN, LCCN and OCLC). As and when a row is processed the flag in the MATCHING_BIB_T is marked ‘Complete’. After processing, the REPORT_T and REPORT_DATA_T table are populated with the matching bib data.
  4. Once the Report_t and Report_data_t tables are ready, the process of Grouping Bibs with a 'Matching identifier' process takes place.
  5. The next step is updating CGD for Monographs, Serials and MVMs.
    1. For Monographs, all the records with material type ‘Monograph’ are fetched. This data is processed to determine if they are Monograph or MVM. Monographs are separately processed based on the use restriction and the count of shared record of each institution.(Condition to check for 'Use Restrictions' is removed. Similarly for 'Single Match' items with Title Exception will not be eligible for CGD updates or grouping)
    2. For MVMs and Serials, all the items are marked as “Open”.

...

  1. The first step is to fetch all the bibs that have been updated based on the last updated date in the bib from Solr.
  2. The matchpoints of these bibs namely, ISBN, ISSN, LCCN, OCLC, are extracted and used in Solr query to find duplicates among the existing SCSB records.
  3. On successful matching, the records (original as well as duplicates) are recorded in the REPORT_T and REPORT_DATA _T tables.
  4. The items of the bibs are fetched from the DB. The use restriction of the items and a flag, INITIAL_MATCHING_DATE in the ITEM_T table are verified.
  5. If the use restrictions are different, then the item having the least use restriction will remain ‘Shared’ and the rest would be marked as ‘Open’ for Monographs. In case of serials or MVM, all items’ CGD will be marked as ‘Open’.
  6. If the use restrictions are same,
  7. If there are any 'Committed' items then the algorithm retains them as committed and marks the 'Shared' items to 'Open'.
  8.  If there are only 'Shared' items then the item with date in the INITIAL_MATCHING_DATE column will be retained as ‘Shared’ and the other will be marked ‘Open’. If the INITIAL_MATCHING_DATE is empty, then the item which entered first (based on the CREATED_DATE column of ITEM_T table) is retained as ‘Shared’ and the rest are marked ‘Open’ for Monograph.
  9. In case of Serials and MVMs, all the items’ CGD are marked as ‘Open’.
  10. Finally, the reports are generated in the configured FTP AWS S3 location (/share/recapscsb-{environment}/reports/matching-reports/prod).


If a new Institution comes and if we want to run the initial matching algorithm again, then the following steps need to be followed :

...