Solr Indexing Guide

Shared Collection Service Bus (SCSB) uses Apache Solr to provide fast, enterprise grade search performance to the bibliographic search feature. The data needs to be indexed for Solr to search and retrieve the search results. When the application is installed and started for the very first time it might need to index the relevant data. Depending on the volume of data the time taken for the indexing process varies. Subsequent updates to the data are immediately indexed in real time.

SCSB allows the administrator to initiate the index process through a Graphical User Interface (GUI).

The SCSB Solr Admin page has four tabs viz. Solr Index, Matching Algorithm Save Reports, Ongoing Matching Job and Generate Reports. Solr Index allows users to initiate the indexing process. The Matching Algorithm Save Reports allows users to process various matching algorithm jobs and also save related reports to the database for subsequent export. The Ongoing Matching Job interface allows the user to initiate jobs related to the matching process. The Generate Reports tab allows the users to export various reports post different processes such as Solr Index, Accession, Deaccession and Submit Collection.

The user can define various parameters to optimize the indexing process. The Number of threads field takes in integer values and is dependent on the RAM size of the server. A 16GB RAM machine can comfortably handle 5 threads simultaneously and remains the recommended number of threads in SCSB. The Number of Documents to be processed per thread is a parameter through which the user can configure the maximum number of documents that can be processed in a thread. Past experience has suggested that 1000 is the optimal number of documents to be processed by SCSB in a single thread. Commit Interval is the number of records after which a commit is made by the Solr. 50000 remains the value in SCSB. In other words, Solr does a commit only after processing 50000 records in SCSB. Institution Code allows users to configure which institution’s records are to be indexed. Having ‘ALL’ as the value indexes all records irrespective of the institution. The Date From field is a date picker and is used to configure the date from which the records are to be processed for indexing. If the Auto Refresh checkbox is checked, the Full Indexing Status automatically displays progress in the indexing status. If the Clean checkbox is checked, any existing Solr documents from past indexing processes are removed (wiped clean) and fresh indexing commences.

The Matching Algorithm Save Reports tab allows the user to individually run tasks part of the Matching Algorithm process. The Find Matching and Run Reports, if selected, would process all existing records for matches depending on predefined conditions and runs a report. If Run Only Reports is selected, only the reports are generated on existing records. The Update CGD in DB task if selected would update CGD (Collection Group Designation) in the database if the records are found to be duplicates and depending upon their use restriction and total count of shared records of each partner. The Update CGD in Solr task if selected would update CGD in the Solr document so that the change gets reflected in the SCSB user interface. The Populate Data For Data Dump task if selected would populate duplicate record information in SCSB XML data. This information is picked up later by the data dump API while generating records.

Under the Ongoing Matching Job tab, the user can initiate individual jobs part of the Ongoing Matching algorithm. Ongoing Matching algorithm runs as part of the ongoing accession process. When new bibliographic items are added to existing ReCAP collection, ongoing matching algorithm looks for duplicates and updates the CGD of the records as per predefined conditions. The Ongoing Matching Algorithm Job identifies duplicates and updates CGD while the Populate Data For DataDump Job records items that are duplicates to the SCSB XML for future export as part of data dump API.

The Generate Reports tab allows the user to generate a list of reports such as the deaccession summary report, ongoing accession report, accession summary report, submit collection rejection report and submit collection exception report. The reports can be generated for all partner institutions or for specific institutions through the Institution dropdown. Transmission of these reports are through the file system (in the server where it is hosted) or a configured FTP location, selected through the Transmission Type dropdown.