ETL Guide

Initial Accession refers to the process of adding all the existing titles and item information in ReCAP facility to Shared Collection Service Bus (SCSB) application. This process involves loading of data provided by partners in a specially defined format (SCSB XML). The partners extract these data from their respective Integrated Library Systems (ILS) using a list of barcodes provided by the Generation Fifth Applications’ Library Archival System (GFA LAS). This process is a one time process executed only once during the initial phase of the project to load existing data as part of the application setup by the administrator. All subsequent accession at ReCAP are captured through the Ongoing Accession process.

 

The process involves getting the contents in the SCSB XML into tables such as the bibliographic_t, holdings_t and item_t and other related tables such as the bibliographic_holdings_t, bibliographic_item_t and item_holdings_t. Another table, xml_records_t, is used as an intermediate table for holding individual bibliographic records from the files. The loading of data into the xml_records_t table is automatic. A polling service looks for files in the pending folder configured as etl.load.directory in application-<environment>.properties file in a predefined time interval and loads the data in the xml_records_t table.

 

The ETL process picks the data from the xml_records_t table, processes them and moves them into the relevant bibliographic, holdings and item tables. This process is initiated through the SCSB ETL user interface (UI).

 

 

The SCSB ETL has three tabs viz. ETL, Upload Files and Reports. The ETL tab is where the ETL process is initiated. The form takes in the values of File Name, Batch Size and Institution as inputs. The File Name is the name of the file that was used to load the xml_records_t table throught the pending folder. This name is also recorded in the xml_records_t table’s XML_FILE column. The ETL process can be restricted to only certain files that were loaded. The File Name field also accepts wildcard inputs through ‘*’. For example, if all files are to be processed then the File Name must have only a ‘*’ or if only files that start with CUL needs to be processed then the File Name must be ‘CUL*’. The Batch Size is the number of records that are to be processed simultaneously by SCSB. The optimal number to be used in case of SCSB is 1000. The Institution drop down will restrict processing of data belonging to that particular institution. The Auto Refresh checkbox is to be checked if the ETL Status text field (greyed out with text - ETL Bulk Ingesting Status) needs to be updated real time as the processing progresses. The ETL button initiates the process.

 

The Upload Files tab is used to upload files that are to be processed for ETL. However, this is not entertained considering the overhead this would cause to performance of the application. For this reason uploads through this interface are restricted to only files with size less than 5MB. The recommended way of getting the files uploaded is to manually copy them into the pending folder. There are no memory or volume limitations on the size of the file uploaded this way.

 

The Reports tab allows the user to generate reports for both ETL and the Batch Export (Export Data Dump) processes depending on the process chosen on the Operation Type drop down. The user can also generate a success report or a failure report using the Report Type drop down. The institution for which the report has to be run is chosen from the Institution drop down and how the report gets delivered (File system or FTP) is chosen through the Transmission Type. The File Name and the date range through the Date From and Date To fields are additional filters that can be used to generate the report.