The first step was to establish the pattern for content ingestion, mapping how to pull in the different types of content from the five different data silos.
- Pipeline modelling
To start with, the Nordcloud team focused on one of the silos: journal content. The team defined the event-driven data pipeline and mapped a range of data processing tasks, then replicated this model across the four other systems.
- Data validation
Inherently, scientific articles are difficult to pull metadata from, or to easily group. Complex components and files from each article need to be indexed. This required additional validation, and custom data models to incorporate all the possible formats of content from each article: images, tables and charts, histology slides, plus the copy.
- File packaging
To complete all the processing tasks, a packaging system was designed to order the files, and replicated across all articles to ensure each and every component was in place and able to be indexed.