BMJ case study

BMJ.

Building a cloud-native content data pipeline for a leading publisher.

  • Processed 4TB across the BMJ’s 5 data systems
  • Created consistent data patterns to ensure complete search results
  • Enabled growth opportunities through content search and retrieval solution

Project Summary.

In partnership with:
Client industry:
Client location:
Project duration:

The client.

BMJ – originally the British Medical Journal – is a medical publisher and one of the world’s most trusted knowledge providers.

BMJ offers quick, accurate, concise, evidence-based answers to clinical questions, plus access to the latest research and guidelines. Tailored medical education resources help clinicians identify learning needs and keep up with the latest evidence, guidelines and best practice.

To continue better supporting the medical industry and sharing knowledge and expertise to improve healthcare outcomes, BMJ partnered with Nordcloud to build a modernised content data pipeline in AWS.

BMJ Logo

Project background.

BMJ wanted a future-proof content data pipeline, built on top of its existing AWS architecture. It should enable teams to leverage, repurpose and repackage content across various sources. With a plethora of resources hidden away across its virtual libraries, BMJ want to be able to repackage content for its customers, and add value in new ways. Also, the solution should be future-proof, enabling integration with machine learning tools for categorisation or entity extraction. 

Challenges.

  • Breaking data silos

The data is stored across five different systems: journals; clinical decision support; education; video management systems; and blog/website content. 

With no common data lake or warehouse, different teams used separate systems, with data completely isolated between each. 

To search across content, users would have to separately tap each different system, use different search methods or keywords, and receive inconsistent levels of information due to the disparate data and pipelines in place.

  • Managing data complexity

For the content pipeline to be effective it is key that the sub-components (graphics, tables, videos) and structure of the content are retained. 

All these components need to be processed and packaged to ensure each item is correctly represented in the index.

Had we not been through this joint process, or had the working relationship not been as effective as it was, the result would never have been so successful. 

Working with Nordcloud in this way has provided  us with great insight on how to collaboratively work with technology partners. We were involved throughout the development iterations and Nordcloud could challenge our existing ways of working and proactively deliver effective solutions.

OLIVIER RENARD

Head of Software Architecture; BMJ

Are you facing similar challenges?

We’re specialists in solving problems using cutting-edge technology. Let us help you enjoy the benefits of the cloud.

Contact us

Our Approach.

Build content pipelines

The first step was to establish the pattern for content ingestion, mapping how to pull in the different types of content from the five different data silos.

  • Pipeline modelling

To start with, the Nordcloud team focused on one of the silos: journal content. The team defined the event-driven data pipeline and mapped a range of data processing tasks, then replicated this model across the four other systems.

  • Data validation

Inherently, scientific articles are difficult to pull metadata from, or to easily group. Complex components and files from each article need to be indexed. This required additional validation, and custom data models to incorporate all the possible formats of content from each article: images, tables and charts, histology slides, plus the copy.

  • File packaging

To complete all the processing tasks, a packaging system was designed to order the files, and replicated across all articles to ensure each and every component was in place and able to be indexed. 

Enable search API

Once the content ingestion pipeline was in place, the next step was to index the data using Elasticsearch, then create an API to enable cross-product search and retrieval.

  • Elasticsearch

The raw data from each storage system is ingested through the content data pipeline solution and into Elasticsearch where it is indexed. 

  • Complete search capability

This enables BMJ to search across the five data sources. This function is a massive enabler, allowing unprecedented flexibility and opportunity for pulling together bespoke content in real time. This is key for a modern publisher that has content going live continuously.

  • API gateway

Combining the above with a custom API allows the BMJ to write applications to use the index to search all its content and be linked back to its original source – journals; best practices; learnings and education; videos; and websites.

DevSecOps Review

The Nordcloud team also reviewed BMJ’s existing DevSecOps practices, along with data and tooling, to identify areas that could be improved.

  • More than a process

DevSecOps isn’t just a one-off exercise. It’s an approach to culture, automation, and AWS platform design that integrates security as a shared responsibility throughout the entire IT lifecycle.

  • Optimisation

The review provides documentation to help measure the effectiveness of a company’s AWS security controls and bring recommendations to promote continuous improvement.

  • Future-proofing

BMJ welcomed this project, challenging its own team to enforce best practices and fill capabilities gaps – beyond the scope of the content pipeline project – to support BMJ’s future AWS efforts.

Need more tech details?

Leave us your email and our specialists will be happy to share their expertise and more project details. Not in the mood to talk?
Simply email your questions!

Contact us
AWS Logo - White

Results.

So, if there’s a global organisation that’s looking to research an area that’s a malaria hotspot, BMJ can pull a package with a range of resources such as statistics on malaria in that area, the risks of malaria on businesses, prevention and treatment advice. All types of content can be quickly packaged as a kind of micro-product that is up-to-date and adaptable to the market.

  • Connected data

BMJ now have a solution for capturing, indexing and searching all of its content across its various data systems. In total, over 4TB has been processed, across the BMJ’s 5 data systems.

  • New products

This facilitates BMJ’s goal of enabling new products and building the search application through the API. This will empower teams to easily search and group different types of content and provide tailored content packages to customers.

  • Speed

The process of retrieving the content through to packaging it and delivering it to a customer, now takes minutes instead of days.

  • Complete data

Previously, a search query might result in incomplete or even empty results. Now, with the data pipeline pulling content consistently from each system in real time, the search results are 100% complete.

We have enjoyed a strong working relationship with Nordcloud which has helped us effectively build the content pipeline solution that will enable future products.

For a business such as ours, where there’s such a volume of valuable content, this solution allows us to explore opportunities and provide more dynamic and flexible new products for our customers at a lower cost.

This was a key landmark in putting the modern infrastructure in place to enable these new possibilities. We have some exciting use cases lined up for this solution and we’re looking forward to supporting new applications through the API.

SEAN HARROP

Content Architect; BMJ

Jumpstart Your Data Platform.

Not yet ready for a chat? Download our guide and see how a cloud-native data estate helps you avoid common data management pitfalls

Get the guide
Jumpstart your data platform

Get in Touch.

Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.

Ilja Summala
Ilja’s passion and tech knowledge help customers transform how they manage infrastructure and develop apps in cloud.
Ilja Summala LinkedIn
Group CTO