Processing pipelines were not set up in the optimal Beam pipeline. So, every time the team ran into a case that the Python version of the framework didn’t support, they worked around this limitation by writing code outside of the Beam programming model. This code was a drain on the performance gains Dataflow could provide.
Energyworx.
Embracing Kubernetes to reduce infrastructure spend.
- 50% savings on compute workloads
- Opened future cost saving potential
- Deployed across other hyperscaler environments
Project Summary.
The client.
Energyworx is a Dutch scale-up that produces big data platforms for customers in the energy sector.
Their customers, both energy suppliers and network operators, use the platform to process and store time series data from energy meters placed in houses and businesses across their area of operations.
The platform is used to automate business processes like billing and predicting energy usage.
Energyworx approached Nordcloud to help them migrate their data processing pipelines to Kubernetes, and achieve cost savings across their compute workloads.
Project background.
Energyworx had already built a large, distributed platform on GCP, mostly using managed services such as App Engine, Dataflow, Pub/Sub, Bigtable and Big Query. But, as the load on their platform rose, so did the costs.
The biggest cost factor was Google Dataflow. This, among other reasons, meant the decision to migrate a significant compute load from their managed platform to Kubernetes.
Energyworx were paying for all the functionality that these services deliver by paying a premium on the compute resources. Compared to the cost of regular VMs, a Dataflow instance, for example, could cost 2.5x as much for the same resources.
The team also found that the Python version of Dataflow was performing worse than the Java implementation and lagging behind in features. They experienced frequent crashes, causing tickets sent in by customers waiting for their data.
It meant dedicating significant time to monitoring their flows, and dealing with crashes also led to data loss when restarting.
Challenges.
Performance limitations
Aggressive auto-scaling
Dataflow was auto-scaling aggressively when a burst of data entered the system, but it didn’t scale down quickly enough after the data had been processed. This led to large unnecessary costs when large batches of data enter the system.
Dependency management
With so many dependencies bound to the version of Apache Beam, it meant updating all dependencies at the same time. This is problematic when upgrading the framework, and the updated version came with an upgrade for a dependent library that had breaking changes. These updates are complicated, and a burden on developers.
Our Approach.
Addressing data duplication
Data processing in the Energyworx platform occurs in several stages, and Dataflow has built-in features for deduplicating the data in the platform.
The Nordcloud team had to address data duplications – one of the inherent problems of distributed systems – when migrating away from Dataflow. It meant producing various new implementation methods to counteract this issue.
Auto-scaling services
ubernetes enables granular control over auto-scaling. But this also means more to manage. For best performance across multiple environments with different data, we used a combination of metrics.
We also handled scale-down. The data didn’t enter systematically, rather bursts throughout the day. It needed to guarantee processing within a defined time limit, despite the variables. We used a sufficiently large stabilisation window to solve these issues.
Failure handling
Issues like retries when making API calls, or monitoring the application status – previously covered by Dataflow, now needed to be handled.
This meant more failure handling. When running 1000s of containers in Kubernetes, they often get shut down. Happening halfway through processing or through sending output data causes performance issues and data duplication. We designed new services to handle these situations.
Solutions.
Enabled Kubernetes pipelines
A large part of the codebase was not tightly coupled to the Beam programming model. Therefore, the team was able to remove the Beam framework entirely and dockerise the pipelines, making it possible to run them on Kubernetes without a Beam runner.
Complete auto-scaling control
While the auto-scaling feature in principle was a big plus, not being able to fine-tune this on a granular level is a significant downside. Kubernetes allows complete control over auto-scaling using Horizontal Pod Auto-scalers.
Balancing dependencies
Upgrading a version of the dependent libraries was often not possible without upgrading the framework version and all the dependent libraries. This is normal when working with a large framework, but also more manageable once we removed the Beam framework, and this allows more freedom to choose dependency libraries.
Results.
This successful migration to Kubernetes enabled Energyworx to achieve several business outcomes that ensure their platform can keep evolving in a sustainable way into the future.
The platform achieves the same business value for customers with over 50% savings on compute workloads.
By moving workloads away from Google managed services to Kubernetes, Energyworx is now one step closer towards being able to deploy the platform across other hyperscaler environments.
New avenues have been opened for future cost saving initiatives, such as migrating more components from other managed services (such as Google App Engine) to Kubernetes, or pre-emptive GKE nodes.
Get in Touch.
Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.