High Performance Computing and Associated Platforms.
Introduction
High-performance computing (HPC) is the field that utilises large-scale computing resources of specific architecture to solve challenging and complex calculations at sub-millisecond latency.
It has emerged over the past couple of decades as a critical platform for scientific breakthroughs ranging from the COVID vaccine drug discovery stages to creating Generative AI models. It has lately - over the course of the past decade become a critical tool for the financial services and insurance (FSI) sector.
HPC has been enabling FSIs to process vast amounts of data, generate complex models at unprecedented levels, alongside derivatives pricing and risk management. HPC has also been helping FSIs facilitate real-time market analysis, empower and predict viable investment opportunities and improve return on investment for stakeholders.
Over the past years, cloud computing has evolved to enable HPC platforms for various customers - all the while trying to abstract away the intricate operating mechanisms. Partnerships between Intel, Nvidia, AMD and Microsoft have been steadily growing.
AWS, although the market leader in cloud services has been eating slowly but surely Microsoft’s cloud market share in HPC over some time. Oracle and IBM have also been trying to attract some market share but they target different audiences to their market.
While there are key benefits and reasons for choosing one cloud provider over the other, the focus of this article is to review specific platforms used to deploy HPC in Azure, with a highlight over user-friendliness, capability, and the benefits of choosing either platform for the specific purposes of running FSI workloads.
Platforms
There are currently two main platform services for HPC workloads in Azure: Batch and CycleCloud. The differentiating factor between the two platforms is the level of freedom oriented towards the workload manager; in Batch, the workload manager is abstracted away and minimal settings are provided to the user/developer to adjust.
In CycleCloud, there is a deeper level of understanding required for the workload managers involved and templates to create clusters bespoke to the requirement. The following sections delve deeper into the realms of Batch and CycleCloud.
Azure Batch
Azure Batch can be thought of as a developer driven platform to schedule workload runs with HPC resources and build chained execution pipelines. It abstracts away all the knowledge relating to workload management tools in the industry such as SLURM, PBS, IBM Spark and gives basic job scheduling parameters to suit processing requirements in batch.
Batch then schedules the jobs - based on developer requests and input - to run the workloads on a managed pool of virtual machines - addressing scaling requirements behind the scenes for the requested walltime. For example, developers can build a service with Batch to run a Monte Carlo risk simulation with a deadline set to be in 8 hours or less; all Azure Batch would care about by default is scaling the node pools to ensure the 8 hour deadline is met.
The following section highlights the modes of working for the platform. It will describe scaling & scheduling operations, communication modes, application loading, storage integration, and observability.
Scaling
Batch provides the ability to provision fixed size or flexible, auto-scalable node pools. The user is capable of specifying a mix of spot instances and dedicated ones in either case to meet a minimum standby requirement for workload requests as well as auto-scaling based on given thresholds in a defined formula.
The auto-scaling formula is exclusive to Batch’s “autoscaling” language and is invoked through either the .NET framework or the REST API. The formulae written by the developers and users alike are free-formed expressions that can include both service-defined variables, which are defined by the Batch service, and user-defined variables.
Scheduling:
Batch operates in a headless state in which the users are enabled to focus solely on building their pipelines for running workloads to finish at a designated time. The platform abstracts away all the knowledge required to run a workload manager, a head node, compute nodes, an entire HPC cluster, and the configuration for resource sharing/allocation between teams.
Users are simply required to operate with either a click-ops manner or utilise a developer-oriented schema to schedule and run their workloads.
Some of the options available for users would be:
- Uploading applications into a node pool.
- Setting the wall clock time for a given task/job.
- Customising node pools and VMs within their node pools (SKU, disks, cache, Images, resizing options, certificates…etc).
- Customising the minimum and maximum number of nodes to be on standby for a selected period of time (eg., ramp up period).
- Customising the inter-communication between nodes (inter-node) in a node pool.
Communication modes:
Part of Batch’s aim is to reduce - or increase depending on the user involved - the complexity of the communication between nodes in a given pool. In a way, it reduces the need to understand the complexity behind networks and communication layers such as (non)blocking paths, UCX protocols and ones that involve Message Passing Interface (MPI) processes. Users can dig straight into switching the communication that suits the deadline’s needs.
Batch provides the flexibility to provision node pools with a virtual network, providing the ability to control traffic to and from the node pools. It also gives the choice to provision these node pools without a virtual network, which then puts maintaining the network infrastructure in Azure’s hands. In either modes, there would be three main communication modes for the node pools in Batch:
- Simplified
- Classic
- Default
The simplified communication mode abstracts much of the complexity and requires whitelisting of outbound connection rules as a baseline for its operations. The classic communication mode on the other hand involves whitelisting inbound and outbound rules to and from the Azure Batch service to the VMs in a node pool. This can be troublesome for air gapped or ring fenced environments that have strict networking requirements.
The Default communication mode on the other hand depends on whether the node pools are provisioned with or without a virtual network attachment. For pools without a virtual network, the pool may be created in either classic or simplified mode. For pools with a virtual network, the pool always defaults to classic.
It is worth also mentioning that specifying the communication mode does not guarantee Batch will use this mode. It is just a preference and depending on the node pool’s configurations (eg., (dis)allowing of public IPs, virtual network association..etc) would these preferences be adhered to.
Application loading:
Batch offers a simple way for sideloading applications into a node pool. The user would only need to wrap the portable executable with its dependency files in a compressed file, and supply command line flags expected for execution. Batch then copies these instructions and distributes them across the nodes to process the required task.
It abstracts away much of the customisation of the environment for optimal performance in a traditional HPC environment, whereby compiler flags for installers would be supplied to match advanced vector flags and associated libraries that would be leveraged for optimal execution for a targeted architecture.
This would give Batch the edge for use cases that require processing, curating or transforming data in parallel. Alternatively, developers could pre-compile their applications to match a given architecture in an Azure VM to eek out some performance, though not as optimal as low-level and environment-based compilations.
Storage:
Batch offers users the ability to link their managed compute nodes with a set of storage options (file shares) residing in on-prem or Azure. The supported file shares by Batch are NFS-based (Azure NetApp Files or traditional NFS servers), CIFS, Blob or Azure Files. The type of compute nodes in question would also decide the level of compatibility for the file shares deployed.
For example, Linux VMs support all types mentioned but Windows ones are only limited to Azure Files. It goes without saying that mounting an on-premises NFS file share should be approached with caution due to complexities dealing with optimal routes and latencies, but the option is available.
Observability:
Batch processes can be probed using a set of APIs, command-line scripts, or the Azure portal to configure, manage, and monitor the jobs in the queue. It also provides an accessible but simplistic dashboard - through the Azure Portal that can be exported to Power BI - of pending, running, failed or complete tasks along with some service telemetry.
Developers and users are also able to leverage Azure’s native Monitor extensions to keep an eye on the service’s overall health and compute node uptime.
Azure CycleCloud
Azure CycleCloud is an enterprise level platform that enables deployment of HPC clusters within minutes, providing a flavour for customising cluster features. Some of these features include but are not limited to parallel file system configurations and storage options, heterogenous/homogenous SKUs, network and environment setups at the OS layer, and workload managers.
It provides the freedom to configure low-level workload manager configurations (eg., chargeback, accounting, group resource), monitor jobs and cluster telemetry in real-time, and provide intuitive messaging about background operations occurring in Azure (eg., scaling, availability of SKUs from the Azure pools, subscription quota messages…etc).
CycleCloud targets HPC operators (administrators and users) who would like to deploy HPC systems on Azure or want to burst into Azure by replicating some of the infrastructure they are running on-premise. These HPC operators are particularly engaged in supporting applications, workflow engines, and computational pipelines without having to retool their internal processes.
CycleCloud comes with a declarative and rich templated language, which provides users the ability to describe their HPC cluster, from the cluster topology (the number and types of cluster nodes), down to the mount points and applications that will be deployed on each node.
The following section highlights the capabilities of CycleCloud and will look into scaling, scheduling, communication modes, application loading, storage integration, and observability.
Scaling:
CycleCloud comes with an auto-scaling feature embedded in the platform as libraries and is scheduler-agnostic. Meaning, whichever scheduler an HPC team is comfortable with provisioning and using, the auto-scaling libraries will be injected into the orchestrating interface.
Thereby whenever a user submits a job request via the scheduler, it is passed to the CycleCloud interface, which then digests the node requirements and provisions the nodes from the Azure pool. It then configures the nodes based on the requirement. All of this happens behind the scenes without user intervention, leaving the users and HPC operators to focus on new tasks or innovations in the organisation.
Scheduling:
CycleCloud does not have internal job scheduling functionalities like Batch does. Instead, it operates by mandating a head node and execution nodes, and offers the HPC administrators or operators to deploy their own schedulers onto the cluster or use pre-built ones. A full list of the pre-built options are Open PBS, PBS Pro, Slurm, IBM’s LSF, Grid Engine, HTCondor, and Microsoft’s HPC Pack.
Freedom to customise any of the schedulers exists as if operating in an on-premise environment, such as customising how these schedulers behave under load, how jobs are handled under different criteria, application of accounting and cost controls as well as enabling monitoring and alerting.
This helps map things for the HPC operators’ since they are already accustomed to the schedulers they choose to provision. Some integrations to Azure would exist but the cost of training is lowered substantially.
Communication modes:
CycleCloud uses virtual machine scale sets (VMSS) as a backbone for elastic compute. This model assists when clusters are in or out of demand and relieves the burden of cost on an organisation when dealing with underutilised or over provisioned clusters. The elastic compute model auto-scales when workloads are ramping up on a given cluster and provisions more nodes as requested and deprovisions them when these nodes have been underutilised after a short period.
The VMSS solution also provides HPC operators the freedom to specify which SKUs could be coupled and placed onto the same “switch” or “server rack” - behind the Azure scenes - with some declarative options in CycleCloud’s templates.
This helps nodes located on the same “switch” or “rack” communicate faster compared to nodes that are sparsely distributed across a given data centre. That freedom also extends to inform the cluster not to couple nodes together as some teams are solely focused on high throughput computing (eg., training AI models), which requires each node to operate independently with its own resources (eg., GPUs), and thus inter-node communication is irrelevant in this case.
Storage:
CycleCloud offers multiple high performance storage solutions dedicated to tapping into the performance and throughput when and where required. These solutions when configured would attach as mount points to a provisioned cluster. Some of the available options are industry standard such as: BeeGFS, Azure NetApp Files, and Lustre.
Freedom to attach other parallel file systems such as OpenZFS or others for the cluster exist too. Of course users are free to mount Azure native storage options such as Azure Blob storage or Azure Files, but for highly intensive workloads, these solutions lack the performance required.
Observability
The core elements of observability offered by CycleCloud lie within the web server that comes packed with it and the chosen scheduler upon deployment. Users and operators would be able to probe their jobs, cost associated from within the head node that operates the scheduler and gain more information from the interactive portal.
It offers event logging relating to node health, node evictions and (in)capability to provision nodes due to a lack of availability in the Azure pool. Alerting and logging capabilities can be tied to Azure native messaging services such as Event Grid to ingest and transmit elsewhere or can be configured to send emails via SMTP to a dedicated and monitored mailbox in an organisation.
Either way, the observability stack provides users various degrees of freedom to keep track of spend, status and overall cluster health as if they were using their cluster in an on-premise environment with their bespoke monitoring tools.
The Case for FSI Workloads
Using either platform for FSI workloads solely depends on the degrees of freedom required and what is offered by either platform. For Azure Batch, the creation of a cluster and configuring its features and components (eg., auto-scaling, storage, networks…etc) can be done with the Batch REST API, Batch Command-line Interface or through Click-Ops.
There is no right answer, and each has their own pros and cons for choosing one over the other. It all depends on the user-base and whether they are experienced developers or users who would like to get the job done.
Azure CycleCloud on the other hand is more oriented towards traditional HPC ecosystems as often is the case with on-premise environments, which would have involved schedulers and workload managers. The learning curve for CycleCloud is a little greater than Batch due to the provisioning of the service atop traditional schedulers but once that hurdle is overcome it is considered business as usual.
From a cost perspective, Azure Batch requires a minimum of 1 node to operate and that node can be the sole compute instance. CycleCloud on the other hand requires the VM hosting the service, a scheduler node and a compute instance to make up the “HPC cluster”.
From a Role-Based Access Control perspective, Azure Batch integrates closely with Azure’s RBAC schema - users are assigned the as-required permissions as would happen for any Azure service, in order to operate. CycleCloud on the other hand stops short of that and access to the HPC cluster’s nodes are dictated by the local administrator accounts provisioned under the service layer, mentioned in the templates and created under the OS layer.
FSIs workloads vary quite broadly. Some deal with algorithmic trading, market risk analysis and forecasts. Others deal with transaction processing and monitoring, high frequency trading, or data modelling based on historical data, among other use cases. If the workloads in question are python-based, such as ones that deal with transaction processing, then the dependencies are easier to deal with and Azure Batch in this case would be a suitable choice.
If the workloads require intricate use of advanced vector extensions such as ones involved in high frequency trading and algorithmic trades or even Monte Carlo simulations, then CycleCloud would be a better choice, because compiling the applications using flags specific to the compute infrastructure would make all the difference in optimal speed of the calculations.
Eventually, it depends on the nature of the employed workload, the familiarity with workload managers and schedulers, whether the focus is to reach the fastest mean time to solution, converging traditional HPC with Cloud infrastructure or democratising it using a developer-oriented and Click-Ops platform.
Conclusion
The choice between Azure Batch or CycleCloud would be highly dependent on the requirements set for dealing with HPC. Azure has invested in its ecosystem to provide platforms that appeal to users from both camps; developer-oriented and traditional system administrators. Every use case is different. The scientific computing industry prefers the traditional HPC ecosystem because it gives them freedom to customise their infrastructure to their liking and the workloads involved.
The financial services industry leans more towards getting things done with or without a traditional HPC environment. Certainly an argument can be won for CycleCloud for those who wish to spin up an HPC cluster with all its bells and whistles in minutes. Others lean towards a developer approach or Click-Ops.
Whichever platform is used, users should just keep in mind that achieving great performance requires understanding of the underlying technology that enables it.
Get in Touch.
Let’s discuss how we can help with your cloud journey. Our experts are standing by to talk about your migration, modernisation, development and skills challenges.