The Multi-Agent System Balancer (MASB) system is the prototype implementation of the AI-driven load balancer solution. This system can be used to orchestrate tasks allocations on massive Cloud / cluster systems. The MASB framework offers an alternative approach to task allocations in that all the actual processing of scheduling logic is offloaded to nodes themselves. The full details of the presented solution can be found here.
The taxonomy analysing deployed and actively used workload schedulers’ solutions can be found here. The article presents a taxonomy in which those systems are divided into several hierarchical groups based on their architecture and design, namely Operating System, Cluster and Big Data schedulers. The review is focused on the key design factors that affect the throughput and scalability of a given solution, as well as the incremental improvements which bettered such an architecture. A special attention is given to Google’s Borg which is one of the most advanced and published systems of this kind.
KEY DESIGN PRINCIPLES
The MASB project was developed on top of the AGOCS framework over several years, during which time it has undergone many changes in both the technology used and the design of the architecture. Examples of these improvements include migration from Java to Scala, change from thread pools to an Akka Actors/Streams framework, introduction of concurrency packages and non-locking object structures, such as TrieMap. The project remains under active development with new features being regularly added.
The main design principles are as follows:
- To provide a stable and robust (i.e. no single point of failure) load balancer and scheduler for a Cloud-class system;
- To efficiently reduce the cost of scaling a Cloud-class system so that it can perform in an acceptable manner on smaller clusters (where there are tens of nodes) as well on huge installations (where there are thousands of nodes);
- To provide an easy way of tuning the behaviours of a load balancer where the distribution of tasks across system nodes can be controlled.
In the prototype implementation of the MASB system, each node is represented by Node Agent (NA). This NA monitors its node’s resources allocations levels, ensuring that the node is not overloaded. When the allocated tasks exceed the node’s resources, its AI module selects overloading tasks and communicates with other agents in an attempt to offload its node.
SERVICE ALLOCATION NEGOTIATION
The decentralisation of load balancing logic removed the dependence on a centralised cell state store. As such, the MASB can support much larger computing cells than the current limit of 25k nodes. The MASB prototype design was tested at up to 100k nodes, meaning that it could support the concurrent scheduling of over one million tasks. All communications in the system are performed via an authored Service Allocation Negotiation protocol.
The framework uses loose coupling at every stage of its operations flow, this means that its scheduling decisions are made only on locally-cached knowledge and all communication between nodes is kept to a minimum.
AI-DRIVEN LOAD BALANCING
The processing of scheduling logic is no longer constrained to the resources available on the head machine. The result is the use of more advanced algorithms leading to significant improvements in the quality of task allocation as well as greater scalability. Load balancing decisions are driven by a meta-heuristic algorithms within AI module and easily tuneable rulesets.
The built-in scoring functions are focused on promoting the highest utilisation of nodes whilst leaving adequate headroom should service usage increase. Administrators can easily tune the scoring logic to achieve the best results to fit their desired parameters. This is of particular importance when building clusters designed for Big Data frameworks and where the allocation of a task close to its data source is critical.
PERFORMANCE
High-quality task allocation decisions can be realised through a configurable set of nodes’ scoring functions and AI-driven offloading routines. The efficacy of this approach was demonstrated by the throughput results being comparable with the performance of Google’s world-class Borg scheduler.
The MASB could manage eight times the number of nodes of the original workload provided in traces from the Google Cluster Data repository without any decline in the tasks’ allocation quality. This result suggests the higher scalability of the presented solution while preserving the high utilisation of cluster nodes.
FURTHER INFORMATION
The presented concept of AI-driven scheduling has been developed over several years and the prototype is still being actively developed. If you are interested in the workings of the MASB or its potential use in your organisation, please contact us.