Using Grid Technologies in the Cloud for High Scalability

Victoria Livschitz, CEO Grid Dynamics [email_address] September 17 th , 2008 Using Grid Technologies on the Cloud for High Scalability A Practitioner Report for Cloud User Group

A word about Grid Dynamics Who we are: global leader in scalability engineering Mission: enable adoption of scalable applications and networks though design patterns, best practices and engineering excellence Value proposition: fusion of innovation with best practices Focused on “physics”, “economics” and “engineering” of extreme scale Founded in 2006, 30 people and growing, HQ in Silicon Valley Services Technology consulting Application & systems architecture, design, development Customers Users of scalable applications: eBay, Bank of America, web start-ups Makers of scalable middleware: GigaSpaces, Sun, Microsoft Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS

Why I am speaking here tonight? We do scalability engineering for a living Cloud computing is new, very exciting and terribly over-hyped Not a lot of solid data on performance, scalability, usability, stability… Many of our customers are early adopters or enablers Their pains, discoveries and lessons are worth sharing The practitioner prospective Recently completed 3 benchmark projects that we can make public Results are presented here tonight

Exploring Scalability thru Benchmarking Benchmark Cloud Vendor Middleware Application 1. Test scalability of EC2 on the simplest map-reduce problem Public commercial cloud, EC2 Amazon GridGain Monte-Carlo 2. Test scalability of data-driven HPC applications, similar to those used in practice Public commercial cloud, EC2 Amazon GigaSpaces Risk Management 3. Explore performance implications of data “in the cloud” vs. “outside the cloud” Incubator compute cloud for academic use, CompFin Microsoft Windows HPC Server, Velocity Data-intensive Analytics

Benchmark #1: Scalability of Simple Map/Reduce Application on EC2

Basic Scalability of Simple Map/Reduce Goal: Establish upper limit on scalability of Monte-Carlo simulations performed on EC2 using GridGain Why Monte-Carlo: simple, widely-used, perfectly scalable problem Why EC2: most popular public cloud Why GridGain: simple, open-source map-reduce middleware Intended Claims: EC2 scales linearly as grid execution platform GridGain scales linearly as map-reduce middleware Businesses can run their existing Monte-Carlo simulations on EC2 today using open-source technologies

Other Goals Understand “process bottlenecks” of EC2 platform Changes to the programming, deployment, management model Ease of use Security Metering and payment Identify scalability bottlenecks at any level in the stack EC2 GridGain Glueware Robustness Stability Predictability

Architecture Job Execution Spare EC2 Instances JMS Message Processing Manages worker nodes and tasks Discovery & Task Assignment Corporate Intranet Amazon EC2 Cloud Controls Grid Operation Configuration & Task Repository Technology Stack: EC2 GridGain Typica OpenMQ Spare Capacity Worker Nodes Head Node OpenMQ Server JMS Grid Console HTTP Server

Performance Methodology & Results Same algorithm exercised on wide range of nodes 2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes Simultaneously double the amount of computations and nodes Measure completion time Repeat several times to get statistical averages Conclusions Total degradation from 13 to 16 seconds, or 20% Discarding first 8 nodes, near perfect scale up to 128 Slight degradation from 128 to 256 (3%), from 256 to 512 (7%) => Prove point of near linear scalability end-to-end

Simple scaling script var itersPerNode = 5000; var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]; for (var i in cnode) { var n = cnode[i]; grid.growEC2Grid(n, true); grid.waitForGridInstances(n); runTask(itersPerNode * n, n, 3); }

Observations Deployment considerations Start-up for whole grid in different configurations is 0.5 - 3 min 2-step deployment process First, bring up one EC2 node as controller Next, use the controller on-the-inside to coordinate bootstrapping Some of EC2 nodes don’t finish bootstrapping successfully Average of 0.5% nodes come up in incomplete state Not clear the nature of the problem If the exact processing power is essential, start the nodes, then kill off the sick ones and bring up a few new ones before starting computation IP address deadlock issue IP addresses of the nodes are needed to start & configure the grid IP addresses are not available until the grid is up & configures Need carefully choreograph bootstrapping and pass IP’s as parameters into controlling scripts

Observations Monitoring considerations Connection to each node from outside is possible, but not efficient Check heartbeat from the internal management nodes Local scripts must be stored on S3 or passed back before termination Programming model considerations EC2 does not support IP multicast Switched to JMS instead Luckily, GridGain supported multiple protocols Typica : 3 rd party connectivity library that use EC2 query interface Undocumented limit on URL length is hit with 100s of nodes Amazon just disconnects with improper URLs without specifying the error, so debugging was hard Workaround: rewrote some parts of our framework to enquire about individual running nodes. Works, but less efficient

Observations Metering and payment Amazon sets a limit on concurrent VM Eventually approval for 550 VMs after some due diligence from Amazon Amazon charges by full or partial VM/hours Sometimes, short usage of VMs is not metered Not clear why One hypotheses: metering “sweeps” happen every so often Be careful with usage bills for testing A test may need to be run multiple times Beware of rouge scripts Test everything on smaller configurations first Scale gradually, or you will miss the bottlenecks

Achieving scalability Software breaks at scale. Including the glueware Barrier #1 was hit at 100 nodes because of ActiveMQ scalability Correction: Switched ActiveMQ for OpenMQ Comment: some users report better ActiveMQ scalability with 5.x Barrier #2 was hit at 300 nodes because of Typica URL length limit Correction: Changed our use of the API Security considerations EC2 credentials are passed to Head Node 3 rd party GridGain tasks can access them Sounds like potential vulnerability

What have we learned? EC2 is ready for production usage on large-scale stateless computations Price/performance Strong linear scale curve GridGain showed itself very well Scale, stability, ease-of-use, pluggability Solid open source choice of map-reduce middleware Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security What’s next? Can we go higher then 512? What is the behavior of more complex applications?

Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2

Data-driven Risk Management on EC2 Goal: Investigate scalability of a prototypical Risk Management application that use significant amount of cached data to support large-scale Monte-Carlo simulations executed on EC2 using GigaSpaces Why risk management: class of problems widely used in financial services Why GigaSpaces: leading middleware platform for compute & data grids Intended Claims: EC2 scales linearly for data-driven HPC applications GigaSpaces scales well as both compute and data grid middleware Businesses can run their existing risk management (and similar) applications on EC2 today using off-the-shelf technologies

Architecture Workers take tasks, perform calculations, write results back User uses ec2-gdc-tools to manage grid Master writes tasks into data grid and waits for results… Amazon EC2 Grid Compute Grid Data Grid Grid Console Service Grid Manager Master

Performance methodology & results Same algorithm exercised on wide range of nodes 16,32, 128, 256, 512. Still limited by Amazon permission of 550 Constant size of data grid (4 large EC2 nodes) Double the nodes with constant amount of work Measure completion time (strive for linear time reduction) Conclusions Near perfect scale from 16 to 256 nodes 28% degradation from 256 to 512 since data cache becomes a bottleneck

What have we learned? EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to Risk Management GigaSpaces showed itself very well Compute - data grid scales well in master-worker pattern Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security Bootstrapping this system is far more complex then GridGain’s. For more details, contact me offline What’s next? How does data grid scale? What about more complex applications? What’s the scalability of co-located compute-data grid configuration?

Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications

Data-intensive Analytics on MS cloud Goal: Investigate performance improvements from data “in the cloud” vs. “outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity What is CompFin++ Labs: MS-funded “incubator” compute cloud for exploration of modern compute & data challenges on massive scale What is Velocity: MS new in-memory data grid middleware, still CTP1 The Model : Computes correlation between stock prices over time. Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%. Intended Claims: Measure impact of data “closeness” to the computation on the cloud

Architecture: Anticipated Bottlenecks

Architecture: CompFin + Velocity

Benchmarked configurations Same analytical model with complex queries Perfect linear scale curve (baseline) Original CompFin Distributed cache (original CompFin + Velocity distributed cache for financial data) Local cache (original CompFin + Velocity distributed cache for financial data + near cache with data-aware routing)

Test methodology 3 ways of measuring scalability were used Fixed amount of computations, increasing amount of data Fixed amount of date, increasing amount of computations Proportional Increase of computations and nodes “ Node” = 1 core “ Data unit” = 32 million records or 512 megabytes of tick data Test 1 Test 2 Test 3 Test # 1 2 3 4 5 6 7 8 9 Nodes 8 32 32 32 32 32 64 128 200 Data Units 1 1 1 6 12 12 24 48 69

Conclusions Data “on the cloud” definitely matters! Performance improvements up to 31 times over “outside the cloud” Velocity distributed cache has some scalability challenges: Failure on 50 nodes cluster with 200 concurrent clients Good news: it’s a very young product and MS is actively improving it Compute-data affinity matters too! Significant performance gain of local cache over distributed cache Local cache resolved distributed cache scalability issue by reducing its load

Final Remarks Clouds are proving themselves out Early adaptors are there already The rest of the real world will join soon There are still significant adoption challenges Technology immaturity Lack of real data, best practices, robust design patterns “ Fitting” of application middleware to cloud platforms is just starting Amazon is the leading commercial cloud provider, but is not the only game in town Companies are building public, private, dedicated and special-purpose clouds

Victoria Livschitz [email_address] Thank You!

Using Grid Technologies in the Cloud for High Scalability

Recommended

More Related Content

What's hot (20)

Viewers also liked (14)

Similar to Using Grid Technologies in the Cloud for High Scalability (20)

Recently uploaded (20)

Using Grid Technologies in the Cloud for High Scalability