SlideShare a Scribd company logo
Victoria Livschitz, CEO Grid Dynamics [email_address] September 17 th , 2008  Using Grid Technologies on the  Cloud for High Scalability A Practitioner Report  for Cloud User Group
A word about Grid Dynamics Who we are: global leader in scalability engineering Mission: enable adoption of scalable applications and networks though design patterns, best practices and engineering excellence Value proposition: fusion of innovation with best practices Focused on “physics”, “economics” and “engineering” of extreme scale Founded in 2006, 30 people and growing, HQ in Silicon Valley Services Technology consulting Application & systems architecture, design, development Customers Users of scalable applications: eBay, Bank of America, web start-ups Makers of scalable middleware: GigaSpaces, Sun, Microsoft Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS
Why I am speaking here tonight? We do scalability engineering for a living Cloud computing is new, very exciting and terribly over-hyped Not a lot of solid data on performance, scalability, usability, stability… Many of our customers are early adopters or enablers Their pains, discoveries and lessons are worth sharing The practitioner prospective Recently completed 3 benchmark projects that we can make public Results are presented here tonight
Exploring Scalability thru Benchmarking Benchmark Cloud Vendor Middleware Application 1. Test scalability of EC2 on the simplest map-reduce problem Public commercial cloud, EC2 Amazon GridGain Monte-Carlo  2. Test scalability of data-driven HPC applications, similar to those used in practice Public commercial cloud, EC2 Amazon GigaSpaces Risk Management 3. Explore performance implications of data “in the cloud” vs. “outside the cloud” Incubator compute cloud for academic use, CompFin Microsoft Windows HPC Server, Velocity Data-intensive Analytics
Benchmark #1: Scalability of Simple Map/Reduce Application on EC2
Basic Scalability of Simple Map/Reduce Goal:  Establish upper limit on scalability of Monte-Carlo simulations performed on EC2 using GridGain Why Monte-Carlo:  simple, widely-used, perfectly scalable problem  Why EC2:  most popular public cloud Why GridGain:  simple, open-source map-reduce middleware Intended Claims: EC2 scales linearly as grid execution platform GridGain scales linearly as map-reduce middleware Businesses can run their existing Monte-Carlo simulations on EC2 today using open-source technologies
Other Goals Understand “process bottlenecks” of EC2 platform Changes to the programming, deployment, management model Ease  of use  Security Metering and payment Identify scalability bottlenecks at any level in the stack EC2 GridGain Glueware Robustness Stability Predictability
Architecture Job Execution Spare EC2 Instances JMS Message Processing Manages worker nodes and tasks Discovery & Task Assignment Corporate  Intranet Amazon  EC2 Cloud Controls Grid Operation Configuration & Task Repository Technology Stack: EC2 GridGain Typica OpenMQ Spare Capacity Worker Nodes Head Node OpenMQ Server JMS Grid Console HTTP Server
Performance Methodology & Results Same algorithm exercised on wide range of nodes 2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes Simultaneously double the amount of computations and nodes Measure completion time Repeat several times to get statistical averages Conclusions Total degradation from 13 to 16 seconds, or 20% Discarding first 8 nodes, near perfect scale up to 128 Slight degradation from 128 to 256 (3%), from 256 to 512 (7%)  => Prove point of near linear scalability end-to-end
Simple scaling script var itersPerNode = 5000; var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]; for (var i in cnode) { var n = cnode[i]; grid.growEC2Grid(n, true); grid.waitForGridInstances(n); runTask(itersPerNode * n, n, 3); }
Observations Deployment considerations Start-up for whole grid in different configurations is 0.5 - 3 min 2-step deployment process First, bring up one EC2 node as controller Next, use the controller on-the-inside to coordinate bootstrapping Some  of EC2 nodes don’t finish bootstrapping successfully Average of 0.5% nodes come up in incomplete state Not clear the nature of the problem If the exact processing power is essential, start the nodes, then kill off the sick ones and bring up a few new ones before starting computation IP address deadlock issue IP addresses of the nodes are needed to start & configure the grid IP addresses are not available until the grid is up & configures Need carefully choreograph bootstrapping and pass IP’s as parameters into controlling scripts
Observations Monitoring considerations Connection to each node from outside is possible, but not efficient Check heartbeat from the internal management nodes  Local scripts must be stored on S3 or passed back before termination Programming model considerations EC2 does not support IP multicast  Switched to JMS instead Luckily, GridGain supported multiple protocols  Typica : 3 rd  party connectivity library that use EC2 query interface Undocumented limit on URL length is hit with 100s of nodes Amazon just disconnects with improper URLs without specifying the error, so debugging was hard Workaround: rewrote some parts of our framework to enquire about individual running nodes. Works, but less efficient
Observations Metering and payment Amazon sets a limit on concurrent VM Eventually approval for 550 VMs after some due diligence from Amazon Amazon charges by full or partial VM/hours Sometimes, short usage of VMs is not metered Not clear why One hypotheses: metering “sweeps” happen every so often Be careful with usage bills for testing A test may need to be run multiple times  Beware of rouge scripts Test everything on smaller configurations first Scale gradually, or you will miss the bottlenecks
Achieving scalability Software breaks at scale. Including the glueware Barrier #1 was hit at 100 nodes because of ActiveMQ scalability Correction: Switched ActiveMQ for OpenMQ Comment: some users report better ActiveMQ scalability with 5.x Barrier #2 was hit at 300 nodes because of Typica URL length limit Correction: Changed our use of the API  Security considerations EC2 credentials are passed to Head Node 3 rd  party GridGain tasks can access them Sounds like potential vulnerability
What have we learned? EC2 is ready for production usage on large-scale stateless computations Price/performance Strong linear scale curve GridGain showed itself very well Scale, stability, ease-of-use, pluggability  Solid open source choice of map-reduce middleware Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security What’s next?  Can we go higher then 512? What is the behavior of more complex applications?
Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2
Data-driven Risk Management on EC2 Goal:  Investigate scalability of a prototypical Risk Management application that use significant amount of cached data to support large-scale Monte-Carlo simulations executed on EC2 using GigaSpaces Why risk management:  class of problems widely used in financial services Why GigaSpaces:  leading middleware platform for compute & data grids Intended Claims: EC2 scales linearly for data-driven HPC applications GigaSpaces scales well as both compute and data grid middleware Businesses can run their existing risk management (and similar) applications on EC2 today using off-the-shelf technologies
Architecture Workers take tasks, perform calculations, write results back User uses   ec2-gdc-tools  to manage grid Master writes tasks into data grid and waits for results… Amazon EC2 Grid Compute Grid Data Grid Grid Console Service Grid Manager Master
Performance methodology & results Same algorithm exercised on wide range of nodes 16,32, 128, 256, 512. Still limited by Amazon permission of 550 Constant size of data grid (4 large EC2 nodes) Double the nodes with constant amount of work Measure completion time (strive for linear time reduction) Conclusions Near perfect scale from 16 to 256 nodes 28% degradation from 256 to 512 since data cache becomes a bottleneck
What have we learned? EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to  Risk Management GigaSpaces showed itself very well Compute - data grid scales well in master-worker pattern Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security Bootstrapping this system is far more complex then GridGain’s. For more details, contact me offline  What’s next? How does data grid scale? What about more complex applications? What’s the scalability of co-located compute-data grid configuration?
Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications
Data-intensive Analytics on MS cloud Goal:  Investigate performance improvements from data “in the cloud” vs. “outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity What is CompFin++ Labs:  MS-funded “incubator” compute cloud for exploration of modern compute & data challenges on massive scale What is Velocity:  MS new in-memory data grid middleware, still CTP1 The Model : Computes correlation between stock prices over time. Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%. Intended Claims: Measure impact of data “closeness” to the computation on the cloud
Architecture: CompFin
Architecture: Anticipated Bottlenecks
Architecture: CompFin + Velocity
Benchmarked configurations Same analytical model with complex queries Perfect linear scale curve (baseline)  Original CompFin Distributed cache (original CompFin + Velocity distributed cache for financial data) Local cache (original CompFin + Velocity distributed cache for financial data + near cache with data-aware routing)
Test methodology 3 ways of measuring scalability were used Fixed amount of computations, increasing amount of data Fixed amount of date, increasing amount of computations Proportional Increase of computations and nodes “ Node” = 1 core “ Data unit” = 32 million records or 512 megabytes of tick data Test 1 Test 2 Test 3 Test # 1 2 3 4 5 6 7 8 9 Nodes 8 32 32 32 32 32 64 128 200 Data Units 1 1 1 6 12 12 24 48 69
Performance results
Performance results
Conclusions Data “on the cloud” definitely matters! Performance improvements up to 31 times over “outside the cloud” Velocity distributed cache has some scalability challenges:  Failure on 50 nodes cluster with 200 concurrent clients Good news: it’s a very young product and MS is actively improving it Compute-data affinity matters too! Significant performance gain of local cache over distributed cache Local cache resolved distributed cache scalability issue by reducing its load
Final Remarks Clouds are proving themselves out Early adaptors are there already The rest of the real world will join soon There are still significant adoption challenges Technology immaturity Lack of real data, best practices, robust design patterns “ Fitting” of application middleware to cloud platforms is just starting Amazon is the leading commercial cloud provider, but is not the only game in town Companies are building public, private, dedicated and special-purpose clouds
Victoria Livschitz [email_address] Thank You!
Ad

More Related Content

What's hot (20)

A Comparative Study of Load Balancing Algorithms for Cloud Computing
A Comparative Study of Load Balancing Algorithms for Cloud ComputingA Comparative Study of Load Balancing Algorithms for Cloud Computing
A Comparative Study of Load Balancing Algorithms for Cloud Computing
IJERA Editor
 
AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...
AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...
AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...
IJCNCJournal
 
Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud
Shyam Hajare
 
Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...
Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...
Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...
Ericsson
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithm
Shilpa Damor
 
G216063
G216063G216063
G216063
inventionjournals
 
Iaetsd improved load balancing model based on
Iaetsd improved load balancing model based onIaetsd improved load balancing model based on
Iaetsd improved load balancing model based on
Iaetsd Iaetsd
 
Presentation
PresentationPresentation
Presentation
Jaspreet1192
 
Elastic neural network method for load prediction in cloud computing grid
Elastic neural network method for load prediction in cloud computing gridElastic neural network method for load prediction in cloud computing grid
Elastic neural network method for load prediction in cloud computing grid
IJECEIAES
 
Load Balancing in Cloud Computing Through Virtual Machine Placement
Load Balancing in Cloud Computing Through Virtual Machine PlacementLoad Balancing in Cloud Computing Through Virtual Machine Placement
Load Balancing in Cloud Computing Through Virtual Machine Placement
IRJET Journal
 
Cloud computing
Cloud computingCloud computing
Cloud computing
Komal Shete
 
N1803048386
N1803048386N1803048386
N1803048386
IOSR Journals
 
DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...
DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...
DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...
IJEACS
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentGenetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing Environment
Swapnil Shahade
 
THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...
THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...
THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...
ijccsa
 
STUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTING
STUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTINGSTUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTING
STUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTING
IJCNCJournal
 
REVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud ComputingREVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud Computing
Jaya Gautam
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computing
ijujournal
 
Cloud partitioning with load balancing a new load balancing technique for pub...
Cloud partitioning with load balancing a new load balancing technique for pub...Cloud partitioning with load balancing a new load balancing technique for pub...
Cloud partitioning with load balancing a new load balancing technique for pub...
IAEME Publication
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET Journal
 
A Comparative Study of Load Balancing Algorithms for Cloud Computing
A Comparative Study of Load Balancing Algorithms for Cloud ComputingA Comparative Study of Load Balancing Algorithms for Cloud Computing
A Comparative Study of Load Balancing Algorithms for Cloud Computing
IJERA Editor
 
AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...
AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...
AUTO RESOURCE MANAGEMENT TO ENHANCE RELIABILITY AND ENERGY CONSUMPTION IN HET...
IJCNCJournal
 
Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud Dynamic Cloud Partitioning and Load Balancing in Cloud
Dynamic Cloud Partitioning and Load Balancing in Cloud
Shyam Hajare
 
Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...
Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...
Conference Paper: Simulating High Availability Scenarios in Cloud Data Center...
Ericsson
 
Resource scheduling algorithm
Resource scheduling algorithmResource scheduling algorithm
Resource scheduling algorithm
Shilpa Damor
 
Iaetsd improved load balancing model based on
Iaetsd improved load balancing model based onIaetsd improved load balancing model based on
Iaetsd improved load balancing model based on
Iaetsd Iaetsd
 
Elastic neural network method for load prediction in cloud computing grid
Elastic neural network method for load prediction in cloud computing gridElastic neural network method for load prediction in cloud computing grid
Elastic neural network method for load prediction in cloud computing grid
IJECEIAES
 
Load Balancing in Cloud Computing Through Virtual Machine Placement
Load Balancing in Cloud Computing Through Virtual Machine PlacementLoad Balancing in Cloud Computing Through Virtual Machine Placement
Load Balancing in Cloud Computing Through Virtual Machine Placement
IRJET Journal
 
DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...
DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...
DCHEFT approach-for-task-scheduling-to-efficient-resource-allocation-in-cloud...
IJEACS
 
Genetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing EnvironmentGenetic Algorithm for task scheduling in Cloud Computing Environment
Genetic Algorithm for task scheduling in Cloud Computing Environment
Swapnil Shahade
 
THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...
THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...
THE EFFECT OF THE RESOURCE CONSUMPTION CHARACTERISTICS OF CLOUD APPLICATIONS ...
ijccsa
 
STUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTING
STUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTINGSTUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTING
STUDY THE EFFECT OF PARAMETERS TO LOAD BALANCING IN CLOUD COMPUTING
IJCNCJournal
 
REVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud ComputingREVIEW PAPER on Scheduling in Cloud Computing
REVIEW PAPER on Scheduling in Cloud Computing
Jaya Gautam
 
A Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud ComputingA Review on Scheduling in Cloud Computing
A Review on Scheduling in Cloud Computing
ijujournal
 
Cloud partitioning with load balancing a new load balancing technique for pub...
Cloud partitioning with load balancing a new load balancing technique for pub...Cloud partitioning with load balancing a new load balancing technique for pub...
Cloud partitioning with load balancing a new load balancing technique for pub...
IAEME Publication
 
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET- Optimization of Completion Time through Efficient Resource Allocation ...
IRJET Journal
 

Viewers also liked (14)

Pedagogikk Eksamen
Pedagogikk EksamenPedagogikk Eksamen
Pedagogikk Eksamen
Maizen88
 
Plegables Para Estudiantes De 6º Y 7º 2010
Plegables Para Estudiantes De 6º Y 7º 2010Plegables Para Estudiantes De 6º Y 7º 2010
Plegables Para Estudiantes De 6º Y 7º 2010
maryrojas
 
PP pechicutcha Mart van Koolwijk
PP pechicutcha Mart van KoolwijkPP pechicutcha Mart van Koolwijk
PP pechicutcha Mart van Koolwijk
mkoolwijk
 
C:\fakepath\nuestros blogs
C:\fakepath\nuestros blogsC:\fakepath\nuestros blogs
C:\fakepath\nuestros blogs
maryrojas
 
C:\fakepath\idioma2010
C:\fakepath\idioma2010C:\fakepath\idioma2010
C:\fakepath\idioma2010
maryrojas
 
Festival del dulce y reflexión 2010
Festival del dulce y reflexión 2010Festival del dulce y reflexión 2010
Festival del dulce y reflexión 2010
maryrojas
 
Stripping An Alien
Stripping An AlienStripping An Alien
Stripping An Alien
Ronaldomo
 
Pedagogikk Eksamen
Pedagogikk EksamenPedagogikk Eksamen
Pedagogikk Eksamen
Maizen88
 
Photograpi trick complete
Photograpi trick completePhotograpi trick complete
Photograpi trick complete
Adi Fest
 
2010 06-10 aws overview - capgenimi sap
2010 06-10 aws overview - capgenimi sap2010 06-10 aws overview - capgenimi sap
2010 06-10 aws overview - capgenimi sap
mabuhr
 
The Basics of WordPress
The Basics of WordPressThe Basics of WordPress
The Basics of WordPress
Thom Allen
 
Angles in Real Life
Angles in Real LifeAngles in Real Life
Angles in Real Life
hakre
 
Whistleblowing
WhistleblowingWhistleblowing
Whistleblowing
guest4cf361
 
Pedagogikk Eksamen
Pedagogikk EksamenPedagogikk Eksamen
Pedagogikk Eksamen
Maizen88
 
Plegables Para Estudiantes De 6º Y 7º 2010
Plegables Para Estudiantes De 6º Y 7º 2010Plegables Para Estudiantes De 6º Y 7º 2010
Plegables Para Estudiantes De 6º Y 7º 2010
maryrojas
 
PP pechicutcha Mart van Koolwijk
PP pechicutcha Mart van KoolwijkPP pechicutcha Mart van Koolwijk
PP pechicutcha Mart van Koolwijk
mkoolwijk
 
C:\fakepath\nuestros blogs
C:\fakepath\nuestros blogsC:\fakepath\nuestros blogs
C:\fakepath\nuestros blogs
maryrojas
 
C:\fakepath\idioma2010
C:\fakepath\idioma2010C:\fakepath\idioma2010
C:\fakepath\idioma2010
maryrojas
 
Festival del dulce y reflexión 2010
Festival del dulce y reflexión 2010Festival del dulce y reflexión 2010
Festival del dulce y reflexión 2010
maryrojas
 
Stripping An Alien
Stripping An AlienStripping An Alien
Stripping An Alien
Ronaldomo
 
Pedagogikk Eksamen
Pedagogikk EksamenPedagogikk Eksamen
Pedagogikk Eksamen
Maizen88
 
Photograpi trick complete
Photograpi trick completePhotograpi trick complete
Photograpi trick complete
Adi Fest
 
2010 06-10 aws overview - capgenimi sap
2010 06-10 aws overview - capgenimi sap2010 06-10 aws overview - capgenimi sap
2010 06-10 aws overview - capgenimi sap
mabuhr
 
The Basics of WordPress
The Basics of WordPressThe Basics of WordPress
The Basics of WordPress
Thom Allen
 
Angles in Real Life
Angles in Real LifeAngles in Real Life
Angles in Real Life
hakre
 
Ad

Similar to Using Grid Technologies in the Cloud for High Scalability (20)

Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Matei Zaharia
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
Matei Zaharia
 
Giga Spaces Alternative To GAE_JavaOne 09
Giga Spaces Alternative To GAE_JavaOne 09Giga Spaces Alternative To GAE_JavaOne 09
Giga Spaces Alternative To GAE_JavaOne 09
Amnon Raviv
 
Why NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB AtlasWhy NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB Atlas
Datavail
 
Primatics Financial - Parallel, High Throughput Risk Calculations On The Cloud
Primatics Financial - Parallel, High Throughput Risk Calculations On The CloudPrimatics Financial - Parallel, High Throughput Risk Calculations On The Cloud
Primatics Financial - Parallel, High Throughput Risk Calculations On The Cloud
Amnon Raviv
 
Scheduling in CCE
Scheduling in CCEScheduling in CCE
Scheduling in CCE
Mayuri Saxena
 
Addressing the 8 Key Pain Points of Kubernetes Cluster Management
Addressing the 8 Key Pain Points of Kubernetes Cluster ManagementAddressing the 8 Key Pain Points of Kubernetes Cluster Management
Addressing the 8 Key Pain Points of Kubernetes Cluster Management
Enterprise Management Associates
 
Architecting Cloud Applications - the essential checklist
Architecting Cloud Applications - the essential checklistArchitecting Cloud Applications - the essential checklist
Architecting Cloud Applications - the essential checklist
Object Consulting
 
Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...
Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...
Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...
RightScale
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEEGLOBALSOFTSTUDENTPROJECTS
 
cloud - internet rengineering
cloud - internet rengineeringcloud - internet rengineering
cloud - internet rengineering
ACMBangalore
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEEMEMTECHSTUDENTPROJECTS
 
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo AquinoFInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
Hugo Aquino
 
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
IEEEGLOBALSOFTTECHNOLOGIES
 
Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...
IEEEFINALYEARPROJECTS
 
A Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System UptimeA Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System Uptime
YogeshIJTSRD
 
Machine Learning Inference at the Edge
Machine Learning Inference at the EdgeMachine Learning Inference at the Edge
Machine Learning Inference at the Edge
Julien SIMON
 
IncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the CloudIncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the Cloud
Gábor Szárnyas
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Matei Zaharia
 
Lessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at DatabricksLessons from Large-Scale Cloud Software at Databricks
Lessons from Large-Scale Cloud Software at Databricks
Matei Zaharia
 
Giga Spaces Alternative To GAE_JavaOne 09
Giga Spaces Alternative To GAE_JavaOne 09Giga Spaces Alternative To GAE_JavaOne 09
Giga Spaces Alternative To GAE_JavaOne 09
Amnon Raviv
 
Why NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB AtlasWhy NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB Atlas
Datavail
 
Primatics Financial - Parallel, High Throughput Risk Calculations On The Cloud
Primatics Financial - Parallel, High Throughput Risk Calculations On The CloudPrimatics Financial - Parallel, High Throughput Risk Calculations On The Cloud
Primatics Financial - Parallel, High Throughput Risk Calculations On The Cloud
Amnon Raviv
 
Addressing the 8 Key Pain Points of Kubernetes Cluster Management
Addressing the 8 Key Pain Points of Kubernetes Cluster ManagementAddressing the 8 Key Pain Points of Kubernetes Cluster Management
Addressing the 8 Key Pain Points of Kubernetes Cluster Management
Enterprise Management Associates
 
Architecting Cloud Applications - the essential checklist
Architecting Cloud Applications - the essential checklistArchitecting Cloud Applications - the essential checklist
Architecting Cloud Applications - the essential checklist
Object Consulting
 
Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...
Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...
Harness the Power of the Cloud for Grid Computing and Batch Processing Applic...
RightScale
 
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
2014 IEEE JAVA CLOUD COMPUTING PROJECT Adaptive algorithm for minimizing clou...
IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS Adaptive algorithm for minimizing clo...
IEEEGLOBALSOFTSTUDENTPROJECTS
 
cloud - internet rengineering
cloud - internet rengineeringcloud - internet rengineering
cloud - internet rengineering
ACMBangalore
 
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
2014 IEEE DOTNET CLOUD COMPUTING PROJECT Automatic scaling of internet applic...
IEEEFINALSEMSTUDENTPROJECTS
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS Automatic scaling of internet appli...
IEEEMEMTECHSTUDENTPROJECTS
 
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo AquinoFInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
Hugo Aquino
 
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT Dynamic resource allocation using...
IEEEGLOBALSOFTTECHNOLOGIES
 
Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...Dynamic resource allocation using virtual machines for cloud computing enviro...
Dynamic resource allocation using virtual machines for cloud computing enviro...
IEEEFINALYEARPROJECTS
 
A Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System UptimeA Study on Replication and Failover Cluster to Maximize System Uptime
A Study on Replication and Failover Cluster to Maximize System Uptime
YogeshIJTSRD
 
Machine Learning Inference at the Edge
Machine Learning Inference at the EdgeMachine Learning Inference at the Edge
Machine Learning Inference at the Edge
Julien SIMON
 
IncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the CloudIncQuery-D: Incremental Queries in the Cloud
IncQuery-D: Incremental Queries in the Cloud
Gábor Szárnyas
 
Ad

Recently uploaded (20)

Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient CareAn Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
Cyntexa
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient CareAn Overview of Salesforce Health Cloud & How is it Transforming Patient Care
An Overview of Salesforce Health Cloud & How is it Transforming Patient Care
Cyntexa
 
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Optima Cyber - Maritime Cyber Security - MSSP Services - Manolis Sfakianakis ...
Mike Mingos
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 

Using Grid Technologies in the Cloud for High Scalability

  • 1. Victoria Livschitz, CEO Grid Dynamics [email_address] September 17 th , 2008 Using Grid Technologies on the Cloud for High Scalability A Practitioner Report for Cloud User Group
  • 2. A word about Grid Dynamics Who we are: global leader in scalability engineering Mission: enable adoption of scalable applications and networks though design patterns, best practices and engineering excellence Value proposition: fusion of innovation with best practices Focused on “physics”, “economics” and “engineering” of extreme scale Founded in 2006, 30 people and growing, HQ in Silicon Valley Services Technology consulting Application & systems architecture, design, development Customers Users of scalable applications: eBay, Bank of America, web start-ups Makers of scalable middleware: GigaSpaces, Sun, Microsoft Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS
  • 3. Why I am speaking here tonight? We do scalability engineering for a living Cloud computing is new, very exciting and terribly over-hyped Not a lot of solid data on performance, scalability, usability, stability… Many of our customers are early adopters or enablers Their pains, discoveries and lessons are worth sharing The practitioner prospective Recently completed 3 benchmark projects that we can make public Results are presented here tonight
  • 4. Exploring Scalability thru Benchmarking Benchmark Cloud Vendor Middleware Application 1. Test scalability of EC2 on the simplest map-reduce problem Public commercial cloud, EC2 Amazon GridGain Monte-Carlo 2. Test scalability of data-driven HPC applications, similar to those used in practice Public commercial cloud, EC2 Amazon GigaSpaces Risk Management 3. Explore performance implications of data “in the cloud” vs. “outside the cloud” Incubator compute cloud for academic use, CompFin Microsoft Windows HPC Server, Velocity Data-intensive Analytics
  • 5. Benchmark #1: Scalability of Simple Map/Reduce Application on EC2
  • 6. Basic Scalability of Simple Map/Reduce Goal: Establish upper limit on scalability of Monte-Carlo simulations performed on EC2 using GridGain Why Monte-Carlo: simple, widely-used, perfectly scalable problem Why EC2: most popular public cloud Why GridGain: simple, open-source map-reduce middleware Intended Claims: EC2 scales linearly as grid execution platform GridGain scales linearly as map-reduce middleware Businesses can run their existing Monte-Carlo simulations on EC2 today using open-source technologies
  • 7. Other Goals Understand “process bottlenecks” of EC2 platform Changes to the programming, deployment, management model Ease of use Security Metering and payment Identify scalability bottlenecks at any level in the stack EC2 GridGain Glueware Robustness Stability Predictability
  • 8. Architecture Job Execution Spare EC2 Instances JMS Message Processing Manages worker nodes and tasks Discovery & Task Assignment Corporate Intranet Amazon EC2 Cloud Controls Grid Operation Configuration & Task Repository Technology Stack: EC2 GridGain Typica OpenMQ Spare Capacity Worker Nodes Head Node OpenMQ Server JMS Grid Console HTTP Server
  • 9. Performance Methodology & Results Same algorithm exercised on wide range of nodes 2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes Simultaneously double the amount of computations and nodes Measure completion time Repeat several times to get statistical averages Conclusions Total degradation from 13 to 16 seconds, or 20% Discarding first 8 nodes, near perfect scale up to 128 Slight degradation from 128 to 256 (3%), from 256 to 512 (7%) => Prove point of near linear scalability end-to-end
  • 10. Simple scaling script var itersPerNode = 5000; var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]; for (var i in cnode) { var n = cnode[i]; grid.growEC2Grid(n, true); grid.waitForGridInstances(n); runTask(itersPerNode * n, n, 3); }
  • 11. Observations Deployment considerations Start-up for whole grid in different configurations is 0.5 - 3 min 2-step deployment process First, bring up one EC2 node as controller Next, use the controller on-the-inside to coordinate bootstrapping Some of EC2 nodes don’t finish bootstrapping successfully Average of 0.5% nodes come up in incomplete state Not clear the nature of the problem If the exact processing power is essential, start the nodes, then kill off the sick ones and bring up a few new ones before starting computation IP address deadlock issue IP addresses of the nodes are needed to start & configure the grid IP addresses are not available until the grid is up & configures Need carefully choreograph bootstrapping and pass IP’s as parameters into controlling scripts
  • 12. Observations Monitoring considerations Connection to each node from outside is possible, but not efficient Check heartbeat from the internal management nodes Local scripts must be stored on S3 or passed back before termination Programming model considerations EC2 does not support IP multicast Switched to JMS instead Luckily, GridGain supported multiple protocols Typica : 3 rd party connectivity library that use EC2 query interface Undocumented limit on URL length is hit with 100s of nodes Amazon just disconnects with improper URLs without specifying the error, so debugging was hard Workaround: rewrote some parts of our framework to enquire about individual running nodes. Works, but less efficient
  • 13. Observations Metering and payment Amazon sets a limit on concurrent VM Eventually approval for 550 VMs after some due diligence from Amazon Amazon charges by full or partial VM/hours Sometimes, short usage of VMs is not metered Not clear why One hypotheses: metering “sweeps” happen every so often Be careful with usage bills for testing A test may need to be run multiple times Beware of rouge scripts Test everything on smaller configurations first Scale gradually, or you will miss the bottlenecks
  • 14. Achieving scalability Software breaks at scale. Including the glueware Barrier #1 was hit at 100 nodes because of ActiveMQ scalability Correction: Switched ActiveMQ for OpenMQ Comment: some users report better ActiveMQ scalability with 5.x Barrier #2 was hit at 300 nodes because of Typica URL length limit Correction: Changed our use of the API Security considerations EC2 credentials are passed to Head Node 3 rd party GridGain tasks can access them Sounds like potential vulnerability
  • 15. What have we learned? EC2 is ready for production usage on large-scale stateless computations Price/performance Strong linear scale curve GridGain showed itself very well Scale, stability, ease-of-use, pluggability Solid open source choice of map-reduce middleware Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security What’s next? Can we go higher then 512? What is the behavior of more complex applications?
  • 16. Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2
  • 17. Data-driven Risk Management on EC2 Goal: Investigate scalability of a prototypical Risk Management application that use significant amount of cached data to support large-scale Monte-Carlo simulations executed on EC2 using GigaSpaces Why risk management: class of problems widely used in financial services Why GigaSpaces: leading middleware platform for compute & data grids Intended Claims: EC2 scales linearly for data-driven HPC applications GigaSpaces scales well as both compute and data grid middleware Businesses can run their existing risk management (and similar) applications on EC2 today using off-the-shelf technologies
  • 18. Architecture Workers take tasks, perform calculations, write results back User uses ec2-gdc-tools to manage grid Master writes tasks into data grid and waits for results… Amazon EC2 Grid Compute Grid Data Grid Grid Console Service Grid Manager Master
  • 19. Performance methodology & results Same algorithm exercised on wide range of nodes 16,32, 128, 256, 512. Still limited by Amazon permission of 550 Constant size of data grid (4 large EC2 nodes) Double the nodes with constant amount of work Measure completion time (strive for linear time reduction) Conclusions Near perfect scale from 16 to 256 nodes 28% degradation from 256 to 512 since data cache becomes a bottleneck
  • 20. What have we learned? EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to Risk Management GigaSpaces showed itself very well Compute - data grid scales well in master-worker pattern Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security Bootstrapping this system is far more complex then GridGain’s. For more details, contact me offline What’s next? How does data grid scale? What about more complex applications? What’s the scalability of co-located compute-data grid configuration?
  • 21. Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications
  • 22. Data-intensive Analytics on MS cloud Goal: Investigate performance improvements from data “in the cloud” vs. “outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity What is CompFin++ Labs: MS-funded “incubator” compute cloud for exploration of modern compute & data challenges on massive scale What is Velocity: MS new in-memory data grid middleware, still CTP1 The Model : Computes correlation between stock prices over time. Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%. Intended Claims: Measure impact of data “closeness” to the computation on the cloud
  • 26. Benchmarked configurations Same analytical model with complex queries Perfect linear scale curve (baseline) Original CompFin Distributed cache (original CompFin + Velocity distributed cache for financial data) Local cache (original CompFin + Velocity distributed cache for financial data + near cache with data-aware routing)
  • 27. Test methodology 3 ways of measuring scalability were used Fixed amount of computations, increasing amount of data Fixed amount of date, increasing amount of computations Proportional Increase of computations and nodes “ Node” = 1 core “ Data unit” = 32 million records or 512 megabytes of tick data Test 1 Test 2 Test 3 Test # 1 2 3 4 5 6 7 8 9 Nodes 8 32 32 32 32 32 64 128 200 Data Units 1 1 1 6 12 12 24 48 69
  • 30. Conclusions Data “on the cloud” definitely matters! Performance improvements up to 31 times over “outside the cloud” Velocity distributed cache has some scalability challenges: Failure on 50 nodes cluster with 200 concurrent clients Good news: it’s a very young product and MS is actively improving it Compute-data affinity matters too! Significant performance gain of local cache over distributed cache Local cache resolved distributed cache scalability issue by reducing its load
  • 31. Final Remarks Clouds are proving themselves out Early adaptors are there already The rest of the real world will join soon There are still significant adoption challenges Technology immaturity Lack of real data, best practices, robust design patterns “ Fitting” of application middleware to cloud platforms is just starting Amazon is the leading commercial cloud provider, but is not the only game in town Companies are building public, private, dedicated and special-purpose clouds
  翻译: