SlideShare a Scribd company logo
Seattle 2018
@danielhochman / Engineer / Lyft
Instrumenting and Scaling
Cloud-Native Databases with Envoy
Seattle 2018
Database outage
1. Disk I/O wait spikes briefly
2. Client opens more connections
3. Slowdown due to auth overhead of new
connections
4. Client opens more connections
5. Hit max connection limit
Seattle 2018
Databases in the cloud
Instantly provision resilient, high-throughput infrastructure
No access to underlying VM and/or shared hardware
Limited access to telemetry
Limited access to configuration
Closed source or no ability to run custom binary
Seattle 2018
Cloud Native
Cloud native technologies empower organizations to build and run
scalable applications in modern, dynamic environments such as
public, private, and hybrid clouds.
Seattle 2018
Service Mesh topology
Service mesh
Edge
DiscoveryEnvoy Proxy is deployed at every hop
Seattle 2018
Instance topology
Application communicates over locally to Envoy
which will proxy all traffic
localhost:6001
localhost:6101
localhost:7000
…
(internal services)
(third-party services)
(cloud services)
and more!
Seattle 2018
Layer 3 / 4: Proxying TCP
- DNS aware
- Load balancing: round robin, least request, ring hash, random, etc
- Impose an idle timeout
- Healthchecking
- Access logging
localhost:7000
Stats
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
Other benefits
iot.us-east-1.amazonaws.com
174.217.14.202
174.217.14.234
Seattle 2018
Layer 5 / 6: Offloading SSL
Stats
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
Other benefits
- Efficient
- Up-to-date and secure (TLS 1.3)
- SNI, cert pinning, session resumption, etc.
- Easier to upgrade
localhost:7000 172.217.14.202:443
Seattle 2018
Layer 7: Managing HTTP
Stats
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
Other benefits
- Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed)
- Manage request retries and timeouts
- Access logging
- Offload GZIP decompression
HTTP/1
HTTP/2
Seattle 2018
Statistics
TCP (L3/L4) SSL (L5/L6) HTTP (L7)
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
cx_length_ms (hist)
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
and more!
Seattle 2018
Dashboards
Live templating
or {% macro envoy_stats(origin, destination) %}
Seattle 2018
Observability
Homogenous telemetry data makes it easier
to observe and correlate behavior in large
systems.
Seattle 2018
Observability
Libraries are heterogenous!
SSL ciphers? Status code metrics? Retry?
import pynamodb
use AwsDynamoDbDynamoDbClient;
import "github.com/aws/dynamodb"
&aws.Config{
Endpoint:aws.String("http://localhost:8000")
}
e.g.
Envoy provides standard access logs, stats,
alarms, retry, etc
Seattle 2018
Layer 7: Beyond HTTP
Envoy supports three other database-specific L7 protocols today
Seattle 2018
DynamoDB
- Protocol: JSON over HTTP
- Cloudwatch telemetry
- min, avg, max latency
- per-table capacity unit throughput
- per-minute
- Benefits of Envoy:
- Histogram of latency (percentiles)
- Custom windowing of metrics
- Per-host, per-zone, and per-cluster statistics
Seattle 2018
DynamoDB with codec
Seattle 2018
POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.GetItem
{
"TableName": "pets",
"Key": {
"Name": {"S": "Patty"}
}
}
DynamoDB with codec
dynamodb.table.pets.GetItem.upstream_rq_time
Seattle 2018
DynamoDB
What was the per-30s p99 for write requests from the
users-streamlistener canary to the pets table?
ts(
envoy.dynamodb.pets.PutItem.upstream_rq_time.p99,
window=30,
group=users-streamlistener,
canary=true,
)
Seattle 2018
MongoDB
- Protocol: Binary JSON (BSON)
- Benefits of Envoy in TCP mode:
- Per-host, per-cluster, per-zone network I/O
- Benefits of Envoy with Mongo codec:
- Per-operation latency
- Count size and number of documents
- Count scattered gets in sharded cluster
How did the number of documents returned by queries
change in us-east-1a after the 3pm deploy of my service?
Seattle 2018
MongoDB at scale
Help! My Mongo database is experiencing outages:
- Disk I/O wait spikes briefly
- Client opens more connections
- Slowdown due to auth overhead of new connections
- Open more connections
- Hit max connection limit
Envoy will rate limit new connections to apply backpressure so that query
times can recover.
Seattle 2018
MongoDB at scale
Help! I deleted an index. I read the code but it was in a 3,000 line class.
The index was still in use and everything fell over until we could
recreate it.
Envoy will efficiently log all Mongo queries in JSON format so that a week
of logs can be audited for usage of the index's fields.
Have you tried the built-in query profiler?
Yes, it caused a serious outage because it's expensive and results in 3x
CPU usage.
Seattle 2018
MongoDB at scale
Envoy will:
- globally rate limit new connections
- efficiently log all Mongo queries
- track the number of queries with no timeout set
- parse the $comment field of a query so we can time and count queries of
individual application methods, log how many records they returned, etc.
… for applications in 3 different languages across 8 clusters.
… 6 months and several outages later ...
Seattle 2018
/var/log/envoy/mongo/0.log
{
"time": "2018-10-13T21:17:08.483Z",
"upstream_host": "172.18.3.19:27817"
"message": {
"opcode": "OP_QUERY",
"query": {
"findAndModify": "user",
"query": {"_id": 903730},
"update": {"$set": {"stats.rating": 4.9}},
"$comment": "{
"hostname": "users-3ae3r",
"httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7",
"callingFunction": "users.UpdateRating" }"
},
},
}
envoy.mongo.callsite.users.UpdateRating.reply_time_ms
Seattle 2018
Redis partitioning proxy
Consistent hashingRedis protocol
+=
Seattle 2018
Redis at scale
localhost:6379
…
SET msg hello
INCR comm
MGET lyft hello
SET msg hello
GET hello
INCR comm
GET lyft
OK
1
nil
To the application, the proxy looks like a single instance of Redis.
Seattle 2018
Approaches
TCP
HTTP
…
Bump-in-the-wire Fully routing
vs
Seattle 2018
Future codecs
Seattle 2018
Roadmap
- More codecs
- Full L7 capability vs bump-in-the-wire
- Better integration of tracing
- More fault injection coverage
- Role-based access control
Seattle 2018
Thanks!
@danielhochman
Ad

More Related Content

What's hot (20)

Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using Loki
Knoldus Inc.
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
confluent
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
초보자를 위한 분산 캐시 이야기
초보자를 위한 분산 캐시 이야기초보자를 위한 분산 캐시 이야기
초보자를 위한 분산 캐시 이야기
OnGameServer
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
confluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Amazon EKS Deep Dive
Amazon EKS Deep DiveAmazon EKS Deep Dive
Amazon EKS Deep Dive
Andrzej Komarnicki
 
Elk
Elk Elk
Elk
Caleb Wang
 
Data Localisation Trends
Data Localisation TrendsData Localisation Trends
Data Localisation Trends
Martina F. Ferracane
 
Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...
HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
Lars Marius Garshol
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
João Paulo Leonidas Fernandes Dias da Silva
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
Rate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX PlusRate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX Plus
NGINX, Inc.
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Linking Metrics to Logs using Loki
Linking Metrics to Logs using LokiLinking Metrics to Logs using Loki
Linking Metrics to Logs using Loki
Knoldus Inc.
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
confluent
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
DataWorks Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
초보자를 위한 분산 캐시 이야기
초보자를 위한 분산 캐시 이야기초보자를 위한 분산 캐시 이야기
초보자를 위한 분산 캐시 이야기
OnGameServer
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
confluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
Chhavi Parasher
 
Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...Managing multiple event types in a single topic with Schema Registry | Bill B...
Managing multiple event types in a single topic with Schema Registry | Bill B...
HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
Lars Marius Garshol
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
Rate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX PlusRate Limiting with NGINX and NGINX Plus
Rate Limiting with NGINX and NGINX Plus
NGINX, Inc.
 

Similar to Instrumenting and Scaling Databases with Envoy (20)

Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
Ruslan Lutsenko
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
Cisco Canada
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
Renzo Tomà
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
darach
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)
jaxLondonConference
 
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
Farid Gurbanov
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Canada
 
Cisco project ideas
Cisco   project ideasCisco   project ideas
Cisco project ideas
VIT University
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQ
lxfontes
 
Bigdata meetup dwarak_realtime_score_app
Bigdata meetup dwarak_realtime_score_appBigdata meetup dwarak_realtime_score_app
Bigdata meetup dwarak_realtime_score_app
Dwarakanath Ramachandran
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
Surendar S
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
DECK36
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh Integration
ThomasGraf42
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
Dhaval Mehta
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
Thomas Weible
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Streamsets Inc.
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
Ruslan Lutsenko
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
Cisco Canada
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
Renzo Tomà
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
darach
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)
jaxLondonConference
 
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Canada
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQ
lxfontes
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
Surendar S
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
DECK36
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh Integration
ThomasGraf42
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
Dhaval Mehta
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
Ad

Recently uploaded (20)

AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Config 2025 presentation recap covering both days
Config 2025 presentation recap covering both daysConfig 2025 presentation recap covering both days
Config 2025 presentation recap covering both days
TrishAntoni1
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Dark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanizationDark Dynamism: drones, dark factories and deurbanization
Dark Dynamism: drones, dark factories and deurbanization
Jakub Šimek
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
IT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information TechnologyIT488 Wireless Sensor Networks_Information Technology
IT488 Wireless Sensor Networks_Information Technology
SHEHABALYAMANI
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Ad

Instrumenting and Scaling Databases with Envoy

  • 1. Seattle 2018 @danielhochman / Engineer / Lyft Instrumenting and Scaling Cloud-Native Databases with Envoy
  • 2. Seattle 2018 Database outage 1. Disk I/O wait spikes briefly 2. Client opens more connections 3. Slowdown due to auth overhead of new connections 4. Client opens more connections 5. Hit max connection limit
  • 3. Seattle 2018 Databases in the cloud Instantly provision resilient, high-throughput infrastructure No access to underlying VM and/or shared hardware Limited access to telemetry Limited access to configuration Closed source or no ability to run custom binary
  • 4. Seattle 2018 Cloud Native Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds.
  • 5. Seattle 2018 Service Mesh topology Service mesh Edge DiscoveryEnvoy Proxy is deployed at every hop
  • 6. Seattle 2018 Instance topology Application communicates over locally to Envoy which will proxy all traffic localhost:6001 localhost:6101 localhost:7000 … (internal services) (third-party services) (cloud services) and more!
  • 7. Seattle 2018 Layer 3 / 4: Proxying TCP - DNS aware - Load balancing: round robin, least request, ring hash, random, etc - Impose an idle timeout - Healthchecking - Access logging localhost:7000 Stats cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total Other benefits iot.us-east-1.amazonaws.com 174.217.14.202 174.217.14.234
  • 8. Seattle 2018 Layer 5 / 6: Offloading SSL Stats handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires Other benefits - Efficient - Up-to-date and secure (TLS 1.3) - SNI, cert pinning, session resumption, etc. - Easier to upgrade localhost:7000 172.217.14.202:443
  • 9. Seattle 2018 Layer 7: Managing HTTP Stats cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout Other benefits - Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed) - Manage request retries and timeouts - Access logging - Offload GZIP decompression HTTP/1 HTTP/2
  • 10. Seattle 2018 Statistics TCP (L3/L4) SSL (L5/L6) HTTP (L7) cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total cx_length_ms (hist) handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout and more!
  • 11. Seattle 2018 Dashboards Live templating or {% macro envoy_stats(origin, destination) %}
  • 12. Seattle 2018 Observability Homogenous telemetry data makes it easier to observe and correlate behavior in large systems.
  • 13. Seattle 2018 Observability Libraries are heterogenous! SSL ciphers? Status code metrics? Retry? import pynamodb use AwsDynamoDbDynamoDbClient; import "github.com/aws/dynamodb" &aws.Config{ Endpoint:aws.String("http://localhost:8000") } e.g. Envoy provides standard access logs, stats, alarms, retry, etc
  • 14. Seattle 2018 Layer 7: Beyond HTTP Envoy supports three other database-specific L7 protocols today
  • 15. Seattle 2018 DynamoDB - Protocol: JSON over HTTP - Cloudwatch telemetry - min, avg, max latency - per-table capacity unit throughput - per-minute - Benefits of Envoy: - Histogram of latency (percentiles) - Custom windowing of metrics - Per-host, per-zone, and per-cluster statistics
  • 17. Seattle 2018 POST / HTTP/1.1 X-Amz-Target: DynamoDB_20120810.GetItem { "TableName": "pets", "Key": { "Name": {"S": "Patty"} } } DynamoDB with codec dynamodb.table.pets.GetItem.upstream_rq_time
  • 18. Seattle 2018 DynamoDB What was the per-30s p99 for write requests from the users-streamlistener canary to the pets table? ts( envoy.dynamodb.pets.PutItem.upstream_rq_time.p99, window=30, group=users-streamlistener, canary=true, )
  • 19. Seattle 2018 MongoDB - Protocol: Binary JSON (BSON) - Benefits of Envoy in TCP mode: - Per-host, per-cluster, per-zone network I/O - Benefits of Envoy with Mongo codec: - Per-operation latency - Count size and number of documents - Count scattered gets in sharded cluster How did the number of documents returned by queries change in us-east-1a after the 3pm deploy of my service?
  • 20. Seattle 2018 MongoDB at scale Help! My Mongo database is experiencing outages: - Disk I/O wait spikes briefly - Client opens more connections - Slowdown due to auth overhead of new connections - Open more connections - Hit max connection limit Envoy will rate limit new connections to apply backpressure so that query times can recover.
  • 21. Seattle 2018 MongoDB at scale Help! I deleted an index. I read the code but it was in a 3,000 line class. The index was still in use and everything fell over until we could recreate it. Envoy will efficiently log all Mongo queries in JSON format so that a week of logs can be audited for usage of the index's fields. Have you tried the built-in query profiler? Yes, it caused a serious outage because it's expensive and results in 3x CPU usage.
  • 22. Seattle 2018 MongoDB at scale Envoy will: - globally rate limit new connections - efficiently log all Mongo queries - track the number of queries with no timeout set - parse the $comment field of a query so we can time and count queries of individual application methods, log how many records they returned, etc. … for applications in 3 different languages across 8 clusters. … 6 months and several outages later ...
  • 23. Seattle 2018 /var/log/envoy/mongo/0.log { "time": "2018-10-13T21:17:08.483Z", "upstream_host": "172.18.3.19:27817" "message": { "opcode": "OP_QUERY", "query": { "findAndModify": "user", "query": {"_id": 903730}, "update": {"$set": {"stats.rating": 4.9}}, "$comment": "{ "hostname": "users-3ae3r", "httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7", "callingFunction": "users.UpdateRating" }" }, }, } envoy.mongo.callsite.users.UpdateRating.reply_time_ms
  • 24. Seattle 2018 Redis partitioning proxy Consistent hashingRedis protocol +=
  • 25. Seattle 2018 Redis at scale localhost:6379 … SET msg hello INCR comm MGET lyft hello SET msg hello GET hello INCR comm GET lyft OK 1 nil To the application, the proxy looks like a single instance of Redis.
  • 28. Seattle 2018 Roadmap - More codecs - Full L7 capability vs bump-in-the-wire - Better integration of tracing - More fault injection coverage - Role-based access control
  翻译: