SlideShare a Scribd company logo
NAVIGATING THE WORLD OF
USER DATA MANAGEMENT AND DATA
DISCOVERY
SMITI SHARMA, VIRTUSTREAM - EMC
About Me
2
Principal Engineer & Lead,
Big Data Cloud, Virtustream
 Oracle – Principal Engineer & PM team
 EMC – Big Data Lead
 Pivotal – Global CTO
 Virtustream EMC
Areas of expertise
 Architecting, developing and managing Mission Critical
Transactional and Analytics Platforms
 RDBMS & NoSQL – Hadoop Platforms
 Product Management and Development
 ODPi RT member
Sessions at Hadoop Summit 2016
 Wed: 11:30am Room 211: User Data Management in HD
 Wed: 4:10pm Room 212: Building PAAS for Smart cities
Smiti.sharma@virtustream.com
@smiti_sharma
About Virtustream
 Enterprise-Class Cloud Solutions
− Cloud IAAS, PAAS
− Cloud Software Solutions
− Cloud Managed Services
− Cloud Professional Services
 Developer of xStream Cloud Management Platform SW
 Inventor of the µVMTM (MicroVM) Cloud Technology
 Industry leading Cloud offers in areas of
− SAP Landscape, HANA
− Storage
− Big Data
 Close partnerships with SAP, EMC, VMWare
 Service Provider to 2,000 Global Workloads
Global Footprint
 Data Management Overview
 Project Background & Context
 High level Architecture and data flow
 Solution criteria
 Evaluation criteria of Data Management Tools
 Differentiation factors
 Proposed Solution
 Conclusions
Agenda
4
5
5
User Data Landscape
Master Data Management
Metadata
- Business
- Technical
- Operational
Reference DataTransactional data
6
Reference DataTransactional data
Metadata
- Business
- Technical
- Operational
Driving Factors for Data Management
• IT custodian of business data
• Data Characteristics
– Business Value
• Analytical vs. transactional systems
– Volume and Volatility
– Complexity of Data type and formats
– Adaptive feedback from IT to Business
– Reusability factor – across different teams
– De-duplication factor
7
MDM is an organizational approach to
managing data as a corporate asset
8
What is Master Data Management
Application framework and collection of
tools implemented with the business
goal of providing, managing & using
data intelligently and productively
Multi-Domain & Organizational MDM
Metadata Data
Reference
Data
Transactional
Data
9
Domain 1: Product
Master Data Metadata Data
Reference
Data
Transactional
Data
Domain 3: Logistics
Master Data
Metadata Data
Reference
Data
Transactional
Data
Domain 2: Supply Chain
Master Data
Universal Ideology..somewhat..
10
11
Data Management for Hadoop: Why, At what stage
7%
4%
9%
Today
In three years
Growth
Indicator
Fig 2: Source TDWI
Fig 1: Source TDWI
Primary Strategy to
improve Quality of
Data managed in Hadoop
• Uni-directional movement of data
• Static and limited identification patterns
• Focused mainly on Transactional systems – data type/Hadoop OSS
integration limited
• Non-adaptive solutions to rapidly changing “schema”
• Limiting performance
12
Traditional MDM Challenges
Building “Data Management
Layer”
for Hadoop Data Lake
13
Project(s) and Context
• Project initiated at two Large Retailers
• Goal to extend the analytical Data Lake
– As of Late 2015 Data Lake built only for Analytics
– Pulls data from Transactional, ERP, POS systems
– Implemented using ODPi (Pivotal/Hortonworks) Distribution and Greenplum for MPPDB
• Next Generation Data Lake
– Current ETL system reaching performance and scale limits  Move ETL in Hadoop
– Move BIDW and Transactional reporting to Hadoop
– Increase users on this system – Security and Quality constraints
– In-store SKU count ~ 500 Million ; Online SKU count ~ 5 Million
• Complex Master Data Management around existing systems
– For Hadoop – the EIM integration didn’t exist and/or processes were not in place
– Little to no interest from EIM data integration team
14
Key Problem Statement (at least 1 of them!)
Evaluate and Prototype the
Data Management Strategy and Product(s)
to enhance and enrich the
“Next Generation Data Lake”
15
High Level Logical Data Architecture
Metadata repository, Policy
management and Business
Rules Engine
Enterprise Security
Framework
(AD/LDAP)
Query/Accessand
VisualizationLayer
• API to access data sources
• Interfaces with Metadata
Repository to define data query
path
• Potentially Custom Portals for
User queries as well as standard
tools
Access
DataSources
I
n
g
e
s
t
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Data Fabric/“Landing Zone”
Processing Framework
In-
Memory
Process-
ing
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
Data Ingest to Persistence or
memory layer
Federated query
Ingest to Metadata
management Layer
Cross reference for Rules,
policies and Metadata
LEGEND
16
Metadata Management
Ingestion and Indexing
Data Management
Solution Requirements
• Inherent Data processing requirements
• Incoming data from sources e.g. Kafka, Storm, Sqoop, Spark
• Be able to manage complex data types e.g. Video files from POS
• Data placement based on priority and sensitivity – memory or disk
• Handling both Synchronous and Async (In-band and out-of-band)
• Integration with existing EIM tools
• Performance requirements
• Increasing ingest volume of data and expanding sources
• Varied Data Type support and considerations
17
File format type Embedded
Metadata
Compression&
Splitable
HQL/SQL interface
viability
Popularity in current
and new landscape
Support for
Schema
evolution
18
CSV/Text No No^ Hive/Hawq Most common Limited
Avro Yes Yes Hive Increasing footprint Yes
JSON Yes No^ Hive/ MongoDB Increasing footprint Yes
RC Files Limited Not as well Hive (RW) Yes No
ORC Files No Yes Hive (RW) Impala ( R) Yes No
Sequence Files (binary
format)
No Yes Hive (RW) Impala ( R) None today Limited
Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited
• Read/Write performance
• Source, Application and development effort and support
• Hierarchical model
File Format Considerations
File format type Embedded
Metadata
Compression&
Splitable
HQL/SQL interface
viability
Popularity in current
and new landscape
Support for
Schema
evolution
19
CSV/Text No No^ Hive/Hawq Most common Limited
Avro Yes Yes Hive Increasing footprint Yes
JSON Yes No^ Hive/ MongoDB Increasing footprint Yes
RC Files Limited Not as well Hive (RW) Yes No
ORC Files No Yes Hive (RW) Impala ( R) Yes No
Sequence Files (binary
format)
No Yes Hive (RW) Impala ( R) None today Limited
Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited
• Read/Write performance
• Source, Application and development effort and support
• Hierarchical model
File Format Considerations
Key Evaluation and Selection
Criteria
20
Initial Challenge
• Too many tools to choose from
• Each claimed to be Metadata management tool
• Each claimed security and integration features
• Resistance from the EIM team when initially involved
• Translating Data Management Ideology to tasks of
evaluation
21
Project Approach
• Build a list of KPI to evaluate tools
• Working with EIM team (best practices advise & SME engagement),
business and IT team support Data lake project
• Vendor Identification – List of 5
• Implementation
• Minimized scope of project
• Decided to tackle integration with legacy EIM at a later date
• After Evaluation, focused on implementing no-more than 2 Data
management tools for Next-Gen Data Fabric Platform
22
• Define Business Metadata (Is reference data available within tool or outside)
• Automation and flexibility in crawling the HDFS and understand the various format
– Range of File formats supported
– Reading each file to extract metadata
– Both for data persisted already and incoming new files in real-time
– Cross reference with lookup or repository for pre-existing classes and profiles
– Maturity of attaching context or facet to the atomic data
– Ability to retrieve descriptive and Structural Metadata even with no Metadata within the content
• Storing the profiled data – actual data and metadata in a repository
• Custom Tagging as well as recognizing Metadata information
• Translation and integration with industry certification and models
23
Metadata Curation and Management (1/2)
Data Profiling
• Ability to classify data – based on user defined categories
– Search/Crawl and identification "Facet Finder” and efficiency of internal repository
– Presence of Data Models if any
– Features around custom Metadata and Tagging
• Once classified - ability for Metadata information to be indexed, and searchable thru
API or Web Interfaces
– Efficiency of Search and indexing
– Richness of Integration with NLPTK
• Data Re-mediation
• Data Archiving and policy implementation
• Notification: Configurable triggers – based on user-defined criteria
24
Metadata Curation and Management (2/2)
Data Classification
Lineage and Versioning
• Be able to identify the origin of data – i.e. from
– Transactional systems, Dump files, Another HDFS file, Repository etc.
– Level of depth of data origination and lineage
• Ability of the solution to sense and preserve Metadata Versions around a given entity
during Capture process and post
• Ability to support Deduplication with the Entity’s metadata
– On the fly without impacting the performance
25
Integration
• Ability to integrate its meta store with enterprise MDM /EIM systems
– Maturity of Metadata Entity Readers (Input/Output) Artifacts from Metastore
– Bi-directional API for other tool integration to identify lineage
– Bi-directional API for other tool integration for SIEM threat assessment and detection
– While maintaining user and security context
• Integration with the various tools of Ingestion, Transformation & Consumption
– Spark, Storm, Kafka, Informatica, Data Stage etc.
• Integration with security tools – LDAP, ACLs, encryption
• Rules and Policy engine
26
Performance, Accuracy and Ease of Use
• Sample visualization of Metadata with Native Reporting tools & others
• Ability to process compressed and encrypted files
• Level of Error and exception handling built in during all processes
• Impact on performance from
– Crawling, scanning and profiling
– Classification & transformation
• Enable notifications of data availability - how customizable are they?
• Self-service discovery portal leveraging curated artifacts
27
Some of the notable Vendors evaluated
• Attivio
• Global ID
• Waterline Data
• Zaloni
• Adaptive Inc.
28
At the time of this study, Falcon and Ranger were new. Little analysis on these products was done
Vendor Evaluation Scoreboard (Template)
29
Vendor Evaluation Summary Results
31
Metadata
curation and
management
Lineage and
versioning
Integration Performance,
Accuracy and
Ease of use
Attivio
Global ID
Waterline Data
Zaloni
Global ID
Attivio
Waterline Data
Zaloni
Zaloni
Attivio
Global ID
Waterline Data
Attivio
Zaloni
Waterline Data
Global ID
• All tools had satisfactory features overall with emphasis in 1 or 2 areas.
• Your choice of tools needs to align with Business and User Requirements
• Waterline: Automated data discovery, self- service
• Attivio: Data Curation – Discovery, Search, flexibility of tagging, performant and scalable
• Global ID: Efficient in Mapping logical models, overlapping data identification and pattern matching
• Zaloni: Had notable interface for Data mapping and flow, integration with external tools
Evaluation Summary CAVEAT: Based on
criterion driven by
customer needs. You
eval and updates from
vendor will affect
results
High Level Logical Data Architecture
Metadata repository, Policy
management and Business
Rules Engine
Enterprise Security
Framework
(AD/LDAP)
Query/Accessand
VisualizationLayer
• API to access data sources
• Interfaces with Metadata
Repository to define data query
path
• Potentially Custom Portals for
User queries as well as standard
tools
Access
DataSources
I
n
g
e
s
t
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Data Fabric/“Landing Zone”
Processing Framework
In-
Memory
Process-
ing
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
32
Metadata Management
Ingestion and Indexing
Data Management
Data Ingest to Persistence or
memory layer
Federated query
Ingest to Metadata
management Layer
Cross reference for Rules,
policies and Metadata
LEGEND
High Level Logical Data Architecture
Metadata repository, Policy
management and Business
Rules Engine
Enterprise Security
Framework
(AD/LDAP)
CustomPortal/other
evaluations(TBD)
• API to access data sources
• Interfaces with Metadata
Repository to define data query
path
• Potentially Custom Portals for
User queries as well as standard
tools
Access
DataSources
Flume/Kafka/SpringXD
Data Sources
(Raw)/Aggregated*
Inventory
data
Logistics
Product/
Vendor Data
Data Fabric/“Landing Zone”
Processing Framework
Apache
Spark/
GemFir
e
Object
store
HDFS
MPP DB RDBMS
NoSQL/
NewSQL
33
Metadata Management
Attivio
Global ID
Data Ingest to Persistence or
memory layer
Federated query
Ingest to Metadata
management Layer
Cross reference for Rules,
policies and Metadata
LEGEND
Key Takeaways
34
 Market
− Metadata Mgmt tools in market are still evolving for Data Lake architectures
− Ever growing and Rich Partner ecosystem
− Hadoop does not offer a sufficient policy engine or action framework
 Customer
− Choice of tool is IT and business driven. Sponsorship important !
− To drive adoption – ease of use and intuitive product a must
− Balancing Multi-vendor and functionality: Limit number of tools to 3
− Recommendation to use Information management Professional Services with selected tool (s)
Key Takeaways
PROCESS
 Evaluation of the tools
− Reviews and demo of the tools versus a full-fledged POC
− Build an adaptive matrix of KPI measurements, customized to your organization - Unless quantified
evaluation would be very subjective
 Beware of the Trap- Analysis - Paralysis
− Multiple business units drive this decision
− Functionality scope - workflows, ETL processes and integration or pure-play data management
− Integration with existing EIM tools was delayed as a priority: Huge part of the success
 Investment/Cost: Existing tools, Level of Effort and implementation
Key Takeaways
References
• References to the following documents were made
– TDWI- Hadoop for enterprise
– MDM institute
• Acknowledgements from the following authors and additional work
– EMC IT Team
– Customer’s IT team for Prototyping along with EMC Field resources
37
Navigating the World of User Data Management and Data Discovery
Ad

More Related Content

What's hot (20)

Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
DataWorks Summit/Hadoop Summit
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
DataWorks Summit/Hadoop Summit
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
DataWorks Summit/Hadoop Summit
 
Data-In-Motion Unleashed
Data-In-Motion UnleashedData-In-Motion Unleashed
Data-In-Motion Unleashed
DataWorks Summit
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
Kiran Kamreddy
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Rittman Analytics
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
DataWorks Summit
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
DataWorks Summit/Hadoop Summit
 
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache AtlasGDPR-focused partner community showcase for Apache Ranger and Apache Atlas
GDPR-focused partner community showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Solving Performance Problems on Hadoop
Solving Performance Problems on HadoopSolving Performance Problems on Hadoop
Solving Performance Problems on Hadoop
Tyler Mitchell
 
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & TrifactaExtend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
Extend Governance in Hadoop with Atlas Ecosystem: Waterline, Attivo & Trifacta
DataWorks Summit/Hadoop Summit
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
DataWorks Summit/Hadoop Summit
 
Data Governance for Data Lakes
Data Governance for Data LakesData Governance for Data Lakes
Data Governance for Data Lakes
Kiran Kamreddy
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
NoSQLmatters
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Denodo
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Rittman Analytics
 

Viewers also liked (20)

Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
WaterlineData
 
IDC Report - Unified Information Access on a Solid Search Base
IDC Report - Unified Information Access on a Solid Search BaseIDC Report - Unified Information Access on a Solid Search Base
IDC Report - Unified Information Access on a Solid Search Base
Attivio
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
Murtaza Doctor
 
Big Data Applications
Big Data ApplicationsBig Data Applications
Big Data Applications
Joseph Zwicker
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
Roman Nikitchenko
 
Hadoop do data warehousing rules apply
Hadoop do data warehousing rules applyHadoop do data warehousing rules apply
Hadoop do data warehousing rules apply
DataWorks Summit
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
Inside Analysis
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
DataWorks Summit
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Reinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapRReinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapR
Lilia Gutnik
 
Extreme Analytics @ eBay
Extreme Analytics @ eBayExtreme Analytics @ eBay
Extreme Analytics @ eBay
DataWorks Summit/Hadoop Summit
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
DataWorks Summit/Hadoop Summit
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
WaterlineData
 
IDC Report - Unified Information Access on a Solid Search Base
IDC Report - Unified Information Access on a Solid Search BaseIDC Report - Unified Information Access on a Solid Search Base
IDC Report - Unified Information Access on a Solid Search Base
Attivio
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
Murtaza Doctor
 
Elephant grooming: quality with Hadoop
Elephant grooming: quality with HadoopElephant grooming: quality with Hadoop
Elephant grooming: quality with Hadoop
Roman Nikitchenko
 
Hadoop do data warehousing rules apply
Hadoop do data warehousing rules applyHadoop do data warehousing rules apply
Hadoop do data warehousing rules apply
DataWorks Summit
 
Hadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality ChallengeHadoop 2.0 - Solving the Data Quality Challenge
Hadoop 2.0 - Solving the Data Quality Challenge
Inside Analysis
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
DataWorks Summit
 
What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it What the #$* is a Business Catalog and why you need it
What the #$* is a Business Catalog and why you need it
DataWorks Summit/Hadoop Summit
 
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
DataWorks Summit/Hadoop Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Reinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapRReinventing the Modern Information Pipeline: Paxata and MapR
Reinventing the Modern Information Pipeline: Paxata and MapR
Lilia Gutnik
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
Ad

Similar to Navigating the World of User Data Management and Data Discovery (20)

Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
PwC
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
MapR Technologies
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
MohammedShahid562503
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Virtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & BénéficesVirtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & Bénéfices
Denodo
 
Big Data
Big DataBig Data
Big Data
Neha Mehta
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
mlang222
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Denodo
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
Apache Hadoop Summit 2016: The Future of Apache Hadoop an Enterprise Architec...
PwC
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
MapR Technologies
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
Hortonworks
 
Virtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & BénéficesVirtualisation de données : Enjeux, Usages & Bénéfices
Virtualisation de données : Enjeux, Usages & Bénéfices
Denodo
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
mlang222
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Denodo
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
AI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of DocumentsAI Agents at Work: UiPath, Maestro & the Future of Documents
AI Agents at Work: UiPath, Maestro & the Future of Documents
UiPathCommunity
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Conte...
Ivano Malavolta
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Limecraft Webinar - 2025.3 release, featuring Content Delivery, Graphic Conte...
Maarten Verwaest
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 

Navigating the World of User Data Management and Data Discovery

  • 1. NAVIGATING THE WORLD OF USER DATA MANAGEMENT AND DATA DISCOVERY SMITI SHARMA, VIRTUSTREAM - EMC
  • 2. About Me 2 Principal Engineer & Lead, Big Data Cloud, Virtustream  Oracle – Principal Engineer & PM team  EMC – Big Data Lead  Pivotal – Global CTO  Virtustream EMC Areas of expertise  Architecting, developing and managing Mission Critical Transactional and Analytics Platforms  RDBMS & NoSQL – Hadoop Platforms  Product Management and Development  ODPi RT member Sessions at Hadoop Summit 2016  Wed: 11:30am Room 211: User Data Management in HD  Wed: 4:10pm Room 212: Building PAAS for Smart cities Smiti.sharma@virtustream.com @smiti_sharma
  • 3. About Virtustream  Enterprise-Class Cloud Solutions − Cloud IAAS, PAAS − Cloud Software Solutions − Cloud Managed Services − Cloud Professional Services  Developer of xStream Cloud Management Platform SW  Inventor of the µVMTM (MicroVM) Cloud Technology  Industry leading Cloud offers in areas of − SAP Landscape, HANA − Storage − Big Data  Close partnerships with SAP, EMC, VMWare  Service Provider to 2,000 Global Workloads Global Footprint
  • 4.  Data Management Overview  Project Background & Context  High level Architecture and data flow  Solution criteria  Evaluation criteria of Data Management Tools  Differentiation factors  Proposed Solution  Conclusions Agenda 4
  • 5. 5 5
  • 6. User Data Landscape Master Data Management Metadata - Business - Technical - Operational Reference DataTransactional data 6 Reference DataTransactional data Metadata - Business - Technical - Operational
  • 7. Driving Factors for Data Management • IT custodian of business data • Data Characteristics – Business Value • Analytical vs. transactional systems – Volume and Volatility – Complexity of Data type and formats – Adaptive feedback from IT to Business – Reusability factor – across different teams – De-duplication factor 7
  • 8. MDM is an organizational approach to managing data as a corporate asset 8 What is Master Data Management Application framework and collection of tools implemented with the business goal of providing, managing & using data intelligently and productively
  • 9. Multi-Domain & Organizational MDM Metadata Data Reference Data Transactional Data 9 Domain 1: Product Master Data Metadata Data Reference Data Transactional Data Domain 3: Logistics Master Data Metadata Data Reference Data Transactional Data Domain 2: Supply Chain Master Data
  • 11. 11 Data Management for Hadoop: Why, At what stage 7% 4% 9% Today In three years Growth Indicator Fig 2: Source TDWI Fig 1: Source TDWI Primary Strategy to improve Quality of Data managed in Hadoop
  • 12. • Uni-directional movement of data • Static and limited identification patterns • Focused mainly on Transactional systems – data type/Hadoop OSS integration limited • Non-adaptive solutions to rapidly changing “schema” • Limiting performance 12 Traditional MDM Challenges
  • 14. Project(s) and Context • Project initiated at two Large Retailers • Goal to extend the analytical Data Lake – As of Late 2015 Data Lake built only for Analytics – Pulls data from Transactional, ERP, POS systems – Implemented using ODPi (Pivotal/Hortonworks) Distribution and Greenplum for MPPDB • Next Generation Data Lake – Current ETL system reaching performance and scale limits  Move ETL in Hadoop – Move BIDW and Transactional reporting to Hadoop – Increase users on this system – Security and Quality constraints – In-store SKU count ~ 500 Million ; Online SKU count ~ 5 Million • Complex Master Data Management around existing systems – For Hadoop – the EIM integration didn’t exist and/or processes were not in place – Little to no interest from EIM data integration team 14
  • 15. Key Problem Statement (at least 1 of them!) Evaluate and Prototype the Data Management Strategy and Product(s) to enhance and enrich the “Next Generation Data Lake” 15
  • 16. High Level Logical Data Architecture Metadata repository, Policy management and Business Rules Engine Enterprise Security Framework (AD/LDAP) Query/Accessand VisualizationLayer • API to access data sources • Interfaces with Metadata Repository to define data query path • Potentially Custom Portals for User queries as well as standard tools Access DataSources I n g e s t Data Sources (Raw)/Aggregated* Inventory data Logistics Product/ Vendor Data Data Fabric/“Landing Zone” Processing Framework In- Memory Process- ing Object store HDFS MPP DB RDBMS NoSQL/ NewSQL Data Ingest to Persistence or memory layer Federated query Ingest to Metadata management Layer Cross reference for Rules, policies and Metadata LEGEND 16 Metadata Management Ingestion and Indexing Data Management
  • 17. Solution Requirements • Inherent Data processing requirements • Incoming data from sources e.g. Kafka, Storm, Sqoop, Spark • Be able to manage complex data types e.g. Video files from POS • Data placement based on priority and sensitivity – memory or disk • Handling both Synchronous and Async (In-band and out-of-band) • Integration with existing EIM tools • Performance requirements • Increasing ingest volume of data and expanding sources • Varied Data Type support and considerations 17
  • 18. File format type Embedded Metadata Compression& Splitable HQL/SQL interface viability Popularity in current and new landscape Support for Schema evolution 18 CSV/Text No No^ Hive/Hawq Most common Limited Avro Yes Yes Hive Increasing footprint Yes JSON Yes No^ Hive/ MongoDB Increasing footprint Yes RC Files Limited Not as well Hive (RW) Yes No ORC Files No Yes Hive (RW) Impala ( R) Yes No Sequence Files (binary format) No Yes Hive (RW) Impala ( R) None today Limited Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited • Read/Write performance • Source, Application and development effort and support • Hierarchical model File Format Considerations
  • 19. File format type Embedded Metadata Compression& Splitable HQL/SQL interface viability Popularity in current and new landscape Support for Schema evolution 19 CSV/Text No No^ Hive/Hawq Most common Limited Avro Yes Yes Hive Increasing footprint Yes JSON Yes No^ Hive/ MongoDB Increasing footprint Yes RC Files Limited Not as well Hive (RW) Yes No ORC Files No Yes Hive (RW) Impala ( R) Yes No Sequence Files (binary format) No Yes Hive (RW) Impala ( R) None today Limited Parquet Yes Yes Yes – Hive and impala Increasing footprint Limited • Read/Write performance • Source, Application and development effort and support • Hierarchical model File Format Considerations
  • 20. Key Evaluation and Selection Criteria 20
  • 21. Initial Challenge • Too many tools to choose from • Each claimed to be Metadata management tool • Each claimed security and integration features • Resistance from the EIM team when initially involved • Translating Data Management Ideology to tasks of evaluation 21
  • 22. Project Approach • Build a list of KPI to evaluate tools • Working with EIM team (best practices advise & SME engagement), business and IT team support Data lake project • Vendor Identification – List of 5 • Implementation • Minimized scope of project • Decided to tackle integration with legacy EIM at a later date • After Evaluation, focused on implementing no-more than 2 Data management tools for Next-Gen Data Fabric Platform 22
  • 23. • Define Business Metadata (Is reference data available within tool or outside) • Automation and flexibility in crawling the HDFS and understand the various format – Range of File formats supported – Reading each file to extract metadata – Both for data persisted already and incoming new files in real-time – Cross reference with lookup or repository for pre-existing classes and profiles – Maturity of attaching context or facet to the atomic data – Ability to retrieve descriptive and Structural Metadata even with no Metadata within the content • Storing the profiled data – actual data and metadata in a repository • Custom Tagging as well as recognizing Metadata information • Translation and integration with industry certification and models 23 Metadata Curation and Management (1/2) Data Profiling
  • 24. • Ability to classify data – based on user defined categories – Search/Crawl and identification "Facet Finder” and efficiency of internal repository – Presence of Data Models if any – Features around custom Metadata and Tagging • Once classified - ability for Metadata information to be indexed, and searchable thru API or Web Interfaces – Efficiency of Search and indexing – Richness of Integration with NLPTK • Data Re-mediation • Data Archiving and policy implementation • Notification: Configurable triggers – based on user-defined criteria 24 Metadata Curation and Management (2/2) Data Classification
  • 25. Lineage and Versioning • Be able to identify the origin of data – i.e. from – Transactional systems, Dump files, Another HDFS file, Repository etc. – Level of depth of data origination and lineage • Ability of the solution to sense and preserve Metadata Versions around a given entity during Capture process and post • Ability to support Deduplication with the Entity’s metadata – On the fly without impacting the performance 25
  • 26. Integration • Ability to integrate its meta store with enterprise MDM /EIM systems – Maturity of Metadata Entity Readers (Input/Output) Artifacts from Metastore – Bi-directional API for other tool integration to identify lineage – Bi-directional API for other tool integration for SIEM threat assessment and detection – While maintaining user and security context • Integration with the various tools of Ingestion, Transformation & Consumption – Spark, Storm, Kafka, Informatica, Data Stage etc. • Integration with security tools – LDAP, ACLs, encryption • Rules and Policy engine 26
  • 27. Performance, Accuracy and Ease of Use • Sample visualization of Metadata with Native Reporting tools & others • Ability to process compressed and encrypted files • Level of Error and exception handling built in during all processes • Impact on performance from – Crawling, scanning and profiling – Classification & transformation • Enable notifications of data availability - how customizable are they? • Self-service discovery portal leveraging curated artifacts 27
  • 28. Some of the notable Vendors evaluated • Attivio • Global ID • Waterline Data • Zaloni • Adaptive Inc. 28 At the time of this study, Falcon and Ranger were new. Little analysis on these products was done
  • 31. 31 Metadata curation and management Lineage and versioning Integration Performance, Accuracy and Ease of use Attivio Global ID Waterline Data Zaloni Global ID Attivio Waterline Data Zaloni Zaloni Attivio Global ID Waterline Data Attivio Zaloni Waterline Data Global ID • All tools had satisfactory features overall with emphasis in 1 or 2 areas. • Your choice of tools needs to align with Business and User Requirements • Waterline: Automated data discovery, self- service • Attivio: Data Curation – Discovery, Search, flexibility of tagging, performant and scalable • Global ID: Efficient in Mapping logical models, overlapping data identification and pattern matching • Zaloni: Had notable interface for Data mapping and flow, integration with external tools Evaluation Summary CAVEAT: Based on criterion driven by customer needs. You eval and updates from vendor will affect results
  • 32. High Level Logical Data Architecture Metadata repository, Policy management and Business Rules Engine Enterprise Security Framework (AD/LDAP) Query/Accessand VisualizationLayer • API to access data sources • Interfaces with Metadata Repository to define data query path • Potentially Custom Portals for User queries as well as standard tools Access DataSources I n g e s t Data Sources (Raw)/Aggregated* Inventory data Logistics Product/ Vendor Data Data Fabric/“Landing Zone” Processing Framework In- Memory Process- ing Object store HDFS MPP DB RDBMS NoSQL/ NewSQL 32 Metadata Management Ingestion and Indexing Data Management Data Ingest to Persistence or memory layer Federated query Ingest to Metadata management Layer Cross reference for Rules, policies and Metadata LEGEND
  • 33. High Level Logical Data Architecture Metadata repository, Policy management and Business Rules Engine Enterprise Security Framework (AD/LDAP) CustomPortal/other evaluations(TBD) • API to access data sources • Interfaces with Metadata Repository to define data query path • Potentially Custom Portals for User queries as well as standard tools Access DataSources Flume/Kafka/SpringXD Data Sources (Raw)/Aggregated* Inventory data Logistics Product/ Vendor Data Data Fabric/“Landing Zone” Processing Framework Apache Spark/ GemFir e Object store HDFS MPP DB RDBMS NoSQL/ NewSQL 33 Metadata Management Attivio Global ID Data Ingest to Persistence or memory layer Federated query Ingest to Metadata management Layer Cross reference for Rules, policies and Metadata LEGEND
  • 35.  Market − Metadata Mgmt tools in market are still evolving for Data Lake architectures − Ever growing and Rich Partner ecosystem − Hadoop does not offer a sufficient policy engine or action framework  Customer − Choice of tool is IT and business driven. Sponsorship important ! − To drive adoption – ease of use and intuitive product a must − Balancing Multi-vendor and functionality: Limit number of tools to 3 − Recommendation to use Information management Professional Services with selected tool (s) Key Takeaways
  • 36. PROCESS  Evaluation of the tools − Reviews and demo of the tools versus a full-fledged POC − Build an adaptive matrix of KPI measurements, customized to your organization - Unless quantified evaluation would be very subjective  Beware of the Trap- Analysis - Paralysis − Multiple business units drive this decision − Functionality scope - workflows, ETL processes and integration or pure-play data management − Integration with existing EIM tools was delayed as a priority: Huge part of the success  Investment/Cost: Existing tools, Level of Effort and implementation Key Takeaways
  • 37. References • References to the following documents were made – TDWI- Hadoop for enterprise – MDM institute • Acknowledgements from the following authors and additional work – EMC IT Team – Customer’s IT team for Prototyping along with EMC Field resources 37
  翻译: