SlideShare a Scribd company logo
Copyright 2011 Trend Micro Inc. 1
Bytewise Approximate Match: Theory,
Algorithms and Applications
Liwei Ren, Ph.D, Trend Micro
EISIC 2015, Manchester, UK, Sept, 2015
Trend Micro
Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Byte-wise Approximate Matching : 6 Cases
• A Framework : Theory, Algorithms, Technologies
• A Few Algorithms and Analysis
• Practical Applications
• Q & A
Classification 9/10/2015 2
Copyright 2011 Trend Micro Inc.
Background
• A Problem in DLP (Data Loss Prevention):
– In 2005, when designing the DLP system for my startup, I had to
solve this problem:
– S = {d1, d2,…, dn} is a bag of sensitive documents. Given any
document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥
δ.
• where RLV(s,t) is a function to measure the relevance of two documents.
• Two challenges: how to construct RLV ? How to make search scalable?
Classification 9/10/2015 3
Copyright 2011 Trend Micro Inc.
Background
• In early 2014, I studied fuzzy hashing :
– a family of similarity preserving hashing techniques & tools
– Example: TLSH, ssdeep, sdhash
– Problem: Given two binary strings s1 and s2, measure the similarity
by SIM(H(s1), H(s2)) = s.
• H is a hash function that preserves string similarity.
• SIM is a function to measure similarity of two hash values
• A challenge: how to evaluate pros & cons between them?
Classification 9/10/2015 4
Copyright 2011 Trend Micro Inc.
Background
• In early 2014, the NIST specification document NIST.SP.800-168
introduces a novel concept of approximate matching :
– To replace the concept of binary similarity matching.
– Four use cases are used to describe this concept:
• Similarity Detection: to identify different versions of a document.
• Cross Correlation: to identify a common object between two documents.
• Embedded Object Detection: to identify a given object inside a document.
• Fragment Detection: to identify the presence of traces/fragments of a
known document in network stream.
• In 2013, I noticed that eDiscovery has a near deduplication
problem that needs to group similar documents together.
5
Copyright 2011 Trend Micro Inc.
Background
• I started this effort in 2014:
6
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
• We extend these NIST cases to 6 cases due to our practice in
DLP & malware analysis.
– that can be described rigorously.
• Conceptual description :
7
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
8
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
9
Copyright 2011 Trend Micro Inc.
Bytewise Approximate Matching : 6 Cases
• Intuitive description with binary strings:
10
Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Match: A Rigorous Definition
• Let us start with a few concepts:
– A string s is β-nontrivial if Len(s) ≥β.
• In practice, set β=64.
• This is excluding the triviality of substrings such as substrings of a few bytes.
– Let SS(β)={ s | string s is β-nontrivial } !!!
– SSIM(s1, s2) measures similarity between two nontrivial strings.
• Definition 1: Given R[1,..,n], T[1,…,m] ∊ SS(β), we introduce six problems to describe
byte-wise approximate matching:
1. R and T are identical if R and T are the same in bytes, i.e., R=T. This is the problem of
identicalness. We denote it as EM1.
2. If SSIM(R,T) > 0, R and T are similar. This is the problem of similarity. We denote it as AM1.
3. R contains T if there is a β-nontrivial substring R[p, …,q] such that T=R[p, …,q]. This is the
problem of containment. We denote it as EM2.
4. R has a β-nontrivial substring r that is similar to T. This is the problem of approximate
containment. We denote it as AM2.
5. R and T are cross-sharing if there exist one or multiple pairs of β-nontrivial substrings <R[p, …,
q], T[u,…,v]> such that R[p,…, q]= T[u,…,v]. This is the problem of cross-sharing. We denote it
as EM3.
6. R and T have two sets of β-nontrivial substrings {r1, r2,…, rn} and {t1, t2,…, tn} respectively such
that rk and tk are similar for k ∊ {1,…,n}. This is the problem of approximate cross-sharing.
We denote it as AM3.
11
Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Match: A Rigorous Definition
• Definition 2: Given R[1,..,n] and T[1,…,m] ∊ SS(β), if any case
in definition 1 is true, we say R and T are byte-wise relevant.
This is a novel relationship.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
• Definition 3: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
12
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• S = An object space S.
•R = A relationship for objects in S.
• Three problems of interest:
1. Matching: Given G1 , G2 ∊ S, one determines if R (G1,G2) =1.
2. Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that R (o, b )=1.
3. Clustering: Given a bag B of objects, partition B into a set of
groups { G1, G2,…,Gm} based on R.
• Given the byte-wise relevance BR , we need solutions for
– Matching
– Searching
– Clustering
13
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Matching problems & the solutions:
Classification 9/10/2015 14
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Searching problem for the relationship BR :
– B is a bag of β-nontrivial strings. Given T ∊ SS(β), find s ∊ B such
that BR(T, s)=1.
Classification 9/10/2015 15
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• How to solve the searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). What about
when B has 1 million strings .
• It is a lazy idea!
– Candidate selection approach:
• STEP 1: select a few candidates {s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select “good” candidates?
• String tokenizer: extract tokens from each string from B.
• Indexer : index the tokens along with the string ID to create a index DB as FP-DB.
• Searcher : given T, generate tokens {FP1, FP2,…,FPq} , we use them to search
possible candidates from FP-DB. Then we evaluate BR(T, s) for each candidate s.
– NOTE:
• This is similar to a keyword based search engine where the keywords are the
tokens.
• A special token is string fingerprint.
– Other tokens include k-grams, k-subsequences, blocks and chunks .
» Fingerprints are generated from special blocks.
16
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Architecture of candidate selection based approach:
17
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• A clustering problem based on the relationship BR :
• Given a bag B of β-nontrivial strings, partition B into a set of groups { G1,
G2,…,Gm} based on BR.
Classification 9/10/2015 18
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• A solution to clustering problem:
• A graph based approach: if BR(s,t)=1, Node(s) and Node(t) are connected.
• A group is a connected sub-graph of the G(V,E) where V=Node(B).
Classification 9/10/2015 19
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• The framework can be summarized as follows:
Classification 9/10/2015 20
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let us go quickly over the theory and algorithm for two matching
problems :
– Similarity AM1
– Cross-sharing EM3
• What is similarity? How to measure it?
– A traditional approach is to compare two strings directly such as the
LCS method ( largest common subsequence).
– The popular fuzzy hash {ssdeep, sdhash and TLSH} use different
algorithms for measuring the similarity.
Classification 9/10/2015 21
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Fuzzy hash can be summarized in the following 3 steps:
Classification 9/10/2015 22
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• ssdeep:
– STEP 1: split the string into a sequence of consecutive chunks.
– STEP 2: hash each chunk into 6 bits and place them into a 80-byte
container sequentially.
– STEP 3: Use Levenshtein distance to measure the similarity.
Classification 9/10/2015 23
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• sdhash:
– STEP 1: Select a few 64-grams of higher entropy values.
– STEP 2: Generate a hash for each and them all hashes into one or
more 256-byte bloom filters.
– STEP 3: Use Hamming distance to measure the similarity.
Classification 9/10/2015 24
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• TLSH:
– STEP 1: For every 5-gram, select 6 triplets out of total 10 (=C5
3).
– STEP 2: Generate a hash for each triplet and map them into a 32-byte
container.
– STEP 3: Use a heuristic diff algorithm to measure the similarity.
Classification 9/10/2015 25
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Summary of Three Fuzzy Hashing Algorithms:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.
Classification 9/10/2015 26
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Three approaches for measuring similarity … {ssdeep, sdhash
& TLSH use digest comparison.
Classification 9/10/2015 27
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:
Classification 9/10/2015 28
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• 4 categories of tokens :
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• 8 ways to describe a string for similarity:
Classification 9/10/2015 29
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Evaluate a fuzzy hash based on follows:
– Data Structure:
• Bag: a bag ignores the order of tokens. It is good at handling content swapping.
• Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
– Tokens:
• k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation.
• k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation .
• Chunks: This approach takes account of every byte in raw granularity. It should be
OK at handling containment and cross sharing
• Blocks: Depending on different selection functions, even though it does not take
account of every byte, but it may present a string more efficiently and that is good
for generating similarity digests. Due to the nature of fixed length blocks, it is good
at handling containment and cross sharing.
• M2.4 leads to a novel fuzzy hashing scheme : TSFP
– It has some capabilities beyond existing schemes.
– I am not introduce it today due to limited time.
30
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let us investigate EM3 :
– The cross-sharing problem.
• What is cross-sharing ? And how to measure it?
• Given a string, any two substrings follow one out of three cases:
31
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Cross-sharing … some examples :
32
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Definition 1: Given T∊ SS(β), let Ω(T)= { s | s is a β-nontrivial substring
of T}.
– Ω(T) is the set of all β-nontrivial substrings of T.
• Definition 2: Given R, T ∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). If there exists
a bijective mapping F: SR  ST such that F(r)=t and r=t, we say that SR
and ST are canonical with F.
33
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Theorem 1: Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T), SR and ST are
canonical with F: SR  ST. ∀ r1 , r2 ∊ SR, one of following cases holds:
34
NOTE: we are only interested in case 1-3.
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Definition 3 : Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). SR and ST are
canonical with F: SR  ST.
– ∀ r1 and r2 ∊ SR which are not identical, if only case 1 holds, we say that SR
and ST are translative.
– ∀ r1 and r2 ∊ SR which are not identical, if one of case 1 -3 holds, we say that SR
and ST are weakly translative.
– Let L(SR, ST) = 𝐋𝐞𝐧(𝐫)<𝐫,𝐭> for measuring SR × ST .
35
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Definition 4: Given R, T∊ SS(β), we define two measurements for cross-
sharing between R and T at two levels:
– L1(R, T) = Max { L(SR, ST) | SR and ST are translative }
– L2 (R, T) = Max { L(SR, ST) | SR and ST are weakly translative }
• Definition 5 : Given T∊ SS(β), if its β-grams (i.e., β-length substrings)
are different to each other, T is β-nonrepetitive.
– This is to measure how random T is.
• Theorem 2 : Given R,T ∊ SS(β), we have:
– L1(R, T) ≤ L2(R, T)
– If both R and T are β-nonrepetitive, L1(R, T) = L2(R, T)
36
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Algorithm ( identify cross-shared substrings)
– INPUT: INPUT: R[1, …, m] and T[1, …, n].
– SUMMARY:
1. Use a rolling hash function H(x) to slide a β-width window along the string R for generating
m+1- β hash values. We store them into a hash table HT with separate chaining using
linked-list to resolve collisions. The nodes in the linked-lists of the hash table HT save the
offsets where hash values are created.
2. From the first offset of T with a the same rolling hash to slide the window of β-width along T,
do match with hash table H. If not matched, continue the next offset, otherwise try to
match the maximum, then continue from an offset around the end of the matched substring
of T.
– OUTPUT: SR ST .
37
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
Theorem 3: SR×ST is from Algorithm 1. If ST ={}, L1(R,T)=L2(R,T) = 0,
otherwise, we have
– SR and ST are weakly translative ;
– L1(R, T) ≤ L(SR, ST) ≤ L2 (R, T).
– If R is β–nonrepetitive, L(SR, ST) = L2(R, T).
– If T is β–nonrepetitive, we have (a) L1(R, T) = L(SR, ST) ; (b) SR and ST are translative; .
38
Copyright 2011 Trend Micro Inc.
A Few Algorithms and Analysis
• Let me summarize what have been done:
Classification 9/10/2015 39
Copyright 2011 Trend Micro Inc.
Practical Applications
• This framework can be applied to the following areas.
– E-Discovery
• Grouping near duplicate documents
• Comparing of near duplicate documents
– Digital forensic analysis
• Identifying similar objects or files
– Anti-plagiarism
• Copy detection
– Source code governance
– Malware analysis
– Spam filtering
– DLP
– etc
40
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your attention.
• Do you have any questions?
• Email: liwei_ren@trendmicro.com
• Home page: http://pitt.academia.edu/LiweiRen for external
talks.
41
Ad

More Related Content

What's hot (20)

Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
Svitlana volkova
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
Marco Righini
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
Zhibo Xiao
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
Yueshen Xu
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
Karol Grzegorczyk
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
Topic Models
Topic ModelsTopic Models
Topic Models
Claudia Wagner
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
Steffen Staab
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
Hiroyuki Kuromiya
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
Bhaskar Mitra
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
Soojung Hong
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Innovation Quotient Pvt Ltd
 
A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)
Alexander Decker
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
rusbase
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
Innovation Quotient Pvt Ltd
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
FReeze FRancis
 
Mit203 analysis and design of algorithms
Mit203  analysis and design of algorithmsMit203  analysis and design of algorithms
Mit203 analysis and design of algorithms
smumbahelp
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
FedorNikolaev
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
Marco Righini
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
Zhibo Xiao
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
Yueshen Xu
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
Bhaskar Mitra
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
Steffen Staab
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
Hiroyuki Kuromiya
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
Bhaskar Mitra
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
Soojung Hong
 
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Innovation Quotient Pvt Ltd
 
A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)A probabilistic data encryption scheme (pdes)
A probabilistic data encryption scheme (pdes)
Alexander Decker
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
rusbase
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
Innovation Quotient Pvt Ltd
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
Bhaskar Mitra
 
Mit203 analysis and design of algorithms
Mit203  analysis and design of algorithmsMit203  analysis and design of algorithms
Mit203 analysis and design of algorithms
smumbahelp
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Nik Spirin
 
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge GraphJoint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
Joint Word and Entity Embeddings for Entity Retrieval from Knowledge Graph
FedorNikolaev
 

Similar to Bytewise Approximate Match: Theory, Algorithms and Applications (20)

Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
Liwei Ren任力偉
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
Liwei Ren任力偉
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
Liwei Ren任力偉
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
Liwei Ren任力偉
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
DataTactics
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
Liwei Ren任力偉
 
Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's traffic
Sandipan Biswas
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Experfy
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learning
Bhushan Kotnis
 
Clique and sting
Clique and stingClique and sting
Clique and sting
Subramanyam Natarajan
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Jonathon Hare
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Data_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdfData_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdf
Shailja Thakur
 
TINET_FRnOG_2008_public
TINET_FRnOG_2008_publicTINET_FRnOG_2008_public
TINET_FRnOG_2008_public
Davide Cherubini
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
Liwei Ren任力偉
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
Liwei Ren任力偉
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
Liwei Ren任力偉
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
Liwei Ren任力偉
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
DataTactics
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
Liwei Ren任力偉
 
Exploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's trafficExploiting tls to disrupt privacy of web application's traffic
Exploiting tls to disrupt privacy of web application's traffic
Sandipan Biswas
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Experfy
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU
 
Relational machine-learning
Relational machine-learningRelational machine-learning
Relational machine-learning
Bhushan Kotnis
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Jonathon Hare
 
Document clustering for forensic analysis an approach for improving compute...
Document clustering for forensic   analysis an approach for improving compute...Document clustering for forensic   analysis an approach for improving compute...
Document clustering for forensic analysis an approach for improving compute...
Madan Golla
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
YONG ZHENG
 
Data_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdfData_Prep_Techniques_Challenges_Methods.pdf
Data_Prep_Techniques_Challenges_Methods.pdf
Shailja Thakur
 
Ad

More from Liwei Ren任力偉 (19)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
Liwei Ren任力偉
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
Liwei Ren任力偉
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
Liwei Ren任力偉
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
Liwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
Liwei Ren任力偉
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
Liwei Ren任力偉
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
Liwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Liwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Liwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
Liwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
Liwei Ren任力偉
 
Math stories
Math storiesMath stories
Math stories
Liwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
Liwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
Liwei Ren任力偉
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the Cloud
Liwei Ren任力偉
 
信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
Liwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
Liwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Liwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Liwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
Liwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
Liwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
Liwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
Liwei Ren任力偉
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the Cloud
Liwei Ren任力偉
 
Ad

Recently uploaded (20)

A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
Sérgio Sacani
 
Mycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes FungiMycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes Fungi
SAYANTANMALLICK5
 
Fatigue and its management in aviation medicine
Fatigue and its management in aviation medicineFatigue and its management in aviation medicine
Fatigue and its management in aviation medicine
ImranJewel2
 
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptxA CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
ANJALICHANDRASEKARAN
 
Freshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and FactorsFreshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and Factors
mytriplemonlineshop
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Coral_Reefs_and_Bleaching_Presentation (1) (1).pptx
Coral_Reefs_and_Bleaching_Presentation (1) (1).pptxCoral_Reefs_and_Bleaching_Presentation (1) (1).pptx
Coral_Reefs_and_Bleaching_Presentation (1) (1).pptx
Nishath24
 
Black hole and its division and categories
Black hole and its division and categoriesBlack hole and its division and categories
Black hole and its division and categories
MSafiullahALawi
 
Somato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptxSomato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptx
klynct
 
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.pptSULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
HRUTUJA WAGH
 
CORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptxCORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptx
DharaniJajula
 
Preparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptxPreparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptx
klynct
 
Cleaned_Expanded_Metal_Nanoparticles_Presentation.pptx
Cleaned_Expanded_Metal_Nanoparticles_Presentation.pptxCleaned_Expanded_Metal_Nanoparticles_Presentation.pptx
Cleaned_Expanded_Metal_Nanoparticles_Presentation.pptx
zainab98aug
 
anatomy of larynx , trachea and bronchi ppt.
anatomy of larynx , trachea and bronchi ppt.anatomy of larynx , trachea and bronchi ppt.
anatomy of larynx , trachea and bronchi ppt.
EmanHamada5
 
Controls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene ExpressionControls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene Expression
NABIHANAEEM2
 
Anti fungal agents Medicinal Chemistry III
Anti fungal agents Medicinal Chemistry  IIIAnti fungal agents Medicinal Chemistry  III
Anti fungal agents Medicinal Chemistry III
HRUTUJA WAGH
 
Reticular formation_groups_organization_
Reticular formation_groups_organization_Reticular formation_groups_organization_
Reticular formation_groups_organization_
klynct
 
What Are Dendritic Cells and Their Role in Immunobiology?
What Are Dendritic Cells and Their Role in Immunobiology?What Are Dendritic Cells and Their Role in Immunobiology?
What Are Dendritic Cells and Their Role in Immunobiology?
Kosheeka : Primary Cells for Research
 
Seismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crustSeismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crust
Sérgio Sacani
 
Transgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative BiolabsTransgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative Biolabs
Creative-Biolabs
 
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
Sérgio Sacani
 
Mycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes FungiMycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes Fungi
SAYANTANMALLICK5
 
Fatigue and its management in aviation medicine
Fatigue and its management in aviation medicineFatigue and its management in aviation medicine
Fatigue and its management in aviation medicine
ImranJewel2
 
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptxA CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
ANJALICHANDRASEKARAN
 
Freshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and FactorsFreshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and Factors
mytriplemonlineshop
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Coral_Reefs_and_Bleaching_Presentation (1) (1).pptx
Coral_Reefs_and_Bleaching_Presentation (1) (1).pptxCoral_Reefs_and_Bleaching_Presentation (1) (1).pptx
Coral_Reefs_and_Bleaching_Presentation (1) (1).pptx
Nishath24
 
Black hole and its division and categories
Black hole and its division and categoriesBlack hole and its division and categories
Black hole and its division and categories
MSafiullahALawi
 
Somato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptxSomato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptx
klynct
 
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.pptSULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
HRUTUJA WAGH
 
CORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptxCORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptx
DharaniJajula
 
Preparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptxPreparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptx
klynct
 
Cleaned_Expanded_Metal_Nanoparticles_Presentation.pptx
Cleaned_Expanded_Metal_Nanoparticles_Presentation.pptxCleaned_Expanded_Metal_Nanoparticles_Presentation.pptx
Cleaned_Expanded_Metal_Nanoparticles_Presentation.pptx
zainab98aug
 
anatomy of larynx , trachea and bronchi ppt.
anatomy of larynx , trachea and bronchi ppt.anatomy of larynx , trachea and bronchi ppt.
anatomy of larynx , trachea and bronchi ppt.
EmanHamada5
 
Controls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene ExpressionControls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene Expression
NABIHANAEEM2
 
Anti fungal agents Medicinal Chemistry III
Anti fungal agents Medicinal Chemistry  IIIAnti fungal agents Medicinal Chemistry  III
Anti fungal agents Medicinal Chemistry III
HRUTUJA WAGH
 
Reticular formation_groups_organization_
Reticular formation_groups_organization_Reticular formation_groups_organization_
Reticular formation_groups_organization_
klynct
 
Seismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crustSeismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crust
Sérgio Sacani
 
Transgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative BiolabsTransgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative Biolabs
Creative-Biolabs
 

Bytewise Approximate Match: Theory, Algorithms and Applications

  • 1. Copyright 2011 Trend Micro Inc. 1 Bytewise Approximate Match: Theory, Algorithms and Applications Liwei Ren, Ph.D, Trend Micro EISIC 2015, Manchester, UK, Sept, 2015 Trend Micro
  • 2. Copyright 2011 Trend Micro Inc. Agenda • Background • Byte-wise Approximate Matching : 6 Cases • A Framework : Theory, Algorithms, Technologies • A Few Algorithms and Analysis • Practical Applications • Q & A Classification 9/10/2015 2
  • 3. Copyright 2011 Trend Micro Inc. Background • A Problem in DLP (Data Loss Prevention): – In 2005, when designing the DLP system for my startup, I had to solve this problem: – S = {d1, d2,…, dn} is a bag of sensitive documents. Given any document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥ δ. • where RLV(s,t) is a function to measure the relevance of two documents. • Two challenges: how to construct RLV ? How to make search scalable? Classification 9/10/2015 3
  • 4. Copyright 2011 Trend Micro Inc. Background • In early 2014, I studied fuzzy hashing : – a family of similarity preserving hashing techniques & tools – Example: TLSH, ssdeep, sdhash – Problem: Given two binary strings s1 and s2, measure the similarity by SIM(H(s1), H(s2)) = s. • H is a hash function that preserves string similarity. • SIM is a function to measure similarity of two hash values • A challenge: how to evaluate pros & cons between them? Classification 9/10/2015 4
  • 5. Copyright 2011 Trend Micro Inc. Background • In early 2014, the NIST specification document NIST.SP.800-168 introduces a novel concept of approximate matching : – To replace the concept of binary similarity matching. – Four use cases are used to describe this concept: • Similarity Detection: to identify different versions of a document. • Cross Correlation: to identify a common object between two documents. • Embedded Object Detection: to identify a given object inside a document. • Fragment Detection: to identify the presence of traces/fragments of a known document in network stream. • In 2013, I noticed that eDiscovery has a near deduplication problem that needs to group similar documents together. 5
  • 6. Copyright 2011 Trend Micro Inc. Background • I started this effort in 2014: 6
  • 7. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases • We extend these NIST cases to 6 cases due to our practice in DLP & malware analysis. – that can be described rigorously. • Conceptual description : 7
  • 8. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases 8
  • 9. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases 9
  • 10. Copyright 2011 Trend Micro Inc. Bytewise Approximate Matching : 6 Cases • Intuitive description with binary strings: 10
  • 11. Copyright 2011 Trend Micro Inc. Byte-wise Approximate Match: A Rigorous Definition • Let us start with a few concepts: – A string s is β-nontrivial if Len(s) ≥β. • In practice, set β=64. • This is excluding the triviality of substrings such as substrings of a few bytes. – Let SS(β)={ s | string s is β-nontrivial } !!! – SSIM(s1, s2) measures similarity between two nontrivial strings. • Definition 1: Given R[1,..,n], T[1,…,m] ∊ SS(β), we introduce six problems to describe byte-wise approximate matching: 1. R and T are identical if R and T are the same in bytes, i.e., R=T. This is the problem of identicalness. We denote it as EM1. 2. If SSIM(R,T) > 0, R and T are similar. This is the problem of similarity. We denote it as AM1. 3. R contains T if there is a β-nontrivial substring R[p, …,q] such that T=R[p, …,q]. This is the problem of containment. We denote it as EM2. 4. R has a β-nontrivial substring r that is similar to T. This is the problem of approximate containment. We denote it as AM2. 5. R and T are cross-sharing if there exist one or multiple pairs of β-nontrivial substrings <R[p, …, q], T[u,…,v]> such that R[p,…, q]= T[u,…,v]. This is the problem of cross-sharing. We denote it as EM3. 6. R and T have two sets of β-nontrivial substrings {r1, r2,…, rn} and {t1, t2,…, tn} respectively such that rk and tk are similar for k ∊ {1,…,n}. This is the problem of approximate cross-sharing. We denote it as AM3. 11
  • 12. Copyright 2011 Trend Micro Inc. Byte-wise Approximate Match: A Rigorous Definition • Definition 2: Given R[1,..,n] and T[1,…,m] ∊ SS(β), if any case in definition 1 is true, we say R and T are byte-wise relevant. This is a novel relationship. – We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0. • Definition 3: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X ↪ Y. 12 EM1 EM2 EM3 AM1 AM2 AM3 ↪ ↪ ↪ ↪ ↪ ↪ ↪
  • 13. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • S = An object space S. •R = A relationship for objects in S. • Three problems of interest: 1. Matching: Given G1 , G2 ∊ S, one determines if R (G1,G2) =1. 2. Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B such that R (o, b )=1. 3. Clustering: Given a bag B of objects, partition B into a set of groups { G1, G2,…,Gm} based on R. • Given the byte-wise relevance BR , we need solutions for – Matching – Searching – Clustering 13
  • 14. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Matching problems & the solutions: Classification 9/10/2015 14
  • 15. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Searching problem for the relationship BR : – B is a bag of β-nontrivial strings. Given T ∊ SS(β), find s ∊ B such that BR(T, s)=1. Classification 9/10/2015 15
  • 16. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • How to solve the searching problem? – Brute force approach : for every s ∊ B, we evaluate BR(T, s). What about when B has 1 million strings . • It is a lazy idea! – Candidate selection approach: • STEP 1: select a few candidates {s1, s2,…,sm} quickly • STEP 2: evaluate each BR(T, sk). – How to select “good” candidates? • String tokenizer: extract tokens from each string from B. • Indexer : index the tokens along with the string ID to create a index DB as FP-DB. • Searcher : given T, generate tokens {FP1, FP2,…,FPq} , we use them to search possible candidates from FP-DB. Then we evaluate BR(T, s) for each candidate s. – NOTE: • This is similar to a keyword based search engine where the keywords are the tokens. • A special token is string fingerprint. – Other tokens include k-grams, k-subsequences, blocks and chunks . » Fingerprints are generated from special blocks. 16
  • 17. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Architecture of candidate selection based approach: 17
  • 18. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • A clustering problem based on the relationship BR : • Given a bag B of β-nontrivial strings, partition B into a set of groups { G1, G2,…,Gm} based on BR. Classification 9/10/2015 18
  • 19. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • A solution to clustering problem: • A graph based approach: if BR(s,t)=1, Node(s) and Node(t) are connected. • A group is a connected sub-graph of the G(V,E) where V=Node(B). Classification 9/10/2015 19
  • 20. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • The framework can be summarized as follows: Classification 9/10/2015 20
  • 21. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Let us go quickly over the theory and algorithm for two matching problems : – Similarity AM1 – Cross-sharing EM3 • What is similarity? How to measure it? – A traditional approach is to compare two strings directly such as the LCS method ( largest common subsequence). – The popular fuzzy hash {ssdeep, sdhash and TLSH} use different algorithms for measuring the similarity. Classification 9/10/2015 21
  • 22. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Fuzzy hash can be summarized in the following 3 steps: Classification 9/10/2015 22
  • 23. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • ssdeep: – STEP 1: split the string into a sequence of consecutive chunks. – STEP 2: hash each chunk into 6 bits and place them into a 80-byte container sequentially. – STEP 3: Use Levenshtein distance to measure the similarity. Classification 9/10/2015 23
  • 24. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • sdhash: – STEP 1: Select a few 64-grams of higher entropy values. – STEP 2: Generate a hash for each and them all hashes into one or more 256-byte bloom filters. – STEP 3: Use Hamming distance to measure the similarity. Classification 9/10/2015 24
  • 25. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • TLSH: – STEP 1: For every 5-gram, select 6 triplets out of total 10 (=C5 3). – STEP 2: Generate a hash for each triplet and map them into a 32-byte container. – STEP 3: Use a heuristic diff algorithm to measure the similarity. Classification 9/10/2015 25
  • 26. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Summary of Three Fuzzy Hashing Algorithms: – Using a first model to describe a binary string with selected features: • ssdeep model: a string is a sequence of chunks (split from the string). • sdhash model: a string is a bag of 64-byte blocks (selected with entropy values). • TLSH model: a string is a bag of triplets (selected from all 5-grams). – Using a second model to map the selected features into a digest which is able to preserve similarity to certain degree. • ssdeep model: a sequence of chunks is mapped into a 80-byte digest. • sdhash model: a bag of blocks is mapped into one or multiple 256-byte bloom filter bitmaps. • TLSH model: a bag of triplets is mapped into a 32-byte container. Classification 9/10/2015 26
  • 27. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Three approaches for measuring similarity … {ssdeep, sdhash & TLSH use digest comparison. Classification 9/10/2015 27 • 1st model plays critical role for similarity comparison. • Let focus on discussing various 1st models today. • Based on a unified format. • 2nd model saves space but further reduces accuracy.
  • 28. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Unified format for 1st model: – A string is described as a collection of tokens (aka, features) organized by a data structure: • ssdeep: a sequence of chunks. • sdhash: a bag of 64-byte blocks with high entropy values. • TLSH: a bag of selected triplets. – Two types of data structures: sequence, bag. – Three types of tokens: chunks, blocks, triplets. • Analogical comparison: Classification 9/10/2015 28
  • 29. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • 4 categories of tokens : – k-grams where k is as small as 3,4,… – k-subsequences: any subsequence with length k. The triplet in TLSH is an example. – Chunks: whole string is split into non-overlapping chunks. – Blocks: selected substrings of fixed length. • 8 ways to describe a string for similarity: Classification 9/10/2015 29
  • 30. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Evaluate a fuzzy hash based on follows: – Data Structure: • Bag: a bag ignores the order of tokens. It is good at handling content swapping. • Sequence: a sequence organizes tokens in an order. This is weak for handling content swapping. – Tokens: • k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation. • k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation . • Chunks: This approach takes account of every byte in raw granularity. It should be OK at handling containment and cross sharing • Blocks: Depending on different selection functions, even though it does not take account of every byte, but it may present a string more efficiently and that is good for generating similarity digests. Due to the nature of fixed length blocks, it is good at handling containment and cross sharing. • M2.4 leads to a novel fuzzy hashing scheme : TSFP – It has some capabilities beyond existing schemes. – I am not introduce it today due to limited time. 30
  • 31. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Let us investigate EM3 : – The cross-sharing problem. • What is cross-sharing ? And how to measure it? • Given a string, any two substrings follow one out of three cases: 31
  • 32. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Cross-sharing … some examples : 32
  • 33. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Definition 1: Given T∊ SS(β), let Ω(T)= { s | s is a β-nontrivial substring of T}. – Ω(T) is the set of all β-nontrivial substrings of T. • Definition 2: Given R, T ∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). If there exists a bijective mapping F: SR  ST such that F(r)=t and r=t, we say that SR and ST are canonical with F. 33
  • 34. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Theorem 1: Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T), SR and ST are canonical with F: SR  ST. ∀ r1 , r2 ∊ SR, one of following cases holds: 34 NOTE: we are only interested in case 1-3.
  • 35. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Definition 3 : Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). SR and ST are canonical with F: SR  ST. – ∀ r1 and r2 ∊ SR which are not identical, if only case 1 holds, we say that SR and ST are translative. – ∀ r1 and r2 ∊ SR which are not identical, if one of case 1 -3 holds, we say that SR and ST are weakly translative. – Let L(SR, ST) = 𝐋𝐞𝐧(𝐫)<𝐫,𝐭> for measuring SR × ST . 35
  • 36. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Definition 4: Given R, T∊ SS(β), we define two measurements for cross- sharing between R and T at two levels: – L1(R, T) = Max { L(SR, ST) | SR and ST are translative } – L2 (R, T) = Max { L(SR, ST) | SR and ST are weakly translative } • Definition 5 : Given T∊ SS(β), if its β-grams (i.e., β-length substrings) are different to each other, T is β-nonrepetitive. – This is to measure how random T is. • Theorem 2 : Given R,T ∊ SS(β), we have: – L1(R, T) ≤ L2(R, T) – If both R and T are β-nonrepetitive, L1(R, T) = L2(R, T) 36
  • 37. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Algorithm ( identify cross-shared substrings) – INPUT: INPUT: R[1, …, m] and T[1, …, n]. – SUMMARY: 1. Use a rolling hash function H(x) to slide a β-width window along the string R for generating m+1- β hash values. We store them into a hash table HT with separate chaining using linked-list to resolve collisions. The nodes in the linked-lists of the hash table HT save the offsets where hash values are created. 2. From the first offset of T with a the same rolling hash to slide the window of β-width along T, do match with hash table H. If not matched, continue the next offset, otherwise try to match the maximum, then continue from an offset around the end of the matched substring of T. – OUTPUT: SR ST . 37
  • 38. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis Theorem 3: SR×ST is from Algorithm 1. If ST ={}, L1(R,T)=L2(R,T) = 0, otherwise, we have – SR and ST are weakly translative ; – L1(R, T) ≤ L(SR, ST) ≤ L2 (R, T). – If R is β–nonrepetitive, L(SR, ST) = L2(R, T). – If T is β–nonrepetitive, we have (a) L1(R, T) = L(SR, ST) ; (b) SR and ST are translative; . 38
  • 39. Copyright 2011 Trend Micro Inc. A Few Algorithms and Analysis • Let me summarize what have been done: Classification 9/10/2015 39
  • 40. Copyright 2011 Trend Micro Inc. Practical Applications • This framework can be applied to the following areas. – E-Discovery • Grouping near duplicate documents • Comparing of near duplicate documents – Digital forensic analysis • Identifying similar objects or files – Anti-plagiarism • Copy detection – Source code governance – Malware analysis – Spam filtering – DLP – etc 40
  • 41. Copyright 2011 Trend Micro Inc. Q&A • Thank you for your attention. • Do you have any questions? • Email: liwei_ren@trendmicro.com • Home page: http://pitt.academia.edu/LiweiRen for external talks. 41
  翻译: