Bytewise Approximate Match: Theory, Algorithms and Applications

Copyright 2011 Trend Micro Inc. 1
Bytewise Approximate Match: Theory,
Algorithms and Applications
Liwei Ren, Ph.D, Trend Micro
EISIC 2015, Manchester, UK, Sept, 2015
Trend Micro

Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Byte-wise Approximate Matching : 6 Cases
• A Framework : Theory, Algorithms, Technologies
• A Few Algorithms and Analysis
• Practical Applications
• Q & A
Classification 9/10/2015 2

Background
• A Problem in DLP (Data Loss Prevention):
– In 2005, when designing the DLP system for my startup, I had to
solve this problem:
– S = {d1, d2,…, dn} is a bag of sensitive documents. Given any
document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥
δ.
• where RLV(s,t) is a function to measure the relevance of two documents.
• Two challenges: how to construct RLV ? How to make search scalable?

Background
• In early 2014, I studied fuzzy hashing :
– a family of similarity preserving hashing techniques & tools
– Example: TLSH, ssdeep, sdhash
– Problem: Given two binary strings s1 and s2, measure the similarity
by SIM(H(s1), H(s2)) = s.
• H is a hash function that preserves string similarity.
• SIM is a function to measure similarity of two hash values
• A challenge: how to evaluate pros & cons between them?

Background
• In early 2014, the NIST specification document NIST.SP.800-168
introduces a novel concept of approximate matching :
– To replace the concept of binary similarity matching.
– Four use cases are used to describe this concept:
• Similarity Detection: to identify different versions of a document.
• Cross Correlation: to identify a common object between two documents.
• Embedded Object Detection: to identify a given object inside a document.
• Fragment Detection: to identify the presence of traces/fragments of a
known document in network stream.
• In 2013, I noticed that eDiscovery has a near deduplication
problem that needs to group similar documents together.
5

Background
• I started this effort in 2014:
6

Bytewise Approximate Matching : 6 Cases
• We extend these NIST cases to 6 cases due to our practice in
DLP & malware analysis.
– that can be described rigorously.
• Conceptual description :
7

8

9

• Intuitive description with binary strings:
10

Byte-wise Approximate Match: A Rigorous Definition
• Let us start with a few concepts:
– A string s is β-nontrivial if Len(s) ≥β.
• In practice, set β=64.
• This is excluding the triviality of substrings such as substrings of a few bytes.
– Let SS(β)={ s | string s is β-nontrivial } !!!
– SSIM(s1, s2) measures similarity between two nontrivial strings.
• Definition 1: Given R[1,..,n], T[1,…,m] ∊ SS(β), we introduce six problems to describe
byte-wise approximate matching:
1. R and T are identical if R and T are the same in bytes, i.e., R=T. This is the problem of
identicalness. We denote it as EM1.
2. If SSIM(R,T) > 0, R and T are similar. This is the problem of similarity. We denote it as AM1.
3. R contains T if there is a β-nontrivial substring R[p, …,q] such that T=R[p, …,q]. This is the
problem of containment. We denote it as EM2.
4. R has a β-nontrivial substring r that is similar to T. This is the problem of approximate
containment. We denote it as AM2.
5. R and T are cross-sharing if there exist one or multiple pairs of β-nontrivial substrings <R[p, …,
q], T[u,…,v]> such that R[p,…, q]= T[u,…,v]. This is the problem of cross-sharing. We denote it
as EM3.
6. R and T have two sets of β-nontrivial substrings {r1, r2,…, rn} and {t1, t2,…, tn} respectively such
that rk and tk are similar for k ∊ {1,…,n}. This is the problem of approximate cross-sharing.
We denote it as AM3.
11

Byte-wise Approximate Match: A Rigorous Definition
• Definition 2: Given R[1,..,n] and T[1,…,m] ∊ SS(β), if any case
in definition 1 is true, we say R and T are byte-wise relevant.
This is a novel relationship.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
• Definition 3: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
12
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪

A Framework of Theory, Algorithms and Technologies
• S = An object space S.
•R = A relationship for objects in S.
• Three problems of interest:
1. Matching: Given G1 , G2 ∊ S, one determines if R (G1,G2) =1.
2. Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that R (o, b )=1.
3. Clustering: Given a bag B of objects, partition B into a set of
groups { G1, G2,…,Gm} based on R.
• Given the byte-wise relevance BR , we need solutions for
– Matching
– Searching
– Clustering
13

• Matching problems & the solutions:

• Searching problem for the relationship BR :
– B is a bag of β-nontrivial strings. Given T ∊ SS(β), find s ∊ B such
that BR(T, s)=1.

• How to solve the searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). What about
when B has 1 million strings .
• It is a lazy idea!
– Candidate selection approach:
• STEP 1: select a few candidates {s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select “good” candidates?
• String tokenizer: extract tokens from each string from B.
• Indexer : index the tokens along with the string ID to create a index DB as FP-DB.
• Searcher : given T, generate tokens {FP1, FP2,…,FPq} , we use them to search
possible candidates from FP-DB. Then we evaluate BR(T, s) for each candidate s.
– NOTE:
• This is similar to a keyword based search engine where the keywords are the
tokens.
• A special token is string fingerprint.
– Other tokens include k-grams, k-subsequences, blocks and chunks .
» Fingerprints are generated from special blocks.
16

• Architecture of candidate selection based approach:
17

• A clustering problem based on the relationship BR :
• Given a bag B of β-nontrivial strings, partition B into a set of groups { G1,
G2,…,Gm} based on BR.

• A solution to clustering problem:
• A graph based approach: if BR(s,t)=1, Node(s) and Node(t) are connected.
• A group is a connected sub-graph of the G(V,E) where V=Node(B).

• The framework can be summarized as follows:

A Few Algorithms and Analysis
• Let us go quickly over the theory and algorithm for two matching
problems :
– Similarity AM1
– Cross-sharing EM3
• What is similarity? How to measure it?
– A traditional approach is to compare two strings directly such as the
LCS method ( largest common subsequence).
– The popular fuzzy hash {ssdeep, sdhash and TLSH} use different
algorithms for measuring the similarity.

• Fuzzy hash can be summarized in the following 3 steps:

• ssdeep:
– STEP 1: split the string into a sequence of consecutive chunks.
– STEP 2: hash each chunk into 6 bits and place them into a 80-byte
container sequentially.
– STEP 3: Use Levenshtein distance to measure the similarity.

• sdhash:
– STEP 1: Select a few 64-grams of higher entropy values.
– STEP 2: Generate a hash for each and them all hashes into one or
more 256-byte bloom filters.
– STEP 3: Use Hamming distance to measure the similarity.

• TLSH:
– STEP 1: For every 5-gram, select 6 triplets out of total 10 (=C5
3).
– STEP 2: Generate a hash for each triplet and map them into a 32-byte
container.
– STEP 3: Use a heuristic diff algorithm to measure the similarity.

• Summary of Three Fuzzy Hashing Algorithms:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.

• Three approaches for measuring similarity … {ssdeep, sdhash
& TLSH use digest comparison.
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.

• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:

• 4 categories of tokens :
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• 8 ways to describe a string for similarity:

• Evaluate a fuzzy hash based on follows:
– Data Structure:
• Bag: a bag ignores the order of tokens. It is good at handling content swapping.
• Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
– Tokens:
• k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation.
• k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling
fragmentation .
• Chunks: This approach takes account of every byte in raw granularity. It should be
OK at handling containment and cross sharing
• Blocks: Depending on different selection functions, even though it does not take
account of every byte, but it may present a string more efficiently and that is good
for generating similarity digests. Due to the nature of fixed length blocks, it is good
at handling containment and cross sharing.
• M2.4 leads to a novel fuzzy hashing scheme : TSFP
– It has some capabilities beyond existing schemes.
– I am not introduce it today due to limited time.
30

• Let us investigate EM3 :
– The cross-sharing problem.
• What is cross-sharing ? And how to measure it?
• Given a string, any two substrings follow one out of three cases:
31

• Cross-sharing … some examples :
32

• Definition 1: Given T∊ SS(β), let Ω(T)= { s | s is a β-nontrivial substring
of T}.
– Ω(T) is the set of all β-nontrivial substrings of T.
• Definition 2: Given R, T ∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). If there exists
a bijective mapping F: SR  ST such that F(r)=t and r=t, we say that SR
and ST are canonical with F.
33

Theorem 1: Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T), SR and ST are
canonical with F: SR  ST. ∀ r1 , r2 ∊ SR, one of following cases holds:
34
NOTE: we are only interested in case 1-3.

Definition 3 : Given R, T∊ SS(β), SR ⊆ Ω(R) and ST ⊆ Ω(T). SR and ST are
canonical with F: SR  ST.
– ∀ r1 and r2 ∊ SR which are not identical, if only case 1 holds, we say that SR
and ST are translative.
– ∀ r1 and r2 ∊ SR which are not identical, if one of case 1 -3 holds, we say that SR
and ST are weakly translative.
– Let L(SR, ST) = 𝐋𝐞𝐧(𝐫)<𝐫,𝐭> for measuring SR × ST .
35

• Definition 4: Given R, T∊ SS(β), we define two measurements for cross-
sharing between R and T at two levels:
– L1(R, T) = Max { L(SR, ST) | SR and ST are translative }
– L2 (R, T) = Max { L(SR, ST) | SR and ST are weakly translative }
• Definition 5 : Given T∊ SS(β), if its β-grams (i.e., β-length substrings)
are different to each other, T is β-nonrepetitive.
– This is to measure how random T is.
• Theorem 2 : Given R,T ∊ SS(β), we have:
– L1(R, T) ≤ L2(R, T)
– If both R and T are β-nonrepetitive, L1(R, T) = L2(R, T)
36

Algorithm ( identify cross-shared substrings)
– INPUT: INPUT: R[1, …, m] and T[1, …, n].
– SUMMARY:
1. Use a rolling hash function H(x) to slide a β-width window along the string R for generating
m+1- β hash values. We store them into a hash table HT with separate chaining using
linked-list to resolve collisions. The nodes in the linked-lists of the hash table HT save the
offsets where hash values are created.
2. From the first offset of T with a the same rolling hash to slide the window of β-width along T,
do match with hash table H. If not matched, continue the next offset, otherwise try to
match the maximum, then continue from an offset around the end of the matched substring
of T.
– OUTPUT: SR ST .
37

Theorem 3: SR×ST is from Algorithm 1. If ST ={}, L1(R,T)=L2(R,T) = 0,
otherwise, we have
– SR and ST are weakly translative ;
– L1(R, T) ≤ L(SR, ST) ≤ L2 (R, T).
– If R is β–nonrepetitive, L(SR, ST) = L2(R, T).
– If T is β–nonrepetitive, we have (a) L1(R, T) = L(SR, ST) ; (b) SR and ST are translative; .
38

• Let me summarize what have been done:

Practical Applications
• This framework can be applied to the following areas.
– E-Discovery
• Grouping near duplicate documents
• Comparing of near duplicate documents
– Digital forensic analysis
• Identifying similar objects or files
– Anti-plagiarism
• Copy detection
– Source code governance
– Malware analysis
– Spam filtering
– DLP
– etc
40

Q&A
• Thank you for your attention.
• Do you have any questions?
• Email: liwei_ren@trendmicro.com
• Home page: http://pitt.academia.edu/LiweiRen for external
talks.
41

Bytewise Approximate Match: Theory, Algorithms and Applications

Recommended

More Related Content

What's hot (20)

Similar to Bytewise Approximate Match: Theory, Algorithms and Applications (20)

More from Liwei Ren任力偉 (19)

Recently uploaded (20)

Bytewise Approximate Match: Theory, Algorithms and Applications