How to understand Vectors of Sub-tokens in NLP
#Vectorization is a critical process in NLP, transforming text into a numerical format that machines can interpret. In this post, let us understand the meaning and the mathematical essence of vector.
Sentence: जब युद्ध होता है, humanity loses its way. Let's strive for शांति और समर्थन. 🤲🕊️ #HumanityPrevail
# Tokens: ['जब', ' ', 'युद्ध', ' ', 'होता', ' ', 'है', ',', ' ', 'humanity', ' ', 'loses', ' ', 'its', ' ', 'way', '.', ' ', 'Let', 's', ' ', 's', 'tri', 've', ' ', 'for', ' ', 'शांति', ' ', 'और', ' ', 'समर्थन', '.', ' ', ' ', '#', 'HumanityPreva', 'il']
SubWord_Tokens: [['जब'], [' '], ['युद्ध'], [' '], ['होता'], [' '], ['है'], [','], [' '], ['humanity'], [' '], ['loses'], [' '], ['its'], [' '], ['way'], ['.'], [' '], ['Let'], ['s'], [' '], ['s'], ['tri'], ['ve'], [' '], ['for'], [' '], ['शांति'], [' '], ['और'], [' '], ['समर्थन'], ['.'], [' '], [' '], ['#'], ['HumanityP', 're', 'va'], ['il']]
# Vectorized Tokens Output :
[[-0.0848 -0.0405 0.0749 ... 0.0125 -0.0187
-0.24429999]
[ 0. 0. 0. ... 0. 0.
0. ]
[-0.0059 0.0279 -0.1062 ... 0.0127 -0.0176
0.0212 ]
...
[-0.0955 0.0334 -0.15099999 ... -0.2156 -0.50550002
0.1788 ]
[-0.0589 -0.0189 0.18279999 ... 0.1231 -0.1339
0.1068 ]
[-0.1945 0.0178 0.0848 ... 0.009 -0.13789999
0.0734 ]]
This article delves into an illustrative example to explain how vectors encapsulate linguistic features and semantic relationships in NLP.
By dissecting the process of vector analysis, we aim to elucidate the mathematical essence and interpretative power of vectors in processing & understanding language.
(5,
0.3086408762101168,
[(2, 0.6058216546146239), (3, 0.2858868943201141)],
'spread across many',
1)
The vector analysis reveals that out of the total subwords provided, 5 contain non-zero values, indicating they are the active or meaningful token in this context.
The average strength of these subwords, calculated as the magnitude of their respective vectors, is approximately 0.309. This suggests a moderate level of significance or contribution of each subword to the overall semantics.
The two main subwords, which have the highest magnitudes (approximately 0.606 and 0.286), are identified as the most influential or "strongest" in terms of their vector representation.
These main #subwords significantly contribute to the meaning or direction of the overall vector, serving as critical focal points in the analysis.
The distribution of strengths across the #subwords is spread among many, indicating a diverse representation of features or dimensions within the vector space.
This spread suggests that the information is not overly concentrated or sparse but is relatively evenly distributed among the different subwords.
(5,
0.3086408762101168,
[(2, 0.6058216546146239), (3, 0.2858868943201141)],
'spread across many',
1)
When considering the cosine similarity, a common measure of orientation or similarity between vectors, it is found that there is one vector (the vector itself) that meets the similarity threshold of 0.85.
In a broader analysis, this step would involve comparing the given vector to a set of other vectors to count similar vectors, aiding in understanding the vector's uniqueness or commonality in the context of a larger dataset.
Recommended by LinkedIn
In essence, the vector represents a multi-dimensional space where each word/subword token contributes to the overall semantic direction, with a couple of subwords standing out as particularly influential. The distribution of strength across these subwords is relatively balanced, and the vector's orientation is such that it might share similarity with others in a larger set.
How to understand and to do #interpretation of Vectors ?
# Vectorized Tokens Output :
[[-0.0848 -0.0405 0.0749 ... 0.0125 -0.0187
-0.24429999]
[ 0. 0. 0. ... 0. 0.
0. ]
[-0.0059 0.0279 -0.1062 ... 0.0127 -0.0176
0.0212 ]
...
[-0.0955 0.0334 -0.15099999 ... -0.2156 -0.50550002
0.1788 ]
[-0.0589 -0.0189 0.18279999 ... 0.1231 -0.1339
0.1068 ]
[-0.1945 0.0178 0.0848 ... 0.009 -0.13789999
0.0734 ]]
Step 1: The number of rows in your output corresponds to the number of subword tokens in your input sentence. The strength or magnitude of each value in the embedding vector represents the significance of that feature for the corresponding subword token. Larger magnitudes typically indicate higher importance. Comparative/Relative Strength:
Step 2 : Comparing the magnitudes of values within a single vector can give you insights into which features are more dominant for a particular token. tokens with larger magnitude values in certain dimensions may be more strongly associated with those features.
Step 3 : Similar subwords are likely to have similar vector representations. You can measure the similarity between two vectors using cosine similarity or Euclidean distance. If two subwords have vectors with high similarity, they are likely semantically related in the context of the training data.
Step 4 Semantic relationships can be inferred by examining the vectors of similar subwords. For example, if two subwords related to weather (e.g., "sunny" and "breeze") have similar vectors, it suggests a semantic relationship in the context of weather-related language. Similarly, if the vectors of subwords related to conflict (e.g., "war" and "chaos") are similar, it indicates a semantic relationship in the context of conflict-related language.
Remember that these interpretations are general guidelines, and the specific meanings of the dimensions in the embedding space depend on the training data and model architecture.
To gain more meaningful insights, you may need to refer to the documentation of the specific word embedding model used and understand the context in which it was trained.
#Vector analysis, through its systematic approach to quantifying and comparing the characteristics of subwords, offers profound insights into the #linguistic dimensions encoded within text. This analysis not only highlights the relative strengths and influences of subwords but also provides a framework for understanding their #semantic relationships.
By employing methods such as magnitude calculation and cosine similarity, researchers can unravel the complex tapestry of language, revealing patterns and connections that lie beneath the surface.
Through such analytical endeavors, the mathematical essence of language becomes increasingly accessible, paving the way for advancements in NLP and related fields.
I hope it helps ...