Mastering Document Field Matching: A complete (?) guide
Modern AI models like GPT-4 Vision and Qwen-2-VL can analyze images and generate structured JSON outputs directly. These models understand both visual content and text, making them ideal for document extraction tasks. When AI models extract structured data from documents—whether using Azure AI Document Intelligence (formerly Form Recognizer) or other vision-based foundation models—we often need to compare one model’s output against either human-labeled ground truth or the output from another approach (e.g., a second vision model, a customer CRM system, or another data source). This matching is crucial for evaluating accuracy (for example, by computing precision and recall) or determining whether a human should review the output. However, aligning predicted fields with the correct ground truth fields is non-trivial if the order is different, some predictions are missing or extra, or confidences vary.
In this post, I will walk through an algorithmic solution for this matching problem, covering various matching algorithms and how to effectively use them. We will work through step-by-step examples and compare the results across different matching strategies. Let’s dive in.
Table of Contents
Code Repository: Jump Directly for fun
Problem Setup: AI Extracted Data vs Ground Truth
Imagine we have an AI model that processed a document (for example, an invoice) and extracted some key fields. Azure’s Document Intelligence service typically returns results in JSON format, including field names, the extracted text, and a confidence score for each field. Meanwhile, the “ground truth” data could come from a manually labeled source, a CRM, or even another vision model. Our goal is to pair each predicted field with the correct field in the second dataset—whether that second dataset is truly human-labeled, from a CRM, or from another AI model. Unmatched predictions become false positives, while fields in the second dataset that lack a match become false negatives.
Suppose the second dataset (acting as our “ground truth”) for an invoice contains three fields: Invoice Number, Date, and Total. The AI model (Azure Document Intelligence) also extracted three fields: it accurately identified the invoice number and date (though with a slight date discrepancy), and it predicted an extra field that isn’t in the second dataset (“Document Type”). Below is a simplified representation of the data:
# Azure Document Intelligence predictions JSON, simplified for brevity
predicted_data = [
{"field": "Invoice Num", "text": "INV-0001", "confidence": 0.95},
{"field": "Date", "text": "2021-01-02", "confidence": 0.80},
{"field": "Document Type", "text": "Invoice", "confidence": 0.70}
]
# "Ground truth" (could be human-labeled, CRM data, or from another model)
ground_truth = [
{"field": "Invoice Number", "text": "INV-0001"},
{"field": "Date", "text": "2021-01-01"},
{"field": "Total", "text": "$100.00"}
]
We can see that:
Our task is to automatically match the first two and flag the latter two as unmatched.
Note: The JSON structure shown above is significantly simplified for clarity. In practice, the JSON returned by Azure Document Intelligence can be more complex. However, parsing this JSON is not a major challenge. I have an working example in my GitHub repository.
Constructing a Cost Matrix for Field Matching
Before we dive into the matching algorithms, it’s essential to establish a systematic way to compare AI-extracted predictions with ground truth fields. This is where the cost matrix comes into play. A cost matrix is simply a two-dimensional grid that quantifies the “mismatch” between every pair of predicted and ground truth fields. In our matrix:
Each cell in the matrix (i.e., at position [i][j]) contains a numerical value—a cost—that reflects how “bad” it would be to match prediction i with ground truth j. The lower the cost, the better the match. By formalizing the problem in this way, we can then use optimization techniques to select the assignment of predictions to ground truth fields that minimizes the overall cost.
Why Use a Cost Matrix?
Designing a Cost Function: We need to compare a predicted field (with its text and confidence) to a ground truth field (text). Several factors can contribute to the cost:
We'll combine these factors into a single cost. One simple approach: define similarity between predicted text and ground truth text (and/or field labels) as a value in [0,1], then cost = 1 - similarity (so 0 cost if perfectly identical). Then add the confidence penalty: total_cost = (1 - similarity) + (1 - confidence). This way, a perfect text match with 95% confidence yields cost ≈ 0.05 + 0.05 = 0.10 (very low), whereas a poor text match or low confidence pushes cost toward 1 or more.
Step-by-step Cost Calculation: Let’s compute the cost matrix for our example. We’ll use difflib.SequenceMatcher to measure text similarity (for simplicity) and apply the formula above.
import difflib
# Helper to compute similarity between two strings (0 to 1)
def similarity(a, b):
return difflib.SequenceMatcher(None, a, b).ratio()
# Compute cost matrix
pred_fields = predicted_data
gt_fields = ground_truth
cost_matrix = [] # will be a list of lists
for pred in pred_fields:
pred_label = pred["field"]
pred_text = pred["text"]
pred_conf = pred["confidence"]
row_costs = []
for gt in gt_fields:
gt_label = gt["field"]
gt_text = gt["text"]
# Field name similarity and value similarity
name_sim = similarity(pred_label.lower(), gt_label.lower())
text_sim = similarity(pred_text, gt_text)
sim_score = 0.5 * name_sim + 0.5 * text_sim # weigh them equally
cost = (1 - sim_score) + (1 - pred_conf) # base cost + confidence penalty
row_costs.append(cost)
cost_matrix.append(row_costs)
# Display the cost matrix
gt_field_names = [gt["field"] for gt in gt_fields]
# ANSI color codes
GREEN = "\033[92m" # For predicted field names
BLUE = "\033[94m" # For ground truth field names
YELLOW = "\033[93m" # For cost values
RESET = "\033[0m" # Reset to default
for i, (pred, row) in enumerate(zip(pred_fields, cost_matrix)):
pred_label = pred["field"]
# Format costs to two decimals
formatted_costs = [f"{cost:.2f}" for cost in row]
print(
f"Pred {i} ({GREEN}\"{pred_label}\"{RESET}) vs GT {BLUE}{gt_field_names}{RESET} costs: {YELLOW}{formatted_costs}{RESET}"
)
Running this for our example yields:
Pred 0 ("Invoice Num") vs GT ['Invoice Number', 'Date', 'Total'] costs: ['0.11', '0.76', '0.79']
Pred 1 ("Date") vs GT ['Invoice Number', 'Date', 'Total'] costs: ['0.98', '0.25', '0.91']
Pred 2 ("Document Type") vs GT ['Invoice Number', 'Date', 'Total'] costs: ['1.05', '1.18', '1.19']
Let’s interpret each row:
For Pred 0 (Invoice Num = INV-0001, conf 0.95):
For Pred 1 (Date = 2021-01-02, conf 0.80):
For Pred 2 (Document Type = "Invoice", conf 0.70):
All costs are greater than 1.0, extremely high (poor match to any of the ground truth fields). The field name “Document Type” is unrelated to Invoice Number, Date, or Total; its value "Invoice" doesn’t match any of the ground truth values. Even with 70% confidence, it’s likely not supposed to match any available ground truth field.
Confidence Thresholding: Before matching, we can optionally discard low-confidence predictions altogether. In practice, one might set a threshold (say 0.5 or 0.6). Any prediction with confidence below that is dropped, on the assumption it's likely wrong or noise. In our example, if we had set a threshold of 0.8, the “Document Type” prediction (70% confidence) would be filtered out before matching, leaving only the first two predictions to match with ground truth. We’ll proceed without dropping it to illustrate how algorithms handle an unmatched prediction, but in production you’d tune this threshold based on validation data to improve precision.
Now that we have a cost matrix, the task reduces to a matching problem: pair up some of the rows and columns such that the sum of costs of chosen pairs is minimized, without reusing any row or column more than once (each prediction and each ground truth can only be matched once). We also allow that some rows or columns may remain unmatched if pairing them is not beneficial (e.g., we might prefer to leave “Document Type” unmatched than assign it incorrectly to "Total").
This is a classic assignment problem in graph theory: we have a bipartite graph between predictions and ground truth entries with edge weights = cost, and we want a minimum weight matching. Next, we’ll explore various algorithms to solve this.
Algorithm 1: Kuhn-Munkres Algorithm (Hungarian Algorithm)
The Hungarian algorithm is a well-known method for solving the assignment problem optimally in polynomial time. It was developed by Kuhn in 1955 (and refined to O(n³) complexity later). Hungarian algorithm guarantees to find the global minimum total cost matching – in other words, the best overall pairing between predictions and ground truth.
How it works (briefly): The algorithm operates on the cost matrix through a series of steps (subtracting row/column minima, drawing minimum lines through zeros, adjusting remaining costs, etc.) to find an optimal set of zero-cost assignments after transformations. The details can be complex, but many libraries implement it. Under the hood, it’s finding an optimal matching in a bipartite graph (it’s also equivalent to finding a minimum weight perfect matching).
Handling unequal numbers of items: The Hungarian method is typically described for a square matrix (equal number of jobs and workers, or predictions and truths). If the sets are unequal, a common trick is to add dummy rows or columns filled with a certain cost (e.g., zero or a threshold penalty) to balance it. Matching a prediction to a dummy column means leaving that prediction unmatched (and possibly incurring a penalty cost), and similarly for dummy rows. We’ll use this approach so that the algorithm can choose to leave something unmatched if that is better than a very high-cost incorrect match. For example, we can add a dummy ground-truth entry with a cost of, say, 0.5 for any prediction matching to it. This gives the algorithm the option to “pay” 0.5 cost to leave a prediction unmatched instead of forcing a match with cost >1.0.
Python Implementation: We will use scipy.optimize.linear_sum_assignment, which implements the Hungarian algorithm. It takes a cost matrix (can be rectangular) and returns the indices of the optimal assignment.
# ANSI color codes
GREEN = "\033[92m" # For predicted field names
BLUE = "\033[94m" # For ground truth field names
YELLOW = "\033[93m" # For headings or cost values if needed
RESET = "\033[0m" # Reset to default
# The cost matrix that we have is used from our previous code where we calculate the values of the cost matrix.
# Prepare cost matrix with a dummy column to allow leaving some predictions unmatched
cost_matrix_np = np.array(cost_matrix)
n_pred, n_gt = cost_matrix_np.shape
if n_pred == n_gt:
dummy_col = np.full((n_pred, 1), 0.5) # cost 0.5 for leaving a pred unmatched
cost_matrix_np = np.hstack([cost_matrix_np, dummy_col])
gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"]
elif n_pred > n_gt:
dummy_cols = np.full((n_pred, n_pred - n_gt), 0.5)
cost_matrix_np = np.hstack([cost_matrix_np, dummy_cols])
gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"] * (n_pred - n_gt)
elif n_gt > n_pred:
dummy_rows = np.full((n_gt - n_pred, n_gt), 0.5)
cost_matrix_np = np.vstack([cost_matrix_np, dummy_rows])
gt_labels = [gt["field"] for gt in ground_truth]
pred_fields += [{"field": "<No Prediction>", "text": "", "confidence": 1.0}] * (n_gt - n_pred)
# Solve assignment using the Hungarian algorithm
row_idx, col_idx = linear_sum_assignment(cost_matrix_np)
assigned_pairs = list(zip(row_idx, col_idx))
# Neatly print the colored assignments
print(f"{YELLOW}Optimal assignment:{RESET}")
for pred_index, gt_index in assigned_pairs:
pred_field = pred_fields[pred_index]["field"]
gt_field = gt_labels[gt_index]
print(f"Pred {pred_index} ({GREEN}\"{pred_field}\"{RESET}) -> GT ({BLUE}{gt_field}{RESET})")
For our example, this might output something like:
Optimal assignment:
Pred 0 ("Invoice Num") -> GT (Invoice Number)
Pred 1 ("Date") -> GT (Date)
Pred 2 ("Document Type") -> GT (<No Match>)
The results are obvious, but for sake of clarity
So, the algorithm chose to match the first two predictions with their correct fields, and match the third prediction (Document Type) to the dummy “No Match” (incurring a cost of 0.5) instead of forcing it to match one of the actual fields at cost >1. This is exactly what we want: it found the minimal-cost solution which leaves the spurious prediction unmatched and leaves the “Total” ground truth unmatched as well. The total cost of this assignment would be 0.11 + 0.25 + 0.5 = 0.86. Any other assignment would cost more (for instance, if it had matched Pred2 to “Total”, the cost would have been ~1.19, making total cost >1.5).
Hungarian algorithm ensured the globally optimal matching. It’s widely used in similar tasks, for example in evaluating object detection where you match predicted bounding boxes to ground truth boxes. The DETR object detection model famously uses Hungarian matching, with a cost that combines class prediction probability and bounding-box overlap, to assign model-detected objects to ground truth objects. In our case, the “class probability” is analogous to the confidence score, and the “bbox difference” is analogous to text difference – we combined them into our cost.
Complexity: Hungarian algorithm runs in O(n³) for n items, which is efficient for moderate sizes (hundreds or a few thousands of items). For a small number of fields per document, it’s extremely fast. If you had a very large number of items to match, this could become a bottleneck, but typically other approaches (min-cost flow variants or approximations) might be considered in those cases.
Output of Matching: We can now produce a report of matched vs unmatched items:
Later, we’ll discuss how to visualize the cost matrix or matches (e.g., creating a heatmap of the cost matrix where low costs = green, high = red, which helps identify the best matches visually).
Algorithm 2: Greedy Nearest-Neighbor Matching
A simpler but heuristic approach is to use a greedy algorithm: repeatedly pick the available prediction-ground truth pair with the lowest cost (nearest match) and mark them as matched, until no more pairs can be matched. This is essentially a locally optimal strategy – at each step, take the best immediate pair. While this is faster and easier to implement, it does not guarantee a globally optimal assignment in all cases. A greedy algorithm can get stuck in a suboptimal configuration because it never reconsiders earlier choices. For example, a classic counterexample is given in the assignment problem: if you greedily assign one agent to a job with cost 1 (best single pair) and the other agent to the remaining job with cost 8 (total 9), you might miss the optimal arrangement of costs 5 and 2 (total 7).
Nevertheless, greedy matching can work reasonably if the cost matrix has a clear one-to-one structure. Let’s apply a greedy matching to our example and see what happens.
Python Implementation: This is a fairly simple implementation of the code and besides numpy library, there are no special libraries that are required.
import numpy as np
# ANSI color codes
GREEN = "\033[92m" # For predicted field names
BLUE = "\033[94m" # For ground truth field names
YELLOW = "\033[93m" # For cost values or headers
RESET = "\033[0m" # Reset to default
# Prepare numpy cost matrix with dummy columns/rows.
cost_matrix_np = np.array(cost_matrix)
n_pred, n_gt = cost_matrix_np.shape
if n_pred == n_gt:
dummy_col = np.full((n_pred, 1), 0.5) # cost 0.5 for leaving a pred unmatched
cost_matrix_np = np.hstack([cost_matrix_np, dummy_col])
gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"]
elif n_pred > n_gt:
dummy_cols = np.full((n_pred, n_pred - n_gt), 0.5)
cost_matrix_np = np.hstack([cost_matrix_np, dummy_cols])
gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"] * (n_pred - n_gt)
elif n_gt > n_pred:
dummy_rows = np.full((n_gt - n_pred, n_gt), 0.5)
cost_matrix_np = np.vstack([cost_matrix_np, dummy_rows])
gt_labels = [gt["field"] for gt in ground_truth]
pred_fields += [{"field": "<No Prediction>", "text": "", "confidence": 1.0}] * (n_gt - n_pred)
# Greedy matching based on all possible pairs sorted by cost ascending
pairs = [(cost_matrix_np[i, j], i, j)
for i in range(cost_matrix_np.shape[0])
for j in range(cost_matrix_np.shape[1])]
pairs.sort(key=lambda x: x[0])
matched_pred = set()
matched_gt = set()
greedy_pairs = []
for cost, i, j in pairs:
if i not in matched_pred and j not in matched_gt:
greedy_pairs.append((i, j, cost))
matched_pred.add(i)
matched_gt.add(j)
# Print the greedy matching results with colored output
print(f"{YELLOW}Greedy matched pairs:{RESET}")
for i, j, cost in greedy_pairs:
pred_field = pred_fields[i]["field"]
gt_field = gt_labels[j]
print(f"Pred {i} ({GREEN}\"{pred_field}\"{RESET}) -> GT ({BLUE}\"{gt_field}\"{RESET}) with cost: {YELLOW}{cost:.2f}{RESET}")
This yields the following output for our example (including dummy column):
Greedy matched pairs:
Pred 0 ("Invoice Num") -> GT ("Invoice Number") with cost: 0.11
Pred 1 ("Date") -> GT ("Date") with cost: 0.25
Pred 2 ("Document Type") -> GT ("<No Match>") with cost: 0.50
We can exclude the dummy column in the above code to iterate j from 0 to orig_n_gt–1, like this:
# Create sorted list of candidate pairs using only real GT indices
pairs = [(cost_matrix_np[i, j], i, j)
for i in range(cost_matrix_np.shape[0])
for j in range(orig_n_gt)]
This yields the following output for our example (excluding dummy column):
Greedy matched pairs:
Pred 0 ("Invoice Num") -> GT ("Invoice Number") with cost: 0.11
Pred 1 ("Date") -> GT ("Date") with cost: 0.25
Pred 2 ("Document Type") -> GT ("Total") with cost: 1.19
The greedy algorithm did match Pred0→GT0 and Pred1→GT1 (same as Hungarian for those), but then it matched Pred2→GT2 (Document Type to Total) with cost 1.19, because at the end that was the only remaining pair. It did not consider leaving them unmatched via the dummy option (we treated dummy as an actual column with cost 0.5; interestingly, if we include the dummy in the pair list, the greedy method would have chosen Pred2→Dummy (0.5) instead of Pred2→Total (1.19) since 0.5 is lower. However, without explicitly adding a dummy or threshold, a naive greedy approach will end up matching everything one-to-one until one side is exhausted).
Result: Greedy matching in this case gave a suboptimal result: it incurred a higher total cost and produced an incorrect match (Document Type matched to Total, which is wrong). The total cost with greedy = 0.11 + 0.25 + 1.19 = 1.55, whereas Hungarian’s optimal was 0.86. The greedy algorithm made a locally optimal choice at each step but ended up pairing two items that shouldn’t be matched. This illustrates the common caveat: greedy algorithms run faster but can yield suboptimal solutions.
However, we can improve the greedy approach by adding simple rules:
Complexity: Sorting all pairs is O(n² log n) for n items, and then iterating is O(n²). This is usually fine for small n, but becomes expensive if n is large (because n² pairs grow quickly). A more efficient greedy implementation could pick the minimum cost pair in O(n²) without full sorting, or use a priority queue for O(n² log n). Greedy is simpler than Hungarian (conceptually), but because of its potential to be wrong in some cases, it’s often used only when a quick approximate solution is acceptable or when the cost structure is such that a near-greedy strategy is provably optimal (which is not the general case for assignment).
Algorithm 3: Gale-Shapley (Stable Marriage) Matching
Another approach to matching is the Stable Marriage algorithm, also known as Gale-Shapley algorithm. This algorithm is designed to find a stable matching between two sets given their preference rankings. A matching is stable if there is no pair of a prediction and a ground truth who would both prefer to be matched to each other over their current partners. Gale-Shapley guarantees to find a stable matching in O(n²) time for n pairs, if one exists, and it works even when each side has its own preference list of the other side.
Recommended by LinkedIn
How can we use this in our context? We can treat each prediction as having preferences over which ground truth field it would rather match (based on ascending cost, i.e., lowest cost = highest preference). Likewise, each ground truth can be given a preference order over predictions (also by increasing cost or perhaps by some priority like importance). Running Gale-Shapley on these preferences will yield a matching where no unmatched prediction-ground truth pair would both prefer each other over their current match.
Important: Stable matching does not guarantee minimum total cost – its goal is different. It ensures a sort of fairness or stability, which can be relevant in certain situations (for example, matching reviewers to conference papers where both have preferences – you don’t want a situation where a reviewer and a paper would prefer to be matched but aren’t). In our scenario, ground truth fields don’t really have intrinsic preferences (any correct prediction is fine), but we can still assign them preferences based on cost (i.e. they “prefer” the prediction that best matches them).
To illustrate Gale-Shapley, we’ll implement it and run on our example. We’ll assume:
Then we run the stable marriage algorithm with predictions proposing to ground truth (this tends to favor the proposing side – here, predictions – in terms of getting their more preferred matches, which is fine).
################################################################
# Gale–Shapley stable matching algorithm
def gale_shapley(pred_prefs, gt_prefs):
# pred_prefs: dict of {pred_index: [gt_index preference list]}
# gt_prefs: dict of {gt_index: [pred_index preference list]}
free_preds = list(pred_prefs.keys())
current_match = {} # gt_index -> pred_index
# Precompute ranking for quick comparison: rank[gt][pred] = position in gt's list (lower = more preferred)
rank = {j: {pred: r for r, pred in enumerate(gt_prefs[j])} for j in gt_prefs}
next_proposal = {i: 0 for i in pred_prefs} # index in pref list each pred will propose next
while free_preds:
i = free_preds[0] # take a free prediction
if next_proposal[i] >= len(pred_prefs[i]):
free_preds.pop(0) # no one left to propose
continue
j = pred_prefs[i][next_proposal[i]]
next_proposal[i] += 1
if j not in current_match:
current_match[j] = i
free_preds.pop(0)
else:
k = current_match[j]
if rank[j][i] < rank[j][k]:
current_match[j] = i
free_preds.pop(0)
free_preds.append(k)
# Otherwise, j rejects i (i remains free)
return current_match
################################################################
# Build preference lists from cost_matrix (not including dummy for stable matching logic)
pred_preferences = {}
for i, row in enumerate(cost_matrix):
sorted_gt = sorted(range(len(row)), key=lambda j: row[j])
pred_preferences[i] = sorted_gt
gt_preferences = {}
for j in range(len(cost_matrix[0])):
sorted_pred = sorted(range(len(cost_matrix)), key=lambda i: cost_matrix[i][j])
gt_preferences[j] = sorted_pred
stable_match = gale_shapley(pred_preferences, gt_preferences)
print(f"\n{YELLOW}Stable matches (gt_index -> pred_index):{RESET}", stable_match)
For our example, the stable matching are as follows:
Cost Matrix (rows: predictions, columns: ground truths):
Pred 0 (Invoice Num): 0.11 0.76 0.79
Pred 1 (Date): 0.98 0.25 0.91
Pred 2 (Document Type): 1.05 1.18 1.19
Stable matches (gt_index -> pred_index): {0: 0, 1: 1, 2: 2}
This means:
So Gale-Shapley ended up matching everything (each ground truth got a partner) in this scenario. Is this matching stable? We need to check if any unmatched pair would prefer each other:
In scenarios where both sides have strong opinions (preferences), stable matching is a great solution. In our case, the ground truth “prefers” the correct predictions strongly, so stability ends up pairing them correctly, but it had no mechanism to decide to leave GT2 unmatched instead of taking a poor match. We could introduce a dummy on each side as an additional participant that represents “no match” with an appropriate preference (this gets complicated and isn’t standard stable marriage).
Takeaway: Gale-Shapley can be used for matching if we care about the stability criterion. However, for pure accuracy evaluation, we usually care more about optimizing the matches (even if that means leaving some unmatched). So stable matching is an interesting alternative approach, but it might not be as useful here unless we reformulate the problem with preferences differently. It does run in O(n²) which is efficient, and ensures no pair of pred+gt would secretly rather be matched to each other than their current partners.
Algorithm 4: Linear Programming (ILP) Approach
The assignment problem can also be formulated and solved as an Integer Linear Program (ILP) or as a special case of a minimum-cost flow problem. The formulation is:
This ILP is a classic bipartite matching formulation. We can solve it with a MILP solver or even a specialized algorithm (Hungarian is essentially an optimized solver for this with the constraints as equalities for a perfect matching). If we drop the integrality (allow x[i,j] to be fractional between 0 and 1), the problem becomes a linear program that is totally unimodal, meaning the optimal solution will still be integer in this assignment structure. However, allowing ≤ 1 instead of =1 constraints means the trivial solution x=0 (match nothing) is feasible with cost 0, which a solver would choose unless we add something to encourage matching. Typically, one would either enforce matching all ground truths (if evaluating recall one might want to match all truths if possible) or add a dummy with a moderate cost as we did.
Python Implementation: We’ll use PuLP (a Python LP library) to illustrate setting up the ILP, though any MILP solver or OR-Tools CP-SAT could be used.
from pulp import LpProblem, LpVariable, lpSum, LpMaximize, LpStatus
# ANSI color codes for colorful printing
GREEN = "\033[92m" # For predicted field names
BLUE = "\033[94m" # For ground truth field names
YELLOW = "\033[93m" # For headings or emphasis
RESET = "\033[0m" # Reset to default
# Create the ILP model as a minimization problem.
prob = LpProblem('Assignment', LpMinimize)
# Create binary variables for each prediction - ground_truth (including dummy) pair (i,j)
x = {(i, j): LpVariable(f"x_{i}_{j}", lowBound=0, upBound=1, cat='Binary')
for i in range(n_pred) for j in range(n_total_gt)}
# Objective: minimize total matching cost.
prob += lpSum(cost[(i, j)] * x[(i, j)]
for i in range(n_pred)
for j in range(n_total_gt))
# Constraints: Force every prediction to be assigned exactly one option (real or dummy)
for i in range(n_pred):
prob += lpSum(x[(i, j)] for j in range(n_total_gt)) == 1
# For real ground truth options, ensure at most one prediction is matched.
for j in range(n_gt):
prob += lpSum(x[(i, j)] for i in range(n_pred)) <= 1
# (No constraint on dummy options so they can be assigned to multiple predictions)
# Solve the ILP
prob.solve()
# Gather matched pairs based on binary variable values
matches = [(i, j) for (i, j) in x if x[(i, j)].value() == 1]
# Colorful output of results similar to the Hungarian code
print(f"{YELLOW}ILP Optimal Assignment (Status: {LpStatus[prob.status]}):{RESET}")
for i, j in matches:
pred_field = predictions[i]["field"]
if j < n_gt:
gt_field = ground_truth[j]["field"]
print(f"Pred {i} ({GREEN}\"{pred_field}\"{RESET}) -> GT {j} ({BLUE}\"{gt_field}\"{RESET})")
else:
print(f"Pred {i} ({GREEN}\"{pred_field}\"{RESET}) -> GT (dummy) ({BLUE}\"<No Match>\"{RESET})")
This should give a matching similar to the earlier methods.
ILP Optimal Assignment (Status: Optimal):
Pred 0 ("Invoice Num") -> GT 0 ("Invoice Number")
Pred 1 ("Date") -> GT 1 ("Date")
Pred 2 ("Document Type") -> GT (dummy) ("<No Match>")
However had we not included a dummy or require matching all, the ILP would find the zero assignment (match nothing) as optimal if all costs are positive. In our example, all real match costs are ≥0.11, so indeed the solver would set x=0 for all pairs for cost 0, which is not what we want. To fix this, we would add dummy options or constraints to ensure we match what we can.
ILP Optimal Assignment (Status: Optimal):
Pred 0 ("Invoice Num") -> GT 2 ("Total")
Pred 1 ("Date") -> GT 1 ("Date")
Pred 2 ("Document Type") -> GT 0 ("Invoice Number")
The linear programming approach is very flexible:
However, solving an ILP with a generic solver might be slower than the Hungarian algorithm for large n, since ILP solvers use general branch-and-bound (though the assignment LP is quite easy for them). In practice, one might use a specialized algorithm (Hungarian, min-cost flow algorithms) for speed, but ILP is a good sanity check and useful if extending the problem beyond a basic assignment.
In summary, the ILP yields the same optimal matches (if set up properly) as the Hungarian algorithm, since it’s essentially solving the same optimization problem by a different means. For brevity, we won’t dive more into the solver output here.
Algorithm 5: Approximate Nearest Neighbor (Vector Matching)
The last approach we'll consider is less about solving the assignment problem optimally and more about quickly finding closest matches in a high-dimensional space. If our fields can be embedded as vectors (for example, using a language model to embed the text of each field, or using the spatial coordinates of the field on the document), we could use an Approximate Nearest Neighbor (ANN) search algorithm to find potential matches.
ANN algorithms (like KD-trees, locality-sensitive hashing (LSH), or modern vector indexes) allow us to query for the nearest neighbor of a point without exhaustively scanning all possibilities. This is extremely useful when you have thousands or millions of items to match (not typical for fields in one document, but could be if matching many documents or large tables).
Use case: Suppose each predicted field and each ground truth field has a feature vector combining various attributes (text embedding, position, etc.). We could pre-index the ground truth vectors in a KD-tree and for each predicted vector, query the nearest neighbor among ground truths. This gives a likely match in roughly O(log N) time per query instead of O(N). However, to ensure a one-to-one matching, we’d still need a second step to handle conflicts (two predictions pointing to the same ground truth). A common strategy could be:
This becomes a greedy strategy seeded by ANN results. It may not yield the optimal assignment, but if the embedding space and distance measure are good, it could be a reasonable approximation.
Example (conceptual): In our case, we might embed each field’s text using a simple vector (like [similarity_to_invoice_number, similarity_to_date, similarity_to_total] or some encoding). But that’s overkill for a small example. Instead, let's demonstrate using a KD-tree on the positions in the cost matrix (just for illustration):
from math import inf
from sklearn.neighbors import NearestNeighbors
import numpy as np
# Let's pretend each ground truth has a 2D vector (for example, x,y position on page).
# This is a made up example as an output of Document Intelligence.
# For our example, we'll just make up some coordinates for ground truth fields:
gt_positions = np.array([
[100, 100], # Invoice Number position
[200, 100], # Date position
[100, 200] # Total position
])
# And some positions for predictions:
pred_positions = np.array([
[105, 105], # Invoice Num (close to Invoice Number)
[205, 100], # Date (close to Date)
[300, 300] # Document Type (somewhere else, not near any ground truth)
])
# Use a KD-tree (via NearestNeighbors) to find nearest GT for each pred
nn = NearestNeighbors(n_neighbors=1, metric='euclidean').fit(gt_positions)
distances, indices = nn.kneighbors(pred_positions)
print("Nearest neighbor GT for each prediction:", indices.flatten())
print("Distances:", distances.flatten())
We get the following output
Nearest neighbor GT for each prediction: [0 1 2]
Distances: [7.07 5.00 212.13]
This indicates:
If we assign according to this, we’d match Pred0→GT0, Pred1→GT1, Pred2→GT2. In this dummy scenario, that matched everything, but we see Pred2 was actually very far from GT2. We could decide that 212 is beyond a threshold and treat Pred2 as unmatched.
This approach is essentially using spatial (or feature) proximity as the matching criterion directly. It’s a good approach when the matching criterion is purely distance in a feature space and one-to-one matching is less strict or can be enforced by other means. For a large-scale system (say matching hundreds of thousands of entries), one might use ANN on text embeddings to shortlist a few candidate matches for each item, and then run a more exact algorithm like Hungarian on that reduced set.
Caveats: Approximate NN search may return a close but incorrect match if the feature representation is not perfect. It also doesn’t inherently handle the one-to-one constraint – you have to enforce it after the fact. It shines when speed is crucial and a good-enough match is acceptable. In our context of evaluating document extraction, we usually prefer an exact optimal matching (we’re not dealing with millions of fields in one document). But if you were integrating information across databases, vector similarity search could be your friend.
To sum up this section: ANN methods (KD-trees, LSH, etc.) partition the vector space to dramatically speed up neighbor searches at the cost of occasional approximation. They are an advanced tool in the matching arsenal, especially for high-dimensional representations from deep models (foundation models that embed text or layout information).
Putting it all together
To play around with these approaches, I created a synthetic data generator. In this function, I generate the data by first creating a set of ground truth fields, where each field has a unique label and corresponding value (for example, "Field_0" with "Value_0", etc.). Predicted fields are then generated by sometimes correctly copying these ground truth values—with a small chance of minor modifications (like appending a random lowercase letter)—and by introducing extra or missing fields. Each predicted field carries a confidence score representing its reliability. This setup simulates real-world scenarios where data extraction might suffer from noise, inaccuracies, or missing information. The confidence threshold is later used to filter out less-reliable predictions, ensuring that only predictions above a certain confidence level are considered by the matching algorithms.
Results
The charts provide a visual evaluation of this matching process.
The cost matrix heatmap displays the computed matching cost between every predicted field and every ground truth field, where darker regions indicate a lower cost—meaning a higher similarity and thus a more likely correct match.
The bar chart compares the number of true positives (TPs), false positives (FPs), and false negatives (FNs) for each matching algorithm, allowing you to quickly see which method most accurately recovers the ground truth.
The confidence distribution plot shows how the confidence scores of the filtered predictions are spread, which can reveal the overall quality of the predictions. Looking at the results you shared, for example, the Hungarian algorithm achieved 9 TPs with only 1 FP and 1 FN, translating to high precision, recall, and F1 scores of 90%, all in a very short execution time of 0.11 ms. In contrast, the Linear Programming and Approx NN methods recorded zero correct matches in this instance, clearly indicating that under the current data conditions, these approaches may not be as effective.
Production Guidelines: Using Confidence Scores and Thresholds Effectively
We touched on confidence scores earlier, but let’s consolidate how to use them in matching:
Below are some of the other optimizations, edge cases, and improvements that you can consider in your implementations.
Production Guidelines: Trade-offs and Choosing an Approach
Each algorithm has situations where it shines:
Conclusion
We developed a complete solution for matching AI-extracted document fields with ground truth, covering everything from constructing a meaningful cost matrix (taking into account text similarity and confidence scores) to exploring multiple matching algorithms. The Hungarian algorithm provided an optimal benchmark, while other methods like greedy and stable matching gave insight into different strategies (and their pitfalls). Incorporating confidence scores from Azure’s Document Intelligence output was key to improving match quality – high-confidence predictions are favored, and low-confidence ones can be filtered or penalized, aligning with best practices of using confidence to decide automated vs human-involved processing.
This solution can be production-ready very easily: the code is modular (we could easily wrap the matching logic into a function), and we’ve considered edge cases (unequal numbers of items, completely missing predictions, etc.). In a real deployment, one would further refine the cost function (perhaps using domain-specific similarity measures, or even ML models to predict match vs no-match) and set appropriate thresholds. Logging or visualizing the cost matrix and matches (e.g., via heatmaps or summary reports of matched/unmatched) can help continuously monitor the model’s extraction quality.
By implementing these algorithms and understanding their outputs, we can build a robust evaluation pipeline for document AI systems or even use these techniques to merge AI-extracted data with existing databases or CRMsolutions. The combination of foundational algorithmic solutions and AI confidence scores ensures that we match data accurately and intelligently, minimizing errors and highlighting uncertainties for review.
If you made it this far, please consider subscribing to my newsletter and sharing or reposting this to your community.