Mastering Document Field Matching: A complete (?) guide
The wedding arch stands in for a bipartite graph, with nodes and edges symbolizing the connections of a stable matching

Mastering Document Field Matching: A complete (?) guide

Modern AI models like GPT-4 Vision and Qwen-2-VL can analyze images and generate structured JSON outputs directly. These models understand both visual content and text, making them ideal for document extraction tasks. When AI models extract structured data from documents—whether using Azure AI Document Intelligence (formerly Form Recognizer) or other vision-based foundation models—we often need to compare one model’s output against either human-labeled ground truth or the output from another approach (e.g., a second vision model, a customer CRM system, or another data source). This matching is crucial for evaluating accuracy (for example, by computing precision and recall) or determining whether a human should review the output. However, aligning predicted fields with the correct ground truth fields is non-trivial if the order is different, some predictions are missing or extra, or confidences vary.

In this post, I will walk through an algorithmic solution for this matching problem, covering various matching algorithms and how to effectively use them. We will work through step-by-step examples and compare the results across different matching strategies. Let’s dive in.


Table of Contents

  1. Problem Setup: AI Extracted Data vs Ground Truth
  2. Constructing a Cost Matrix for Field Matching
  3. Algorithm 1: Kuhn-Munkres Algorithm (Hungarian Algorithm)
  4. Algorithm 2: Greedy Nearest-Neighbor Matching
  5. Algorithm 3: Gale-Shapley (Stable Marriage) Matching
  6. Algorithm 4: Linear Programming (ILP) Approach
  7. Algorithm 5: Approximate Nearest Neighbor (Vector Matching)
  8. Putting it all together - Comparison across all the algorithms
  9. Production Guidelines: Using Confidence Scores and Thresholds Effectively
  10. Production Guidelines: Trade-offs and Choosing an Approach
  11. Conclusion


Code Repository: Jump Directly for fun


Problem Setup: AI Extracted Data vs Ground Truth

Imagine we have an AI model that processed a document (for example, an invoice) and extracted some key fields. Azure’s Document Intelligence service typically returns results in JSON format, including field names, the extracted text, and a confidence score for each field. Meanwhile, the “ground truth” data could come from a manually labeled source, a CRM, or even another vision model. Our goal is to pair each predicted field with the correct field in the second dataset—whether that second dataset is truly human-labeled, from a CRM, or from another AI model. Unmatched predictions become false positives, while fields in the second dataset that lack a match become false negatives.

Suppose the second dataset (acting as our “ground truth”) for an invoice contains three fields: Invoice Number, Date, and Total. The AI model (Azure Document Intelligence) also extracted three fields: it accurately identified the invoice number and date (though with a slight date discrepancy), and it predicted an extra field that isn’t in the second dataset (“Document Type”). Below is a simplified representation of the data:

# Azure Document Intelligence predictions JSON, simplified for brevity
predicted_data = [
    {"field": "Invoice Num", "text": "INV-0001", "confidence": 0.95},
    {"field": "Date", "text": "2021-01-02", "confidence": 0.80},
    {"field": "Document Type", "text": "Invoice", "confidence": 0.70}
]

# "Ground truth" (could be human-labeled, CRM data, or from another model)
ground_truth = [
    {"field": "Invoice Number", "text": "INV-0001"},
    {"field": "Date", "text": "2021-01-01"},
    {"field": "Total", "text": "$100.00"}
]        

We can see that:

  • “Invoice Num” (predicted) likely corresponds to Invoice Number (ground truth) with the same value INV-0001 (a correct match).
  • “Date” is predicted as 2021-01-02 whereas ground truth is 2021-01-01 (off by one day – a partial mismatch).
  • “Total” is present in ground truth ($100.00) but the model did not extract it (a missed field → will be unmatched).
  • The model predicted a field “Document Type: Invoice” which isn’t in the ground truth (an extra field → will be unmatched).

Our task is to automatically match the first two and flag the latter two as unmatched.

Note: The JSON structure shown above is significantly simplified for clarity. In practice, the JSON returned by Azure Document Intelligence can be more complex. However, parsing this JSON is not a major challenge. I have an working example in my GitHub repository.

Constructing a Cost Matrix for Field Matching

Before we dive into the matching algorithms, it’s essential to establish a systematic way to compare AI-extracted predictions with ground truth fields. This is where the cost matrix comes into play. A cost matrix is simply a two-dimensional grid that quantifies the “mismatch” between every pair of predicted and ground truth fields. In our matrix:

  • Rows represent the predictions made by the AI model.
  • Columns represent the corresponding ground truth fields.

Each cell in the matrix (i.e., at position [i][j]) contains a numerical value—a cost—that reflects how “bad” it would be to match prediction i with ground truth j. The lower the cost, the better the match. By formalizing the problem in this way, we can then use optimization techniques to select the assignment of predictions to ground truth fields that minimizes the overall cost.

Why Use a Cost Matrix?

  1. Quantitative Comparison: It allows us to combine different factors—such as text similarity and model confidence—into a single numeric score.
  2. Optimization Ready: Once we’ve populated the matrix with costs, the problem of pairing predictions with ground truth becomes an instance of the well-studied assignment problem. This paves the way for applying efficient algorithms like the Hungarian algorithm to find the best overall matching.
  3. Handling Imperfections: In many real-world scenarios, predictions may be missing, extra, or have varying confidence levels. A cost matrix helps us decide not only which fields should be paired but also whether some predictions should be left unmatched if no good match exists.

Designing a Cost Function: We need to compare a predicted field (with its text and confidence) to a ground truth field (text). Several factors can contribute to the cost:

  • Text similarity: If the predicted text exactly matches the ground truth text (e.g., the same invoice number or date), this should yield a low cost. If they are very different, the cost is high. We can use string similarity metrics (like Levenshtein edit distance or sequence matching) for this.
  • Field name similarity: Sometimes the field labels help (e.g., “Invoice Num” vs “Invoice Number” are essentially the same field). If the predicted field name differs from the expected field (like a date predicted where total is expected), that should incur a higher cost.
  • Confidence score: When working with Azure Document Intelligence, we get a confidence score together with the predicted value. A prediction with a low confidence should be considered less reliable. We can incorporate this by penalizing low confidence – e.g., using (1 - confidence) as an added cost. High confidence (close to 1.0 or 100%) adds little cost, while low confidence adds a larger cost, effectively telling the algorithm that matching this prediction to anything is “expensive” (thus it might be left unmatched if possible). This aligns with the idea of using confidence to decide whether to trust a prediction or send it for human review.

We'll combine these factors into a single cost. One simple approach: define similarity between predicted text and ground truth text (and/or field labels) as a value in [0,1], then cost = 1 - similarity (so 0 cost if perfectly identical). Then add the confidence penalty: total_cost = (1 - similarity) + (1 - confidence). This way, a perfect text match with 95% confidence yields cost ≈ 0.05 + 0.05 = 0.10 (very low), whereas a poor text match or low confidence pushes cost toward 1 or more.

Step-by-step Cost Calculation: Let’s compute the cost matrix for our example. We’ll use difflib.SequenceMatcher to measure text similarity (for simplicity) and apply the formula above.

import difflib

# Helper to compute similarity between two strings (0 to 1)
def similarity(a, b):
    return difflib.SequenceMatcher(None, a, b).ratio()

# Compute cost matrix
pred_fields = predicted_data
gt_fields = ground_truth

cost_matrix = []  # will be a list of lists
for pred in pred_fields:
    pred_label = pred["field"]
    pred_text = pred["text"]
    pred_conf = pred["confidence"]
    row_costs = []
    for gt in gt_fields:
        gt_label = gt["field"]
        gt_text = gt["text"]
        # Field name similarity and value similarity
        name_sim = similarity(pred_label.lower(), gt_label.lower())
        text_sim = similarity(pred_text, gt_text)
        sim_score = 0.5 * name_sim + 0.5 * text_sim  # weigh them equally
        cost = (1 - sim_score) + (1 - pred_conf)     # base cost + confidence penalty
        row_costs.append(cost)
    cost_matrix.append(row_costs)

# Display the cost matrix
gt_field_names = [gt["field"] for gt in gt_fields]

# ANSI color codes
GREEN = "\033[92m"    # For predicted field names
BLUE = "\033[94m"     # For ground truth field names
YELLOW = "\033[93m"   # For cost values
RESET = "\033[0m"     # Reset to default

for i, (pred, row) in enumerate(zip(pred_fields, cost_matrix)):
    pred_label = pred["field"]
    # Format costs to two decimals
    formatted_costs = [f"{cost:.2f}" for cost in row]
    print(
        f"Pred {i} ({GREEN}\"{pred_label}\"{RESET}) vs GT {BLUE}{gt_field_names}{RESET} costs: {YELLOW}{formatted_costs}{RESET}"
    )        

Running this for our example yields:

Pred 0 ("Invoice Num") vs GT ['Invoice Number', 'Date', 'Total'] costs: ['0.11', '0.76', '0.79']
Pred 1 ("Date") vs GT ['Invoice Number', 'Date', 'Total'] costs: ['0.98', '0.25', '0.91']
Pred 2 ("Document Type") vs GT ['Invoice Number', 'Date', 'Total'] costs: ['1.05', '1.18', '1.19']        

Let’s interpret each row:

For Pred 0 (Invoice Num = INV-0001, conf 0.95):

  • Cost to match GT Invoice Number (INV-0001) is 0.11, which is very low (good match). This makes sense: the field labels are almost the same (“Invoice Num” vs “Invoice Number”) and the text INV-0001 exactly matches; confidence 95% adds only 0.05 cost.
  • Cost to match GT Date is 0.76 – quite high. Different field name and completely different text (an invoice number vs a date) gives low similarity, hence high cost.
  • Cost to match GT Total is 0.79, also a bad match (different field and text).

For Pred 1 (Date = 2021-01-02, conf 0.80):

  • Cost to GT Date is 0.25 (pretty low). Field name matches exactly, text is very similar to 2021-01-01 (off by one day, but ~90% similar), confidence 80% adds 0.20.
  • Cost to GT Invoice Number or Total are high (>0.9), as a date doesn’t match those fields.

For Pred 2 (Document Type = "Invoice", conf 0.70):

All costs are greater than 1.0, extremely high (poor match to any of the ground truth fields). The field name “Document Type” is unrelated to Invoice Number, Date, or Total; its value "Invoice" doesn’t match any of the ground truth values. Even with 70% confidence, it’s likely not supposed to match any available ground truth field.

Confidence Thresholding: Before matching, we can optionally discard low-confidence predictions altogether. In practice, one might set a threshold (say 0.5 or 0.6). Any prediction with confidence below that is dropped, on the assumption it's likely wrong or noise. In our example, if we had set a threshold of 0.8, the “Document Type” prediction (70% confidence) would be filtered out before matching, leaving only the first two predictions to match with ground truth. We’ll proceed without dropping it to illustrate how algorithms handle an unmatched prediction, but in production you’d tune this threshold based on validation data to improve precision.

Now that we have a cost matrix, the task reduces to a matching problem: pair up some of the rows and columns such that the sum of costs of chosen pairs is minimized, without reusing any row or column more than once (each prediction and each ground truth can only be matched once). We also allow that some rows or columns may remain unmatched if pairing them is not beneficial (e.g., we might prefer to leave “Document Type” unmatched than assign it incorrectly to "Total").

This is a classic assignment problem in graph theory: we have a bipartite graph between predictions and ground truth entries with edge weights = cost, and we want a minimum weight matching. Next, we’ll explore various algorithms to solve this.


Algorithm 1: Kuhn-Munkres Algorithm (Hungarian Algorithm)

The Hungarian algorithm is a well-known method for solving the assignment problem optimally in polynomial time. It was developed by Kuhn in 1955 (and refined to O(n³) complexity later). Hungarian algorithm guarantees to find the global minimum total cost matching – in other words, the best overall pairing between predictions and ground truth.

How it works (briefly): The algorithm operates on the cost matrix through a series of steps (subtracting row/column minima, drawing minimum lines through zeros, adjusting remaining costs, etc.) to find an optimal set of zero-cost assignments after transformations. The details can be complex, but many libraries implement it. Under the hood, it’s finding an optimal matching in a bipartite graph (it’s also equivalent to finding a minimum weight perfect matching).

Handling unequal numbers of items: The Hungarian method is typically described for a square matrix (equal number of jobs and workers, or predictions and truths). If the sets are unequal, a common trick is to add dummy rows or columns filled with a certain cost (e.g., zero or a threshold penalty) to balance it. Matching a prediction to a dummy column means leaving that prediction unmatched (and possibly incurring a penalty cost), and similarly for dummy rows. We’ll use this approach so that the algorithm can choose to leave something unmatched if that is better than a very high-cost incorrect match. For example, we can add a dummy ground-truth entry with a cost of, say, 0.5 for any prediction matching to it. This gives the algorithm the option to “pay” 0.5 cost to leave a prediction unmatched instead of forcing a match with cost >1.0.

Python Implementation: We will use scipy.optimize.linear_sum_assignment, which implements the Hungarian algorithm. It takes a cost matrix (can be rectangular) and returns the indices of the optimal assignment.

# ANSI color codes
GREEN = "\033[92m"    # For predicted field names
BLUE = "\033[94m"     # For ground truth field names
YELLOW = "\033[93m"   # For headings or cost values if needed
RESET = "\033[0m"     # Reset to default

# The cost matrix that we have is used from our previous code where we calculate the values of the cost matrix. 
# Prepare cost matrix with a dummy column to allow leaving some predictions unmatched
cost_matrix_np = np.array(cost_matrix)
n_pred, n_gt = cost_matrix_np.shape
if n_pred == n_gt:
    dummy_col = np.full((n_pred, 1), 0.5)  # cost 0.5 for leaving a pred unmatched
    cost_matrix_np = np.hstack([cost_matrix_np, dummy_col])
    gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"]
elif n_pred > n_gt:
    dummy_cols = np.full((n_pred, n_pred - n_gt), 0.5)
    cost_matrix_np = np.hstack([cost_matrix_np, dummy_cols])
    gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"] * (n_pred - n_gt)
elif n_gt > n_pred:
    dummy_rows = np.full((n_gt - n_pred, n_gt), 0.5)
    cost_matrix_np = np.vstack([cost_matrix_np, dummy_rows])
    gt_labels = [gt["field"] for gt in ground_truth]
    pred_fields += [{"field": "<No Prediction>", "text": "", "confidence": 1.0}] * (n_gt - n_pred)

# Solve assignment using the Hungarian algorithm
row_idx, col_idx = linear_sum_assignment(cost_matrix_np)
assigned_pairs = list(zip(row_idx, col_idx))

# Neatly print the colored assignments
print(f"{YELLOW}Optimal assignment:{RESET}")
for pred_index, gt_index in assigned_pairs:
    pred_field = pred_fields[pred_index]["field"]
    gt_field = gt_labels[gt_index]
    print(f"Pred {pred_index} ({GREEN}\"{pred_field}\"{RESET}) -> GT ({BLUE}{gt_field}{RESET})")        

For our example, this might output something like:

Optimal assignment:                                                                                                                                                     
Pred 0 ("Invoice Num") -> GT (Invoice Number)                                                                                                                           
Pred 1 ("Date") -> GT (Date)                                                                                                                                            
Pred 2 ("Document Type") -> GT (<No Match>)           

The results are obvious, but for sake of clarity

  • Pred 0 → GT 0 (our Pred0 “Invoice Num” matched to GT0 “Invoice Number”),
  • Pred 1 → GT 1 (“Date” matched to “Date”),
  • Pred 2 → GT 3, where column 3 is the dummy <No Match> we added.

So, the algorithm chose to match the first two predictions with their correct fields, and match the third prediction (Document Type) to the dummy “No Match” (incurring a cost of 0.5) instead of forcing it to match one of the actual fields at cost >1. This is exactly what we want: it found the minimal-cost solution which leaves the spurious prediction unmatched and leaves the “Total” ground truth unmatched as well. The total cost of this assignment would be 0.11 + 0.25 + 0.5 = 0.86. Any other assignment would cost more (for instance, if it had matched Pred2 to “Total”, the cost would have been ~1.19, making total cost >1.5).

Hungarian algorithm ensured the globally optimal matching. It’s widely used in similar tasks, for example in evaluating object detection where you match predicted bounding boxes to ground truth boxes. The DETR object detection model famously uses Hungarian matching, with a cost that combines class prediction probability and bounding-box overlap, to assign model-detected objects to ground truth objects. In our case, the “class probability” is analogous to the confidence score, and the “bbox difference” is analogous to text difference – we combined them into our cost.

Complexity: Hungarian algorithm runs in O(n³) for n items, which is efficient for moderate sizes (hundreds or a few thousands of items). For a small number of fields per document, it’s extremely fast. If you had a very large number of items to match, this could become a bottleneck, but typically other approaches (min-cost flow variants or approximations) might be considered in those cases.

Output of Matching: We can now produce a report of matched vs unmatched items:

  • Matched: Invoice Number ←→ Invoice Num (INV-0001), confidence 95% – matched correctly
  • Matched: Date ←→ Date (2021-01-01 vs 2021-01-02), confidence 80% – matched (with a minor value discrepancy, possibly send to human review).
  • Unmatched (Ground Truth): Total $100.00 – model missed this field
  • Unmatched (Prediction): Document Type = “Invoice”, confidence 70% – no corresponding field in truth

Later, we’ll discuss how to visualize the cost matrix or matches (e.g., creating a heatmap of the cost matrix where low costs = green, high = red, which helps identify the best matches visually).


Algorithm 2: Greedy Nearest-Neighbor Matching

A simpler but heuristic approach is to use a greedy algorithm: repeatedly pick the available prediction-ground truth pair with the lowest cost (nearest match) and mark them as matched, until no more pairs can be matched. This is essentially a locally optimal strategy – at each step, take the best immediate pair. While this is faster and easier to implement, it does not guarantee a globally optimal assignment in all cases. A greedy algorithm can get stuck in a suboptimal configuration because it never reconsiders earlier choices. For example, a classic counterexample is given in the assignment problem: if you greedily assign one agent to a job with cost 1 (best single pair) and the other agent to the remaining job with cost 8 (total 9), you might miss the optimal arrangement of costs 5 and 2 (total 7).

Nevertheless, greedy matching can work reasonably if the cost matrix has a clear one-to-one structure. Let’s apply a greedy matching to our example and see what happens.

Python Implementation: This is a fairly simple implementation of the code and besides numpy library, there are no special libraries that are required.

import numpy as np

# ANSI color codes
GREEN = "\033[92m"    # For predicted field names
BLUE = "\033[94m"     # For ground truth field names
YELLOW = "\033[93m"   # For cost values or headers
RESET = "\033[0m"     # Reset to default

# Prepare numpy cost matrix with dummy columns/rows.
cost_matrix_np = np.array(cost_matrix)
n_pred, n_gt = cost_matrix_np.shape
if n_pred == n_gt:
    dummy_col = np.full((n_pred, 1), 0.5)  # cost 0.5 for leaving a pred unmatched
    cost_matrix_np = np.hstack([cost_matrix_np, dummy_col])
    gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"]
elif n_pred > n_gt:
    dummy_cols = np.full((n_pred, n_pred - n_gt), 0.5)
    cost_matrix_np = np.hstack([cost_matrix_np, dummy_cols])
    gt_labels = [gt["field"] for gt in ground_truth] + ["<No Match>"] * (n_pred - n_gt)
elif n_gt > n_pred:
    dummy_rows = np.full((n_gt - n_pred, n_gt), 0.5)
    cost_matrix_np = np.vstack([cost_matrix_np, dummy_rows])
    gt_labels = [gt["field"] for gt in ground_truth]
    pred_fields += [{"field": "<No Prediction>", "text": "", "confidence": 1.0}] * (n_gt - n_pred)

# Greedy matching based on all possible pairs sorted by cost ascending
pairs = [(cost_matrix_np[i, j], i, j)
         for i in range(cost_matrix_np.shape[0])
         for j in range(cost_matrix_np.shape[1])]
pairs.sort(key=lambda x: x[0])

matched_pred = set()
matched_gt = set()
greedy_pairs = []

for cost, i, j in pairs:
    if i not in matched_pred and j not in matched_gt:
        greedy_pairs.append((i, j, cost))
        matched_pred.add(i)
        matched_gt.add(j)

# Print the greedy matching results with colored output
print(f"{YELLOW}Greedy matched pairs:{RESET}")
for i, j, cost in greedy_pairs:
    pred_field = pred_fields[i]["field"]
    gt_field = gt_labels[j]
    print(f"Pred {i} ({GREEN}\"{pred_field}\"{RESET}) -> GT ({BLUE}\"{gt_field}\"{RESET}) with cost: {YELLOW}{cost:.2f}{RESET}")        

This yields the following output for our example (including dummy column):

Greedy matched pairs: 
Pred 0 ("Invoice Num") -> GT ("Invoice Number") with cost: 0.11
Pred 1 ("Date") -> GT ("Date") with cost: 0.25
Pred 2 ("Document Type") -> GT ("<No Match>") with cost: 0.50        

We can exclude the dummy column in the above code to iterate j from 0 to orig_n_gt–1, like this:

# Create sorted list of candidate pairs using only real GT indices
pairs = [(cost_matrix_np[i, j], i, j)
         for i in range(cost_matrix_np.shape[0])
         for j in range(orig_n_gt)]        

This yields the following output for our example (excluding dummy column):

Greedy matched pairs:
Pred 0 ("Invoice Num") -> GT ("Invoice Number") with cost: 0.11
Pred 1 ("Date") -> GT ("Date") with cost: 0.25
Pred 2 ("Document Type") -> GT ("Total") with cost: 1.19        

The greedy algorithm did match Pred0→GT0 and Pred1→GT1 (same as Hungarian for those), but then it matched Pred2→GT2 (Document Type to Total) with cost 1.19, because at the end that was the only remaining pair. It did not consider leaving them unmatched via the dummy option (we treated dummy as an actual column with cost 0.5; interestingly, if we include the dummy in the pair list, the greedy method would have chosen Pred2→Dummy (0.5) instead of Pred2→Total (1.19) since 0.5 is lower. However, without explicitly adding a dummy or threshold, a naive greedy approach will end up matching everything one-to-one until one side is exhausted).

Result: Greedy matching in this case gave a suboptimal result: it incurred a higher total cost and produced an incorrect match (Document Type matched to Total, which is wrong). The total cost with greedy = 0.11 + 0.25 + 1.19 = 1.55, whereas Hungarian’s optimal was 0.86. The greedy algorithm made a locally optimal choice at each step but ended up pairing two items that shouldn’t be matched. This illustrates the common caveat: greedy algorithms run faster but can yield suboptimal solutions.

However, we can improve the greedy approach by adding simple rules:

  • If the lowest available cost is above a certain threshold, we might decide not to match that pair (i.e., leave those items unmatched). In our example, noticing that 1.19 is high and perhaps above a 1.0 threshold could have prevented the bad match.
  • We could also integrate the dummy matching concept: include a “no match” option for each prediction with a moderate cost (like 0.5) so that the greedy algorithm can choose that if all real options are worse. Essentially, that converts greedy to a heuristic that can skip poor matches.

Complexity: Sorting all pairs is O(n² log n) for n items, and then iterating is O(n²). This is usually fine for small n, but becomes expensive if n is large (because n² pairs grow quickly). A more efficient greedy implementation could pick the minimum cost pair in O(n²) without full sorting, or use a priority queue for O(n² log n). Greedy is simpler than Hungarian (conceptually), but because of its potential to be wrong in some cases, it’s often used only when a quick approximate solution is acceptable or when the cost structure is such that a near-greedy strategy is provably optimal (which is not the general case for assignment).


Algorithm 3: Gale-Shapley (Stable Marriage) Matching

Another approach to matching is the Stable Marriage algorithm, also known as Gale-Shapley algorithm. This algorithm is designed to find a stable matching between two sets given their preference rankings. A matching is stable if there is no pair of a prediction and a ground truth who would both prefer to be matched to each other over their current partners. Gale-Shapley guarantees to find a stable matching in O(n²) time for n pairs, if one exists, and it works even when each side has its own preference list of the other side.

How can we use this in our context? We can treat each prediction as having preferences over which ground truth field it would rather match (based on ascending cost, i.e., lowest cost = highest preference). Likewise, each ground truth can be given a preference order over predictions (also by increasing cost or perhaps by some priority like importance). Running Gale-Shapley on these preferences will yield a matching where no unmatched prediction-ground truth pair would both prefer each other over their current match.

Important: Stable matching does not guarantee minimum total cost – its goal is different. It ensures a sort of fairness or stability, which can be relevant in certain situations (for example, matching reviewers to conference papers where both have preferences – you don’t want a situation where a reviewer and a paper would prefer to be matched but aren’t). In our scenario, ground truth fields don’t really have intrinsic preferences (any correct prediction is fine), but we can still assign them preferences based on cost (i.e. they “prefer” the prediction that best matches them).

To illustrate Gale-Shapley, we’ll implement it and run on our example. We’ll assume:

  • Each prediction ranks ground truth fields by increasing cost (lowest cost = top choice).
  • Each ground truth ranks predictions by increasing cost (i.e., whichever prediction seems most similar/accurate for that field is most preferred).

Then we run the stable marriage algorithm with predictions proposing to ground truth (this tends to favor the proposing side – here, predictions – in terms of getting their more preferred matches, which is fine).


################################################################
# Gale–Shapley stable matching algorithm
def gale_shapley(pred_prefs, gt_prefs):
    # pred_prefs: dict of {pred_index: [gt_index preference list]}
    # gt_prefs: dict of {gt_index: [pred_index preference list]}
    free_preds = list(pred_prefs.keys())
    current_match = {}  # gt_index -> pred_index
    # Precompute ranking for quick comparison: rank[gt][pred] = position in gt's list (lower = more preferred)
    rank = {j: {pred: r for r, pred in enumerate(gt_prefs[j])} for j in gt_prefs}
    next_proposal = {i: 0 for i in pred_prefs}  # index in pref list each pred will propose next

    while free_preds:
        i = free_preds[0]  # take a free prediction
        if next_proposal[i] >= len(pred_prefs[i]):
            free_preds.pop(0)  # no one left to propose
            continue
        j = pred_prefs[i][next_proposal[i]]
        next_proposal[i] += 1
        if j not in current_match:
            current_match[j] = i
            free_preds.pop(0)
        else:
            k = current_match[j]
            if rank[j][i] < rank[j][k]:
                current_match[j] = i
                free_preds.pop(0)
                free_preds.append(k)
            # Otherwise, j rejects i (i remains free)
    return current_match

################################################################
# Build preference lists from cost_matrix (not including dummy for stable matching logic)
pred_preferences = {}
for i, row in enumerate(cost_matrix):
    sorted_gt = sorted(range(len(row)), key=lambda j: row[j])
    pred_preferences[i] = sorted_gt

gt_preferences = {}
for j in range(len(cost_matrix[0])):
    sorted_pred = sorted(range(len(cost_matrix)), key=lambda i: cost_matrix[i][j])
    gt_preferences[j] = sorted_pred

stable_match = gale_shapley(pred_preferences, gt_preferences)
print(f"\n{YELLOW}Stable matches (gt_index -> pred_index):{RESET}", stable_match)
        

For our example, the stable matching are as follows:

Cost Matrix (rows: predictions, columns: ground truths):
Pred 0 (Invoice Num): 0.11  0.76  0.79
Pred 1 (Date): 0.98  0.25  0.91
Pred 2 (Document Type): 1.05  1.18  1.19

Stable matches (gt_index -> pred_index): {0: 0, 1: 1, 2: 2}        

This means:

  • GT0 (Invoice Number) matched with Pred0 (Invoice Num),
  • GT1 (Date) matched with Pred1 (Date),
  • GT2 (Total) matched with Pred2 (Document Type).

So Gale-Shapley ended up matching everything (each ground truth got a partner) in this scenario. Is this matching stable? We need to check if any unmatched pair would prefer each other:

  • The only unmatched pair in terms of cross sets would be none, since all preds are matched too (Pred2 is matched to GT2 here).
  • However, notice this stable matching is actually the same result the greedy gave (and it’s not the optimal assignment). It matched the Document Type prediction to the Total field, which is incorrect. Why did stable matching do that? Because:From the predictions’ perspective, Pred2 (Document Type) prefers GT0 most (cost 1.05 vs others 1.18,1.19), but GT0 was already matched to Pred0 whom GT0 prefers (Pred0 cost 0.11 vs Pred2 cost 1.05). Pred2 then tried GT1, which was matched to Pred1 whom GT1 prefers (0.25 vs 1.18). Finally Pred2 proposed to GT2, which was free at that point and accepted.Once matched, no “rogue pairs” exist that would break stability. For instance, Pred2 and GT0 are not matched, but Pred2 would prefer GT0 over its current GT2 yes; would GT0 prefer Pred2 over its current Pred0? No (GT0 likes Pred0 much better). Similarly, no mutual preference violation exists. So the matching is stable but not optimal in terms of cost.

In scenarios where both sides have strong opinions (preferences), stable matching is a great solution. In our case, the ground truth “prefers” the correct predictions strongly, so stability ends up pairing them correctly, but it had no mechanism to decide to leave GT2 unmatched instead of taking a poor match. We could introduce a dummy on each side as an additional participant that represents “no match” with an appropriate preference (this gets complicated and isn’t standard stable marriage).

Takeaway: Gale-Shapley can be used for matching if we care about the stability criterion. However, for pure accuracy evaluation, we usually care more about optimizing the matches (even if that means leaving some unmatched). So stable matching is an interesting alternative approach, but it might not be as useful here unless we reformulate the problem with preferences differently. It does run in O(n²) which is efficient, and ensures no pair of pred+gt would secretly rather be matched to each other than their current partners.


Algorithm 4: Linear Programming (ILP) Approach

The assignment problem can also be formulated and solved as an Integer Linear Program (ILP) or as a special case of a minimum-cost flow problem. The formulation is:

  • Decision variables: x[i,j] for each prediction i and ground truth j, where x[i,j] = 1 if prediction i is matched to gt j, 0 otherwise.
  • Objective: minimize sum{i,j} cost[i,j] * x[i,j].
  • Constraints: each prediction i can be matched to at most one ground truth (∑_j x[i,j] ≤ 1), and each ground truth j can be matched to at most one prediction (∑_i x[i,j] ≤ 1). If we require every prediction or every truth to be matched, we use equality constraints; but in our case we allow unmatched, so “≤ 1” is appropriate.
  • Additionally, we can allow matches to a dummy to represent unmatched (or equivalently, not force the sums to equal 1).

This ILP is a classic bipartite matching formulation. We can solve it with a MILP solver or even a specialized algorithm (Hungarian is essentially an optimized solver for this with the constraints as equalities for a perfect matching). If we drop the integrality (allow x[i,j] to be fractional between 0 and 1), the problem becomes a linear program that is totally unimodal, meaning the optimal solution will still be integer in this assignment structure. However, allowing ≤ 1 instead of =1 constraints means the trivial solution x=0 (match nothing) is feasible with cost 0, which a solver would choose unless we add something to encourage matching. Typically, one would either enforce matching all ground truths (if evaluating recall one might want to match all truths if possible) or add a dummy with a moderate cost as we did.

Python Implementation: We’ll use PuLP (a Python LP library) to illustrate setting up the ILP, though any MILP solver or OR-Tools CP-SAT could be used.

from pulp import LpProblem, LpVariable, lpSum, LpMaximize, LpStatus

# ANSI color codes for colorful printing
GREEN  = "\033[92m"    # For predicted field names
BLUE   = "\033[94m"    # For ground truth field names
YELLOW = "\033[93m"    # For headings or emphasis
RESET  = "\033[0m"     # Reset to default

# Create the ILP model as a minimization problem.
prob = LpProblem('Assignment', LpMinimize)

# Create binary variables for each prediction - ground_truth (including dummy) pair (i,j)
x = {(i, j): LpVariable(f"x_{i}_{j}", lowBound=0, upBound=1, cat='Binary')
     for i in range(n_pred) for j in range(n_total_gt)}

# Objective: minimize total matching cost.
prob += lpSum(cost[(i, j)] * x[(i, j)]
              for i in range(n_pred)
              for j in range(n_total_gt))

# Constraints: Force every prediction to be assigned exactly one option (real or dummy)
for i in range(n_pred):
    prob += lpSum(x[(i, j)] for j in range(n_total_gt)) == 1

# For real ground truth options, ensure at most one prediction is matched.
for j in range(n_gt):
    prob += lpSum(x[(i, j)] for i in range(n_pred)) <= 1

# (No constraint on dummy options so they can be assigned to multiple predictions)

# Solve the ILP
prob.solve()

# Gather matched pairs based on binary variable values
matches = [(i, j) for (i, j) in x if x[(i, j)].value() == 1]

# Colorful output of results similar to the Hungarian code
print(f"{YELLOW}ILP Optimal Assignment (Status: {LpStatus[prob.status]}):{RESET}")
for i, j in matches:
    pred_field = predictions[i]["field"]
    if j < n_gt:
        gt_field = ground_truth[j]["field"]
        print(f"Pred {i} ({GREEN}\"{pred_field}\"{RESET}) -> GT {j} ({BLUE}\"{gt_field}\"{RESET})")
    else:
        print(f"Pred {i} ({GREEN}\"{pred_field}\"{RESET}) -> GT (dummy) ({BLUE}\"<No Match>\"{RESET})")
        

This should give a matching similar to the earlier methods.

ILP Optimal Assignment (Status: Optimal):
Pred 0 ("Invoice Num") -> GT 0 ("Invoice Number")
Pred 1 ("Date") -> GT 1 ("Date")
Pred 2 ("Document Type") -> GT (dummy) ("<No Match>")        

However had we not included a dummy or require matching all, the ILP would find the zero assignment (match nothing) as optimal if all costs are positive. In our example, all real match costs are ≥0.11, so indeed the solver would set x=0 for all pairs for cost 0, which is not what we want. To fix this, we would add dummy options or constraints to ensure we match what we can.

ILP Optimal Assignment (Status: Optimal):
Pred 0 ("Invoice Num") -> GT 2 ("Total")
Pred 1 ("Date") -> GT 1 ("Date")
Pred 2 ("Document Type") -> GT 0 ("Invoice Number")        

The linear programming approach is very flexible:

  • We can add additional constraints easily. E.g., if we had a rule that at least X fields must be matched, or that certain pairs are not allowed (set x[i,j]=0 for some pairs), etc., ILP can handle that.
  • We can also modify the objective. For example, if we instead maximize total matched confidence or something, it’s easy to tweak.

However, solving an ILP with a generic solver might be slower than the Hungarian algorithm for large n, since ILP solvers use general branch-and-bound (though the assignment LP is quite easy for them). In practice, one might use a specialized algorithm (Hungarian, min-cost flow algorithms) for speed, but ILP is a good sanity check and useful if extending the problem beyond a basic assignment.

In summary, the ILP yields the same optimal matches (if set up properly) as the Hungarian algorithm, since it’s essentially solving the same optimization problem by a different means. For brevity, we won’t dive more into the solver output here.


Algorithm 5: Approximate Nearest Neighbor (Vector Matching)

The last approach we'll consider is less about solving the assignment problem optimally and more about quickly finding closest matches in a high-dimensional space. If our fields can be embedded as vectors (for example, using a language model to embed the text of each field, or using the spatial coordinates of the field on the document), we could use an Approximate Nearest Neighbor (ANN) search algorithm to find potential matches.

ANN algorithms (like KD-trees, locality-sensitive hashing (LSH), or modern vector indexes) allow us to query for the nearest neighbor of a point without exhaustively scanning all possibilities. This is extremely useful when you have thousands or millions of items to match (not typical for fields in one document, but could be if matching many documents or large tables).

Use case: Suppose each predicted field and each ground truth field has a feature vector combining various attributes (text embedding, position, etc.). We could pre-index the ground truth vectors in a KD-tree and for each predicted vector, query the nearest neighbor among ground truths. This gives a likely match in roughly O(log N) time per query instead of O(N). However, to ensure a one-to-one matching, we’d still need a second step to handle conflicts (two predictions pointing to the same ground truth). A common strategy could be:

  1. Query each prediction’s nearest ground truth.
  2. Sort predictions by how close their nearest match is.
  3. Assign them in that order, and if a ground truth is already taken, move to the next nearest for that prediction.

This becomes a greedy strategy seeded by ANN results. It may not yield the optimal assignment, but if the embedding space and distance measure are good, it could be a reasonable approximation.

Example (conceptual): In our case, we might embed each field’s text using a simple vector (like [similarity_to_invoice_number, similarity_to_date, similarity_to_total] or some encoding). But that’s overkill for a small example. Instead, let's demonstrate using a KD-tree on the positions in the cost matrix (just for illustration):

from math import inf
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Let's pretend each ground truth has a 2D vector (for example, x,y position on page).
# This is a made up example as an output of Document Intelligence. 
# For our example, we'll just make up some coordinates for ground truth fields:
gt_positions = np.array([
    [100, 100],  # Invoice Number position
    [200, 100],  # Date position
    [100, 200]   # Total position
])
# And some positions for predictions:
pred_positions = np.array([
    [105, 105],  # Invoice Num (close to Invoice Number)
    [205, 100],  # Date (close to Date)
    [300, 300]   # Document Type (somewhere else, not near any ground truth)
])

# Use a KD-tree (via NearestNeighbors) to find nearest GT for each pred
nn = NearestNeighbors(n_neighbors=1, metric='euclidean').fit(gt_positions)
distances, indices = nn.kneighbors(pred_positions)
print("Nearest neighbor GT for each prediction:", indices.flatten())
print("Distances:", distances.flatten())
        

We get the following output

Nearest neighbor GT for each prediction: [0 1 2]  
Distances: [7.07 5.00 212.13]        

This indicates:

  • Pred0’s nearest GT is 0 (Invoice Number) at distance ~7 (close by),
  • Pred1’s nearest GT is 1 (Date) at distance 5,
  • Pred2’s nearest GT is 2 (Total) at distance 212 (really far, likely indicating it doesn’t correspond to any field location).

If we assign according to this, we’d match Pred0→GT0, Pred1→GT1, Pred2→GT2. In this dummy scenario, that matched everything, but we see Pred2 was actually very far from GT2. We could decide that 212 is beyond a threshold and treat Pred2 as unmatched.

This approach is essentially using spatial (or feature) proximity as the matching criterion directly. It’s a good approach when the matching criterion is purely distance in a feature space and one-to-one matching is less strict or can be enforced by other means. For a large-scale system (say matching hundreds of thousands of entries), one might use ANN on text embeddings to shortlist a few candidate matches for each item, and then run a more exact algorithm like Hungarian on that reduced set.

Caveats: Approximate NN search may return a close but incorrect match if the feature representation is not perfect. It also doesn’t inherently handle the one-to-one constraint – you have to enforce it after the fact. It shines when speed is crucial and a good-enough match is acceptable. In our context of evaluating document extraction, we usually prefer an exact optimal matching (we’re not dealing with millions of fields in one document). But if you were integrating information across databases, vector similarity search could be your friend.

To sum up this section: ANN methods (KD-trees, LSH, etc.) partition the vector space to dramatically speed up neighbor searches at the cost of occasional approximation. They are an advanced tool in the matching arsenal, especially for high-dimensional representations from deep models (foundation models that embed text or layout information).


Putting it all together

To play around with these approaches, I created a synthetic data generator. In this function, I generate the data by first creating a set of ground truth fields, where each field has a unique label and corresponding value (for example, "Field_0" with "Value_0", etc.). Predicted fields are then generated by sometimes correctly copying these ground truth values—with a small chance of minor modifications (like appending a random lowercase letter)—and by introducing extra or missing fields. Each predicted field carries a confidence score representing its reliability. This setup simulates real-world scenarios where data extraction might suffer from noise, inaccuracies, or missing information. The confidence threshold is later used to filter out less-reliable predictions, ensuring that only predictions above a certain confidence level are considered by the matching algorithms.

Results


Article content

The charts provide a visual evaluation of this matching process.

Article content
Cost Matrix HeatMap

The cost matrix heatmap displays the computed matching cost between every predicted field and every ground truth field, where darker regions indicate a lower cost—meaning a higher similarity and thus a more likely correct match.

Article content
True Positives, False Positives and False Negatives per algorithm

The bar chart compares the number of true positives (TPs), false positives (FPs), and false negatives (FNs) for each matching algorithm, allowing you to quickly see which method most accurately recovers the ground truth.


Article content
Confidence Distribution Plot

The confidence distribution plot shows how the confidence scores of the filtered predictions are spread, which can reveal the overall quality of the predictions. Looking at the results you shared, for example, the Hungarian algorithm achieved 9 TPs with only 1 FP and 1 FN, translating to high precision, recall, and F1 scores of 90%, all in a very short execution time of 0.11 ms. In contrast, the Linear Programming and Approx NN methods recorded zero correct matches in this instance, clearly indicating that under the current data conditions, these approaches may not be as effective.


Production Guidelines: Using Confidence Scores and Thresholds Effectively

We touched on confidence scores earlier, but let’s consolidate how to use them in matching:

  • As part of cost: Incorporating (1 - confidence) into the cost function, as we did, directly uses the model’s probability estimates. A prediction with 99% confidence will have a 0.01 penalty, whereas one with 50% confidence has a 0.5 penalty – this often makes the algorithm prefer to match high-confidence predictions and be wary of low-confidence ones. This is valuable because the model itself signals uncertainty; combining that with similarity measures yields a more robust cost. In the DETR example mentioned, using the negative log-likelihood of the class (which is like a confidence) as part of cost is analogous.
  • Threshold filtering: Before matching, filter out predictions below a confidence cutoff. This is commonly done in production systems. For instance, if a field prediction is below 60% confidence, you might exclude it from automatic processing and send it for manual review. By filtering, you reduce the chance of bad matches (garbage in, garbage out). In our example, if “Document Type” had only 40% confidence, removing it upfront means the algorithms only deal with the two solid predictions. Those get matched to their truths, and the missing “Total” is clearly unmatched (which might trigger an error handling or human review).
  • Post-matching threshold: Another angle is after obtaining a match, consider the confidence of that match. If the match cost is still high (or the similarity is low), you might flag that even if the algorithm matched something. For example, the date had 80% confidence and a slight mismatch – we matched it, but perhaps we’d flag it for verification due to the date discrepancy. This could be done by checking if the cost of a matched pair exceeds some limit.

Below are some of the other optimizations, edge cases, and improvements that you can consider in your implementations.

  • Edge Cases: If either the extracted or true list is empty, our logic should handle it (the cost matrix will be empty in one dimension, and we interpret all outputs as missed or spurious accordingly). We should also handle cases where schema differences occur (e.g., a key present in truth not present in prediction or vice versa, which our code accounts for by treating missing keys as mismatches). If multiple items are identical or very similar, the cost matrix might have equal costs causing ties – the Hungarian algorithm will still find an optimal assignment, but it’s arbitrary which identical item pairs with which. This typically doesn’t affect overall accuracy counts but is something to note when analyzing results.
  • Conversion and Tolerance: We implemented basic conversion (strings to floats, case normalization) in comparing primitives. In practice, you might need more robust normalization (date parsing, ignoring punctuation, handling synonyms). For numeric fields, consider a tolerance (e.g., if a price is off by a few cents, you might or might not count it as correct depending on requirements). The cost function can also be made continuous (e.g., cost = absolute difference in price) if we want to allow small deviations to be penalized less.
  • Weighted Costs: Not all fields are equally important. In some cases, you might weight certain fields higher in the cost. For example, to ensure items are matched primarily by a key field (like a product code or name), you could use a low cost for matching identical names and a high cost for mismatched names, so the algorithm prioritizes pairing items with the same name
  • Partial Credit vs Exact Match: Our accuracy metrics counted a field as correct only if it exactly matched after conversion. Depending on needs, you could assign partial credit. For instance, if an address has multiple components and the AI got some right and some wrong, you might score that partially. The recursive approach we used naturally handles partial correctness by counting at the field level. If you wanted an item-level “exact match” metric, we showed how to count fully correct items separately.
  • Automation and Libraries: We manually implemented the comparison logic for clarity. In real projects, libraries like DeepDiff can compute differences between nested structures and even provide a similarity score. In fact, one approach is to use a distance metric (like DeepDiff’s “deep distance”) to fill the cost matrix, then apply the Hungarian algorithm


Production Guidelines: Trade-offs and Choosing an Approach

Each algorithm has situations where it shines:

  • If you have a typical evaluation size and need absolute confidence in the evaluation accuracy, the Hungarian algorithm is the go-to solution, as it will maximize true matches and is reasonably fast for most evaluation tasks (often completing in fractions of a second for hundreds of items). It guarantees the maximum number of correct matches (or minimum cost). For example, when evaluating an extraction model on a small batch of documents where precision of evaluation matters, Hungarian gives the fairest scoring by considering the best global alignment between prediction and ground truth. It’s the best choice when in doubt, ensuring the evaluation isn’t skewed by how matching is done.
  • If the dataset is very large or you need results very quickly, and you can tolerate the possibility that the matching isn’t perfect, a Greedy matching or ANN-based method might be preferable. For example, in a streaming scenario or extremely large-scale experiment, an ANN method can give a result in near-real-time where an exact method might be too slow. The cost is a potential tiny drop in evaluation accuracy (e.g., missing out on a match or two that an optimal algorithm would have found). In real-time systems or preliminary analyses, a greedy match (perhaps combined with confidence thresholding) might be acceptable. Be cautious: if matching quality is critical, greedy might miss some correct pairings that an optimal method would catch.
  • Use Approximate Nearest Neighbor (ANN) based matching when dealing with very large scale matching (hundreds of thousands of items or more) or high-dimensional embeddings representing the items, where exact matching would be too slow. ANN is common in domains like image or text retrieval (RAG scenarios). In document data extraction, if each field is embedded into a vector space (for semantic matching) or if you need to quickly match detected text snippets to reference text, ANN can dramatically speed up the candidate finding step. Keep in mind the slight accuracy hit – you might miss a correct match occasionally due to approximation. In practice, a two-stage approach is often used: ANN to propose likely matches, followed by a verification step to ensure correctness.
  • Use Stable Marriage only if your matching scenario genuinely involves two-sided preferences (which is uncommon in straightforward AI vs ground truth matching). Its main advantage (stability) doesn’t typically translate to better precision/recall – it’s more about fairness. In virtually all cases of prediction matching, you’d prioritize accuracy over the kind of fairness stable matching provides, so this is more of a theoretical inclusion. One real-world scenario for stable matching in data processing could be entity resolution where two data sources have to agree on pairings without one side being dominant.
  • ILP methods shine when the matching problem deviates from the standard assumptions. For instance, maybe each ground truth can be matched by at most two predictions (say, an ensemble of models voting on a ground truth) – you can model that with an ILP. Or perhaps you want to maximize a custom score that isn’t just a sum of individual match scores; ILP gives that freedom. For example, if you wanted to match fields but also limit that certain fields can only be matched if a total confidence sum exceeds a threshold (a constraint that’s not part of a basic one-to-one match), you might model this in an ILP. The trade-off is computational: for large problems, ILP might be significantly slower. In practice, if you stick to a linear assignment formulation, specialized solvers (like Hungarian or min-cost flow algorithms) will outperform a general ILP solver.


Conclusion

We developed a complete solution for matching AI-extracted document fields with ground truth, covering everything from constructing a meaningful cost matrix (taking into account text similarity and confidence scores) to exploring multiple matching algorithms. The Hungarian algorithm provided an optimal benchmark, while other methods like greedy and stable matching gave insight into different strategies (and their pitfalls). Incorporating confidence scores from Azure’s Document Intelligence output was key to improving match quality – high-confidence predictions are favored, and low-confidence ones can be filtered or penalized, aligning with best practices of using confidence to decide automated vs human-involved processing.

This solution can be production-ready very easily: the code is modular (we could easily wrap the matching logic into a function), and we’ve considered edge cases (unequal numbers of items, completely missing predictions, etc.). In a real deployment, one would further refine the cost function (perhaps using domain-specific similarity measures, or even ML models to predict match vs no-match) and set appropriate thresholds. Logging or visualizing the cost matrix and matches (e.g., via heatmaps or summary reports of matched/unmatched) can help continuously monitor the model’s extraction quality.

By implementing these algorithms and understanding their outputs, we can build a robust evaluation pipeline for document AI systems or even use these techniques to merge AI-extracted data with existing databases or CRMsolutions. The combination of foundational algorithmic solutions and AI confidence scores ensures that we match data accurately and intelligently, minimizing errors and highlighting uncertainties for review.


If you made it this far, please consider subscribing to my newsletter and sharing or reposting this to your community.

To view or add a comment, sign in

More articles by Setu Chokshi

Insights from the community

Others also viewed

Explore topics