Metasearch: Search and RAG multiple datasets without data governance chaos

Metasearch: Search and RAG multiple datasets without data governance chaos

Metasearch systems take your query, send it to multiple search engines, and then show you the combined results. 

Most successful travel websites are metasearch engines. Travel sites like Kayak don’t have a single database of all the world’s flights, hotel rooms and available rental cars. Instead, they take your request, send it to dozens of other search engines, and then show you the ranked results. 

This same metasearch trick makes sense in the world of enterprise data management. Most organizations have roughly a dozen different data sources that typically need to be searched to produce an answer to a difficult question—especially one of those questions from upper management that start “Have we ever worked on…” or “Is there a single person in our organization who has ever…”

For example, when I worked at the US Department of Homeland Security, I frequently needed to search with Google (for the public internet), Google (for the DHS website), Service Now (for our knowledge base), GitHub (for a few open source projects), Jira (because other stuff was in Jira), Microsoft One Drive (because I might save a document as a PDF), Outlook (to search my mail), and, of course, Sharepoint. There were other data sources to search as well. Then, there were all of the data sources that I would have wanted to search if only I had known about them.

This is where Swirl comes in and why I am excited to have recently joined Swirl’s advisory board.

Swirl is a metasearch system that allows organizations like DHS* to create and deploy their own metasearch engines. You configure Swirl with a group of data sources. When users type their query into Swirl, it sends the query to each data source, ranks the result, and shows the results. This ranking is done with state-of-the-art natural language processing.

In addition to ranking, Swirl can send the search results to a large language model (LLM) like OpenAI’s ChatGPT, using a process known as Retrieval Augmented Generation (RAG). The idea here is to use LLMs to summarize the results that Swirl got from those searches. RAG combines the results from many sources into a single synthetic answer that’s faster for a human to read and digest. RAG also preserves pointers to the original results. This is how search engines like Bing can now provide “references” for AI-generated answers.

Swirl executes each search with the user’s own credentials rather than using special credentials set up for the Swirl application. This means that my M365 searches differ from yours: each search shows only the data to which we are each authorized. These rules are executed by the search providers themselves, rather than being replicated within Swirl. 

Of course, as the number of search providers increases, it would be poor form to send every search query to every search provider — that would potentially cause a lot of needless searchers. Here’s another application for traditional natural language processing: evaluating the user’s query to determine which set of search engines it should be sent.

Swirl is open source software, which means that engineers within large organizations can just try it out without engaging with their purchasing departments. In fact, the whole thing runs in a docker container, so you can run it on a laptop without even setting up IT infrastructure.

Swirl has both an open-source offering and a commercial offering that includes additional connectors and support. It works today, and because it’s open source, it’s only going to get better.


*Although I use DHS as an example here, Swirl can also be used by much smaller organizations. Whenever I write DHS above, I’m just using DHS as an example of a prototypical big organization with big organization IT and data governance issues. The views expressed in this article are mine alone and do not represent those of DHS or the US Government.

Andrew Beider

Senior Vice President | Senior Portfolio Management Director | Financial Advisor | *CFP® | *CLU® at Morgan Stanley

1y

Hi cousin how's family

Like
Reply
Jesse Tayler

Team Builder, Startup Cofounder and App Store Inventor

1y

sounds like the days before google when we'd use services to search other services and then review compiled data. Certainly a job for A.I. slaves.

Like
Reply

To view or add a comment, sign in

More articles by Simson Garfinkel

  • A Modest Proposal, or the Sound of Inevitability

    For preventing the end of journalism, ending the dependence of AI systems on data “scraped” from the Internet, and…

    13 Comments
  • Noisy Outtakes

    My book Differential Privacy will be published March 25 by MIT Press. The book is part of the “Essential Knowledge…

    3 Comments
  • Spooky Data at a Distance

    As Halloween fast approaches, I thought it would be fun to recount a dinner talk that I gave several years ago on a…

    6 Comments
  • Trust and Safety

    If your website or service allows users to post comments or exchange messages with other users, then you will…

    5 Comments
  • Review: Claire Bowen's "Government Data of the People"

    As governments and corporations make increasingly more use of our personal data, a growing number of computer…

    3 Comments
  • Vector Databases and RAG

    “You Do Not Need a Vector Database” is the provocative title of a recent blog post (with code) by Dr. Yucheng Low…

    12 Comments
  • Testing the family china for lead

    In this issue I take a break from data and talk about something physical. This is Jerry Urban from Inspector 3755, his…

    6 Comments
  • Sensitive Locations

    Do you work in a sensitive location? On January 9th, the US Federal Trade Commission settled a case with data broker…

    4 Comments
  • WHOOP's AI (LLM) Coach

    In September, I joined the WHOOP Coach beta program, a new feature that WHOOP recently added to its popular fitness…

    2 Comments
  • ORINink, brightening the MTA

    Today on the #6 Subway in NYC I saw a man doing rapid drawings of other people in the car, then leaving the…

    4 Comments

Insights from the community

Others also viewed

Explore topics