oversimplified

Patent Search Explained: From Keywords to Semantic AI

Wolfgang Stark

24 Nov 2024 — 2 min read

Jacquard loom - a mechanical index - but for threads, not words.

Digital search works so well that it often feels magical—how else could systems sift through vast amounts of data in milliseconds and even seem to “understand” what you’re looking for? But there’s no magic involved—just some preparation behind the scenes.

How Search Actually Works

When you search for something, the software doesn’t start rummaging through every piece of information it has to find a match. That would be far too slow, even for today’s powerful computers. Instead, the system prepares the data to be searched efficiently by assigning values to pieces of information. These values act like shortcuts, making it quick and easy to find what you’re looking for.

The Basic Idea: The Index

Think of a phone book. Phone numbers are assigned to names, and the names are ordered alphabetically. If you want to find someone’s number, you don’t flip through every page—you jump straight to the section with their name.

This is exactly how search engines work with a concept called an index. For example, if a document has a unique ID like a patent number, that ID is added to a list (the index), which links it to the document. This allows the system to locate documents quickly without combing through all the data.

Applying Indexing to Text

Text works the same way, but it needs a bit more preparation to be searchable. Let’s say you have a document that mentions “self-driving cars”

Breaking Text into Pieces:

The document is split into smaller parts, like individual words or word stems. For example:

“driving” might be reduced to “driv”
“Cars” might just be “car”

Adding Entries to the Index:

These processed words (or stems) are added to the index. This means that if you search for “car” (not “cars”) or “drive” (not “driving”), the document will still match.

This process is fast and effective, but there’s a limitation: it doesn’t understand meaning. A search for “battery-driven car” would match, but “autonomous vehicle” wouldn’t—even though they mean almost the same thing.

Enter AI: Semantic Understanding with Embeddings

To go beyond simple keyword matching, modern systems use embeddings—a way of assigning values to text that capture its meaning.

Here’s how it works:

Turning Text into Numbers:

Instead of just breaking text into words, an AI model transforms the entire document into a list of numbers, called a vector. This vector represents the meaning of the document in a multi-dimensional space.

Finding Similar Documents:

When you search for something like “autonomous vehicle,” the search term is also converted into a vector. The system then looks for documents with similar vectors, which represent similar meanings.

This is done using an algorithm called K-Nearest Neighbors (KNN), which finds the vectors closest to your search term in that multi-dimensional space.

Results That “Understand”:

With this method, a search for “autonomous vehicle” could return documents about “self-driving cars” because the system recognizes their semantic similarity—even though the words themselves are different.

The Core Idea

Whether you’re using traditional indexing or advanced AI techniques like embeddings, it all boils down to one principle:

Assign values to pieces of information so they can be looked up quickly.

With indexes, those values are based on keywords or stems.
With embeddings, they’re multi-dimensional vectors that capture deeper meaning.

By combining these approaches—or choosing the one that best fits your needs—you can build search systems that are not only fast but also smart enough to understand what users are really looking for.

Patent Search Explained: From Keywords to Semantic AI

Wolfgang Stark

Read more

ChatGPT gets direct USPTO access

Patent Connector: MCP server for AI-powered patent research

EPO OPS v3.2 client library released

EPO BDDS client library released