The industry-wide impulse to rebrand as an AI company, often by means of the slap-a-chatbot-on-it approach, can make products labeled “AI” seem like no more than vaporware. However, when it comes to search there is substance behind the hype. AI’s promise has long relied on its ability to quickly and accurately make sense of vast amounts of data. This is a feat for which traditional search methods, dependent on keyword matching and basic algorithms, have often fallen short. During AI’s Great Flowering it should come as little surprise that vendors in the search space are all rebranding as AI companies, but because AI is poised to revolutionize relevant result delivery these claims have the potential to be refreshingly far from superficial.
Companies intent on alleviating the pain of subpar data querying have long looked to improve search using techniques now lumped by marketers under the heading of AI such as machine learning and natural language processing (NLP). Yet despite this technology’s longevity within the querying domain, it has only been recently that AI has overtaken these companies’ marketing pitches. For instance, although Sinequa was founded in 2002 as an enterprise search solution (“’Le Monde’ s’archive dans une base XML”, trans. “Le Monde is archived in an XML database”), today Sinequa brands itself as “The most capable Search-Powered AI Assistant Platform,” making it an ideal solution for “Powering your GenAI Assistants.” Typesense differentiates its search service by its use of “cutting-edge search algorithms that take advantage of the latest advances in Hardware Capabilities & Machine Learning.” And perhaps most importantly considering its holding an estimated <90% market share for search, Algolia’s headline promises to: “Show users what they need with AI search that understands them.” For search-as-a-service platforms in particular, both startups and incumbents are marketing themselves as leaders in the AI revolution that is transforming how we find and interact with information.
What interests me about this shift is its dual implications from a marketing and a computer science perspective. Centering a company’s go-to-market around AI has both strategic and even disruptive aspects, particularly when it comes to the evergreen computational problem of data management. Search is set apart from this hype-driven trend of every company being an AI company now by focusing on profoundly AI-adjacent challenges such as data transformation, retrieval, and analysis. Search companies tout AI-powered features that include advanced query understanding, personalization, and prediction because, unlike many other submarkets in the software industry, search is uniquely positioned to actually leverage the native capabilities of AI and machine learning.
The Way We Search Now
Search has historically relied on keyword matching, which has significant limitations when it comes to understanding context and language nuance. Of course computers can’t really “understand” anything, and I use this term like the AI community uses “intelligence”: as an anthropomorphized shorthand to describe complex algorithmic and computational processes. And yet understanding is precisely the dream of many AI enthusiasts.
When it comes to data retrieval, the problem of computer understanding is tied to the distinction in data science between a “query,” which must match exactly, and “search,” which allows for imperfect matches but often leads to irrelevant or incomplete results. Database administrators deal with queries, but these hyper-specific prompts are notoriously finicky because they are rigidly rule-based and can lead to unintended and sometimes catastrophic actions. Conversely, search is hindered by keyword imprecision (synonyms, ambiguous language) and overall lack of context. One way of framing this difference is to say that in order to make requests by querying, humans use the computer’s syntax, whereas in search, the computer uses human syntax.
Enter AI. AI-powered search promises to bypass the need for querying by making search work well through understanding intent, analyzing user behavior, and adapting to deliver more relevant results. The ways in which search is improving are legion. In case you haven’t been paying attention to the latest-and-greatest in AI-enhanced search techniques, I have collected and defined some significant key concepts below. If you’re well-versed in this domain feel free to skip this section.
Grouping in Search: Grouping in search refers to the technique of organizing search results into clusters based on certain criteria. This helps users to easily navigate through the results by categorizing similar items together, making it more convenient to find relevant information. Some typical applications of grouping in search include search engines that cluster search results by source, topic, or type of content (news, images, videos); e-commerce for grouping products by category, brand, price range, or customer ratings; and customer support for grouping help articles or FAQs by topic, issue type, or product line. Some techniques for grouping include using clustering algorithms like k-means, hierarchical clustering, or Density-based spatial clustering of applications with noise (DBSCAN) to group similar items based on their features; faceted search that allows users to apply multiple filters that act as groupings, such as by category, brand, or other attributes; and taxonomies and ontologies to utilize predefined hierarchies or relationships between data points to group them logically.
Metadata Filtering: metadata filtering involves using metadata (data about data such as author, date, tags, categories, file type, size, etcetera) to refine and narrow down search results. Metadata provides additional context and descriptive information about the primary data, which can be used to improve the relevance of search results. Typical use cases for metadata filtering include e-commerce applications that need to sort products by price, brand, category, ratings, and availability; digital libraries needing to filter by publication date, author, genre, and format; and Content Management Systems (CMSs) looking to filter documents by file type, creation date, author, and tags. Techniques like metadata indexing for storing metadata separately or alongside the primary data to facilitate quick access and filtering; and faceted navigation that provides a user interface to apply multiple filters (facets), allow for more targeted search refinement.
Multi-Vector Search: Multi-vector search is an information retrieval and machine learning technique that enhances the accuracy of complex queries by capturing the data’s various dimensions using multiple vectors to represent and search for relevant data. Approaches to multi-vector search include:
- Vector/ Word Embedding: Creates multiple vectors representing different aspects of the data using embedding techniques such as Google’s Word2Vec; Bidirectional Encoder Representations from Transformers (BERT); and image embedding.
- Similarity Measure: Calculates similarity scores using various distance metrics (cosine similarity, Euclidean distance) for each vector, and combining these scores to rank results.
- Fusion Method: Merges outputs from various vectors using techniques such as weighted averaging or concatenation.
- Retrieval-Augmented Generation (RAG): A method combining the fetching of relevant information from large datasets with generative models to produce more accurate and contextually enriched outputs by using a specified set of documents to respond to queries. Because RAG extracts semantic meaning from datasets it has the ability to go beyond merely predicting outcomes to actually responding to those outcomes.
Because multi-vector search can be extremely precise, taking into account various facets of the data, it has found applications in search engines by incorporating user preferences and contextual information; recommendation systems by leveraging vectors to represent user behavior, preferences, and item features; and image retrieval to combine vectors that capture, for instance, color, texture, shape, and other features of images to improve retrieval accuracy.
Natural Language Processing (NLP): NLP is in many respects the bedrock of AI today because it enables machines to understand, interpret, and generate human language that facilitates tasks such as translation, sentiment analysis, and conversational interactions. It has improved search by employing processes such as tokenization, where text is broken down into individual words or phrases, and parsing, where the grammatical structure of the text is analyzed to enable search engines to interpret the nuances of human language. NLP then applies semantic analysis to determine the meaning and intent behind the text, which allows for more accurate and context-aware processing. NLP can also refine and expand queries by predicting user intent and then suggest more precise search terms through automated prompt refinement. By handling natural language queries that are more conversational and intuitive for end users, search vendors leveraging NLP are able to offer a better experience.
Semantic search: Semantic search focuses on understanding the meaning and intent behind the words in a query by retrieving results based on the context and relationships between concepts. Unlike lexical search, which relies on exact keyword matching to return results that contain the specific words used in the query regardless of their context, semantic search is a keystone of NLP because it provides more relevant and accurate results that take into consideration synonyms, related terms, and overall meaning, which lexical search might miss through the exclusion of exact keywords.
Sparse vs Dense Search: Sparse search refers to searching through data that has a large number of possible values or features, but only a small subset of them are relevant. In other words, most of the elements in the dataset have no significant value. Sparse search is common in text data, especially in documents represented by word vectors where only a few words are present in each document. Dense search, conversely, refers to searching through data where most of the elements have significant, non-zero values. In this context, the data is richly populated, and almost all features or values are relevant. Dense search is common in image data (where each pixel has a value), sensor data, and even some scientific studies of genetic data. Unlike sparse data algorithms like (tf-idf and Okapi BM25), dense data typically requires more memory and computational power to process. Therefore, sparse data uses storage formats that skip over zero values, while dense data is stored in regular matrices.
Top-k: Top-k refers to the process of identifying and retrieving the k most relevant, significant, or highest-scoring items from a larger dataset based on a specific user-defined set of criteria. This is useful because retrieving and processing all possible matches to a search query can be computationally expensive and time-consuming, especially with large datasets. By focusing on the top-k results, systems can significantly reduce the computational load and deliver faster responses.
Vector Search/ Nearest Neighbor Search: Vector search, often referred to as nearest neighbor search, is a technique used to find data points, usually within a specified distance from a given query point in a high-dimensional space. It works by converting data points into high-dimensional vectors capturing semantic meaning and relationships. Search queries are converted into vectors and the engine retrieves the most relevant data points based on their proximity to the query vector in the vector space. Mathematical distance metrics, such as cosine similarity or Euclidean distance, determine similarity comparisons. This technique can be useful for recommendation systems in which the user needs to identify items within a certain similarity threshold to a base item; image retrieval for finding images that are similar to a query image; anomaly detection to discover data points that are within a certain range of normal behavior; and geospatial search for finding locations or points of interest within a certain distance from a given location. A modified (1+ε)-approximate nearest neighbor search approach can be useful when dealing with large datasets where exact nearest neighbor searches would be computationally expensive and inefficient.
Evolution of NLP
AI-enhanced search, and NLP specifically, is not a new technology, but has come a long way in recent years. MIT’s Computer Science and Artificial Intelligence Laboratory introduced the START (SynTactic Analysis using Reversible Transformations) Natural Language Question Answering Machine project in 1993 in order to allow users to deploy everyday language for querying an online database of information. START was followed by the NLP search engine Ask Jeeves in 1996 (rebranded as Ask.com in 2006). As an old millennial I can attest to Ask Jeeves’s wow-factor, although, in this high schooler’s experience, it couldn’t hold a candle to Yahoo in the late-90s.
In the early 2000s we entered the Google search era, which not only improved search capabilities on the internet, it also successfully monetized it. Search quality and revenue were never mutually exclusive, as Bill Gross noted when he purchased Oliver McBryan’s early search engine the World Wide Web Worm (1993) as the basis of the “pay per click” search engine Goto.com (1996), which became Overture. Overture’s model of auctioning search terms proved lucrative and enduring, but, in the end, Google’s superior inbound link aggregating system prevailed beyond Yahoo’s acquisition of Overture in 2003.
Much of Google’s success in the search space can be attributed to its technological superiority in NLP. Zoom ahead to the 2010s, and Google engineers had made tremendous improvements in NLP, particularly through the development of transformers. Google’s Word2Vec (2013), a suite of optimizations and architectures useful for learning word embeddings from large datasets, and BERT (2019), which uses transformers to “process words in relation to all the other words in a sentence, rather than one-by-one in order.” Unfortunately, by over-rotating on paid search by preferencing paid results above the best results, recently Google has, in Cory Doctorow’s words, “enshittified its search”—a situation that has driven folks like Tejas Kumar to prefer NLP chat-bot search products like Perplexity AI over traditional search engines.
I am so impressed with the team's work on @perplexity_ai. It has fully replaced both Google and OpenAI as my go-to product for accurate quality information.
Truly exceptional, inspired execution.
I wonder if I could/should get their CEO on the podcast.
— Tejas Kumar (@TejasKumar_) July 9, 2024
For over a decade, and arguably from the earliest days of the modern internet, the innovations that enabled the 2020s AI boom such as transformers and NLP are identical with those that allowed full-context search. It is largely owing to the marketing hype surrounding AI right now that these technologies have moved beyond nerddom (academic papers, Hacker News) to enter the limelight of broader consumer awareness and interest (marketing website homepages, blog posts targeted at business readers). For example, in a recent blog post for the search software company Coveo addressing how “our in-house NLP team is focused on identifying and productizing the best approach for a given use case,” Kurt Cagle, editor in chief of The Cagle Report, explains personalization’s importance for search:
search also needs to be sensitive to the person asking the question. Some information may be available to the CEO that might not be available to a visitor to the company website. Thus, the context for such queries includes determining who should be told what, what is currently embargoed content, and which information cannot be passed on due to privacy regulations.
Of course, Coveo is not alone in asserting the importance of NLP for its AI-enhanced search. Dustin Coates, Principal Product Manager at Algolia, addresses how “Algolia’s search and discovery APIs leverage NLP,” while Tessa Roberts, formerly a content & communications manager at Bloomreach, articulates: “How Natural Language Processing Can Help Product Discovery.”
The Blast Radius
If AI search is a truly disruptive force, as I have argued, then it is a worthwhile exercise to consider who within the tech community is feeling the fallout. Below I have identified a few categories of developer tools and techniques that are deeply tied to search in order to sketch out what it means for these domains to experience this shift’s most acute impact.
Business Intelligence (BI): The rise of augmented analytics for BI, which employs AI and machine learning to automate data preparation and insight generation, leverages advances in data search and interpretation to improve business report and suggestion creation. This has three major implications for the field:
- Velocity: The pace of data collection, cleaning, and analysis has accelerated significantly.
- Efficiency: Manual BI run by human analysts is ill equipped to manage large and complex data sets. Automation permits more sophisticated analysis because computers can make sense of even the most unwieldy data sets.
- Skills Gap: Many BI tools now integrate NLP, meaning that instead of relying on trained analysts, anyone with business questions can run their own queries. They have also improved the user experience by making dashboards more intuitive. Moreover, by automating BI tasks, augmented analytics lowers the barrier to entry because analysts no longer rely on rigid query systems.
Database: The advantages of vector search are shuffling the deck for database incumbents and startups alike. It is worth stating and repeating that vectors were a thing well before AI effectively sucked all the air (or maybe VC funds) out of the room. That said, compared to tabular data storage and retrieval approaches, nearly every database vendor recognizes vector search as a paradigm shift. Whether the market will prefer dedicated vector databases (Chroma, Marqo, Qdrant, Milvus, LanceDB, Vespa, Weaviate, Pinecone, Zilliz) or general purpose ones that support vector search (Fauna, OpenSearch, ClickHouse, Postgres, Cassandra, Elasticsearch, Redis, Rockset, SingleStore), remains to be seen. What is more, the boundaries of this upheaval in the product space have yet to be determined. Vendors are testing out which services and features will appeal most to consumers, while some tools combine capabilities such as Qdrant, which is both a vector database and vector search engine.
Data Lake: As the problem of querying unstructured data like images and video remains unsolved, AI has the potential to transform the data lake industry. Snowflake’s Twitter/X subheading touts it’s own AI-chops by coining the hashtag #AIDataCloud, and promising “to help leading organizations share data, build applications and power their business with AI.” As part of this campaign, Snowflake’s documentation includes a section on the subject of vector embeddings. More interestingly, Databricks’s recent acquisition of Tabular suggests a recognition that this is an area crying out for AI-enhanced improvement. Databricks has gone all-in on AI by branding itself “the world’s first data intelligence platform powered by generative AI,” while promising to “Infuse AI into every facet of your business,” and although the press release frames Tabular’s acquisition around bringing data format compatibility to the lakehouse, interoperability also promises to advance Databricks’s vector search capabilities.
Observability: Search has been key to the evolution of the observability product space. ElasticSearch was designed as a search product, but ended up as an observability tool. Same thing happened with Splunk. The lights of the observability movement have also acknowledged this domain as one for which search is key. Charity Majors, CTO at Honeycomb.io, ties the affinity between search and observability to first principles citing the command line basics of grep in her post on the subject:
the difference between strings and structured data is ~basically the difference between grep and all of computer science. 😛
For these tactical and historical reasons, the observability industry’s marketing departments are poised to position their products as part of the AI-enhanced search revolution. Observability products capitalizing on the excitement abound, and include Dynatrace’s Davis Assistant, which is “powered by a more accurate and context-aware natural language processing (NLP) engine,” and Datadog’s Bits AI, intended to: “investigate and resolve issues faster by using natural language to interact with all of your observability data.”
Wrapping-up
AI is revolutionizing search by using more context-aware solutions to overcome the limitations of traditional methods. Search powered by AI promises a future where finding information is not just about matching keywords but truly understanding queries to meet user needs. But beyond search’s aspirational and realized ability to generate accurate and useful results, search’s AI-enhanced capabilities have never been marketed more heavily than they are today. Vendors in the search space are not only leveraging AI techniques to achieve more accurate, relevant, and user-friendly experiences, they are also aware of the business case for highlighting this facet of their technology.
Disclaimer: Dynatrace, Elastic, Fauna, Google, Honeycomb, and Redis are all RedMonk clients.
Correction 8/20/2024: Fixed my translation of “Le Monde’ s’archive dans une base XML.” Thanks Thad McIlroy for pointing out my error.
Stephane Rodet says:
July 23, 2024 at 8:29 am
Thanks for the insightful article! What is also interesting in that space is the connection with summarisation (that you already approach), and also with translation. Once the models are as integrated and powerful as now, you can easily search content from other languages and get translated results back – in written but also spoken form. On the other hand, a lot of great tech has been around but underutilised for some times, remember Google Desktop Search? Local search hosting could still become much better in terms of deployments and results quality. Interesting times.
Thad McIlroy says:
August 13, 2024 at 5:32 am
You translate “’Le Monde’s’ archive dans une base XML”, as “The world is archived in an XML database”), but I believe the reference is to the French newspaper, Le Monde, and that its database is/was structured with XML.
kate holterhoff says:
August 21, 2024 at 1:06 am
Good shout! Thanks for this correction. I’ll fix my botched translation.
Ludovic Leforestier says:
August 14, 2024 at 4:41 pm
Great post. I do feel however it’s an arms race between GenAI to produce incresingly dumb content and ML-powered search to sieve through the much increased noise level > https://www.linkedin.com/posts/lludovic_archat-activity-7178056610601156608-vJpP?utm_source=share&utm_medium=member_desktop
Kyle575 says:
October 3, 2024 at 4:10 pm
I find the advancements in AI-driven search incredibly exciting! It’s refreshing to see a shift from keyword matching to a more nuanced understanding of user queries. This promises to enhance the search experience significantly and better meet our information needs.