Similarity Search DemystifiedΒΆ
Finding Relevant Needles in the Data Haystack!
IntroductionΒΆ
The Bigdata.com API provides powerful retrieval capabilities, enabling you to search and analyze news articles, transcripts, corporate filings, and other documents. Notably, it supports both keyword-based searches and similarity searches, along with a range of other advanced search features.
In this notebook, weβll demonstrate how to use the Bigdata.com API to perform a similarity search effectively.
# Import required modules and classes
import html
from IPython.display import display, HTML
from bigdata_client import Bigdata
from bigdata_client.daterange import RollingDateRange
from bigdata_client.models.advanced_search_query import Similarity
from bigdata_client.models.search import DocumentType, SortBy
# Initialize the Bigdata client
# Make sure BIGDATA_USERNAME and BIGDATA_PASSWORD are set in the environment
# Alternatively, you can pass your credentials directly to the Bigdata class
bigdata = Bigdata()
Helper FunctionsΒΆ
We define a helper function to show the search results in a nicely formatted HTML:
def escape_special_chars(text):
"""Escapes special characters for safe HTML display."""
text = html.escape(text) # Escapes HTML special characters like <, >, &
# text = text.replace(r"$", r"\$") # Escape the dollar sign properly
text = text.replace(" ", " ") # Preserve double spaces
return text
def print_results_html(results):
"""Prints search results in a readable format."""
html_output = """
<style>
.results-container {
font-family: Arial, sans-serif;
background: #1e1e1e;
color: white;
padding: 20px;
border-radius: 10px;
max-width: 800px;
margin: auto;
box-shadow: 0px 4px 10px rgba(0, 0, 0, 0.5);
}
.result-card {
border: 1px solid #444;
padding: 15px;
margin: 15px 0;
border-radius: 8px;
background: #2a2a2a;
transition: transform 0.2s, box-shadow 0.2s;
}
.result-card:hover {
transform: scale(1.02);
box-shadow: 0px 4px 10px rgba(255, 255, 255, 0.1);
}
.rank-container {
display: flex;
gap: 10px; /* Space between rank bubbles */
align-items: center;
margin-bottom: 10px;
}
.rank-badge {
font-weight: bold;
font-size: 16px;
padding: 6px 12px;
border-radius: 20px;
display: inline-block;
color: white;
}
.badge-blue {
background: #1E88E5;
}
.headline {
font-size: 20px;
font-weight: bold;
color: #ffcc00;
}
.timestamp {
font-size: 14px;
color: #cccccc;
}
.text {
font-size: 16px;
line-height: 1.6;
color: #dddddd;
}
</style>
<div class='results-container'>
"""
for idx, document in enumerate(results, 1):
# Infer ranks for the document
headline = escape_special_chars(document.headline.title())
timestamp = document.timestamp.strftime("%Y-%m-%d %H:%M:%S")
relevance = round(document.chunks[0].relevance, 2)
first_chunk_text = escape_special_chars(document.chunks[0].text)
html_output += f"""
<div class='result-card'>
<div class='rank-container'>
<div class='rank-badge badge-blue'>{('πππ' * idx)[:idx]}
</div>
</div>
<div class='headline'>{headline}</div>
<div class='timestamp'><strong>Timestamp:</strong> {timestamp}</div>
<div class='relevance'><strong>π Relevance:</strong> {relevance}</div>
<div class='text'>{first_chunk_text}</div>
</div>
"""
html_output += "</div>"
display(HTML(html_output))
Define Search Query and ParametersΒΆ
We define our search parameters, including the query, time period, and the number of documents to retrieve. In this example, we are searching for articles related to the Federal Reserveβs actions on inflation and concerns about tariffs.
# Create a similarity search query
query = Similarity('Fed addresses inflation amid tariff concerns')
# Search within a specific time frame
DATE_RANGE = RollingDateRange.LAST_WEEK
# Set the rerank threshold to improve search relevance
RERANK_THRESHOLD = 0.85
# This will limit the search to news articles only
chunk_relevance = ...
# Set the maximum number of documents to retrieve
DOCUMENT_LIMIT = 10
Execute SearchΒΆ
We now run the search using the specified parameters.
One of the key features of the Bigdata API is the ability to rerank the search results based on relevance scores. This is a cross-encoder reranking that can help you find the most relevant documents quickly. You can read more about the reranking feature here.
We activate this feature by setting the rerank_threshold
:
# Execute the search
# Configure and execute the search with specified parameters
search = bigdata.search.new(
query=query,
date_range=DATE_RANGE,
rerank_threshold=RERANK_THRESHOLD,
scope=DocumentType.NEWS, # Limit to news articles
sortby=SortBy.RELEVANCE # Sort by relevance score
)
# Run the search and get results
results = search.run(DOCUMENT_LIMIT)
Display ResultsΒΆ
Now that we have the search results, we can display them in a readable format:
print_results_html(results)
ConclusionΒΆ
For more details and documentation on the Bigdata.com API, refer to the official documentation. There are many more filters that you can apply to narrow down your search results.
Happy Searching! π