Search in uploaded files¶
It will only take you 5 minutes and will guide you through:
Install
bigdata-client
packageAuthenticate to bigdata.com
Create two sample files to upload
Upload private files
Query bigdata.com
Note
We recommend to try this how-to guide directly on Google Colab
Install bigdata-client
package¶
Follow Prerequisites instructions to set up the require environment.
Authenticate to bigdata.com¶
Because you have already set your credentials in the environment following the Prerequisites step, Bigdata constructor will read them.
from bigdata_client import Bigdata
bigdata = Bigdata()
Create two samples files to upload¶
Create the following two sample files in your local directory.
File name data_science_research-2020-06.txt
:
RavenPack Data Science researchers recommend the following stocks in June 2020
Microsoft (NASDAQ: MSFT): Microsoft has been heavily investing in AI and cloud computing, which are key growth areas for the company.
Datadog (NASDAQ: DDOG): Datadog is a leading provider of monitoring and analytics solutions for cloud-based applications.
Oracle (NYSE: ORCL): Oracle is a major player in the enterprise software and cloud computing market.
File name soup_recipes-2020-06.txt
:
We recommend making chicken noodle soup with homemade chicken stock
Upload private files¶
We will upload two files and use the parameter provider_data_utc
to inform bigdata about their creation date. This will modify the document published date, allowing us to better assign a reporting date to detected events.
file = bigdata.uploads.upload_from_disk("./data_science_research-2020-06.txt",
provider_document_id='my_document_id',
provider_date_utc='2020-06-10 12:00:00',
primary_entity='RavenPack',
skip_metadata=True)
# Check the file's processing status
file.reload_status()
print(f"File processing status: {file.status}")
# Wait for completion
file.wait_for_completion(timeout=60)
print(f"File processing status: {file.status}")
Output:
File processing status: PENDING
File processing status: COMPLETED
The first file was successfully analysed and indexed into the bigdata.com vector database.
As you might have many type of private documents, you could also assign tags to each type and use them during the search
file.add_tags(["Data Science Research"])
print(f"File tags: {file.tags}")
Output:
File tags: ['Data Science Research']
Let’s upload the second file
file = bigdata.uploads.upload_from_disk("./soup_recipes-2020-06.txt",
provider_document_id='my_document_id',
provider_date_utc='2020-06-10 12:00:00',
primary_entity='RavenPack',
skip_metadata=True)
# Check the file's processing status
file.reload_status()
print(f"File processing status: {file.status}")
# Wait for completion
file.wait_for_completion(timeout=60)
print(f"File processing status: {file.status}")
Output:
File processing status: PENDING
File processing status: COMPLETED
and tag as Cooking recipes
file.add_tags(["Cooking recipes"])
print(f"File tags: {file.tags}")
Output:
File tags: ['Cooking recipes']
Query bigdata.com¶
Let’s do a Similarity
search with the text recommend stock
in the month of June 2020:
from bigdata_client.query import Similarity
from bigdata_client.daterange import AbsoluteDateRange
from bigdata_client.models.search import DocumentType
# Similarity search
query = Similarity("recommend stock")
# Full month of June 2020
in_june_2020 = AbsoluteDateRange("2020-06-01T08:00:00", "2020-06-30T00:00:00")
# Create a bigdata search
search = bigdata.search.new(query, date_range=in_june_2020, scope=DocumentType.ALL)
# Retrieve content of four documents
documents = search.run(4)
for doc in documents:
print(f"\nDocument headline: {doc.headline}")
Output:
Document headline: Nifty outlook and stock recommendations by CapitalVia: Buy RBL Bank, ONGC
Document headline: Forget the Naysayers: 3 Top Retail Stocks You Should Own
Document headline: Here's How to Invest Like Warren Buffett
Document headline: 2 Tech Stocks to Buy Right Now
The private files got indexed but there are many other files, let’s narrow the date range to only 2 seconds around the publication timestamp of our private files:
# Narrow down the date range to 2 seconds
two_secs_in_june_2020 = AbsoluteDateRange("2020-06-10T11:59:59", "2020-06-10T12:00:01")
# Create a bigdata search
search = bigdata.search.new(query, date_range=two_secs_in_june_2020, scope=DocumentType.ALL)
# Retrieve content of four documents
documents = search.run(4)
for doc in documents:
print(f"\nDocument headline: {doc.headline}")
Output:
Document headline: soup_recipes-2020-06.txt
Document headline: Deutsche Post AG: Investor Meeting
Document headline: Ford Motor Co.: Deutsche Bank Global Auto Industry Conference
Document headline: data_science_research-2020-06.txt
🎉We see them both!
If we only want to get insights from our private files, then we can set the scope
to DocumentType.FILES
.
# Create a bigdata search with scope "FILES"
search = bigdata.search.new(query, date_range=in_june_2020, scope=DocumentType.FILES)
# Retrieve content of four documents
documents = search.run(4)
# Read all retrieved documents and print some details
for doc in documents:
print(f"\nDocument headline: {doc.headline}")
for chunk in doc.chunks:
print(f" Chunk text: {chunk.text}")
Output:
Document headline: soup_recipes-2020-06.txt
Chunk text: We recommend making chicken noodle soup with homemade chicken stock
Document headline: data_science_research-2020-06.txt
Chunk text: (Sample file for testing purpose) RavenPack Data Science researches recommend the following stocks in June 2020 Microsoft (NASDAQ: MSFT): Microsoft has been heavily investing in AI and cloud computing, which are key growth areas for the company.
We can even use tags to focus on specfic type of private files, for instance Data Science Research
from bigdata_client.query import Similarity, FileTag
# Similarity search
query = Similarity("recommend stock") & FileTag("Data Science Research")
# Create a bigdata search
search = bigdata.search.new(query, date_range=in_june_2020, scope=DocumentType.FILES)
# Retrieve content of four documents
documents = search.run(4)
# Read all retrieved documents and print some details
for doc in documents:
print(f"\nDocument headline: {doc.headline}")
for chunk in doc.chunks:
print(f" Chunk text: {chunk.text}")
Output:
Document headline: data_science_research-2020-06.txt
Chunk text: (Sample file for testing purpose) RavenPack Data Science researches recommend the following stocks in June 2020 Microsoft (NASDAQ: MSFT): Microsoft has been heavily investing in AI and cloud computing, which are key growth areas for the company.
Summary¶
Congratulations! 🎉 You have successfully uploaded private files and retrieve insights about them amongst millions of other documents.
Next steps¶
The following pages are related to private file uploading, managing tags and search using the tag query filter:
Upload your own content: It describes all supported parameters and methods to manage private files.
Batch files upload: It contains a script to help your organization quickly upload all private files.
FileTag: It describes the
FileTag
query filter.Query operators: It describes the supported query operators:
&
,|
,~
,All
andAny
.