Search in uploaded files

It will only take you 5 minutes and will guide you through:

  • Install bigdata-client package

  • Authenticate to bigdata.com

  • Create two sample files to upload

  • Upload private files

  • Query bigdata.com

Note

We recommend to try this how-to guide directly on Google Colab

Install bigdata-client package

Follow Prerequisites instructions to set up the require environment.

Authenticate to bigdata.com

Because you have already set your credentials in the environment following the Prerequisites step, Bigdata constructor will read them.

from bigdata_client import Bigdata

bigdata = Bigdata()

Create two samples files to upload

Create the following two sample files in your local directory.

File name data_science_research-2020-06.txt:

RavenPack Data Science researchers recommend the following stocks in June 2020

Microsoft (NASDAQ: MSFT): Microsoft has been heavily investing in AI and cloud computing, which are key growth areas for the company.

Datadog (NASDAQ: DDOG): Datadog is a leading provider of monitoring and analytics solutions for cloud-based applications.

Oracle (NYSE: ORCL): Oracle is a major player in the enterprise software and cloud computing market.

File name soup_recipes-2020-06.txt:

We recommend making chicken noodle soup with homemade chicken stock

Upload private files

We will upload two files and use the parameter provider_data_utc to inform bigdata about their creation date. This will modify the document published date, allowing us to better assign a reporting date to detected events.

file = bigdata.uploads.upload_from_disk("./data_science_research-2020-06.txt",
                                    provider_document_id='my_document_id',
                                    provider_date_utc='2020-06-10 12:00:00',
                                    primary_entity='RavenPack',
                                    skip_metadata=True)

# Check the file's processing status
file.reload_status()
print(f"File processing status: {file.status}")

# Wait for completion
file.wait_for_completion(timeout=60)
print(f"File processing status: {file.status}")

Output:

File processing status: PENDING
File processing status: COMPLETED

The first file was successfully analysed and indexed into the bigdata.com vector database.

As you might have many type of private documents, you could also assign tags to each type and use them during the search

file.add_tags(["Data Science Research"])
print(f"File tags: {file.tags}")

Output:

File tags: ['Data Science Research']

Let’s upload the second file

file = bigdata.uploads.upload_from_disk("./soup_recipes-2020-06.txt",
                                    provider_document_id='my_document_id',
                                    provider_date_utc='2020-06-10 12:00:00',
                                    primary_entity='RavenPack',
                                    skip_metadata=True)
# Check the file's processing status
file.reload_status()
print(f"File processing status: {file.status}")

# Wait for completion
file.wait_for_completion(timeout=60)
print(f"File processing status: {file.status}")

Output:

File processing status: PENDING
File processing status: COMPLETED

and tag as Cooking recipes

file.add_tags(["Cooking recipes"])
print(f"File tags: {file.tags}")

Output:

File tags: ['Cooking recipes']

Query bigdata.com

Let’s do a Similarity search with the text recommend stock in the month of June 2020:

from bigdata_client.query import Similarity
from bigdata_client.daterange import AbsoluteDateRange
from bigdata_client.models.search import DocumentType

# Similarity search
query = Similarity("recommend stock")

# Full month of June 2020
in_june_2020 = AbsoluteDateRange("2020-06-01T08:00:00", "2020-06-30T00:00:00")

# Create a bigdata search
search = bigdata.search.new(query, date_range=in_june_2020, scope=DocumentType.ALL)

# Retrieve content of four documents
documents = search.run(4)
for doc in documents:
    print(f"\nDocument headline: {doc.headline}")

Output:

Document headline: Nifty outlook and stock recommendations by CapitalVia: Buy RBL Bank, ONGC

Document headline: Forget the Naysayers: 3 Top Retail Stocks You Should Own

Document headline: Here's How to Invest Like Warren Buffett

Document headline: 2 Tech Stocks to Buy Right Now

The private files got indexed but there are many other files, let’s narrow the date range to only 2 seconds around the publication timestamp of our private files:

# Narrow down the date range to 2 seconds
two_secs_in_june_2020 = AbsoluteDateRange("2020-06-10T11:59:59", "2020-06-10T12:00:01")

# Create a bigdata search
search = bigdata.search.new(query, date_range=two_secs_in_june_2020, scope=DocumentType.ALL)

# Retrieve content of four documents
documents = search.run(4)
for doc in documents:
    print(f"\nDocument headline: {doc.headline}")

Output:

Document headline: soup_recipes-2020-06.txt

Document headline: Deutsche Post AG: Investor Meeting

Document headline: Ford Motor Co.: Deutsche Bank Global Auto Industry Conference

Document headline: data_science_research-2020-06.txt

🎉We see them both!

If we only want to get insights from our private files, then we can set the scope to DocumentType.FILES.

# Create a bigdata search with scope "FILES"
search = bigdata.search.new(query, date_range=in_june_2020, scope=DocumentType.FILES)

# Retrieve content of four documents
documents = search.run(4)

# Read all retrieved documents and print some details
for doc in documents:
    print(f"\nDocument headline: {doc.headline}")
    for chunk in doc.chunks:
        print(f"  Chunk text: {chunk.text}")

Output:

Document headline: soup_recipes-2020-06.txt
Chunk text: We recommend making chicken noodle soup with homemade chicken stock

Document headline: data_science_research-2020-06.txt
Chunk text: (Sample file for testing purpose) RavenPack Data Science researches recommend the following stocks in June 2020 Microsoft (NASDAQ: MSFT): Microsoft has been heavily investing in AI and cloud computing, which are key growth areas for the company.

We can even use tags to focus on specfic type of private files, for instance Data Science Research

from bigdata_client.query import Similarity, FileTag

# Similarity search
query = Similarity("recommend stock") & FileTag("Data Science Research")

# Create a bigdata search
search = bigdata.search.new(query, date_range=in_june_2020, scope=DocumentType.FILES)

# Retrieve content of four documents
documents = search.run(4)

# Read all retrieved documents and print some details
for doc in documents:
    print(f"\nDocument headline: {doc.headline}")
    for chunk in doc.chunks:
        print(f"  Chunk text: {chunk.text}")

Output:

Document headline: data_science_research-2020-06.txt
Chunk text: (Sample file for testing purpose) RavenPack Data Science researches recommend the following stocks in June 2020 Microsoft (NASDAQ: MSFT): Microsoft has been heavily investing in AI and cloud computing, which are key growth areas for the company.

Summary

Congratulations! 🎉 You have successfully uploaded private files and retrieve insights about them amongst millions of other documents.

Next steps

The following pages are related to private file uploading, managing tags and search using the tag query filter:

  • Upload your own content: It describes all supported parameters and methods to manage private files.

  • Batch files upload: It contains a script to help your organization quickly upload all private files.

  • FileTag: It describes the FileTag query filter.

  • Query operators: It describes the supported query operators: &, |, ~ , All and Any.