Upload your own content¶
Bigdata not only allows you to query and analyze pre-existing data, but also to upload your own content to be analyzed and searched. The only method currently supported is to upload a file from disk:
from bigdata_client import Bigdata
bigdata = Bigdata()
file = bigdata.uploads.upload_from_disk('path/to/file')
The file
object returned is a bigdata_client.models.uploads.File
object,
which contains:
id
: The unique identifier of the file.name
: The name of the file. It will be set to the name of the original file in the disk.status
: The status of the file. Checkbigdata_client.file_status.FileStatus
for the list of possible statuses.uploaded_at
: The datetime when the file was uploaded, according to the server.raw_size
: The size of the file in bytes.
Besides the path
, the upload_from_disk()
method also accepts the following optional parameters:
provider_document_id
: Allows you to assign a specific ID to your document which will be available asprovider_document_id
in the metadata node of theannotated.json
. It is useful in case you want to co-relate your own ids with the ones provided by Bigdata.provider_date_utc
: Allows you to assign a specific timestamp (a string withYYYY-MM-DD hh:mm:ss
format or a datetime) to your document. This will modify the document published date, allowing us to better assign a reporting date to detected events.primary_entity
: You can specify a “Primary Entity” to boost entity recognition in your document. When a primary entity is set for a document, it increases the chances to detect events even when the entity is not explicitly mentioned. Setting a primary entity is optional and you can use either a name or the corresponding rp_entity_id.skip_metadata
: If True, it will upload the file but not retrieve its metadata. Recommended for bulk uploads. It is False by default.
file = bigdata.uploads.upload_from_disk('path/to/file',
provider_document_id='my_document_id',
provider_date_utc='2022-01-01 12:00:00',
primary_entity='Apple Inc.',
skip_metadata=True)
Note that at the moment when the file is uploaded, it is not immediately
available for querying. The file must be processed first. The status of the
file can be checked by accessing the status
attribute of the file object, but
the value will not get updated. To get the most recent status of the file, you
must call the reload_status()
method of the file object:
from bigdata_client.api.uploads import FileStatus
import time
last_status = None
while file.status != FileStatus.COMPLETED:
if last_status != file.status:
print(f"\n{file.status}", end="", flush=True)
last_status = file.status
else:
print(".", end="", flush=True)
file.reload_status()
time.sleep(0.5)
print(f"\n{file.status}", flush=True)
PENDING..........
PROCESSING..........
COMPLETED
Since waiting for a file to be processed is such a common operation, the
library provides a helper method to do this wait_for_completion()
. This
method will block the execution of the program until the file is in a final
state (either COMPLETED
, DELETED
or FAILED
):
file.wait_for_completion()
By default, this method will wait “forever” until the file is processed. If
you want to limit the time you are willing to wait, you can pass a timeout
parameter to the method. After the timeout is reached, the method will raise a
TimeoutError
exception:
file.wait_for_completion(timeout=60)
Tag uploaded files¶
You can modify file tags using the add_tags()
, remove_tags()
, and set_tags()
methods of the File
class objects.
The file object may come from the list()
, get()
, or upload_from_disk()
methods.
Add Tag¶
To add a tag to a file, use the add_tags()
method. You can add a single tag or a list of tags.
file = bigdata.uploads.get("4DC8AF5500AD4EB0A360D0C7BD6F9286")
print(file.tags)
>>> []
file.add_tags(["New Tag"])
print(file.tags)
>>> ["New Tag"]
file.add_tags(["New Tag 2", "New Tag 3"])
print(file.tags)
>>> ["New Tag", "New Tag 2", "New Tag 3"]
Remove Tag¶
To remove a tag from a file, use the remove_tags()
method. You can remove a single tag or a list of tags.
file.remove_tags(["New Tag"])
print(file.tags)
>>> ["New Tag 2", "New Tag 3"]
# To remove all tags from a file
file.remove_tags(file.tags)
print(file.tags)
>>> []
Working with your files¶
To list all the files that have been uploaded to the server, you can use the
list()
method:
files = bigdata.uploads.list()
for file in files:
print(file)
In case you have many files, you must iterate over the results:
for n in itertools.count(start=1):
files = bigdata_cli.uploads.list(page_number=n)
do_stuff_with_files(files)
if not files:
break
Where the output contains the ID, file size, upload date, and name of the file:
C48410DA1AEE439ABAA0619F272B67F4 123 Jan 1 2021 My First Document.pdf,
BE61DA39E0F540A599E958BBEB9BA3D5 1K Feb 10 2023 Document_2.txt,
687A8B473E654416A0C19CD79EE77413 120K Jul 31 2024 Document-3.docx,
F1345B07DDE145CAB30C08CC01B393D6 1.2M Dec 31 2024 Another file.docx,
3A56AC4B2BCB42FEA7B0AF062FE78534 1.1G Apr 10 2024 The last file.pdf,
Additionally, you can get a file by its ID:
file = bigdata.uploads.get("<document_id>")
print(file)
# C48410DA1AEE439ABAA0619F272B67F4 123 Jan 1 2021 My First Document.pdf
Once your files are processed, you can download 3 different versions of the file:
The original file, by calling the
download_original()
method of the file object.The annotated version of the file, by calling the
download_annotated()
method of the file object. This is a JSON file containing the text together with the detections made by the system.The analytics version of the file, by calling the
download_analytics()
method of the file object. This is a JSON file containing the analytics created by the system.
file.download_original('path/to/save/original_file')
file.download_annotated('path/to/save/annotated_file.json')
file.download_analytics('path/to/save/analytics_file.json')
Additionally, you can get the annotations directly as a python dictionary by
calling the get_<file_type>_dict()
method:
annotations = file.get_annotated_dict()
print(annotations)
analytics = file.get_analytics_dict()
print(analytics)
Deleting uploaded files¶
To delete a file, you can use the delete()
method of the file object, where
the object may be coming from the list()
method, from the get()
method, or
from the upload_from_disk()
method:
files = []
for n in itertools.count(start=1):
files_in_page = bigdata_cli.uploads.list_shared(page_number=n)
files.extend(files_in_page)
if not files_in_page:
break
for i, file in enumerate(files):
print(f"{i} {file}")
print(f"Enter the file row number to delete: [0 - {len(files)-1}]")
file_id = int(input())
if file_id > 0:
file = files[file_id]
file.delete()
# The file is now deleted and bigdata.uploads.get() will raise an exception since the file does not exist anymore
Warning
Note that deleting a file is a permanent operation and cannot be undone.
Another way to delete a file, if we know the ID, is to use the delete()
method of the Uploads
object. This will avoid the need to get the file object
first:
bigdata.uploads.delete("<document_id>")
Warning
Only files that are in the COMPLETED
or FAILED
can be deleted.
Attempting to delete a file that is still being processed will raise an
exception. To avoid this, you can use the wait_for_completion()
method:
file = bigdata.uploads.upload_from_disk('path/to/file')
# Wait for the file to be processed
file.wait_for_completion()
file.delete()