datasets

This module contains scripts to download/process all datasets available in dbcollection.

These scripts are self contained, meaning they can be imported and used to manually setup a dataset.

Constructors: Classes

BaseDataset

class dbcollection.datasets.BaseDataset(data_path, cache_path, extract_data=True, verbose=True)[source]

Base class for download/processing a dataset.

Parameters:
  • data_path (str) – Path to the data directory.
  • cache_path (str) – Path to the cache file
  • extract_data (bool, optional) – Extracts the downloaded files if they are compacted.
  • verbose (bool) – Be verbose
Variables:
  • data_path (str) – Path to the data directory.
  • cache_path (str) – Path to the cache file
  • extract_data (bool, optional) – Extracts the downloaded files if they are compacted.
  • verbose (bool) – Be verbose
  • urls (list) – List of URL links to download.
  • keywords (list) – List of keywords.
  • tasks (dict) – Dataset’s tasks.
  • default_task (str) – Default task name.
download()[source]

Download and extract files to disk.

Returns:A list of keywords.
Return type:tuple
get_task_constructor(task)[source]

Returns the class constructor for the input task.

Parameters:task (str) – Task name.
Returns:
  • str – Task name.
  • str – Task’s ending suffix (if any).
  • BaseTask – Constructor to process the metadata of a task.
parse_task_name(task)[source]

Parses the task string to look for key suffixes.

Parameters:task (str) – Task name.
Returns:Returns a task name without the ‘_s’ suffix.
Return type:str
process(task='default')[source]

Processes the metadata of a task.

Parameters:task (str, optional) – Task name.
Returns:Returns a dictionary with the task name as key and the filename as value.
Return type:dict

BaseTask

class dbcollection.datasets.BaseTask(data_path, cache_path, suffix=None, verbose=True)[source]

Base class for processing a task of a dataset.

Parameters:
  • data_path (str) – Path to the data directory.
  • cache_path (str) – Path to the cache file
  • suffix (str, optional) – Suffix to select optional properties for a task.
  • verbose (bool, optional) – Be verbose.
Variables:
  • data_path (str) – Path to the data directory.
  • cache_path (str) – Path to the cache file
  • suffix (str, optional) – Suffix to select optional properties for a task.
  • verbose (bool, optional) – Be verbose.
  • filename_h5 (str) – hdf5 metadata file name.
add_data_to_default(handler, data, set_name=None)[source]

Add data of a set to the default group.

For each field, the data is organized into a single big matrix.

Parameters:
  • hdf5_handler (h5py._hl.group.Group) – hdf5 group object handler.
  • data (list/dict) – List or dict containing the data annotations of a particular set or sets.
  • set_name (str) – Set name.
add_data_to_source(hdf5_handler, data, set_name=None)[source]

Store data annotations in a nested tree fashion.

It closely follows the tree structure of the data.

Parameters:
  • hdf5_handler (h5py._hl.group.Group) – hdf5 group object handler.
  • data (list/dict) – List or dict containing the data annotations of a particular set or sets.
  • set_name (str) – Set name.
load_data()[source]

Load data of the dataset (create a generator).

Load data from annnotations and split it to corresponding sets (train, val, test, etc.)

process_metadata()[source]

Process metadata and store it in a hdf5 file.

run()[source]

Run task processing.