datasets¶

This module contains scripts to download/process all datasets available in dbcollection.

These scripts are self contained, meaning they can be imported and used to manually setup a dataset.

Constructors: Classes¶

class dbcollection.datasets.BaseDataset(data_path, cache_path, extract_data=True, verbose=True)[source]¶

Base class for download/processing a dataset.

Parameters:

data_path (str) – Path to the data directory.
cache_path (str) – Path to the cache file
extract_data (bool, optional) – Extracts the downloaded files if they are compacted.
verbose (bool) – Be verbose

Variables:

data_path (str) – Path to the data directory.
cache_path (str) – Path to the cache file
extract_data (bool, optional) – Extracts the downloaded files if they are compacted.
verbose (bool) – Be verbose
urls (list) – List of URL links to download.
keywords (list) – List of keywords.
tasks (dict) – Dataset’s tasks.
default_task (str) – Default task name.

download()[source]¶

Download and extract files to disk.

Returns:	A list of keywords.
Return type:	tuple

get_task_constructor(task)[source]¶

Returns the class constructor for the input task.

Parameters:	task (str) – Task name.
Returns:	str – Task name. str – Task’s ending suffix (if any). BaseTask – Constructor to process the metadata of a task.

parse_task_name(task)[source]¶

Parses the task string to look for key suffixes.

Parameters:	task (str) – Task name.
Returns:	Returns a task name without the ‘_s’ suffix.
Return type:	str

process(task='default')[source]¶

Processes the metadata of a task.

Parameters:	task (str, optional) – Task name.
Returns:	Returns a dictionary with the task name as key and the filename as value.
Return type:	dict

class dbcollection.datasets.BaseTask(data_path, cache_path, suffix=None, verbose=True)[source]¶

Base class for processing a task of a dataset.

Parameters:	data_path (str) – Path to the data directory. cache_path (str) – Path to the cache file suffix (str, optional) – Suffix to select optional properties for a task. verbose (bool, optional) – Be verbose.
Variables:	data_path (str) – Path to the data directory. cache_path (str) – Path to the cache file suffix (str, optional) – Suffix to select optional properties for a task. verbose (bool, optional) – Be verbose. filename_h5 (str) – hdf5 metadata file name.

add_data_to_default(handler, data, set_name=None)[source]¶

Add data of a set to the default group.

For each field, the data is organized into a single big matrix.

Parameters:	hdf5_handler (h5py._hl.group.Group) – hdf5 group object handler. data (list/dict) – List or dict containing the data annotations of a particular set or sets. set_name (str) – Set name.

add_data_to_source(hdf5_handler, data, set_name=None)[source]¶

Store data annotations in a nested tree fashion.

It closely follows the tree structure of the data.

Parameters:	hdf5_handler (h5py._hl.group.Group) – hdf5 group object handler. data (list/dict) – List or dict containing the data annotations of a particular set or sets. set_name (str) – Set name.

load_data()[source]¶

Load data of the dataset (create a generator).

Load data from annnotations and split it to corresponding sets (train, val, test, etc.)

process_metadata()[source]¶: Process metadata and store it in a hdf5 file.