core

Core API methods/classes.

Dataset management

API methods for managing datasets.

This module contains several methods for easy management of datasets. These include methods for:

  • downloading datasets’ data files from online sources (urls)
  • processing/parsing data files + annotations into a HDF5 file to store metadata information
  • loading dataset’s metadata into a data loader object
  • add/remove datasets to/from cache
  • managing the cache file
  • querying the cache file for some dataset/keyword
  • displaying information about available datasets in cache or for download

These methods compose the core API for dealing with dataset management. Users should be able to take advantage of most functionality by using only these functions to manage and query their datasets in a simple and easy way.

Methods

download

dbcollection.core.api.download.download(name, data_dir='', extract_data=True, verbose=True)[source]

Download a dataset data to disk.

This method will download a dataset’s data files to disk. After download, it updates the cache file with the dataset’s name and path where the data is stored.

Parameters:
  • name (str) – Name of the dataset.
  • data_dir (str, optional) – Directory path to store the downloaded data.
  • extract_data (bool, optional) – Extracts/unpacks the data files (if true).
  • verbose (bool, optional) – Displays text information (if true).

Examples

Download the CIFAR10 dataset to disk.

>>> import dbcollection as dbc
>>> dbc.download('cifar10')

process

dbcollection.core.api.process.process(name, task='default', verbose=True)[source]

Process a dataset’s metadata and stores it to file.

The data is stored a a HSF5 file for each task composing the dataset’s tasks.

Parameters:
  • name (str) – Name of the dataset.
  • task (str, optional) – Name of the task to process.
  • verbose (bool, optional) – Displays text information (if true).
Raises:

KeyError – If a task does not exist for a dataset.

Examples

>>> import dbcollection as dbc

Download the CIFAR10 dataset to disk.

>>> dbc.process('cifar10', task='classification', verbose=False)

load

dbcollection.core.api.load.load(name, task='default', data_dir='', verbose=True)[source]

Returns a metadata loader of a dataset.

Returns a loader with the necessary functions to manage the selected dataset.

Parameters:
  • name (str) – Name of the dataset.
  • task (str, optional) – Name of the task to load.
  • data_dir (str, optional) – Directory path to store the downloaded data.
  • verbose (bool, optional) – Displays text information (if true).
Returns:

Data loader class.

Return type:

DataLoader

Raises:

Exception – If dataset is not available for loading.

Examples

Load the MNIST dataset.

>>> import dbcollection as dbc
>>> mnist = dbc.load('mnist')
>>> print('Dataset name: ', mnist.db_name)
Dataset name:  mnist

add

dbcollection.core.api.add.add(name, task, data_dir, hdf5_filename, categories=(), verbose=True, force_overwrite=False)[source]

Add a dataset/task to the list of available datasets for loading.

Parameters:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task to load.
  • data_dir (str) – Path of the stored data in disk.
  • hdf5_filename (str) – Path to the metadata HDF5 file.
  • categories (list, optional) – List of keyword strings to categorize the dataset.
  • verbose (bool, optional) – Displays text information (if true).
  • force_overwrite (bool, optional) – Forces the overwrite of data in cache

Examples

Add a dataset manually to dbcollection.

>>> import dbcollection as dbc
>>> dbc.add('new_db', 'new_task', 'new/path/db', 'newdb.h5', ['new_category'])
>>> dbc.query('new_db')
{'new_db': {'tasks': {'new_task': 'newdb.h5'}, 'data_dir': 'new/path/db', 'keywords':
['new_category']}}

remove

dbcollection.core.api.remove.remove(name, task='', delete_data=False, verbose=True)[source]

Remove/delete a dataset and/or task from the cache.

Removes the dataset’s information registry from the dbcollection.json cache file. The dataset’s data files remain in disk if ‘delete_data’ is not enabled. If you intended to remove the data files as well, the ‘delete_data’ input arg must be set to ‘True’ in order to also remove the data files.

Moreover, if the intended action is to remove a task of the dataset from the cache registry, this can be achieved by specifying the task name to be deleted. This effectively removes only that task entry for the dataset. Note that deleting a task results in removing the associated HDF5 metadata file from disk.

Parameters:
  • name (str) – Name of the dataset to delete.
  • task (str, optional) – Name of the task to delete.
  • delete_data (bool, optional) – Delete all data files from disk for this dataset if True.
  • verbose (bool, optional) – Displays text information (if true).

Examples

Remove a dataset from the list.

>>> import dbcollection as dbc
>>> # add a dataset
>>> dbc.add('new_db', 'new_task', 'new/path/db', 'newdb.h5', ['new_category'])
>>> dbc.query('new_db')
{'new_db': {'tasks': {'new_task': 'newdb.h5'}, 'data_dir': 'new/path/db',
'keywords': ['new_category']}}
>>> dbc.remove('new_db')  # remove the dataset
Removed 'new_db' dataset: cache=True, disk=False
>>> dbc.query('new_db')  # check if the dataset info was removed (retrieves an empty dict)
{}

config_cache

query

info_cache

info_datasets

fetch_list_datasets

Classes

DownloadAPI

class dbcollection.core.api.download.DownloadAPI(name, data_dir, extract_data, verbose)[source]

Dataset download API class.

This class contains methods to correctly download a dataset’s data files to disk.

Parameters:
  • name (str) – Name of the dataset.
  • data_dir (str) – Directory path to store the downloaded data.
  • extract_data (bool) – Extracts/unpacks the data files (if true).
  • verbose (bool) – Displays text information (if true).
Variables:
  • name (str) – Name of the dataset.
  • data_dir (str) – Directory path to store the downloaded data.
  • save_data_dir (str) – Data files save dir path.
  • save_cache_dir (str) – Cache save dir path.
  • extract_data (bool) – Flag to extract data (if True).
  • verbose (bool) – Flag to display text information (if true).
  • cache_manager (CacheManager) – Cache manager object.
create_dir(path)[source]

Create a directory in the disk.

download_dataset()[source]

Download the dataset to disk.

get_download_data_dir_from_cache()[source]

Create a dir path from the cache information for this dataset.

run()[source]

Main method.

update_cache()[source]

Update the cache manager information for this dataset.

ProcessAPI

class dbcollection.core.api.process.ProcessAPI(name, task, verbose)[source]

Dataset metadata process API class.

This class contains methods to correctly process the dataset’s data files and convert their metadata to disk.

Parameters:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task to process.
  • verbose (bool) – Displays text information (if true).
Variables:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task to process.
  • verbose (bool) – Displays text information (if true).
  • extract_data (bool) – Flag to extract data (if True).
  • cache_manager (CacheManager) – Cache manager object.
Raises:

KeyError – If a task does not exist for a dataset.

check_if_task_exists_in_database(task)[source]

Check if task exists in the list of available tasks for processing.

create_dir(path)[source]

Create a directory in the disk.

exists_task(task)[source]

Checks if a task exists for a dataset.

get_default_task()[source]

Returns the default task for this dataset.

parse_task_name(task)[source]

Parse the input task string.

process_dataset()[source]

Process the dataset’s metadata.

run()[source]

Main method.

update_cache(task_info)[source]

Update the cache manager information for this dataset.

LoadAPI

class dbcollection.core.api.load.LoadAPI(name, task, data_dir, verbose)[source]

Dataset load API class.

This class contains methods to correctly load a dataset’s metadata as a data loader object.

Parameters:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task to load.
  • data_dir (str) – Directory path to store the downloaded data.
  • verbose (bool) – Displays text information (if true).
Variables:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task to load.
  • data_dir (str) – Directory path to store the downloaded data.
  • verbose (bool) – Displays text information (if true).
  • cache_manager (CacheManager) – Cache manager object.
  • available_datasets_list (list) – List of available datast names for download.
download_dataset()[source]

Download the dataset to disk.

get_data_loader()[source]

Return a DataLoader object.

parse_task_name(task)[source]

Validate the task name.

process_dataset()[source]

Process the dataset’s metadata.

run()[source]

Main method.

AddAPI

class dbcollection.core.api.add.AddAPI(name, task, data_dir, hdf5_filename, categories, verbose, force_overwrite)[source]

Add dataset API class.

This class contains methods to correctly register a dataset in the cache.

Parameters:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task to load.
  • data_dir (str) – Path of the stored data in disk.
  • hdf5_filename (str) – Path to the metadata HDF5 file.
  • categories (tuple) – List of keyword strings to categorize the dataset.
  • verbose (bool) – Displays text information.
  • force_overwrite (bool) – Forces the overwrite of data in cache
Variables:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task to load.
  • data_dir (str) – Path of the stored data in disk.
  • hdf5_filename (bool) – Path to the metadata HDF5 file.
  • categories (tuple) – Tuple of keyword strings to categorize the dataset.
  • verbose (bool) – Displays text information.
  • force_overwrite (bool) – Forces the overwrite of data in cache
  • cache_manager (CacheManager) – Cache manager object.
run()[source]

Main method.

RemoveAPI

class dbcollection.core.api.remove.RemoveAPI(name, task, delete_data, verbose)[source]

Dataset remove API class.

This class contains methods to remove a dataset registry from cache. Also, it can remove the dataset’s files from disk if needed.

Parameters:
  • name (str) – Name of the dataset to delete.
  • task (str, optional) – Name of the task to delete.
  • delete_data (bool) – Delete all data files from disk for this dataset if True.
Variables:
  • name (str) – Name of the dataset to delete.
  • task (str) – Name of the task to delete.
  • delete_data (bool) – Delete all data files from disk for this dataset if True.
  • cache_manager (CacheManager) – Cache manager object.
exists_dataset()[source]

Return True if a dataset name exists in the cache.

print_msg_registry_removal()[source]

Prints to screen the success message.

remove_dataset()[source]

Removes the dataset from cache (and disk if selected).

remove_dataset_data_files_from_disk()[source]

Removes the directory containing the data files from disk.

remove_dataset_entry_from_cache()[source]

Removes the dataset registry from cache.

remove_dataset_registry()[source]

Removes the dataset registry from cache.

remove_registry_from_cache()[source]

Remove the dataset or task from cache.

remove_task_registry()[source]

Remove the task registry for this dataset from cache.

run()[source]

Main method.

ConfigAPI

QueryAPI

InfoCacheAPI

InfoDatasetAPI

Cache management

CacheManager

Data loading

Dataset’s metadata loader classes.

DataLoader

class dbcollection.core.loader.DataLoader(name, task, data_dir, hdf5_filepath)[source]

Dataset metadata loader class.

This class contains several methods to fetch data from a hdf5 file by using simple, easy to use functions for (meta)data handling.

Parameters:
  • name (str) – Name of the dataset.
  • task (str) – Name of the task.
  • data_dir (str) – Path of the dataset’s data directory on disk.
  • hdf5_filepath (str) – Path of the metadata cache file stored on disk.
Variables:
  • db_name (str) – Name of the dataset.
  • task (str) – Name of the task.
  • data_dir (str) – Path of the dataset’s data directory on disk.
  • hdf5_filepath (str) – Path of the hdf5 metadata file stored on disk.
  • hdf5_file (h5py._hl.files.File) – hdf5 file object handler.
  • root_path (str) – Default data group of the hdf5 file.
  • sets (tuple) – List of names of set splits (e.g. train, test, val, etc.)
  • object_fields (dict) – Data field names for each set split.
get(set_name, field, index=None, convert_to_str=False)[source]

Retrieves data from the dataset’s hdf5 metadata file.

This method retrieves the i’th data from the hdf5 file with the same ‘field’ name. Also, it is possible to retrieve multiple values by inserting a list/tuple of number values as indexes.

Parameters:
  • set_name (str) – Name of the set.
  • field (str) – Name of the data field.
  • idx (int/list/tuple, optional) – Index number of the field. If it is a list, returns the data for all the value indexes of that list.
  • convert_to_str (bool, optional) – Convert the output data into a string. Warning: output must be of type np.uint8
Returns:

Numpy array containing the field’s data. If convert_to_str is set to True, it returns a string or list of strings.

Return type:

np.ndarray/list/str

Raises:

KeyError – If set name is not valid or does not exist.

info(set_name=None)[source]

Prints information about all data fields of a set.

Displays information of all fields of a set group inside the hdf5 metadata file. This information contains the name of the field, as well as the size/shape of the data, the data type and if the field is contained in the ‘object_ids’ list.

If no ‘set_name’ is provided, it displays information for all available sets.

This method only shows the most useful information about a set/fields internals, which should be enough for most users in helping to determine how to use/handle a specific dataset with little effort.

Parameters:set_name (str, optional) – Name of the set.
Raises:KeyError – If set name is not valid or does not exist.
list(set_name=None)[source]

List of all field names of a set.

Parameters:set_name (str, optional) – Name of the set.
Returns:List of all data fields of the dataset.
Return type:list/dict
Raises:KeyError – If set name is not valid or does not exist.
object(set_name, index=None, convert_to_value=False)[source]

Retrieves a list of all fields’ indexes/values of an object composition.

Retrieves the data’s ids or contents of all fields of an object.

It basically works as calling the get() method for each individual field and then groups all values into a list w.r.t. the corresponding order of the fields.

Parameters:
  • set_name (str) – Name of the set.
  • index (int/list/tuple, optional) – Index number of the field. If it is a list, returns the data for all the value indexes of that list. If no index is used, it returns the entire data field array.
  • convert_to_value (bool, optional) – If False, outputs a list of indexes. If True, it outputs a list of arrays/values instead of indexes.
Returns:

List of indexes of the data fields available in ‘object_fields’. If convert_to_value is set to True, it returns a list of data instead of indexes.

Return type:

list

Raises:

KeyError – If set name is not valid or does not exist.

object_field_id(set_name, field)[source]

Retrieves the index position of a field in the ‘object_ids’ list.

This method returns the position of a field in the ‘object_ids’ object. If the field is not contained in this object, it returns a null value.

Parameters:
  • set_name (str) – Name of the set.
  • field (str) – Name of the field in the metadata file.
Returns:

Index of the field in the ‘object_ids’ list.

Return type:

int

Raises:

KeyError – If set name is not valid or does not exist.

size(set_name=None, field='object_ids')[source]

Size of a field.

Returns the number of the elements of a field.

Parameters:
  • set_name (str, optional) – Name of the set.
  • field (str, optional) – Name of the field in the metadata file.
Returns:

Returns the size of a field.

Return type:

list/dict

Raises:

KeyError – If set name is not valid or does not exist.

SetLoader

class dbcollection.core.loader.SetLoader(hdf5_group)[source]

Set metadata loader class.

This class contains several methods to fetch data from a specific set (group) in a hdf5 file. It contains useful information about a specific group and also several methods to fetch data.

Parameters:

hdf5_group (h5py._hl.group.Group) – hdf5 group object handler.

Variables:
  • hdf5_group (h5py._hl.group.Group) – hdf5 group object handler.
  • set (str) – Name of the set.
  • fields (tuple) – List of all field names of the set.
  • object_fields (tuple) – List of all field names of the set contained by the ‘object_ids’ list.
  • nelems (int) – Number of rows in ‘object_ids’.
__len__()[source]
Returns:Number of elements
Return type:int
get(field, index=None, convert_to_str=False)[source]

Retrieves data from the dataset’s hdf5 metadata file.

This method retrieves the i’th data from the hdf5 file with the same ‘field’ name. Also, it is possible to retrieve multiple values by inserting a list/tuple of number values as indexes.

Parameters:
  • field (str) – Field name.
  • index (int/list/tuple, optional) – Index number of the field. If it is a list, returns the data for all the value indexes of that list.
  • convert_to_str (bool, optional) – Convert the output data into a string. Warning: output must be of type np.uint8
Returns:

Numpy array containing the field’s data. If convert_to_str is set to True, it returns a string or list of strings.

Return type:

np.ndarray/list/str

Raises:

KeyError – If the field does not exist in the list.

info()[source]

Prints information about the data fields of a set.

Displays information of all fields available like field name, size and shape of all sets. If a ‘set_name’ is provided, it displays only the information for that specific set.

This method provides the necessary information about a data set internals to help determine how to use/handle a specific field.

list()[source]

List of all field names.

Returns:List of all data fields of the dataset.
Return type:list
object(index=None, convert_to_value=False)[source]

Retrieves a list of all fields’ indexes/values of an object composition.

Retrieves the data’s ids or contents of all fields of an object.

It basically works as calling the get() method for each individual field and then groups all values into a list w.r.t. the corresponding order of the fields.

Parameters:
  • index (int/list/tuple, optional) – Index number of the field. If it is a list, returns the data for all the value indexes of that list. If no index is used, it returns the entire data field array.
  • convert_to_value (bool, optional) – If False, outputs a list of indexes. If True, it outputs a list of arrays/values instead of indexes.
Returns:

Returns a list of indexes or, if convert_to_value is True, a list of data arrays/values.

Return type:

list

object_field_id(field)[source]

Retrieves the index position of a field in the ‘object_ids’ list.

This method returns the position of a field in the ‘object_ids’ object. If the field is not contained in this object, it returns a null value.

Parameters:field (str) – Name of the field in the metadata file.
Returns:Index of the field in the ‘object_ids’ list.
Return type:int
Raises:KeyError – If field does not exists in the list of object fields.
size(field='object_ids')[source]

Size of a field.

Returns the number of the elements of a field.

Parameters:field (str, optional) – Name of the field in the metadata file.
Returns:Returns the size of the field.
Return type:tuple
Raises:KeyError – If field is invalid or does not exist in the fields dict.

FieldLoader

class dbcollection.core.loader.FieldLoader(hdf5_field, obj_id=None)[source]

Field metadata loader class.

This class contains several methods to fetch data from a specific field of a set (group) in a hdf5 file. It contains useful information about the field and also several methods to fetch data.

Parameters:
  • hdf5_field (h5py._hl.dataset.Dataset) – hdf5 field object handler.
  • obj_id (int, optional) – Position of the field in ‘object_fields’.
Variables:
  • data (h5py._hl.dataset.Dataset) – hdf5 group object handler.
  • set (str) – Name of the set.
  • name (str) – Name of the field.
  • type (type) – Type of the field’s data.
  • shape (tuple) – Shape of the field’s data.
  • fillvalue (int) – Value used to pad arrays when storing the data in the hdf5 file.
  • obj_id (int) – Identifier of the field if contained in the ‘object_ids’ list.
__getitem__(index)[source]
Parameters:index (int) – Index
Returns:Numpy data array.
Return type:np.ndarray
__len__()[source]
Returns:Number of samples
Return type:int
get(index=None, convert_to_str=False)[source]

Retrieves data of the field from the dataset’s hdf5 metadata file.

This method retrieves the i’th data from the hdf5 file. Also, it is possible to retrieve multiple values by inserting a list/tuple of number values as indexes.

Parameters:
  • index (int/list/tuple, optional) – Index number of he field. If it is a list, returns the data for all the value indexes of that list.
  • convert_to_str (bool, optional) – Convert the output data into a string. Warning: output must be of type np.uint8
Returns:

Numpy array containing the field’s data. If convert_to_str is set to True, it returns a string or list of strings.

Return type:

np.ndarray/list/str

Note

When using lists/tuples of indexes, this method sorts the list and removes duplicate values. This is because the h5py api requires the indexing elements to be in increasing order when retrieving data.

info(verbose=True)[source]

Prints information about the field.

Displays information like name, size and shape of the field.

Parameters:verbose (bool, optional) – If true, display extra information about the field.
object_field_id()[source]

Retrieves the index position of the field in the ‘object_ids’ list.

This method returns the position of the field in the ‘object_ids’ object. If the field is not contained in this object, it returns a null value.

Returns:Index of the field in the ‘object_ids’ list.
Return type:int
size()[source]

Size of the field.

Returns the number of the elements of the field.

Returns:Returns the size of the field.
Return type:tuple
to_memory

Modifies how data is accessed and stored.

Accessing data from a field can be done in two ways: memory or disk. To enable data allocation and access from memory requires the user to specify a boolean. If set to True, data is allocated to a numpy ndarray and all accesses are done in memory. Otherwise, data is kept in disk and accesses are done using the HDF5 object handler.