ccdtools.catalog.DataCatalog

class ccdtools.catalog.DataCatalog(yaml_path=None)

Bases: object

A catalog for managing and loading datasets with versioning and subdataset support.

This class loads dataset configuration from a YAML file and provides methods to list, search, and load datasets with support for multiple versions and subdatasets.

config_file

Path to the YAML configuration file.

Type:

pathlib.Path or str

config

Parsed YAML configuration content.

Type:

dict

datasets

DataFrame listing all datasets, versions, and subdatasets with their metadata.

Type:

pandas.DataFrame

_df_summary

Subset or summary of the datasets DataFrame, used for display purposes (e.g., in _repr_html_). This may reflect filtered search results or the full catalog if no filtering has been applied. By default, it returns self.datasets.

Type:

pandas.DataFrame

Examples

Initialise and view a catalog

>>> catalog = DataCatalog()
>>> catalog

Perform a search and view the filtered catalog

>>> filtered_catalog = catalog.search('temperature')
>>> filtered_catalog
__init__(yaml_path=None)

Initialize a DataCatalog.

This constructor initializes a DataCatalog from a given YAML file. By default, the packaged config/datasets.yaml file is used.

Parameters:

yaml_path (pathlib.Path or str, optional) – Path to the YAML configuration file. If not provided, the default packaged configuration file is used. Default is None.

Raises:

FileNotFoundError – If the provided yaml_path does not exist.

Notes

The YAML file should contain a datasets key with dataset configurations.

Methods

__init__([yaml_path])

Initialize a DataCatalog.

available_resolutions(dataset[, version, ...])

Show available resolutions for a given dataset/version/subdataset.

available_subdatasets(dataset[, version])

Show available subdatasets for a given dataset and version.

available_versions(dataset)

Show available versions for a given dataset.

help([dataset, version])

Describe available datasets and their supported options without loading data.

load_dataset(dataset[, version, subdataset])

Load any dataset by name/version/subdataset with optional directory filtering.

search(keyword)

Search datasets by keyword in dataset name, display name, or tags.

available_resolutions(dataset, version=None, subdataset=None)

Show available resolutions for a given dataset/version/subdataset.

Parameters:
  • dataset (str) – Name of the dataset.

  • version (str, optional) – Version of the dataset. If not provided, the latest version is used. Default is None.

  • subdataset (str, optional) – Name of the subdataset. Default is None.

Returns:

List of available resolutions, or None if no resolutions are defined.

Return type:

list of str

Raises:

KeyError – If no matching dataset entry is found.

Warning

UserWarning

If no resolutions are defined for the specified dataset/version/subdataset.

available_subdatasets(dataset, version=None)

Show available subdatasets for a given dataset and version.

Parameters:
  • dataset (str) – Name of the dataset.

  • version (str, optional) – Version of the dataset. If not provided, the latest version is used. Default is None.

Returns:

List of available subdataset names, or None if no subdatasets are defined for the dataset.

Return type:

list of str

Raises:

ValueError – If no versions are found for the dataset.

Warning

UserWarning

If no subdatasets are defined for the specified dataset and version.

available_versions(dataset)

Show available versions for a given dataset.

Parameters:

dataset (str) – Name of the dataset.

Returns:

List of available version names.

Return type:

list of str

help(dataset=None, version=None)

Describe available datasets and their supported options without loading data.

This method provides information about available datasets, versions, subdatasets, and supported loading options. When called without arguments, it lists all available datasets. When a dataset is specified, it lists available versions. When both dataset and version are specified, it provides detailed information about the dataset/version combination including available subdatasets and supported keywords.

Parameters:
  • dataset (str, optional) – Name of the dataset to describe. If not provided, lists all available datasets. Default is None.

  • version (str, optional) – Version of the dataset to describe. If not provided, lists available versions for the specified dataset. Default is None.

Returns:

This method prints information to stdout about datasets, versions, subdatasets, and supported catalog keywords. No value is returned.

Return type:

None

Raises:
  • KeyError – If the specified dataset does not exist.

  • KeyError – If the specified version does not exist for the given dataset.

Examples

List all available datasets:

>>> catalog.help()

List versions for a specific dataset:

>>> catalog.help(dataset='dataset_name')

Show detailed information for a dataset version:

>>> catalog.help(dataset='dataset_name', version='v1')

Notes

The method provides hints for further exploration and example usage based on the available metadata for the dataset/version combination.

load_dataset(dataset, version=None, subdataset=None, **kwargs)

Load any dataset by name/version/subdataset with optional directory filtering.

Parameters:
  • dataset (str) – Name of the dataset to load.

  • version (str, optional) – Version of the dataset to load. If not provided, the latest version is used. Default is None.

  • subdataset (str, optional) – Name of the subdataset to load (if applicable). Default is None.

  • **kwargs

    Additional keyword arguments to pass to the loader function. Common options include:

    • resolutionstr, optional

      Resolution to load (if supported by the dataset).

    • staticbool, optional

      Whether to load static files (if supported by the dataset).

Returns:

Loaded dataset in the appropriate format determined by the loader function.

Return type:

pandas.DataFrame, geopandas.GeoDataFrame, or xarray.Dataset

Raises:
  • KeyError – If no matching dataset entry is found.

  • TypeError – If subdataset is specified for a dataset that does not define subdatasets.

  • ValueError – If multiple entries match the criteria or if multiple subdatasets exist and none is specified.

Examples

Load the latest version of a dataset:

>>> data = catalog.load_dataset('dataset_name')

Load a specific version:

>>> data = catalog.load_dataset('dataset_name', version='v1')

Load a specific subdataset:

>>> data = catalog.load_dataset('dataset_name', version='v1', subdataset='sub1')
search(keyword)

Search datasets by keyword in dataset name, display name, or tags.

Accepts either a single string or a list of keywords and returns a DataFrame of datasets matching any of the provided keywords.

Parameters:

keyword (str or list of str) – Keyword(s) to search for in dataset name, display name, or tags.

Returns:

DataFrame of datasets matching any of the keywords.

Return type:

pandas.DataFrame

Examples

Search for a single keyword:

>>> results = catalog.search('temperature')

Search for multiple keywords:

>>> results = catalog.search(['temperature', 'precipitation'])