ccdtools.catalog.DataCatalog
- class ccdtools.catalog.DataCatalog(yaml_path=None)
Bases:
objectA catalog for managing and loading datasets with versioning and subdataset support.
This class loads dataset configuration from a YAML file and provides methods to list, search, and load datasets with support for multiple versions and subdatasets.
- config_file
Path to the YAML configuration file.
- Type:
pathlib.Pathorstr
- config
Parsed YAML configuration content.
- Type:
dict
- datasets
DataFrame listing all datasets, versions, and subdatasets with their metadata.
- Type:
pandas.DataFrame
- _df_summary
Subset or summary of the datasets DataFrame, used for display purposes (e.g., in _repr_html_). This may reflect filtered search results or the full catalog if no filtering has been applied. By default, it returns
self.datasets.- Type:
pandas.DataFrame
Examples
Initialise and view a catalog
>>> catalog = DataCatalog() >>> catalog
Perform a search and view the filtered catalog
>>> filtered_catalog = catalog.search('temperature') >>> filtered_catalog
- __init__(yaml_path=None)
Initialize a DataCatalog.
This constructor initializes a DataCatalog from a given YAML file. By default, the packaged
config/datasets.yamlfile is used.- Parameters:
yaml_path (
pathlib.Pathorstr, optional) – Path to the YAML configuration file. If not provided, the default packaged configuration file is used. Default isNone.- Raises:
FileNotFoundError – If the provided
yaml_pathdoes not exist.
Notes
The YAML file should contain a
datasetskey with dataset configurations.
Methods
__init__([yaml_path])Initialize a DataCatalog.
available_resolutions(dataset[, version, ...])Show available resolutions for a given dataset/version/subdataset.
available_subdatasets(dataset[, version])Show available subdatasets for a given dataset and version.
available_versions(dataset)Show available versions for a given dataset.
help([dataset, version])Describe available datasets and their supported options without loading data.
load_dataset(dataset[, version, subdataset])Load any dataset by name/version/subdataset with optional directory filtering.
search(keyword)Search datasets by keyword in dataset name, display name, or tags.
- available_resolutions(dataset, version=None, subdataset=None)
Show available resolutions for a given dataset/version/subdataset.
- Parameters:
dataset (
str) – Name of the dataset.version (
str, optional) – Version of the dataset. If not provided, the latest version is used. Default isNone.subdataset (
str, optional) – Name of the subdataset. Default isNone.
- Returns:
List of available resolutions, or
Noneif no resolutions are defined.- Return type:
listofstr- Raises:
KeyError – If no matching dataset entry is found.
Warning
- UserWarning
If no resolutions are defined for the specified dataset/version/subdataset.
- available_subdatasets(dataset, version=None)
Show available subdatasets for a given dataset and version.
- Parameters:
dataset (
str) – Name of the dataset.version (
str, optional) – Version of the dataset. If not provided, the latest version is used. Default isNone.
- Returns:
List of available subdataset names, or
Noneif no subdatasets are defined for the dataset.- Return type:
listofstr- Raises:
ValueError – If no versions are found for the dataset.
Warning
- UserWarning
If no subdatasets are defined for the specified dataset and version.
- available_versions(dataset)
Show available versions for a given dataset.
- Parameters:
dataset (
str) – Name of the dataset.- Returns:
List of available version names.
- Return type:
listofstr
- help(dataset=None, version=None)
Describe available datasets and their supported options without loading data.
This method provides information about available datasets, versions, subdatasets, and supported loading options. When called without arguments, it lists all available datasets. When a dataset is specified, it lists available versions. When both dataset and version are specified, it provides detailed information about the dataset/version combination including available subdatasets and supported keywords.
- Parameters:
dataset (
str, optional) – Name of the dataset to describe. If not provided, lists all available datasets. Default isNone.version (
str, optional) – Version of the dataset to describe. If not provided, lists available versions for the specified dataset. Default isNone.
- Returns:
This method prints information to stdout about datasets, versions, subdatasets, and supported catalog keywords. No value is returned.
- Return type:
None
- Raises:
KeyError – If the specified dataset does not exist.
KeyError – If the specified version does not exist for the given dataset.
Examples
List all available datasets:
>>> catalog.help()
List versions for a specific dataset:
>>> catalog.help(dataset='dataset_name')
Show detailed information for a dataset version:
>>> catalog.help(dataset='dataset_name', version='v1')
Notes
The method provides hints for further exploration and example usage based on the available metadata for the dataset/version combination.
- load_dataset(dataset, version=None, subdataset=None, **kwargs)
Load any dataset by name/version/subdataset with optional directory filtering.
- Parameters:
dataset (
str) – Name of the dataset to load.version (
str, optional) – Version of the dataset to load. If not provided, the latest version is used. Default isNone.subdataset (
str, optional) – Name of the subdataset to load (if applicable). Default isNone.**kwargs –
Additional keyword arguments to pass to the loader function. Common options include:
resolutionstr, optionalResolution to load (if supported by the dataset).
staticbool, optionalWhether to load static files (if supported by the dataset).
- Returns:
Loaded dataset in the appropriate format determined by the loader function.
- Return type:
pandas.DataFrame,geopandas.GeoDataFrame, orxarray.Dataset- Raises:
KeyError – If no matching dataset entry is found.
TypeError – If
subdatasetis specified for a dataset that does not define subdatasets.ValueError – If multiple entries match the criteria or if multiple subdatasets exist and none is specified.
Examples
Load the latest version of a dataset:
>>> data = catalog.load_dataset('dataset_name')
Load a specific version:
>>> data = catalog.load_dataset('dataset_name', version='v1')
Load a specific subdataset:
>>> data = catalog.load_dataset('dataset_name', version='v1', subdataset='sub1')
- search(keyword)
Search datasets by keyword in dataset name, display name, or tags.
Accepts either a single string or a list of keywords and returns a DataFrame of datasets matching any of the provided keywords.
- Parameters:
keyword (
strorlistofstr) – Keyword(s) to search for in dataset name, display name, or tags.- Returns:
DataFrame of datasets matching any of the keywords.
- Return type:
pandas.DataFrame
Examples
Search for a single keyword:
>>> results = catalog.search('temperature')
Search for multiple keywords:
>>> results = catalog.search(['temperature', 'precipitation'])