Parquet Catalogs

IRSA provides various catalogs and tables as spatially-partitioned Parquet datasets. These support efficient access for a wide range of common use cases and are especially good for machine learning, statistics, and other large-scale approaches. Parquet can be queried directly using common libraries in Python and other languages and the files are well-suited for parallel processing. IRSA uses HATS and HEALPix (described below) for the partitioning, providing additional efficiency for spatially oriented use cases. The datasets can be accessed in two ways:

  1. Query the cloud copy (recommended). The Cloud Data Access page lists bucket and path information for all of IRSA's cloud holdings including HATS and HEALPix Parquet catalogs. The files can be efficiently queried in place (no need to download them first) even in highly parallel workflows. See Python Notebook Tutorials: Accessing IRSA's cloud holdings for examples of how to navigate and query Parquet datasets.
  2. Bulk download scripts are provided for those who prefer to download the files to local storage before querying.

Non-Parquet versions of these catalogs are also available and can be accessed several ways, including IRSA Viewer, Catalog Search, and APIs.

Querying Parquet datasets

Apache Parquet is a file format that includes rich metadata and supports fast, SQL-like queries on large datasets. Libraries in various languages can read Parquet files. Some Python examples include:

Tips for efficient queries

Query efficiency is strongly impacted by dataset partitioning, and especially by a library's ability to understand and use the partitioning. To facilitate the efficiency, the user should pay special attention to the following (both demonstrated in tutorials linked above):

  1. When using a method like read_parquet() or parquet.dataset(), look for a keyword argument like partitioning and pass the value "hive". Most Python libraries use this value by default, but it's important to check. Hive is a partition-naming scheme that is used by all datasets described on this page (essentially, the directory naming syntax is "key=value/", which identifies the partitions).
  2. When querying a dataset, for example with a method like read_parquet(columns=[...], filter=[...]), include the relevant partition key/value pair(s) in the row filters whenever possible. This allows the Parquet reader to completely ignore all other partitions. This is relevant for any query with a spatial component, since all datasets described on this page are partitioned spatially. Additionally, this can be used when parallel processing or any time you want to grab ahold of one or more specific partitions at a time.

HEALPix

HEALPix (Hierarchical Equal Area isoLatitude Pixelization; Górski et al., 2005) is a tiling of the sky into equal-area pixels. The HEALPix "order" ("k", in some contexts) determines the pixel area: higher order => smaller pixels, better resolution, and more total pixels required to cover the sky.

Hint: A pixel index can be calculated from RA and Dec coordinates using the hpgeom Python library.

HATS partitioning and HATS Collections

Parquet products listed with an S3 prefix that ends in "/hats" are HATS Collections.

HATS (Hierarchical Adaptive Tiling Scheme) is a HEALPix-based partitioning scheme (plus metadata) for Parquet datasets. The HEALPix orders at which data is partitioned is adaptive -- it varies within a given catalog based on the on-sky density of rows -- with the aim of creating partitions that have roughly equal numbers of rows. Similarly sized partitions are important for efficient access, especially when parallel processing. HATS is designed especially to support large-scale uses cases with spatial dependencies that are common in astronomy, such as cross matching. Coupled with the benefits of Parquet, HATS catalogs efficiently support a wide range of use cases.

A HATS Collection comprises a HATS Catalog plus ancillary datasets that support its use. These are described below. Note that all HATS datasets are Parquet datasets and can be accessed using any tool or library that understands Parquet. In addition, the lsdb Python library has been developed specifically to support large-scale astronomy use cases with HATS datasets. It provides simple interfaces that allow users to perform (e.g.,) full-catalog cross matches with just a few lines of code.

The basic directory structure (excluding metadata) is as follows:

    .../hats/                                   # S3 prefix ending with "/hats"
    ├── catalog_name-hats/                      # HATS Catalog
    │   ├── dataset/                            # Catalog's Parquet dataset
    │       └── Norder=k/                       # 'k' (integer) = HEALPix order
    │          └── Dir=d/                       # 'd' (integer) = 10_000 * floor['n' / 10_000]
    │             └── Npix=n/                   # 'n' (integer) = pixel index at HEALPix order 'k'
    │                └── part0.snappy.parquet   # Parquet file with data for this partition
    │            ... more partitions and data files ...
    ├── catalog_name-hats_index_objectid/       # HATS Index Table for column 'objectid'
    │   ├── dataset/                            # Index's Parquet dataset
    │            ... partitions and data files, depending on the index ...
    ├── catalog_name-hats_margin_10arcsec/      # 10" HATS Margin Cache
    │   ├── dataset/                            # Margin's Parquet dataset
    │            ... directories for Norder, Dir, and Npix, plus data files (as above) ...
    

HEALPix order 5 partitioning

Parquet products listed with an S3 prefix that ends in "/healpix_k5" are partitioned by HEALPix orders 0 and 5 (nested numbering scheme).

When querying the "/healpix_k5" products, include the order 5 pixel index whenever possible, for efficiency. (It is not necessary to include order 0 in addition to 5.) The basic directory structure is as follows:

    .../healpix_k5/                     # S3 prefix ending with "/healpix_k5"
    ├── catalog-name.parquet/           # Directory holding the Parquet dataset
    │   ├── healpix_k0=N/               # 'N' (integer) = pixel index at HEALPix order 0
    │       └── healpix_k5=M/           # 'M' (integer) = pixel index at HEALPix order 5
    │          └── part0.snappy.parquet # Parquet file with data for this partition
    │        ... more partitions and data files ...
    │   ├── _common_metadata            # Small Parquet file with the schema (no data)
    │   └── _metadata                   # Parquet file with all metadata (no data) collected from data files
    ├── README.txt and other ancillary files