Parquet Catalogs
IRSA provides various catalogs and tables as spatially-partitioned Parquet datasets. These support efficient access for a wide range of common use cases and are especially good for machine learning, statistics, and other large-scale approaches. Parquet can be queried directly using common libraries in Python and other languages and the files are well-suited for parallel processing. IRSA uses HATS and HEALPix (described below) for the partitioning, providing additional efficiency for spatially oriented use cases. The datasets can be accessed in two ways:
- Query the cloud copy (recommended). The Cloud Data Access page lists bucket and path information for all of IRSA's cloud holdings including HATS and HEALPix Parquet catalogs. The files can be efficiently queried in place (no need to download them first) even in highly parallel workflows. See Python Notebook Tutorials: Accessing IRSA's cloud holdings for examples of how to navigate and query Parquet datasets.
- Bulk download scripts are provided for those who prefer to download the files to local storage before querying.
Non-Parquet versions of these catalogs are also available and can be accessed several ways, including IRSA Viewer, Catalog Search, and APIs.
Querying Parquet datasets
Apache Parquet is a file format that includes rich metadata and supports fast, SQL-like queries on large datasets. Libraries in various languages can read Parquet files. Some Python examples include:
- pandas - Commonly used. Good for basic and intermediate use cases. Supports column and row filters during the read. Demonstrated in tutorials linked above.
- dask - Parallel computing library for analytics. Good for intermediate use cases.
- pyarrow - Powerful. Good for intermediate and advanced use cases. Used under the hood by most other Python libraries that read Parquet, but with limited functionality exposed to the user. Use it directly for full access, including the ability to construct new columns on the fly (e.g., construct colors from flux columns) and use them in row filters. Demonstrated in tutorials linked above.
Query efficiency is strongly impacted by dataset partitioning, and especially by a library's ability to understand and use the partitioning. To facilitate the efficiency, the user should pay special attention to the following (both demonstrated in tutorials linked above):
-
When using a method like
read_parquet()
orparquet.dataset()
, look for a keyword argument likepartitioning
and pass the value"hive"
. Most Python libraries use this value by default, but it's important to check. Hive is a partition-naming scheme that is used by all datasets described on this page (essentially, the directory naming syntax is "key=value/", which identifies the partitions). -
When querying a dataset, for example with a method like
read_parquet(columns=[...], filter=[...])
, include the relevant partition key/value pair(s) in the row filters whenever possible. This allows the Parquet reader to completely ignore all other partitions. This is relevant for any query with a spatial component, since all datasets described on this page are partitioned spatially. Additionally, this can be used when parallel processing or any time you want to grab ahold of one or more specific partitions at a time.
HEALPix
HEALPix (Hierarchical Equal Area isoLatitude Pixelization; Górski et al., 2005) is a tiling of the sky into equal-area pixels. The HEALPix "order" ("k", in some contexts) determines the pixel area: higher order => smaller pixels, better resolution, and more total pixels required to cover the sky.
Hint: A pixel index can be calculated from RA and Dec coordinates using the hpgeom Python library.
HATS partitioning and HATS Collections
Parquet products listed with an S3 prefix that ends in "/hats" are HATS Collections.
HATS (Hierarchical Adaptive Tiling Scheme) is a HEALPix-based partitioning scheme (plus metadata) for Parquet datasets. The HEALPix orders at which data is partitioned is adaptive -- it varies within a given catalog based on the on-sky density of rows -- with the aim of creating partitions that have roughly equal numbers of rows. Similarly sized partitions are important for efficient access, especially when parallel processing. HATS is designed especially to support large-scale uses cases with spatial dependencies that are common in astronomy, such as cross matching. Coupled with the benefits of Parquet, HATS catalogs efficiently support a wide range of use cases.
A HATS Collection comprises a HATS Catalog plus ancillary datasets that support its use. These are described below. Note that all HATS datasets are Parquet datasets and can be accessed using any tool or library that understands Parquet. In addition, the lsdb Python library has been developed specifically to support large-scale astronomy use cases with HATS datasets. It provides simple interfaces that allow users to perform (e.g.,) full-catalog cross matches with just a few lines of code.
- HATS Catalog
- Main data product. Holds all catalog data.
- Partitioned by the HATS columns 'Norder', 'Dir', and 'Npix'. These are described with the directory structure below and more information can be found at HATS Directory Scheme.
- There will be exactly one Catalog in a Collection.
- HATS Index Table
- Maps values in the column being indexed (typically an object or source ID provided by the mission or project that produced the data) to the Catalog partitions in which they reside.
- Holds the index-column values, their partitions (values of 'Norder', etc.), and possibly a few more columns from the Catalog.
- Partitioned by ranges of values in the index column.
- Useful for finding Catalog rows by index-column value rather than sky coordinates.
- There may be any number of Index Tables in a Collection, each indexing a different column.
- HATS Margin Cache
- Holds catalog data that is located in a margin (typically 5-10 arcseconds wide) around each Catalog partition.
- Partitioned by the HATS columns 'Norder', 'Dir', and 'Npix'.
- Useful for cross matching or anytime you want to load a partition plus a little extra padding around the outside. For example, when when parallel processing (thus handling partitions independently) but need to ensure that spatial searches don't miss rows located just outside the given pixel/partition boundary.
- There may be any number of Margin Caches in a Collection, each using a different margin width.
The basic directory structure (excluding metadata) is as follows:
.../hats/ # S3 prefix ending with "/hats" ├── catalog_name-hats/ # HATS Catalog │ ├── dataset/ # Catalog's Parquet dataset │ └── Norder=k/ # 'k' (integer) = HEALPix order │ └── Dir=d/ # 'd' (integer) = 10_000 * floor['n' / 10_000] │ └── Npix=n/ # 'n' (integer) = pixel index at HEALPix order 'k' │ └── part0.snappy.parquet # Parquet file with data for this partition │ ... more partitions and data files ... ├── catalog_name-hats_index_objectid/ # HATS Index Table for column 'objectid' │ ├── dataset/ # Index's Parquet dataset │ ... partitions and data files, depending on the index ... ├── catalog_name-hats_margin_10arcsec/ # 10" HATS Margin Cache │ ├── dataset/ # Margin's Parquet dataset │ ... directories for Norder, Dir, and Npix, plus data files (as above) ...
HEALPix order 5 partitioning
Parquet products listed with an S3 prefix that ends in "/healpix_k5" are partitioned by HEALPix orders 0 and 5 (nested numbering scheme).
When querying the "/healpix_k5" products, include the order 5 pixel index whenever possible, for efficiency. (It is not necessary to include order 0 in addition to 5.) The basic directory structure is as follows:
.../healpix_k5/ # S3 prefix ending with "/healpix_k5" ├── catalog-name.parquet/ # Directory holding the Parquet dataset │ ├── healpix_k0=N/ # 'N' (integer) = pixel index at HEALPix order 0 │ └── healpix_k5=M/ # 'M' (integer) = pixel index at HEALPix order 5 │ └── part0.snappy.parquet # Parquet file with data for this partition │ ... more partitions and data files ... │ ├── _common_metadata # Small Parquet file with the schema (no data) │ └── _metadata # Parquet file with all metadata (no data) collected from data files ├── README.txt and other ancillary files