Cloud-Native Layouts

Last updated on 2025-11-18 | Edit this page

Overview

Questions

What are cloud-native data layouts?
Why are cloud-native layouts important for interoperability in climate science data?
What are key technologies for cloud-native data layouts?

Objectives

Understand the concept of cloud-native data layouts.
Recognize the importance of cloud-native layouts for interoperability in climate science data.
Identify key technologies used in cloud-native data layouts.

Content ✔ What is a cloud-native layout? • Chunked, object-based storage • Parallel IO • Lazy loading • Works natively with Dask, Ray, Spark, Pangeo ✔ Why cloud-native matters for climate datasets • ERA5, CMIP6, GOES are TB/PB scale • Repeated slicing for ML • Cloud-native = scalable structural interoperability ✔ Key technologies • Zarr • Kerchunk • Parquet • Example projects: ERA5-to-Zarr, GOES-16 COG/Zarr ✔ NetCDF vs cloud-native • NetCDF = interoperable but not cloud-native • Zarr = cloud-native but depends on semantic standards (CF)

Hands-on Exercise: Convert NetCDF → Virtual Zarr Using Kerchunk: NetCDF → Zarr reference → open with xarray → inspect CF metadata

Reinforces structural interoperability.

Key Points

Cloud-native data layouts, such as Zarr and Parquet, are designed for efficient storage and access in cloud environments, enabling scalable and parallel processing of large climate datasets.
Cloud-native layouts enhance interoperability by allowing seamless integration with distributed computing frameworks like Dask, Ray, and Spark, facilitating efficient data slicing and analysis.
Key technologies for cloud-native data layouts include Zarr for chunked storage, Kerchunk for virtual datasets, and Parquet for tabular data, all of which support scalable and interoperable workflows in climate science.