Content from Introduction


Last updated on 2026-03-26 | Edit this page

Overview

Questions

  • Why interoperability is important when dealing with research data?
  • What are the three layers of interoperability?
  • How can you identify if a dataset is interoperable or not?

Objectives

  • Understand why interoperability matters in climate & atmospheric science
  • Recognize the 3 layers: structural, semantic, technical
  • Identify interoperable vs non-interoperable datasets

What is interoperability?


From the foundational article: The FAIR Guiding Principles for scientific data management and stewardship 1

Three guiding principles for interoperability are:

  • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • I2. (meta)data use vocabularies that follow FAIR principles
  • I3. (meta)data include qualified references to other (meta)data
Challenge

Assessing how interoperable a dataset is

You receive a dataset containing global precipitation estimates for 2010–2020. Its characteristics are:

  • Provided as an NetCDF file.
  • Variables have short, cryptic names (e.g., prcp, lat, lon).
  • Metadata uses inconsistent units (some missing).
  • Coordinates and grids are documented only in an accompanying PDF.
  • The dataset includes a DOI and references two external datasets used for validation.
  • No controlled vocabularies or community standards (e.g., CF, GCMD keywords) are used.

Based on the FAIR interoperability principles (I1–I3), how would you rate the interoperability of this dataset?

  1. High interoperability — it uses a widely supported file format and includes references to other datasets.
  2. Moderate interoperability — some technical elements exist, but semantic clarity and formal vocabularies are missing.
  3. Low interoperability — metadata and semantic descriptions do not meet I1–I3 requirements.
  4. Full interoperability — all three interoperability principles (I1, I2, I3) are clearly satisfied.

Correct answer: B or C depending on the level of strictness, but for educational clarity, choose C.

I1: Not satisfied (no formal/shared language for knowledge representation; heavy reliance on PDF documentation).

I2: Not satisfied (no controlled vocabularies, no standards).

I3: Partially satisfied (qualified references exist, but insufficient context). => Overall, interoperability is low.

Identify the three layers of interoperability


Semantic interoperability = meaning

Semantic interoperability ensures that data carries shared, consistent meaning across institutions and tools. This is achieved through:

  • standard vocabularies
  • controlled terms
  • variable naming conventions
  • units
  • coordinate definitions

Examples include CF standard names, attributes, and controlled vocabularies. Without semantic interoperability, datasets cannot be reliably interpreted, compared, or combined.

Structural interoperability = representation

Structural interoperability ensures that data is organized, stored, and encoded in predictable, machine-actionable ways. This is achieved through:

  • common file formats

  • shared data models

  • consistent dimension and array structures

Examples include NetCDF, Zarr, and Parquet, which define how variables, coordinates, and metadata are stored. Structural interoperability allows tools across programming languages and platforms to read data consistently.

Technical interoperability = access

Technical interoperability ensures that data can be accessed, exchanged, and queried using standard, machine-readable mechanisms. This is achieved through:

  • APIs

  • remote access protocols

  • web services

  • cloud object storage interfaces

Examples include OPeNDAP, THREDDS and REST APIs. Technical interoperability enables automated workflows, cloud computing, and scalable analytics.

References

  • European Commission (Ed.). (2004). European interoperability framework for pan-European egovernment services. Publications Office.
  • European Commission. Directorate General for Research and Innovation. & EOSC Executive Board. (2021). EOSC interoperability framework: Report from the EOSC Executive Board Working Groups FAIR and Architecture. Publications Office. https://data.europa.eu/doi/10.2777/620649
Challenge

Reflect back on the three guiding principles for interoperability (I1–I3)(Think-Pair-Discuss):

  • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • I2. (meta)data use vocabularies that follow FAIR principles
  • I3. (meta)data include qualified references to other (meta)data

Do they represent all the three layers of interoperability (structural, semantic, technical)? Explain your reasoning.

FAIR’s interoperability principles emphasize semantic interoperability, while structural and technical layers are insufficiently addressed.

For a domain like climate science—where structural standards like NetCDF-CF and technical standards like OPeNDAP matter enormously, these three guiding principles alone is not enough to guarantee practical interoperability.

Key elements of interoperable research workflows


Interoperable research workflows rely on a set of shared practices, formats, and technologies that allow data to be exchanged, understood, and reused consistently across tools and institutions. In climate and atmospheric science, these elements form the backbone of scalable, reproducible, and machine-actionable data ecosystems.

  • Community formats (NetCDF, Zarr, Parquet) provide a common structural foundation.

    These formats encode data in predictable ways, with clear rules about dimensions, variables, and internal structure. NetCDF remains the dominant community standard for multidimensional geoscience data, while Zarr offers a cloud-native representation suitable for large-scale, distributed computing. Parquet complements both by providing an efficient columnar format for tabular or metadata-rich data. Using community formats ensures that tools across languages and platforms can interpret datasets consistently.

  • Standardized metadata (CF conventions) provide the semantic layer needed for meaningful interpretation.

    CF conventions define variable names, units, coordinate systems, and grid attributes so that datasets from different sources “speak the same language.” This allows climate model output, satellite observations, and reanalysis products to be aligned and compared reliably.

  • Stable APIs enable technical interoperability by providing machine-readable access to data and metadata.

    APIs based on HTTP and JSON allow automated workflows, programmatic data publication, and integration between repositories, processing systems, and analysis tools. A stable, well-documented API ensures that downstream services and scripts continue to function even as data collections evolve.

  • Cloud-native layouts make large datasets scalable and performant.

    By storing data as independent chunks in object storage, formats such as Zarr allow parallel, lazy, and distributed access—ideal for big climate datasets, serverless workflows, and AI pipelines. This ensures that even multi-terabyte archives can be streamed efficiently without requiring full downloads.

Together, these elements work as a coordinated system: community formats provide structure, metadata provides meaning, APIs provide access, and cloud-native layouts provide scalability.

This combination is what enables truly interoperable research workflows in general bu specially to modern climate and atmospheric science. Why?

Challenge

Real world barriers to data interoperability and reuse (Think-Pair-Discuss)

Think of a time when you tried to reuse a dataset that you did not produce. (5 minutes) What was the most significant barrier you encountered?

Pair discussion (5 minutes)

Share your experiences with your partner:

  • What specific interoperability challenges did you face (structural, semantic, technical)?

  • How did you try to overcome them?

  • Would adherence to FAIR I1–I3 principles have helped? If so, how?

Group debrief (5 minutes)

Discuss as a group:

  • Common obstacles

  • Whether these were structural, semantic, or technical

  • How such challenges could be prevented if data producers had designed the dataset with interoperability in mind (e.g., CF conventions, persistent identifiers, shared vocabularies, formal metadata languages)

Specific data challenges in the climate & atmospheric sciences


  • Heterogeneous data origins: Climate research integrates satellite retrievals, weather models, climate simulations, in-situ sensors, radar, lidar, aircraft measurements, and reanalysis datasets—each with its own structure, conventions, and processing workflows.

  • Different spatial and temporal resolutions: Satellite images may be daily or hourly at 1 km resolution, while climate models may provide monthly or daily outputs on coarse grids; combining them requires consistent metadata and alignment.

  • Multiple file formats and data models: Data may come as GRIB, NetCDF, GeoTIFF, HDF5, CSV, or Zarr, each with different structural assumptions that affect processing and interpretation.

  • Inconsistent metadata quality: Missing units, inconsistent variable names, unclear coordinate systems, or non-standard attributes are frequent issues—making semantic interoperability a major challenge.

  • Large data volume and velocity: Earth observation missions (e.g., Sentinel, GOES), reanalysis products (ERA5), and high-resolution climate simulations produce terabytes to petabytes of data, making efficient, interoperable access necessary.

  • Different access mechanisms and services: Data are distributed across portals using APIs, OPeNDAP servers, cloud object storage, FTP, THREDDS catalogs, proprietary download tools, or manual interfaces—requiring technical interoperability to automate workflows.

  • Versioning and reproducibility issues: Climate datasets evolve frequently (e.g., reprocessed satellite series, new CMIP6 versions), and without stable identifiers or catalog metadata, reproducibility becomes difficult across institutions.

  • Need for multi-model and multi-dataset comparisons: Studies such as model evaluation, bias correction, and data assimilation depend on aligning diverse datasets that were never originally designed to work together.

Why interoperability is essential


Interoperability is essential in climate and atmospheric science because researchers routinely work with multiple heterogeneous datasets that were never originally designed to work together. By ensuring that data are described consistently, stored in predictable structures, and accessed through standard mechanisms, interoperability makes it possible to combine and reuse data efficiently across research workflows.

First, interoperability enables data reuse: when datasets follow shared metadata conventions and formats, researchers can easily understand what variables represent, how they were produced, and how they can be used in new contexts. This avoids redundant effort and saves time across research groups.

Second, interoperability enables integration across sources—for example, combining model output with satellite observations, radar measurements, in-situ sensors, and reanalysis datasets. These data sources differ in resolution, structure, access method, and semantics; without shared standards, aligning them becomes difficult or impossible.

Third, interoperability reduces friction in data pipelines. Standardized formats, consistent metadata, and machine-actionable APIs allow workflows to run smoothly without manual cleaning, renaming, or restructuring. This is especially critical when handling large, frequently updated datasets typical in climate research.

Finally, interoperability is required for automation, AI, dashboards, and multi-disciplinary science. Machine learning pipelines, automated monitoring systems, and interactive applications rely on consistent, accessible, and machine-readable data. Without interoperability, these tools break or require extensive custom engineering.

In short, interoperability is what makes the diverse, high-volume data ecosystem of climate and atmospheric science usable, scalable, and scientifically trustworthy.

Challenge

True/False or Agree/Disagree with discussion afterwards

  • “As long as data are open access, they are interoperable.”
  • “Metadata standards help ensure interoperability.”
  • “As long as data is using an open standard format is interoperable”

F,T,F

Challenge

Discuss with you peer:

Participants should identify whether the dataset is interoperable based on the three layers discussed (structural, semantic, technical).

dataset 1: Interoperable

  • Structure: NetCDF format with clear dimensions and variables.
  • Metadata: CF-compliant attributes, standard names, units.
  • Access: OPeNDAP protocol for remote access.

dataset 2: Not interoperable

  • Structure: CSV files with ambiguous column headers.
  • Metadata: Lacks standardized metadata, unclear variable meanings.
  • Access: Manual download, no API or remote access.

dataset 3: Not interoperable

  • Structure: NetCDF format but missing CF compliance.
  • Metadata: Inconsistent or missing units, unclear variable names.
  • Access: OPeNDAP protocol
Key Points
  • Interoperability ensures that data can be understood, combined, accessed, and reused across tools, institutions, and workflows with minimal manual intervention.

  • Interoperability operates at three complementary layers:structural (how data is encoded and organized),semantic (how data is described and interpreted), and technical (how data is accessed and exchanged).

  • The FAIR interoperability principles I1–I3 primarily address the semantic layer. They provide essential guidance on shared metadata languages, vocabularies, and references, but they do not fully cover structural and technical interoperability.

  • In climate and atmospheric science, all three layers are required for practical reuse. Structural standards (e.g., NetCDF, Zarr), semantic conventions (e.g., CF), and technical mechanisms (e.g., APIs, OPeNDAP, THREDDS) must work together.

  • Many real-world barriers to reuse datasets (unclear metadata, missing units, inconsistent coordinate systems, incompatible file formats, unstable access mechanisms) are failures of one or more interoperability layers.

  • Interoperable research workflows rely on established community formats, standardized metadata conventions, stable access protocols, and scalable cloud-native layouts that allow large heterogeneous datasets to be aligned, streamed, and analysed consistently.

  • Interoperability is essential in climate science because datasets come from diverse sources (models, satellites, sensors, reanalysis) and must be combined into integrated analyses that are reproducible and machine-actionable.


  1. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., … & Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 3(1), 1-9.↩︎

Content from Structural interoperability


Last updated on 2026-03-11 | Edit this page

Overview

Questions

  • What is structural interoperability?

  • How do open standards and community governance enable structurally interoperable research data?

  • Which structural expectations must a data format satisfy to support automated, machine-actionable workflows?

  • Which open standards are commonly used in climate and atmospheric sciences to achieve structural interoperability?

  • What is NetCDF’s data structure?

Objectives

  • Explain structural interoperability in terms of data models, dimensions, variables, and metadata organization.

  • Identify the role of open standards and community-driven governance in ensuring long-term structural interoperability.

  • Describe the key structural expectations required for automated alignment, georeferencing, metadata interpretation, and scalable analysis.

  • Analyze a NetCDF file to identify its core structural elements (dimensions, variables, coordinates, and attributes).

What is structural interoperability?


Structural interoperability refers to how data are organized internally, their dimensions, attributes, and encoding rules. It ensures that different tools can parse and manipulate a dataset without prior knowledge of a custom schema or ad-hoc documentation.

In particular for the field of climate & atmospheric sciences, for a dataset to be structurally interoperable:

  • Arrays must have known shapes and consistent dimension names (e.g., time, lat, lon, height).
  • Metadata must follow predictable rules (e.g., attributes like units, missing_value, long_name).
  • Coordinates must be clearly defined, enabling slicing, reprojection, or aggregation.

Structural interoperability answers the question: Can machines understand how this dataset is structured without human intervention?

Structural interoperability relies on open standards


To attain structural interoperability, two key aspects are essential: machine-actionability and longevity. Open standards ensure that these two aspects are met. An open standard is not merely a published file format. From the perspective of structural interoperability, an open standard guarantees that:

  • The data model is publicly specified (arrays, dimensions, attributes, relationships)

  • The rules for interpretation are explicit, not inferred from software behavior

  • The specification can be independently implemented by multiple tools

  • The standard evolves through transparent versioning, avoiding silent breaking changes

These properties ensure that a dataset remains structurally interpretable even when:

  • The original software is no longer available
  • The dataset is reused in a different scientific domain
  • Automated agents, rather than humans, perform the analysis

Usually these standards are adopted and maintained by a non-profit organization and its ongoing development is driven by a community of users and developers on the basis of an open-decision making process

Formats such as NetCDF, Zarr, and Parquet emerge from broad communities like:

  • Unidata (NetCDF)
  • Pangeo (cloud-native geoscience workflows)
  • Open Geospatial Consortium (OGC) & World Meteorological Organization (WMO)

These groups define formats that encode stable, widely adopted structural constraints that machines and humans can rely on.

Structural Interoperability is about Data Models, not just data formats

Structural interoperability does not emerge from file extensions alone. It is enforced by an underlying data model that defines:

  • What kinds of objects exist (arrays, variables, coordinates)

  • How those objects relate to one another

  • Which relationships are mandatory, optional, or forbidden

  • Community standards such as NetCDF and Zarr succeed because they define and constrain a formal data model, rather than allowing arbitrary structure.

This is why structurally interoperable datasets can be:

  • Programmatically inspected

  • Automatically validated

  • Reliably transformed across tools and platforms

Challenge

Structural expectations encoded in open standards (Think-Pair-Discuss)

Which structural expectation are required for:

  • automated alignment of datasets along common dimensions (e.g., time, latitude, longitude)
  • georeferencing and spatial indexing
  • correct interpretation of units, scaling factors, and missing values
  • efficient analysis of large datasets using chunking and compression

Automated alignment of datasets along common dimensions , (e.g., time, latitude, longitude)

  • Required structural expectation: Named, ordered dimensions

Explanation: Automated alignment requires that datasets explicitly declare:

  • Dimension names (e.g., time, lat, lon)
  • Dimension order and length

Open standards such as NetCDF and Zarr enforce explicit dimension definitions. Tools like xarray rely on this structure to automatically align arrays across datasets without manual intervention.

Georeferencing and spatial indexing

  • Required structural expectations: Coordinate variables, Explicit relationships between coordinates and data variables

Explanation: Georeferencing depends on the structural presence of coordinate variables that:

  • Map array indices to real-world locations
  • Are explicitly linked to data variables via dimensions

This enables spatial indexing, masking, reprojection, and subsetting. Without declared coordinates, spatial operations require external knowledge and break structural interoperability.

Correct interpretation of units, scaling factors, and missing values

  • Required structural expectation: Consistent attribute schema

Explanation: Open standards require metadata attributes to be:

  • Explicitly declared
  • Attached to variables
  • Machine-readable (e.g., units, scale_factor, add_offset, _FillValue)

This allows tools to correctly interpret numerical values without relying on external documentation. While the meaning of units is semantic, their presence and location are structural requirements.

Efficient analysis of large datasets using chunking and compression

  • Required structural expectation: Chunking and compression mechanisms defined by the standard

Explanation: Formats such as NetCDF4 and Zarr define:

  • How data is divided into chunks
  • How chunks are compressed
  • How chunks can be accessed independently

This structural organization enables parallel, lazy, and out-of-core computation, which is essential for large-scale climate and atmospheric datasets.

NetCDF


Once upon a time, in the 1980s at University Corporation for Atmospheric Research (UCAR) and Unidata, a group of researchers came together to address the need for a common format for atmospheric science data. Back then the group of researchers committed to this task asked the following question: how to effectively share my data? such as it is portable and can be read easily by humans and machines? the answer was NetCDF.

NetCDF (Network Common Data Form) is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

A NetCDF file consists of dimensions, variables, and attributes
A NetCDF file consists of dimensions, variables, and attributes

Self-describing structure (Structural layer):

  • One file carries data and metadata, usable on any system. This means that there is a header which describes the layout of the rest of the file, in particular the data arrays, as well as arbitrary file metadata in the form of name/value attributes. 
  • Dimensions define axes (time, lat, lon, level).
  • Variables store multidimensional arrays.
  • Coordinate variables describe spatial/temporal context.
  • Attributes are metadata per variable, and for the whole file.
  • Variable attributes encode units, fill values, variable names, etc.
  • Global attributes provide dataset-level metadata (title,creator, convention,year).
  • No external schema is required, the file contains its own structural metadata.

Other useful open standards used in the field


Zarr (cloud-native, chunked, distributed)

Zarr is a format designed for scalable cloud workflows and interactive high-performance analytics.

Structural characteristics:

  • Stores arrays in chunked, compressed pieces across object storage systems.
  • Supports hierarchical metadata similar to NetCDF.
  • Uses JSON metadata that tools interpret consistently (e.g., xarray → Mapper).
  • Works seamlessly with Dask for parallelized computation.

Why it matters:

  • Enables analysis of petabyte-scale datasets (e.g., NASA Earthdata Cloud).
  • Structural rules are community-governed (Zarr v3 specification).
  • Allows fine-grained access—no need to download entire files.

Zarr is becoming central for cloud-native structural interoperability.

Parquet (columnar, schema-driven)

Parquet offers efficient tabular storage and is ideal for:

  • Station data
  • Metadata catalogs
  • Observational time series

Structural characteristics:

  • Strongly typed schema ensures column-level consistency.
  • Supports complex types but enforces stable structure.
  • Columnar layout enables fast filtering and analytical queries.

In climate workflows:

  • Parquet is used for catalog metadata in Intake, STAC, and Pangeo Forge.
  • Structural predictability supports machine-discoverable datasets.

Parquet complements NetCDF/Zarr, addressing non-array use cases.

Challenge

Identify the structural elements in a NetCDF file

Please perform the following steps to explore the structural elements of a NetCDF file:

SH

1. Open a NetCDF file : https://opendap.4tu.nl/thredds/dodsC/IDRA/2019/01/02/IDRA_2019-01-02_12-00_raw_data.nc.html
2. Identify variable metadata
3. Identify the global attributes
4. Identify the dimensions and coordinate variables

1. Open the NetCDF file

The dataset can be explored through the OPeNDAP interface:

https://opendap.4tu.nl/thredds/dodsC/IDRA/2019/01/02/IDRA_2019-01-02_12-00_raw_data.nc.html

This file contains raw radar measurements from the IDRA radar system for a specific time period.

2. Variable metadata

The dataset contains several variables describing the radar measurements and the coordinates.

In each variable there is metadata like :

standard_name, units, axis,long_name,comment

3. Global attributes

Examples of dataset-level attributes include:

  • title
  • institution
  • history
  • references
  • Conventions
  • location
  • source
  • example

These attributes provide general information describing the dataset as a whole, including provenance and context.

4. Dimensions and coordinate variables

Dimensions

  • time_raw_data
  • range
  • time_processed_data
  • sample_beat_signal
  • scalar

These dimensions define the structure of the dataset.

Coordinate variables

Same name as dimensions

  • time_raw_data
  • range
  • time_processed_data
  • sample_beat_signal

These variables provide the coordinate values used to interpret the data.

Data variables

Total 27 variables. Some examples:

  • noise_power_horizontal > (range) float32
  • azimuth_raw_data > (time_raw_data) float32
  • radial_velocity > (time_processed_data, range) float32

These elements define the structural organization of the dataset, allowing software tools to interpret the data automatically and enabling structural interoperability.

Key Points
  • Structural interoperability concerns how data are organized, not what they mean.
  • Open standards are essential for machine-actionability and long-term reuse.
  • Structural interoperability is enforced by data models, not file extensions.
  • Structural interoperability is enforced by data models.
  • Standards maintained by communities (e.g. Unidata, Pangeo, OGC/WMO) encode shared structural contracts that tools and workflows can reliably depend on.
  • NetCDF exemplifies structural interoperability for multidimensional geoscience data.

Content from Semantic interoperability


Last updated on 2026-03-11 | Edit this page

Overview

Questions

  • What is semantic interoperability ?

  • Why is structural interoperability alone insufficient for meaningful data reuse?

  • How do community metadata conventions (e.g. CF) encode shared scientific meaning?

  • What does it mean for a NetCDF file to be “CF-compliant”?

Objectives

By the end of this episode, learners will be able to:

  • Distinguish between structural and semantic interoperability.

  • Explain why shared vocabularies and conventions are required for machine-actionable meaning.

  • Describe the role of the CF Conventions in climate and atmospheric sciences.

  • Apply a CF compliance checker to evaluate how CF-compliant NetCDF files are.

What is semantic interoperability?


Semantic interoperability concerns shared meaning.

A dataset is semantically interoperable when machines and humans interpret its variables in the same scientific way, without relying on informal documentation, personal knowledge, or context outside the data itself. Semantic interoperability answers the question:

“Do we agree on what this data represents?”

This goes beyond structure. Two datasets may both contain a variable named temp, stored as a float array over time and space, yet represent:

  • air temperature at 2 m

  • sea surface temperature

  • model potential temperature

  • sensor voltage converted to temperature

Are these the same quantity? No.

Without semantic constraints, machines cannot reliably compare, combine, or reuse such data.

Why structural interoperability is not enough


Structural interoperability ensures that:

  • dimensions are explicit,

  • arrays align,

  • metadata is machine-readable.

However, structure does not define meaning.

Example:

A NetCDF file may be perfectly readable by xarray. Variables may have dimensions (time, lat, lon).

Units may be present. Yet machines still cannot know: what physical quantity is represented, at which reference height or depth, whether values are comparable across datasets.

This gap is addressed by semantic conventions, not file formats.

Semantic interoperability via CF Conventions


The Climate and Forecast (CF) Conventions define a shared semantic layer on top of NetCDF’s structural model.

CF specifies, among others:

  • Standard names: Controlled vocabulary linking variables to formally defined physical quantities (e.g. air_temperature, sea_surface_temperature)

  • Units: Enforced through UDUNITS-compatible expressions

  • Coordinate semantics: Meaning of vertical coordinates, bounds, and reference systems

  • Grid mappings and projections: Explicit spatial reference information

  • Relationships between variables: For example, how bounds, auxiliary coordinates, or cell methods relate to data variables

By adhering to CF, datasets become semantically interoperable:

  • A NetCDF file without CF can be structurally interoperable, but it is semantically ambiguous.

  • A NetCDF file with CF becomes interpretable across tools, domains, and time.

Community governance and ecosystem alignment


Semantic interoperability is not achieved by individual researchers alone.

CF Conventions are:

  • Developed and maintained by a broad scientific community

  • Reviewed, versioned, and openly governed

  • Adopted by major infrastructures and workflows

This shared semantic contract enables: Cross-dataset comparison, automated discovery and filtering and large-scale synthesis and reuse.

Challenge

Semantic interoperability . True or False?

Indicate whether each statement is True or False, and justify your answer.

  1. A NetCDF file with dimensions, variables, and units is semantically interoperable by default.

  2. CF standard names allow machines to distinguish between different kinds of “temperature”.

  3. Semantic interoperability mainly benefits human readers, not automated workflows.

  4. Two datasets using the same CF standard name can be compared without manual interpretation.

  5. Semantic interoperability can be achieved without community-agreed conventions.

  1. False. Structure alone does not define meaning; semantics require controlled vocabularies and conventions.

  2. True . CF standard names explicitly encode physical meaning, not just labels.

  3. False. Semantic interoperability is essential for automated discovery, comparison, and integration.

  4. True . Shared semantics enable machine-actionable comparability (subject to resolution and context).

  5. False. Semantic interoperability depends on community agreement, not individual interpretation.

CF compliance


A NetCDF file is considered CF-compliant when it adheres to the rules and conventions defined by the CF standard.

The tool IOOS Compliance Checker is a web application that evaluates NetCDF files against CF Conventions based on Python. Its source code is available on GitHub.

Try the Compliance Checker

  1. Go to the IOOS Compliance Checker.

    • Explore the interface and options.
  2. Provide a valid remote OPenNDAP url

    • A valid url (endpoint to the dataset, not a html page) : https://opendap.4tu.nl/thredds/dodsC/IDRA/year/month/day/filename.nc
    • For example, use this sample dataset:
      • https://opendap.4tu.nl/thredds/dodsC/IDRA/2009/04/27/IDRA_2009-04-27_06-08_raw_data.nc
      • https://opendap.4tu.nl/thredds/dodsC/IDRA/2019/01/02/IDRA_2019-01-02_12-00_raw_data.nc (previous episode)
    • Wrong url (html page): https://opendap.4tu.nl/thredds/catalog/IDRA/2009/04/27/catalog.html?dataset=IDRA_scan/2009/04/27/IDRA_2009-04-27_06-08_raw_data.nc
  3. Click on Submit.

  4. Review and download the report

Key Points
  • Semantic interoperability ensures that data variables have shared, machine-actionable scientific meaning, not just readable structure.

  • Structural interoperability is necessary but insufficient for reliable comparison and reuse across datasets.

  • The CF Conventions provide a community-governed semantic layer on top of NetCDF through standard names, units, and coordinate semantics.

  • CF compliance enables automated discovery, comparison, and integration in climate and atmospheric science workflows.

  • Semantic interoperability depends on community-agreed conventions, not on file formats or variable names alone.

Content from Technical interoperability: Streaming protocols


Last updated on 2026-03-26 | Edit this page

Overview

Questions

  • What is technical interoperability?
  • What is the DAP (Data Access Protocol)?
  • How does OPeNDAP enable remote access without full download?
  • What happens when we open a remote NetCDF file using xarray.open_dataset()?
  • Why are streaming protocols essential for large-scale scientific workflows?

Objectives

By the end of this episode, learners will be able to:

  • Define technical interoperability in the context of scientific data infrastructures.
  • Explain how DAP enables interoperable machine-to-machine data access.
  • Access a remote NetCDF dataset via OPeNDAP using Python.
  • Perform server-side subsetting of variables and dimensions.
  • Distinguish between metadata access and actual data transfer.

What is technical interoperability?


Technical interoperability concerns machine-to-machine communication.

A system is technically interoperable when independent systems can exchange and access data through standardized protocols without manual intervention.

If structural interoperability answers:

“Can I read this file?”

Technical interoperability answers:

“Can I access and exchange this data across systems in a scalable way?”

This layer operates below semantics.
It is about transport, protocol, and infrastructure.

Examples include:

  • HTTP
  • REST APIs
  • OPeNDAP
  • OGC services

In scientific data infrastructures, technical interoperability enables remote analysis workflows.

Why file download is not scalable


Large scientific datasets (climate reanalysis, ocean models, satellite archives) often reach:

  • Tens of gigabytes
  • Terabytes
  • Petabytes

Downloading entire files:

  • Is inefficient
  • Consumes bandwidth
  • Duplicates storage
  • Breaks reproducibility pipelines

Modern workflows require:

  • Remote access
  • Server-side filtering
  • On-demand subsetting
  • Integration into automated pipelines

This is where streaming protocols become essential.

DAP and OPeNDAP


The Data Access Protocol (DAP) is a protocol designed to enable remote access to structured scientific data.

OPeNDAP is a widely adopted implementation of DAP.

DAP allows:

  • Access to metadata without full download
  • Server-side slicing (e.g., select time range, variable subset)
  • Transmission of only requested data

In practice, this means:

You interact with a dataset hosted on a remote server as if it were local — but only the necessary data is transferred.

This is technical interoperability in action.

Hands-on: Accessing NetCDF via OPeNDAP in Python


We now move from concept to practice.

We will use:

  • xarray
  • A remote OPeNDAP endpoint
  • A NetCDF dataset hosted on a THREDDS server
  • Jupyter Lab

Step 1 – Open a remote dataset

  • Open Jupyter Lab and choose the appropiate environment of the lesson (see Setup)

  • Launch Jupyter Lab, open a terminal and type:

BASH


jupyter lab
  • Open a new notebook
  • Check installed libraries

PYTHON

import xarray as xr
  • Open a dataset

PYTHON


url = "https://opendap.4tu.nl/thredds/dodsC/IDRA/2019/01/02/IDRA_2019-01-02_12-00_raw_data.nc"

ds = xr.open_dataset(url,engine="pydap")

ds

Observe:

  • The dataset structure loads immediately.

  • Dimensions and metadata are visible.

  • The file has not been fully downloaded.

What happened?

Only metadata and coordinate information were accessed.

Step 2 – Select a variable

PYTHON

ds["spectrum_width"] # still no full download, just metadata

Step 3 – Perform server-side subsetting

  • Actual data transfer occurs

  • Now lets select a variable → “spectrum_width”, using positional indexing and we will take a 10×10 subset along two dimensions.

PYTHON


ds["spectrum_width"].isel(time_processed_data=slice(0,10),range=slice(0,10))
  • Now lets print the values of this subsetting

PYTHON


ds["spectrum_width"].isel(time_processed_data=slice(0,10),range=slice(0,10)).values # to print values in the scren
  • Slicing by the names of the dimensions

PYTHON


ds["spectrum_width"].sel(
    time_processed_data=slice("2019-01-02T12:00:00.000000000", "2019-01-02T12:00:02.097152173"),
    range=slice(0, 1000)
)
  • Using head

PYTHON


ds["spectrum_width"].head()
ds["spectrum_width"].head(time_processed_data=10)
ds["spectrum_width"].head(range=2)
ds["spectrum_width"].head(range=2).to_pandas() # tabular view

PYTHON


ds["spectrum_width"].isel(time_processed_data=0).values #one radar profile (1D slice)

ds["spectrum_width"].isel(range=1).values # One time series  

Now actual data transfer occurs — but only for:

  • One variable

  • A limited time window

This is server-side subsetting enabled by DAP.

Step 4 Plotting a profile

PYTHON


import matplotlib.pyplot as plt 

ds["spectrum_width"].isel(time_processed_data=0).plot()

ds["spectrum_width"].head(range=10).plot()

You have multiple equivalent ways to express the same operation:

.isel() → positional slicing (what you used) .sel() → coordinate-aware slicing .head() → quick inspection .values → raw data extraction .plot() → visual interpretation

Relevance for resarch workflows


Streaming protocols enable:

  • Scalable climate analysis (ERA5, CMIP6)

  • AI/ML training pipelines

  • Reproducible notebooks

  • Cloud-based workflows

  • Data repository integration

Technical interoperability ensures that:

Data repositories are not only storage systems, they become computational infrastructure.

Challenge

Technical interoperability — True or False?

Indicate whether each statement is True or False and justify your answer.

  • Opening a remote dataset with xarray.open_dataset() automatically downloads the entire file.

  • DAP enables server-side filtering before data transfer.

  • Streaming protocols replace the need for structural interoperability.

  • OPeNDAP works independently of file formats.

  • Technical interoperability enables automated workflows across infrastructures.

False. Only metadata is accessed initially; data is transferred upon explicit selection.

True. Subsetting occurs on the server before transmission.

False. Technical interoperability depends on structural interoperability.

False. DAP operates on structured data models (e.g., NetCDF).

True. It enables scalable machine-to-machine access.

Key Points
  • Technical interoperability enables machine-to-machine data exchange through standardized protocols.

  • OPeNDAP implements the DAP protocol for remote access to structured scientific datasets.

  • Remote datasets can be explored without full download.

  • Server-side subsetting reduces bandwidth and supports scalable workflows.

  • Streaming protocols transform data repositories into interoperable computational infrastructure.

Content from Technical interoperability: API


Last updated on 2026-03-25 | Edit this page

Overview

Questions

  • What is technical interoperability in research data infrastructures?
  • What is a REST API?
  • How do APIs enable machine-to-machine workflows?
  • How do APIs depend on structural and semantic interoperability?
  • How can we programmatically manage datasets using the 4TU.ResearchData API?

Objectives

By the end of this episode, learners will be able to:

  • Define APIs as mechanisms of technical interoperability.
  • Explain core REST concepts (HTTP methods, endpoints, JSON, authentication).
  • Interact with a repository API using curl.
  • Create and manage dataset metadata programmatically.
  • Understand the lifecycle of API-driven data publication.

Technical interoperability and APIs


Technical interoperability concerns how systems communicate.

While structural interoperability ensures that data follow predictable formats (e.g. NetCDF arrays and dimensions) and semantic interoperability ensures shared meaning (e.g. CF conventions), technical interoperability ensures that software systems can reliably exchange data and metadata without human intervention. In practice, technical interoperability is achieved through standardized protocols, of which APIs are the most prominent example.

An API (Application Programming Interface) defines how one system can request services or data from another system in a precise, machine-readable way.

APIs enable:

  • Automated data retrieval

  • Programmatic publication of datasets

  • Distributed processing pipelines

  • Machine-to-machine workflows

  • Cross-institutional integration of infrastructures

Then:

APIs operationalize technical interoperability.

REST APIs: core concepts


A REST API is an application programming interface (API) that conforms to the design principles of the representational state transfer (REST) architectural style, a style used to connect distributed hypermedia systems. REST APIs are sometimes referred to as RESTful APIs or RESTful web APIs.

Most modern research data infrastructures expose REST APIs, which rely on widely adopted web standards.

An API defines:

  • Endpoints (resources)
  • Methods (actions)
  • Representations (JSON)
  • Authentication (identity)

In the case of repositories, APIs transform repositories from storage platforms into programmable infrastructure.

Key concepts include:

  • HTTP as the transport protocol (HTTP- HyperText Transfer Protocol)

  • JSON as a structured, machine-readable representation of metadata (JSON- JavaScript Object Notation)

  • Stable identifiers for datasets and resources

  • Versioning to support long-term reuse and evolution of services

  • Self-describing endpoints, where responses contain sufficient metadata to be interpreted by machines

  • REST methods:

    • GET – retrieve data or metadata

    • POST – create new resources

    • PUT / PATCH – update existing resources

    • DELETE – remove resources

Relation to structural and semantic interoperability

APIs do not operate in isolation. APIs depend on structural interoperability: JSON responses must follow well-defined schemas.

APIs depend on semantic interoperability: Metadata fields, vocabularies, and controlled terms ensure that machines interpret content consistently.

Without structural and semantic agreement, an API may be technically functional but scientifically meaningless.

Relevance of APIs for climate and atmospheric sciences

Climate and atmospheric research is inherently computational and distributed:

  • Data volumes are large and continuously growing

  • Analyses are increasingly automated

  • Workflows span institutions, models, sensors, and repositories

  • Reproducibility requires programmatic access

APIs make it possible to build end-to-end interoperable workflows, from data acquisition to publication and reuse, without manual intervention.

Challenge

API : True or False?

  • An API guarantees semantic interoperability.

  • A REST API always uses JSON.

  • HTTP methods correspond to resource operations.

  • Without stable identifiers, APIs break reproducibility.

  • APIs replace the need for structured data formats.

False — semantics depend on shared vocabularies.

False — JSON is common but not required.

True.

True.

False — APIs depend on structure.

Hands on 4TU.ResearchData REST API


The 4TU.ResearchData repository provides a REST API that allows programmatic access to its datasets and metadata. This enables researchers to integrate data publication and retrieval into their automated workflows.

The documentation for the 4TU.ResearchData REST API can be found at: https://djehuty.4tu.nl/

Query datasets using the 4TU.ResearchData API

Get datasets or software deposited in 4TU (via curl)

BASH


curl -X GET "https://data.4tu.nl/v2/articles"  | jq

What is curl?

curl stands for Client URL.

It’s a command-line tool that allows you to transfer data to or from a server using various internet protocols, most commonly HTTP and HTTPS.

It is especially useful for making API requests — you can send GET, POST, PUT, DELETE requests, upload or download files, send headers or authentication tokens, and more.

Why curl works for APIs

REST APIs are based on the HTTP protocol, just like websites. When you visit a webpage, your browser sends a GET request and displays the HTML it gets back. When you use curl, you do the same thing, but in your terminal. For example:

curl https://data.4tu.nl/v2/articles This sends an HTTP GET request to the 4TU.ResearchData API.

Key reasons why curl is used:

It’s built into most Linux/macOS systems and easily installable on Windows.

Scriptable: usable in bash scripts, notebooks, automation.

Supports headers, query parameters, tokens, POST data, etc.

Can output to files (>, -o, -O) or pipe to processors like jq.

How to download a specific file using curl

Command Behavior
curl URL Prints file to screen (no saving)
curl -O URL Downloads and saves with original name
curl -o filename URL Downloads and saves with custom name
curl -L -O URL Follows redirects and saves file
curl -C - -O URL Resumes an interrupted download

Add parameters to the same endpoint to filter results

Challenge

Practicing API calls with curl

  1. Show in the screen the metadata of 2 datasets published since May 1st 2025 using curl and jq to format the output.
  2. Save the information of 2 datasets published since May 1st 2025 using curl to a file called data.json in the current directory.
  3. Show in the screen the metadata of 10 software published since January 1st 2025.

BASH


curl "https://data.4tu.nl/v2/articles?limit=2&published_since=2025-05-01" | jq

curl "https://data.4tu.nl/v2/articles?limit=2&published_since=2025-05-01" > data.json

curl "https://data.4tu.nl/v2/articles?item_type=9&limit=10&published_since=2025-01-01" | jq

Get information per dataset ID

BASH


curl "https://data.4tu.nl/v2/articles/03c249d6-674c-47cf-918f-1ef9bdafe749" | jq # /v2/articles/uuid 

Get all the files per dataset ID

BASH


curl "https://data.4tu.nl/v2/articles/03c249d6-674c-47cf-918f-1ef9bdafe749/files" | jq # /v2/articles/uuid/files

Search Datasets by Keyword

BASH


curl --request POST  --header "Content-Type: application/json" --data '{ "search_for": "atmospheric" }' https://data.4tu.nl/v2/articles/search | jq

BASH


curl --request POST  --header "Content-Type: application/json" --data '{ "search_for": "netcdf" }' https://data.4tu.nl/v2/articles/search | jq

The 4TU.ResearchData API also supports the creation, the metadata update , the file upload and submission for review tasks. For more information visit the documentation page djehuty.4tu.nl

Key Points
  • APIs operationalize technical interoperability by enabling standardized machine-to-machine interaction.

  • REST APIs use HTTP methods, predictable endpoints, JSON representations, stable identifiers, and authentication mechanisms.

  • APIs depend on structural interoperability (schemas) and semantic interoperability (controlled vocabularies).

  • Command-line tools such as curl provide direct access to API functionality and enable automation.

  • The 4TU.ResearchData API supports full dataset lifecycle management: discovery, creation, metadata update, file upload, and submission for review.

Content from Cloud-Native Layouts


Last updated on 2026-03-30 | Edit this page

Overview

Questions

  • What does “cloud-native” mean in the context of scientific data?
  • Why can NetCDF struggle in cloud environments?
  • How is Zarr different from NetCDF?
  • Which part of interoperability is affected by cloud-native layouts?

Objectives

  • Explain what makes a data format cloud-native.
  • Compare NetCDF and Zarr from a cloud-access perspective.
  • Identify how cloud-native layouts influence structural interoperability.
  • Create a virtual Zarr dataset from NetCDF using Kerchunk.

Cloud-Native Layouts


What Does “Cloud-Native” Mean?

Cloud-native data layouts are designed for:

  • Object storage (e.g., S3-compatible systems)
  • Access over HTTP
  • Parallel reads
  • Loading only the pieces of data you need (lazy access)

In climate science, datasets such as:

  • ERA5 (ECMWF reanalysis dataset)
  • CMIP6 (climate model intercomparison dataset)

are often terabytes to petabytes in size.

Typical workflows include:

  • Reading a single variable
  • Selecting one time slice
  • Extracting a spatial subset
  • Repeating this many times (e.g., for machine learning)

A cloud-native layout makes these repeated small reads efficient.

NetCDF vs Zarr (Cloud Perspective)


NetCDF

NetCDF (Network Common Data Form) is a widely used scientific data format.

Designed for:

  • HPC systems
  • Large files on shared storage
  • Sequential or file-based access

In the cloud:

  • Stored as a single binary file
  • Harder to parallelize over HTTP
  • Repeated slicing can be inefficient

NetCDF provides strong structural interoperability in traditional computing environments,
but it is not optimized for object storage systems.

Zarr

Zarr is a chunked, cloud-optimized array storage format.

Designed for:

  • Object storage
  • Many small chunks
  • Parallel HTTP access

Data is stored as:

  • Small chunk files
  • JSON metadata
  • A directory-like structure compatible with object storage

Advantages in the cloud:

  • Read only the chunks you need
  • Many workers can read simultaneously
  • Efficient for repeated slicing

Zarr is considered cloud-native because it:

  1. Reduces unnecessary data movement
  2. Enables scalable parallel processing
  3. Supports interactive analysis of very large datasets

What changes in Interoperability?


Cloud-native layouts mainly affect structural interoperability.

They change:

  • How data is physically organized
  • How it is accessed
  • How scalable it is

They do not automatically change:

  • Variable names
  • Units
  • Coordinate conventions

That belongs to semantic interoperability, which still relies on:

  • CF conventions
  • Agreed metadata standards

So:

  • NetCDF → structural interoperability (file-based)
  • Zarr → structural interoperability (cloud-native)

Both can support semantic interoperability, but only if metadata conventions are respected.

Converting a dataset from file-based to cloud-native (hands-on session)


In this session, we will use Kerchunk to create a virtual Zarr dataset from an existing NetCDF file.

This allows us to access the dataset in a cloud-native way without modifying or copying the original data.

NetCDF → Virtual Zarr with Kerchunk

Goal

Create a cloud-native representation of an existing NetCDF file
without rewriting or duplicating the data.

What is Kerchunk?

Kerchunk creates a virtual Zarr dataset from existing formats such as NetCDF or HDF5.

Instead of converting data, Kerchunk:

  • Reads the original file structure
  • Maps byte ranges to Zarr-style chunk references
  • Writes a small JSON reference file

Result:

  • The original file remains unchanged
  • The JSON acts as a reference layer
  • No data duplication
  • No heavy conversion
  • Immediate cloud-compatible access

Different ways of accessing remote data

Approach What happens Who does the work
OPeNDAP Server interprets the dataset and sends subsets Server
Kerchunk Client reconstructs dataset structure and reads chunks directly Client

Step 1: Create a Kerchunk reference

  • Open JupyterLab in your project folder
  • Continue in your notebook or create a new one

PYTHON

import json
from kerchunk.netCDF3 import NetCDF3ToZarr

In this step, we are not converting data. We are creating a JSON file that describes how to access the data.

We are transforming how data is accessed, not the data itself.

PYTHON

# Direct file endpoint (raw file access, not OPeNDAP)
file_url = "https://opendap.4tu.nl/thredds/fileServer/IDRA/2019/01/02/IDRA_2019-01-02_12-00_raw_data.nc"

# Inspect the NetCDF file and build a mapping:
# Zarr chunks → byte ranges in the original file
ref = NetCDF3ToZarr(file_url, inline_threshold=100).translate()

# Save the mapping as JSON (no data stored here)
with open("idra_ref.json", "w") as f:
    json.dump(ref, f)

Step 2: Open as a virtual Zarr dataset

xarray behaves as if it is reading a Zarr dataset, but data is still coming from the original NetCDF file.

PYTHON

import xarray as xr

# Kerchunk reads metadata locally from the JSON file, and retrieves data lazily from the original remote file only when needed.
ds_ref = xr.open_dataset(
    "idra_ref.json",
    engine="kerchunk",
    storage_options={
        "remote_protocol": "https",
        "remote_options": {
            "asynchronous": True,
        },
    },
)

ds_ref

Step 3: Inspect structure and metadata

The semantics remain the same.

PYTHON

ds_ref.dims
ds_ref.variables
ds_ref.attrs

Only the access pattern has changed, not the data meaning.

Step 4: Perform lazy slicing

PYTHON

# Select a subset (only required chunks are accessed)
subset = ds_ref.isel(time_raw_data=0)

subset
  • No full dataset is loaded
  • Only relevant chunks are accessed
  • Access is lazy and chunk-based

Conceptual comparison

Feature OPeNDAP Kerchunk
Access type Protocol-based Storage-based
Endpoint /dodsC/ /fileServer/
Who interprets data Server Client
Data model File-oriented Chunk-oriented
Scalability Limited by server Scales with client + cloud
Parallel access Limited Natural (chunk-based)
Reusability Low High (JSON reusable)
Setup complexity Low Medium
Performance Server + network dependent Chunk-wise, parallel access

When to use what?


Use OPeNDAP when:

  • You want quick remote access
  • You rely on existing data services (e.g., THREDDS)
  • You are doing exploratory analysis
  • You do not control data storage

Typical use: browsing datasets, small analyses

Use Kerchunk when:

  • You want cloud-native workflows

  • You need scalable or parallel processing

  • You work with large datasets

  • You integrate with:

    • zarr
    • dask
    • object storage (S3, etc.)

Typical use: large-scale analysis, pipelines, reproducible workflows

Key Points
  • Cloud-native layouts are optimized for object storage and HTTP access.
  • NetCDF works well on HPC systems but is not optimized for cloud-native environments.
  • Zarr stores data in chunks, enabling efficient parallel access.
  • Kerchunk enables cloud-native access to NetCDF without data duplication.
  • Kerchunk changes the access pattern, not the data itself.
  • Cloud-native layouts affect structural interoperability, while semantic interoperability depends on metadata standards such as CF conventions.

Content from Interoperable Infrastructure in the AI Era


Last updated on 2026-03-30 | Edit this page

Overview

Questions

  • What does “AI-ready” mean in the context of climate data infrastructures?
  • Why is interoperability a prerequisite for trustworthy AI?
  • Which infrastructural components enable AI at scale?

Objectives

  • Explain what makes a data infrastructure AI-ready.
  • Connect AI requirements to structural, semantic, and technical interoperability.
  • Identify infrastructural components that enable scalable and reproducible AI workflows.

Why talk about AI in an interoperability course?

Artificial Intelligence and machine learning are increasingly applied to:

  • Climate simulations
  • Earth observation data
  • Extreme event prediction
  • Downscaling and bias correction
  • Environmental monitoring

However, AI systems do not operate on raw data alone.
They depend on infrastructure , and that infrastructure must be interoperable.

Without interoperability, AI pipelines become:

  • Fragile
  • Non-reproducible
  • Difficult to scale

At its core, AI requires machine-actionable data ecosystems.

What does AI need from data infrastructure?

AI workflows in climate science typically require:

1. Large-scale multidimensional datasets

  • NetCDF or Zarr
  • High spatial and temporal resolution
  • Petabyte-scale archives

2. Consistent semantic metadata

  • CF-compliant variables
  • Clear units
  • Well-defined coordinate systems
  • Machine-readable descriptions

3. Cloud-native, chunked access

  • Efficient partial reads
  • Parallel loading
  • Compatibility with distributed compute

4. Discoverability

  • Structured catalogs
  • STAC-like metadata
  • Searchable and filterable resources

5. Stable programmatic access

  • Well-documented REST APIs
  • Persistent identifiers
  • Versioned datasets

AI systems are not just consumers of data.
They are automated pipelines that depend on consistency across all layers.

Typical challenges

Many repositories were not designed with AI in mind. Common obstacles include:

  • Data fragmentation across portals
  • Non-standard variable naming
  • Missing or inconsistent metadata
  • Download-only workflows (no API access)
  • Lack of dataset versioning
  • Poor documentation

These issues break:

  • Automation
  • Reproducibility
  • Cross-dataset integration

Exercise 1 — Think–Pair–Discuss (5 min)


Discussion

Challenge

You are designing an AI model to predict extreme rainfall events using multiple datasets from different repositories.

Think (1 min):
What could go wrong if the datasets are not interoperable?

Pair (2 min):
Compare your answers with a partner. Identify: - One structural issue
- One semantic issue
- One technical issue

Discuss (2 min):
Share examples with the group.

Typical issues include:

  • Structural: incompatible formats (NetCDF vs CSV vs proprietary formats)
  • Semantic: different variable names (precip, rainfall, tp) or inconsistent units
  • Technical: no API access, requiring manual downloads

These prevent automated pipelines and introduce hidden errors in AI models.

Key elements of an AI-ready interoperable infrastructure


An AI-ready infrastructure builds on three layers of interoperability:

Structural interoperability

  • Community formats (NetCDF, Zarr)
  • Cloud-native layouts
  • Chunked multidimensional storage

Semantic interoperability

  • CF conventions
  • Controlled vocabularies
  • Standard coordinate systems
  • Clear provenance metadata

Technical interoperability

  • REST APIs
  • STAC catalogs
  • Persistent identifiers (DOI, URIs)
  • Authentication mechanisms
  • Dataset versioning

When these layers align, AI systems can:

  • Discover datasets automatically
  • Load them efficiently
  • Interpret variables correctly
  • Combine sources consistently
  • Reproduce experiments

Exercise 2 — True or False (5 min)


Discussion

Challenge

Decide whether the following statements are True or False.

  1. AI models only require large datasets; metadata is optional.
  2. NetCDF files are automatically AI-ready without additional standards.
  3. APIs are essential for scalable AI workflows.
  4. Interoperability mainly affects data sharing, not AI performance.
  5. Dataset versioning is important for reproducibility in AI.
  1. False — Metadata is critical for interpretation and correct model input.
  2. False — Standards like CF conventions are needed for semantic clarity.
  3. True — APIs enable automation and scalable access.
  4. False — Interoperability directly impacts model reliability and integration.
  5. True — Versioning ensures experiments can be reproduced.

Interoperability enables AI

Interoperability determines whether AI workflows are:

  • Efficient — scalable data loading and processing
  • Reproducible — same dataset, same version, same metadata
  • Integrable — multiple datasets combined coherently
  • Trustworthy — transparent provenance and standards

AI performance is not only about model architecture.
It is equally about data quality and infrastructure design.

Example: FAIR Earth Observation initiatives

Projects such as FAIR-EO (FAIR Open and AI-ready Earth Observation resources)
aim to align:

  • FAIR principles
  • Earth observation standards
  • AI-ready infrastructures

The focus is not just making data open, but making it machine-actionable at scale.

Key Points
  • AI-ready infrastructures require interoperable data layers.
  • Structural, semantic, and technical interoperability jointly enable AI workflows.
  • Cloud-native formats and consistent metadata are essential for scalable AI.
  • APIs, catalogs, identifiers, and versioning ensure reproducibility and automation.
  • AI reliability depends as much on infrastructure design as on model quality.