ethord

Publishing Open Metadata for Open Research Data Projects of the ETH Domain

Authors

Affiliation

Lars Schöbitz

Global Health Engineering - ETH Zurich

Nicoló Massari

Global Health Engineering - ETH Zurich

Prof. Elizabeth Tilley

Global Health Engineering - ETH Zurich

Published

January 19, 2026

Show code

# Load required packages for data analysis with ethord R data package
library(ggthemes)
library(tidyverse)
library(ggtext)
library(gt)
library(ethord)

1 Introduction

Disclaimer

This manuscript presents an analysis of ETH Board Open Research Data (ORD) Program metadata and is currently a work in progress. It serves as an internal reporting tool during the data collection and validation phase. The data is not yet complete and has not been fully validated. Results and interpretations should be considered preliminary and subject to revision as additional project data becomes available and quality checks are completed.

1.1 ETH Domain Open Research Data (ORD) Program

The ETH Domain Open Research Data (ORD) Program represents a significant investment in advancing open research practices across Swiss federal institutions. This report provides a draft analysis of data extracted from 76 out 96 funded projects, focusing on the program’s structure, budget distribution, and metadata accessibility.

1.2 Measure 1: Calls for Field Specific Actions

The primary goal of the measure is to support ETH researchers to engage in, and develop ORD practices and to become ORD leaders in their fields.

The program has funded 96 projects with a total investment of 15 million CHF.

2 Metadata Infrastructure

2.1 The ORD Portal

The ETH Domain maintains an ORD portal that showcases funded projects. While the portal provides basic information, there are significant opportunities for improvement in data accessibility and structured metadata provision.

2.1.1 Current Portal Limitations

The current portal has several limitations:

Portal shows titles, abstracts, institutions, applicant names
No structured bulk data or programmatic access
Limited visibility of reports and outputs
No systematic tracking of project outcomes

Portal highlights showing basic metadata

2.2 Distribution of Projects Across Institutions

The following visualization shows how projects are distributed across institutions and project categories using the ethord package. It’s a first example of how we can use the structured metadata to analyze the program’s reach and impact.

Show code

application_metadata |>
  mutate(category = case_when(
    project_category == "Contribute" ~ "Contribute (30k)",
    project_category == "Explore" ~ "Explore (150k)",
    project_category == "Establish" ~ "Establish (1.5m)"
  )) |>
  mutate(category = factor(category,
                           levels = c("Contribute (30k)",
                                      "Explore (150k)",
                                      "Establish (1.5m)"))) |>
  count(main_applicant_institution, category) |>
  mutate(main_applicant_institution = str_wrap(main_applicant_institution, width = 30)) |>
  ggplot(aes(x = fct_reorder(main_applicant_institution, n),
             y = n,
             fill = category)) +
  geom_col(position = "dodge") +
  geom_label(aes(label = n),
             position = position_dodge(width = 0.9),
             show.legend = FALSE,
             color = "white",
             fontface = "bold",
             size = 3) +
  coord_flip() +
  labs(
    title = "Open Research Data Program of the ETH Board",
    subtitle = "Number of funded projects per institution of lead applicant and project category",
    y = "Number of projects",
    x = NULL,
    fill = "Project category:"
  ) +
  scale_fill_colorblind() +
  theme_minimal(base_size = 10) +
  theme(panel.grid.major.y = element_blank(),
        axis.text.y = element_text(size = 8))

Figure 1: Number of funded projects per institution of lead applicant and project category

Data from: Massari, Schöbitz, and Tilley (2025)

3 Non-Public Metadata Challenges

3.1 Information Gap

While proposals, scientific reports, and lists of outputs contain valuable information, this data is not publicly available as open, structured, machine-readable data. This creates several challenges:

3.1.1 Key Questions We Cannot Answer

Without structured metadata, we cannot easily answer questions such as:

How were budgets distributed among their cost categories?
How many publications are derived from these projects?
How many ORD datasets have been published?

3.1.2 Limitations and Consequences

The primary limitations stem from confidentiality requirements:

Proposals, scientific reports, and lists of outputs are confidential
Only available to EPFL Research Office
Protected as intellectual property by researchers

This has several consequences:

Limits discoverability and impact assessment
Reviewers have privileged access while public cannot evaluate program effectiveness
Contradicts goal of helping researchers become ORD leaders

3.1.3 The Solution

Our approach to addressing these challenges:

Ask all applicants for permission to extract metadata from project documentation
Apply FAIR data sharing principles that we advocate for
Publish FAIR-compliant, DOI-assigned metadata dataset

4 The ethord R Data Package

4.1 Overview

The ethord R data package provides structured, open access to metadata from the ETH Domain ORD Program.

Package details:

Website: https://global-health-engineering.github.io/ethord/ (Massari, Schöbitz, and Tilley 2025)
9 datasets covering application and reporting phases
Open source development (GitHub)
Permissive license (CC-BY)
Assigned DOI (Zenodo)

4.2 Current Status

4.2.1 Data extraction

Code was developed to extract data from PDFs
Extraction of metadata into .json files to .csv files
Support of LLMs in the extraction process

4.2.2 Reproducibilty

Generate data may not be 100% reproducible
In the process of rerunning scripts to evaluate differences
Working through data quality checks to identify issues

4.2.3 Continued Development

The data package is still work in progress
Data can be used, but with caution and disclaimers about current status
Future updates will improve data quality and completeness, as all reports are submitted

4.2.4 Approach

Reach out to project leads to:

verify extracted data by providing a data overview per project
invite to contribute in keeping data on project outputs up-to-date using GitHub issues (e.g. https://github.com/Global-Health-Engineering/ethord/issues/20)

4.3 Available Datasets

4.3.1 Application Phase Datasets (6 datasets)

application_metadata - Core project info, timeline, budget
application_budget - Detailed budget breakdown by category
application_ethics - Ethics considerations and flags
application_metadata_applicants - Co-applicant information
application_metadata_keywords - Research keywords
application_metadata_work_packages - Work package structure

4.3.2 Reporting Phase Datasets (3 datasets)

report_metadata - Project reports and updates
report_output - Project outputs and user reach
report_metadata_coapplicants - Co-applicant contributions

5 Budget Analysis

5.1 Budget Distribution by Cost Category

How is the budget distributed across different cost categories? We can answer this question using the structured metadata in the ethord package.

The results in Table 1 show that the program was largely used for personnel costs, representing approximately 90% of the total budget across all projects. This substantial investment in human resources reflects the program’s emphasis on capacity building and expertise development. An important consideration for future program planning is whether funded staff were already employed at the institutions or hired specifically for these projects. Additionally, any follow-up program should consider sustainability mechanisms, as many staff hired for these projects may face funding gaps when projects conclude, potentially limiting the long-term impact of developed expertise and infrastructure.

Show code

# Calculate total budgets by category
budget_summary <- application_budget |>
  summarise(
    Personnel = sum(total_budget_personnel_total_direct, na.rm = TRUE),
    Travel = sum(total_budget_travel, na.rm = TRUE),
    Equipment = sum(total_budget_equipment, na.rm = TRUE),
    `Other Direct` = sum(total_budget_other_total_direct, na.rm = TRUE),
    Subcontracting = sum(total_budget_subcontracting, na.rm = TRUE)
  ) |>
  pivot_longer(cols = everything(),
               names_to = "cost_category",
               values_to = "total_chf") |>
  filter(total_chf > 0) |>
  mutate(percent = total_chf / sum(total_chf) * 100) |>
  arrange(desc(total_chf))

# Display as formatted table
budget_summary |>
  gt() |>
  tab_header(
    title = "ORD Program Budget Distribution",
    subtitle = str_glue("Cost breakdown across {nrow(application_metadata)} funded projects")
  ) |>
  cols_label(
    cost_category = "Cost Category",
    total_chf = "Total (CHF)",
    percent = "Percentage"
  ) |>
  fmt_number(
    columns = total_chf,
    decimals = 0,
    use_seps = TRUE
  ) |>
  fmt_percent(
    columns = percent,
    decimals = 1,
    scale_values = FALSE
  ) |>
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_column_labels()
  ) |>
  tab_options(
    table.font.size = px(14),
    heading.title.font.size = px(18),
    heading.subtitle.font.size = px(14)
  )

Table 1: ORD Program Budget Distribution by Cost Category

Cost Category	Total (CHF)	Percentage
ORD Program Budget Distribution
Cost breakdown across 76 funded projects
Personnel	6,701,489	83.6%
Other Direct	689,569	8.6%
Equipment	302,240	3.8%
Travel	181,770	2.3%
Subcontracting	143,000	1.8%

Data from: Massari, Schöbitz, and Tilley (2025)

5.2 Budget Treemap Visualization

A treemap provides an alternative view of the budget distribution, with area proportional to spending.

Show code

application_budget_long <- application_budget |>
  select(project_id, !ends_with("total_direct"), -phase, -total_budget_total_costs) |>
  pivot_longer(cols = !project_id, names_to = "budget_item", values_to = "amount") |>
  filter(!is.na(amount), amount > 0) |>
  mutate(budget_item = str_remove(budget_item, "total_budget_")) |>
  mutate(
    main_category = case_when(
      str_starts(budget_item, "personnel_") ~ "Personnel",
      str_starts(budget_item, "other_") ~ "Other Costs",
      budget_item == "travel" ~ "Travel",
      budget_item == "equipment" ~ "Equipment",
      budget_item == "subcontracting" ~ "Subcontracting",
      TRUE ~ "Other Costs"
    ),
    sub_category = str_remove(budget_item, "^(personnel|other)_") |>
      str_replace_all("_", " ") |>
      str_to_title()
  )

library(treemap)

# Prepare data for treemap
treemap_data <- application_budget_long |>
  group_by(main_category, sub_category) |>
  summarise(total = sum(amount), .groups = "drop") |>
  mutate(
    label = str_glue("{sub_category}\nCHF {scales::comma(total, accuracy = 1)}")
  )

# Create treemap with annotations
treemap(
  treemap_data,
  index = c("main_category", "sub_category"),
  vSize = "total",
  vColor = "main_category",
  type = "categorical",
  palette = "Set2",
  title = "Budget Distribution Treemap",
  fontsize.labels = c(14, 10),
  fontcolor.labels = c("white", "black"),
  fontface.labels = c(2, 1),
  bg.labels = c("transparent"),
  align.labels = list(c("center", "center"), c("center", "center")),
  overlap.labels = 0.5,
  border.col = c("white", "gray90"),
  border.lwds = c(4, 2)
)

Data from: Massari, Schöbitz, and Tilley (2025)

5.3 Detailed Budget Summary Table

Table 2 provides a detailed breakdown of the budget allocation, with particular focus on the Personnel category. Within the personnel budget of approximately CHF 7.4 million, the distribution reveals strategic investment across different career stages: Senior Staff accounts for 40.4% (CHF 3.0 million across 25 projects), Postdocs represent 30.9% (CHF 2.3 million across 29 projects), while the “Other” category—likely encompassing technical staff, research assistants, and other support roles—comprises 23.8% (CHF 1.8 million across 43 projects). Student positions received the smallest allocation at 4.8% (CHF 356,000 across 23 projects). This distribution pattern suggests that projects prioritized experienced researchers and professional staff.

Show code

# Create detailed budget table with grouping and totals
budget_table_data <- application_budget_long |>
  group_by(main_category, sub_category) |>
  summarise(total_chf = sum(amount), .groups = "drop") |>
  group_by(main_category) |>
  mutate(
    category_total = sum(total_chf),
    percent_of_category = total_chf / category_total * 100
  ) |>
  ungroup() |>
  arrange(desc(category_total), desc(total_chf))

budget_table_data |>
  gt(groupname_col = "main_category") |>
  tab_header(
    title = "Detailed Budget Breakdown",
    subtitle = "Data of 76/95 projects"
  ) |>
  cols_label(
    sub_category = "Subcategory",
    total_chf = "Amount (CHF)",
    percent_of_category = "% of Category"
  ) |>
  # Add summary rows for each group
  summary_rows(
    groups = everything(),
    columns = total_chf,
    fns = list(
      "Category Total" = ~sum(., na.rm = TRUE)
    ),
    fmt = ~fmt_number(., decimals = 0, use_seps = TRUE)
  ) |>
  # Add grand total row
  grand_summary_rows(
    columns = total_chf,
    fns = list(
      "Grand Total" = ~sum(., na.rm = TRUE)
    ),
    fmt = ~fmt_number(., decimals = 0, use_seps = TRUE)
  ) |>
  fmt_number(
    columns = total_chf,
    decimals = 0,
    use_seps = TRUE
  ) |>
  fmt_percent(
    columns = percent_of_category,
    decimals = 1,
    scale_values = FALSE
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_row_groups()
  ) |>
  tab_style(
    style = cell_fill(color = "gray95"),
    locations = cells_body(columns = everything(), rows = percent_of_category > 50)
  ) |>
  tab_style(
    style = list(
      cell_fill(color = "lightblue"),
      cell_text(weight = "bold")
    ),
    locations = cells_summary()
  ) |>
  tab_style(
    style = list(
      cell_fill(color = "steelblue"),
      cell_text(weight = "bold", color = "white")
    ),
    locations = cells_grand_summary()
  ) |>
  tab_options(
    table.font.size = px(11),
    heading.title.font.size = px(16),
    heading.subtitle.font.size = px(12),
    row_group.background.color = "lightblue",
    row_group.font.weight = "bold",
    summary_row.background.color = "lightblue"
  ) |>
  cols_hide(columns = category_total)

Table 2: Detailed Budget Breakdown by Category and Subcategory

	Subcategory	Amount (CHF)	% of Category
Detailed Budget Breakdown
Data of 76/95 projects
Personnel
	Senior Staff	2,979,824	40.4%
	Postdocs	2,280,533	30.9%
	Other	1,752,967	23.8%
	Students	355,750	4.8%
Category Total	—	7,369,074	—
Equipment
	Equipment	302,240	100.0%
Category Total	—	302,240	—
Other Costs
	Conferences Workshops	165,455	73.3%
	Publication Fees	35,250	15.6%
	Other	24,927	11.0%
Category Total	—	225,632	—
Travel
	Travel	181,770	100.0%
Category Total	—	181,770	—
Subcontracting
	Subcontracting	143,000	100.0%
Category Total	—	143,000	—
Grand Total	—	8,221,716	—

Data from: Massari, Schöbitz, and Tilley (2025)

6 The ethord Package Website

The ethord package has a comprehensive website with documentation, vignettes, and examples.

Visit the package website: global-health-engineering.github.io/ethord/

7 Future Possibilities

7.1 ETHZ FAIR Coalition Open Call

Recommendations for future program iterations:

Adopt “Open by Default” policies: Establish project metadata as openly accessible by default, with clear exceptions only for privacy or security concerns, aligning with Swiss federal open data requirements
Implement structured application forms: Transition from DOCX/PDF-based proposals to structured digital forms that capture metadata in machine-readable format from the application phase, reducing manual extraction effort and improving data quality
Standardize reporting mechanisms: Deploy structured forms for project reporting to ensure consistent, comparable data collection across all funded projects and enable real-time tracking of outputs and outcomes

7.2 Making Administrative Data “Open by Default”

How could we make such administrative data “open by default” in the future?

7.3 Swiss Open by Default Policy Framework

Switzerland has established a comprehensive framework for open data publication:

Federal Foundation: Open Government Data Strategy 2019-2023 (Swiss Federal Council 2019) established “open by default” principle for all federal agencies.

Legal Mandate: Federal Act EMBAG Article 10 (Swiss Federal Assembly 2024) legally requires open data publication unless restricted by privacy or security.

Implementation: OGD Masterplan 2024-2027 (Federal Statistical Office 2024a) operationalizes through:

Progressive data opening with documented exceptions
Quality standards aligned with FAIR principles (Wilkinson et al. 2016)
Centralized coordination via opendata.swiss (Federal Statistical Office 2024b)
Standardized metadata using DCAT-AP CH (Federal Statistical Office 2023)

Switzerland is unique in having legally mandated open by default - not just policy recommendations. This creates accountability and consistency across agencies. The FAIR alignment shows how government OGD principles directly apply to research data contexts.

8 Conclusion

The ethord package demonstrates how structured, open metadata can enhance transparency and enable deeper analysis of research funding programs. By applying FAIR principles to administrative data, we create opportunities for better understanding program outcomes and researcher engagement with open research data practices.

8.1 Data Availability

All data and analyses presented in this report are available through the ethord R package (Massari, Schöbitz, and Tilley 2025):

Website: https://global-health-engineering.github.io/ethord/
GitHub: https://github.com/global-health-engineering/ethord
DOI: Available via Zenodo

9 References

Federal Statistical Office. 2023. “DCAT Application Profile for Data Portals in Switzerland (DCAT-AP CH).” https://www.dcat-ap.ch/.

———. 2024a. “Open Government Data Masterplan 2024-2027.” https://www.bfs.admin.ch/bfs/en/home/services/ogd.html.

———. 2024b. “Opendata.swiss: Swiss Open Government Data Portal.” https://opendata.swiss/.

Massari, Nicolo, Lars Schöbitz, and Elizabeth Tilley. 2025. “Ethord: ETH Board Open Research Data (ORD) Program Project Metadata and Report Data.” https://doi.org/10.5281/zenodo.16563064.

Swiss Federal Assembly. 2024. Federal Act on the Use of Electronic Means for the Fulfilment of Government Tasks (EMBAG). https://www.fedlex.admin.ch/eli/cc/2023/682/en.

Swiss Federal Council. 2019. “Open Government Data Strategy Switzerland 2019-2023.” https://www.admin.ch/gov/en/start/documentation/media-releases.msg-id-74641.html.

Wilkinson, Mark D et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data. https://doi.org/10.1038/sdata.2016.18.