ethord

Publishing Open Metadata for Open Research Data Projects of the ETH Domain

Published

January 19, 2026

Show code
# Load required packages for data analysis with ethord R data package
library(ggthemes)
library(tidyverse)
library(ggtext)
library(gt)
library(ethord)

1 Introduction

NoteDisclaimer

This manuscript presents an analysis of ETH Board Open Research Data (ORD) Program metadata and is currently a work in progress. It serves as an internal reporting tool during the data collection and validation phase. The data is not yet complete and has not been fully validated. Results and interpretations should be considered preliminary and subject to revision as additional project data becomes available and quality checks are completed.

1.1 ETH Domain Open Research Data (ORD) Program

The ETH Domain Open Research Data (ORD) Program represents a significant investment in advancing open research practices across Swiss federal institutions. This report provides a draft analysis of data extracted from 76 out 96 funded projects, focusing on the program’s structure, budget distribution, and metadata accessibility.

1.2 Measure 1: Calls for Field Specific Actions

The primary goal of the measure is to support ETH researchers to engage in, and develop ORD practices and to become ORD leaders in their fields.

The program has funded 96 projects with a total investment of 15 million CHF.

2 Metadata Infrastructure

2.1 The ORD Portal

The ETH Domain maintains an ORD portal that showcases funded projects. While the portal provides basic information, there are significant opportunities for improvement in data accessibility and structured metadata provision.

ETH Domain ORD Portal

2.1.1 Current Portal Limitations

The current portal has several limitations:

  • Portal shows titles, abstracts, institutions, applicant names
  • No structured bulk data or programmatic access
  • Limited visibility of reports and outputs
  • No systematic tracking of project outcomes

Portal highlights showing basic metadata

2.2 Distribution of Projects Across Institutions

The following visualization shows how projects are distributed across institutions and project categories using the ethord package. It’s a first example of how we can use the structured metadata to analyze the program’s reach and impact.

Show code
application_metadata |>
  mutate(category = case_when(
    project_category == "Contribute" ~ "Contribute (30k)",
    project_category == "Explore" ~ "Explore (150k)",
    project_category == "Establish" ~ "Establish (1.5m)"
  )) |>
  mutate(category = factor(category,
                           levels = c("Contribute (30k)",
                                      "Explore (150k)",
                                      "Establish (1.5m)"))) |>
  count(main_applicant_institution, category) |>
  mutate(main_applicant_institution = str_wrap(main_applicant_institution, width = 30)) |>
  ggplot(aes(x = fct_reorder(main_applicant_institution, n),
             y = n,
             fill = category)) +
  geom_col(position = "dodge") +
  geom_label(aes(label = n),
             position = position_dodge(width = 0.9),
             show.legend = FALSE,
             color = "white",
             fontface = "bold",
             size = 3) +
  coord_flip() +
  labs(
    title = "Open Research Data Program of the ETH Board",
    subtitle = "Number of funded projects per institution of lead applicant and project category",
    y = "Number of projects",
    x = NULL,
    fill = "Project category:"
  ) +
  scale_fill_colorblind() +
  theme_minimal(base_size = 10) +
  theme(panel.grid.major.y = element_blank(),
        axis.text.y = element_text(size = 8))
Figure 1: Number of funded projects per institution of lead applicant and project category

Data from: Massari, Schöbitz, and Tilley (2025)

3 Non-Public Metadata Challenges

3.1 Information Gap

While proposals, scientific reports, and lists of outputs contain valuable information, this data is not publicly available as open, structured, machine-readable data. This creates several challenges:

3.1.1 Key Questions We Cannot Answer

Without structured metadata, we cannot easily answer questions such as:

  • How were budgets distributed among their cost categories?
  • How many publications are derived from these projects?
  • How many ORD datasets have been published?

3.1.2 Limitations and Consequences

The primary limitations stem from confidentiality requirements:

  • Proposals, scientific reports, and lists of outputs are confidential
  • Only available to EPFL Research Office
  • Protected as intellectual property by researchers

This has several consequences:

  • Limits discoverability and impact assessment
  • Reviewers have privileged access while public cannot evaluate program effectiveness
  • Contradicts goal of helping researchers become ORD leaders

3.1.3 The Solution

Our approach to addressing these challenges:

  • Ask all applicants for permission to extract metadata from project documentation
  • Apply FAIR data sharing principles that we advocate for
  • Publish FAIR-compliant, DOI-assigned metadata dataset

4 The ethord R Data Package

4.1 Overview

The ethord R data package provides structured, open access to metadata from the ETH Domain ORD Program.

Package details:

4.2 Current Status

4.2.1 Data extraction

  • Code was developed to extract data from PDFs
  • Extraction of metadata into .json files to .csv files
  • Support of LLMs in the extraction process

4.2.2 Reproducibilty

  • Generate data may not be 100% reproducible
  • In the process of rerunning scripts to evaluate differences
  • Working through data quality checks to identify issues

4.2.3 Continued Development

  • The data package is still work in progress
  • Data can be used, but with caution and disclaimers about current status
  • Future updates will improve data quality and completeness, as all reports are submitted

4.2.4 Approach

Reach out to project leads to:

4.3 Available Datasets

4.3.1 Application Phase Datasets (6 datasets)

  • application_metadata - Core project info, timeline, budget
  • application_budget - Detailed budget breakdown by category
  • application_ethics - Ethics considerations and flags
  • application_metadata_applicants - Co-applicant information
  • application_metadata_keywords - Research keywords
  • application_metadata_work_packages - Work package structure

4.3.2 Reporting Phase Datasets (3 datasets)

  • report_metadata - Project reports and updates
  • report_output - Project outputs and user reach
  • report_metadata_coapplicants - Co-applicant contributions

5 Budget Analysis

5.1 Budget Distribution by Cost Category

How is the budget distributed across different cost categories? We can answer this question using the structured metadata in the ethord package.

The results in Table 1 show that the program was largely used for personnel costs, representing approximately 90% of the total budget across all projects. This substantial investment in human resources reflects the program’s emphasis on capacity building and expertise development. An important consideration for future program planning is whether funded staff were already employed at the institutions or hired specifically for these projects. Additionally, any follow-up program should consider sustainability mechanisms, as many staff hired for these projects may face funding gaps when projects conclude, potentially limiting the long-term impact of developed expertise and infrastructure.

Show code
# Calculate total budgets by category
budget_summary <- application_budget |>
  summarise(
    Personnel = sum(total_budget_personnel_total_direct, na.rm = TRUE),
    Travel = sum(total_budget_travel, na.rm = TRUE),
    Equipment = sum(total_budget_equipment, na.rm = TRUE),
    `Other Direct` = sum(total_budget_other_total_direct, na.rm = TRUE),
    Subcontracting = sum(total_budget_subcontracting, na.rm = TRUE)
  ) |>
  pivot_longer(cols = everything(),
               names_to = "cost_category",
               values_to = "total_chf") |>
  filter(total_chf > 0) |>
  mutate(percent = total_chf / sum(total_chf) * 100) |>
  arrange(desc(total_chf))

# Display as formatted table
budget_summary |>
  gt() |>
  tab_header(
    title = "ORD Program Budget Distribution",
    subtitle = str_glue("Cost breakdown across {nrow(application_metadata)} funded projects")
  ) |>
  cols_label(
    cost_category = "Cost Category",
    total_chf = "Total (CHF)",
    percent = "Percentage"
  ) |>
  fmt_number(
    columns = total_chf,
    decimals = 0,
    use_seps = TRUE
  ) |>
  fmt_percent(
    columns = percent,
    decimals = 1,
    scale_values = FALSE
  ) |>
  tab_style(
    style = list(
      cell_text(weight = "bold")
    ),
    locations = cells_column_labels()
  ) |>
  tab_options(
    table.font.size = px(14),
    heading.title.font.size = px(18),
    heading.subtitle.font.size = px(14)
  )
Table 1: ORD Program Budget Distribution by Cost Category
ORD Program Budget Distribution
Cost breakdown across 76 funded projects
Cost Category Total (CHF) Percentage
Personnel 6,701,489 83.6%
Other Direct 689,569 8.6%
Equipment 302,240 3.8%
Travel 181,770 2.3%
Subcontracting 143,000 1.8%

Data from: Massari, Schöbitz, and Tilley (2025)

5.2 Budget Treemap Visualization

A treemap provides an alternative view of the budget distribution, with area proportional to spending.

Show code
application_budget_long <- application_budget |>
  select(project_id, !ends_with("total_direct"), -phase, -total_budget_total_costs) |>
  pivot_longer(cols = !project_id, names_to = "budget_item", values_to = "amount") |>
  filter(!is.na(amount), amount > 0) |>
  mutate(budget_item = str_remove(budget_item, "total_budget_")) |>
  mutate(
    main_category = case_when(
      str_starts(budget_item, "personnel_") ~ "Personnel",
      str_starts(budget_item, "other_") ~ "Other Costs",
      budget_item == "travel" ~ "Travel",
      budget_item == "equipment" ~ "Equipment",
      budget_item == "subcontracting" ~ "Subcontracting",
      TRUE ~ "Other Costs"
    ),
    sub_category = str_remove(budget_item, "^(personnel|other)_") |>
      str_replace_all("_", " ") |>
      str_to_title()
  )

library(treemap)

# Prepare data for treemap
treemap_data <- application_budget_long |>
  group_by(main_category, sub_category) |>
  summarise(total = sum(amount), .groups = "drop") |>
  mutate(
    label = str_glue("{sub_category}\nCHF {scales::comma(total, accuracy = 1)}")
  )

# Create treemap with annotations
treemap(
  treemap_data,
  index = c("main_category", "sub_category"),
  vSize = "total",
  vColor = "main_category",
  type = "categorical",
  palette = "Set2",
  title = "Budget Distribution Treemap",
  fontsize.labels = c(14, 10),
  fontcolor.labels = c("white", "black"),
  fontface.labels = c(2, 1),
  bg.labels = c("transparent"),
  align.labels = list(c("center", "center"), c("center", "center")),
  overlap.labels = 0.5,
  border.col = c("white", "gray90"),
  border.lwds = c(4, 2)
)
Figure 2: Budget Distribution Treemap

Data from: Massari, Schöbitz, and Tilley (2025)

5.3 Detailed Budget Summary Table

Table 2 provides a detailed breakdown of the budget allocation, with particular focus on the Personnel category. Within the personnel budget of approximately CHF 7.4 million, the distribution reveals strategic investment across different career stages: Senior Staff accounts for 40.4% (CHF 3.0 million across 25 projects), Postdocs represent 30.9% (CHF 2.3 million across 29 projects), while the “Other” category—likely encompassing technical staff, research assistants, and other support roles—comprises 23.8% (CHF 1.8 million across 43 projects). Student positions received the smallest allocation at 4.8% (CHF 356,000 across 23 projects). This distribution pattern suggests that projects prioritized experienced researchers and professional staff.

Show code
# Create detailed budget table with grouping and totals
budget_table_data <- application_budget_long |>
  group_by(main_category, sub_category) |>
  summarise(total_chf = sum(amount), .groups = "drop") |>
  group_by(main_category) |>
  mutate(
    category_total = sum(total_chf),
    percent_of_category = total_chf / category_total * 100
  ) |>
  ungroup() |>
  arrange(desc(category_total), desc(total_chf))

budget_table_data |>
  gt(groupname_col = "main_category") |>
  tab_header(
    title = "Detailed Budget Breakdown",
    subtitle = "Data of 76/95 projects"
  ) |>
  cols_label(
    sub_category = "Subcategory",
    total_chf = "Amount (CHF)",
    percent_of_category = "% of Category"
  ) |>
  # Add summary rows for each group
  summary_rows(
    groups = everything(),
    columns = total_chf,
    fns = list(
      "Category Total" = ~sum(., na.rm = TRUE)
    ),
    fmt = ~fmt_number(., decimals = 0, use_seps = TRUE)
  ) |>
  # Add grand total row
  grand_summary_rows(
    columns = total_chf,
    fns = list(
      "Grand Total" = ~sum(., na.rm = TRUE)
    ),
    fmt = ~fmt_number(., decimals = 0, use_seps = TRUE)
  ) |>
  fmt_number(
    columns = total_chf,
    decimals = 0,
    use_seps = TRUE
  ) |>
  fmt_percent(
    columns = percent_of_category,
    decimals = 1,
    scale_values = FALSE
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_row_groups()
  ) |>
  tab_style(
    style = cell_fill(color = "gray95"),
    locations = cells_body(columns = everything(), rows = percent_of_category > 50)
  ) |>
  tab_style(
    style = list(
      cell_fill(color = "lightblue"),
      cell_text(weight = "bold")
    ),
    locations = cells_summary()
  ) |>
  tab_style(
    style = list(
      cell_fill(color = "steelblue"),
      cell_text(weight = "bold", color = "white")
    ),
    locations = cells_grand_summary()
  ) |>
  tab_options(
    table.font.size = px(11),
    heading.title.font.size = px(16),
    heading.subtitle.font.size = px(12),
    row_group.background.color = "lightblue",
    row_group.font.weight = "bold",
    summary_row.background.color = "lightblue"
  ) |>
  cols_hide(columns = category_total)
Table 2: Detailed Budget Breakdown by Category and Subcategory
Detailed Budget Breakdown
Data of 76/95 projects
Subcategory Amount (CHF) % of Category
Personnel
Senior Staff 2,979,824 40.4%
Postdocs 2,280,533 30.9%
Other 1,752,967 23.8%
Students 355,750 4.8%
Category Total 7,369,074
Equipment
Equipment 302,240 100.0%
Category Total 302,240
Other Costs
Conferences Workshops 165,455 73.3%
Publication Fees 35,250 15.6%
Other 24,927 11.0%
Category Total 225,632
Travel
Travel 181,770 100.0%
Category Total 181,770
Subcontracting
Subcontracting 143,000 100.0%
Category Total 143,000
Grand Total 8,221,716

Data from: Massari, Schöbitz, and Tilley (2025)

6 The ethord Package Website

The ethord package has a comprehensive website with documentation, vignettes, and examples.

Visit the package website: global-health-engineering.github.io/ethord/

7 Future Possibilities

7.1 ETHZ FAIR Coalition Open Call

Recommendations for future program iterations:

  • Adopt “Open by Default” policies: Establish project metadata as openly accessible by default, with clear exceptions only for privacy or security concerns, aligning with Swiss federal open data requirements
  • Implement structured application forms: Transition from DOCX/PDF-based proposals to structured digital forms that capture metadata in machine-readable format from the application phase, reducing manual extraction effort and improving data quality
  • Standardize reporting mechanisms: Deploy structured forms for project reporting to ensure consistent, comparable data collection across all funded projects and enable real-time tracking of outputs and outcomes

7.2 Making Administrative Data “Open by Default”

How could we make such administrative data “open by default” in the future?

7.3 Swiss Open by Default Policy Framework

Switzerland has established a comprehensive framework for open data publication:

Federal Foundation: Open Government Data Strategy 2019-2023 (Swiss Federal Council 2019) established “open by default” principle for all federal agencies.

Legal Mandate: Federal Act EMBAG Article 10 (Swiss Federal Assembly 2024) legally requires open data publication unless restricted by privacy or security.

Implementation: OGD Masterplan 2024-2027 (Federal Statistical Office 2024a) operationalizes through:

Switzerland is unique in having legally mandated open by default - not just policy recommendations. This creates accountability and consistency across agencies. The FAIR alignment shows how government OGD principles directly apply to research data contexts.

8 Conclusion

The ethord package demonstrates how structured, open metadata can enhance transparency and enable deeper analysis of research funding programs. By applying FAIR principles to administrative data, we create opportunities for better understanding program outcomes and researcher engagement with open research data practices.

8.1 Data Availability

All data and analyses presented in this report are available through the ethord R package (Massari, Schöbitz, and Tilley 2025):

9 References

Federal Statistical Office. 2023. “DCAT Application Profile for Data Portals in Switzerland (DCAT-AP CH).” https://www.dcat-ap.ch/.
———. 2024a. “Open Government Data Masterplan 2024-2027.” https://www.bfs.admin.ch/bfs/en/home/services/ogd.html.
———. 2024b. “Opendata.swiss: Swiss Open Government Data Portal.” https://opendata.swiss/.
Massari, Nicolo, Lars Schöbitz, and Elizabeth Tilley. 2025. “Ethord: ETH Board Open Research Data (ORD) Program Project Metadata and Report Data.” https://doi.org/10.5281/zenodo.16563064.
Swiss Federal Assembly. 2024. Federal Act on the Use of Electronic Means for the Fulfilment of Government Tasks (EMBAG). https://www.fedlex.admin.ch/eli/cc/2023/682/en.
Swiss Federal Council. 2019. “Open Government Data Strategy Switzerland 2019-2023.” https://www.admin.ch/gov/en/start/documentation/media-releases.msg-id-74641.html.
Wilkinson, Mark D et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data. https://doi.org/10.1038/sdata.2016.18.