Research Data Management Infrastructure I

Beginning in 2010, the authors of the CARL application to the Canada Foundation for Innovation (see Community Actions to Preserve Research Data in Canada) used the term, Research Data Management Infrastructure (RDMI), to identify the confluence of e-Science, Cyberinfrastructure, and data-intensive science (see From National Institution to National Infrastructure).  We like to believe that we coined this concept, although in the U.K. a JISC funding envelope used the identical terminology at approximately the same time.  The JISC program description mentions several drivers that shaped the purpose of this specific funding envelope, many of which are just as relevant in Canada as in the U.K.

Higher education institutions are under increasing pressure to provide services and infrastructure for research data management. These pressures come from a variety of sources: the opportunities of more data intensive and more open, collaborative research; the requirements of research funders; the increasing concern for research transparency and integrity; institutions concern to avoid [reputational] damage caused by poor responses to FoI requests or by data loss.  JISC website

The definition I use for RDMI builds on the context described in the JISC description:

RDMI is the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.

This discussion will focus specifically on the context served by RDMI, namely, data activities across the research lifecycle.

The Research Lifecycle

The research process is made up of a large set of activities that tend to be grouped into a series of fairly discrete stages.  Each stage typically consists of a set of related activities to accomplish a primary task, the outcomes of which are then passed to the next stage. For example, a survey’s design stage will result in the selection of a sample and an instrument for collecting data. The completion of these activities flow to a data collection stage where interviews are conducted and information is gathered from the sample.  While not all stages are necessarily linear, many of them do have logical dependencies that require sequential ordering.  For example, a research proposal is typically prepared before a grant application is submitted.

As with any project management operation, the granularity at which activities are described presents different views of a project. Similarly, the stages in the research lifecycle can be aggregated or disaggregated into larger or smaller groupings. Nevertheless, there is a level at which a primary task will be accomplished and its outcomes passed to another stage.  In a survey, for example, there is a point where data processing is completed and a data product is passed along for analysis.

Research workflow of a typical scholar showing the nonlinear development of research projects and the multiple stages at which data are collected

The Jahnke and Asher diagram of the workflow of a typical researcher is intended to show the nonlinear nature of the research process.  I feel that the more important message depicted in this workflow is the connection to data throughout the various stages.  Many of the activities in the research lifecycle indirectly or directly involve aspects of data management.  The above diagram shows examples of data-related tasks in the feasibility research, project design, and active research stages.

The Research Lifecycle

There is a second lifecycle that is closely interrelated with the research lifecycle.  This is the data lifecycle, which overlaps stages in the research lifecycle but also consists of some important stages independent of project-based research, including stages dealing with data dissemination, preservation, discovery, and repurposing.  While the Humphrey diagram does not identify stages specific to the data lifecycle, it does consist of more stages outside the project level shown in the Jahnke and Asher diagram, including references to knowledge transfer and repositories for data and research outputs.

Research Data Management

Research data management involves the practices and activities across the research lifecycle that involve the operational support of data through design, production, processing, documentation, analysis, preservation, discovery and reuse.  Collectively, these data-related activities span the stages of project-based research as well as the extended stages that tend to be institutionally based.  The activities are about the “what” and “how” of research data.

RDMI is the configuration of staff, services, and tools assembled to support data management across the research lifecycle and more specifically to provide comprehensive coverage of the stages making up the data lifecycle.

Data Stewardship

In contrast to Data Management, Data Stewardship is about the identity of those responsible for ensuring data management activities are performed to best practice levels and standards across the lifecycle.  Stewardship addresses “who” is responsible for a specific data activity (I’d like to acknowledge Wendy Watkins’ contribution in making this distinction between responsibility and activity).  Data policies, institutional norms, granting council requirements, and domain practices all contribute to defining the roles of those who are responsible for data at the various lifecycle stages.  Ideally, a comprehensive plan at the beginning of a research project would identify the supporting parties across the data lifecycle.  If a data management plan fails to identify who is responsible for specific data-related activities, the risk that not all activities will be completed is heightened.  Data Management Plans should be broadened to become Data Management and Stewardship plans.

The design of RDMI needs to enable data stewards across the data lifecycle to fulfill their responsibilities.

Project-level and Institutional-level Stewardship

A clarification needs to be made about the parties responsible for the various stages of the data lifecycle.  We are currently in a period during which data stewardship roles are under scrutiny.  Clearly, there are stages for which the researcher is the data steward.  The model for conducting research has traditionally been at the project level.  In this context, the researcher is responsible for both defining and conducting the work.  They are also often responsible for securing the funds to do the research.

However, as noted in the JISC quote above, increasingly institutions are discovering a need to take on new responsibilities dealing with research data management, which often entails providing services and infrastructure.  University administrators are much more aware of the value of data to their institution than they were necessarily in the past.  Both operational and research data are now being treated as digital assets that need policies, practices, services, and infrastructure to secure their future.  One consequence is the willingness of some institutions to support stages in the data lifecycle that previously had fallen between the cracks.  Some of these new responsibilities for data require additional investments in services and infrastructure, while others will involve the redeployment of staff or reconfiguration of services to fulfill newly accepted data responsibilities.

Research Data Management Interventions

The Jeffreys’ diagram shows stages in the institutional model from a JICS-funded project to develop research data management infrastructure at the University of Oxford.  While this graphic was not necessarily intended to depict the shared responsibilities between project-level research and the institution, one can see the interplay between both.  The left-hand stages of Project Planning, Project Setup, Data Creation, Documentation, and aspects of Local Storage largely are the researcher’s responsibilities, while the institution assumes responsibility for aspects of Local Storage, Institutional Storage, Rediscovery Mechanism, and Retrieval Mechanism.  Oxford has chosen to provide a mix of services across all of these stages even though the researcher is the primary data steward in half of the stages.  The infrastructure in these stages is to help researchers accomplish their data management tasks without yielding their control over them.

Working with researchers on their campus, university senior administrators have an important leadership role in developing ground-level RDMI.

Institutional Collaboration

The most innovative nations in the future will be those that best manage their research data today.  This is a meaningful incentive for institutions in Canada to collaborate in the development of RDMI, and all the more important in the absence of top-down national support.  No single institution can on its own manage the problem posed by research data.  But collectively, institutions working together can build the shared infrastructure needed by the research community.

Several successful models of collaboration across institutions attest to the viability of building national RDMI through a shared approach.  The Canadian Polar Data Network is one example of a cross-sector collaboration between the higher education and federal government sectors that provided data curation and preservation services for Canadian-funded research in the recent International Polar Year.  Collectively, this network of institutions is able to provide a greater service than any one could offer individually.

Institutional engagement in data stewardship becomes an important step in developing bottom-up national RDMI.

The next essay addresses the three pillars making up research infrastructure in Canada and compares RDMI with the support for high speed research networks and high performance computing.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]