Research Data Management Infrastructure II

In the previous entry, Research Data Management Infrastructure (RDMI) was defined as the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.  The context for RDMI within the research lifecycle was described and the importance of institutional-level engagement in data stewardship was emphasized.  Finally, the position was taken that cross-institutional collaboration would enable building collectively the national RDMI that has eluded Canada without top-down design or resources.  How does this context compare with the two other pillars of Canada’s research infrastructure?

Research Infrastructure: The Three Pillars

The Canadian University Council of Chief Information Officers (CUCCIO) hosted the Digital Infrastructure Summit in June 2012 in Saskatoon to address the unclear future of research infrastructure in Canada today.  Concerns have been expressed about the lack of a vision for research infrastructure in Canada and the need for more coordinated planning.  For example, the current business models for CANARIE, the coordinating agency for Canada’s high-speed optical research network, and for Compute Canada, the organization for high performance computing, operate on funding cycles that are less than optimal and on brinksmanship review processes that seem to threaten the very existence of this critical infrastructure.  Borrowing from the National Data Summit format, the CUCCIO Summit invited around sixty leaders in research infrastructure to discuss how best to approach these concerns.  Coming out of this forum was the establishment of a Leadership Council with a mission to articulate a vision for research infrastructure and to organize a follow-up summit.

Canada's Research Infrastructure PillarsWhile Canada does not have a formally recognized national organization for RDMI (Research Data Canada and CARL are working to fill part of this void), CUCCIO recognizes data infrastructure as one of three pillars constituting Canada’s research infrastructure, along with a high speed research network and high performance computing. There are some important differences between the formal support for these latter two infrastructure pillars and RDMI.  First, different forces drive these three infrastructure pillars.

  1. CANARIE provides top-down coordination and incentives, working with a group of Optical Regional Advanced Networks (ORANs) across the country.   The ORANs keep the operational delivery of the high speed network close to the researchers in their areas, while CANARIE works to weave the regional communication networks into a national research service.
  2. High Performance Computing (HPC) in Canada has a similar organizational structure of regional services (WestGrid, Compute Ontario, Calcul Quebec, Compute Atlantic) with national governance provided through Compute Canada, although the regional services tend to operate with a tradition of independence.  Nevertheless, HPC has received top-down incentives, including financial support through the Canada Foundation for Innovation.
  3. As already stated, RDMI does not have a formal national organization to represent its interests, although there are national coordinating roles for both Research Data Canada and CARL to play in data curation and infrastructure within their communities.  Unfortunately, no regional organizations for data infrastructure exist.

While RDMI has been embraced as an equal infrastructure partner by leaders in CANARIE and Compute Canada, the playing field is clearly unequal at this stage.  The good news is that Research Data Canada and CARL continue to be invited to participate in events organized by the other two infrastructure partners.

Second, the voice for RDMI is often ad hoc and diluted.  CANARIE and Compute Canada serve as single points of contact for their infrastructure.  Typically, individual researchers are called to speak on behalf of data infrastructure, even though they may represent only a narrow perspective on data management infrastructure.  A consequence is that the voice for research data often becomes haphazard.  The risks are that a data advocate may not be present at an important research infrastructure event or that the message is too narrow for today’s range of research data issues.

Third, RDMI is dependent on bottom-up initiatives, requiring a great deal of coordination and cooperation to be successful.  The organization of top-down initiatives typically depend on control and governance.  With bottom-up projects, the most important organizational factors are trust, collaboration, and cooperation.  These two different organizational structures also tend to result in different styles of internal politics.

Finally, the international peers for each of Canada’s infrastructure pillars are different.  Both CANARIE and Compute Canada see their counterpart organizations in the United States, Australia, United Kingdom, and the rest of Europe as their peers.  The models and practices for funding and planning are also similar among these peers.  Look at what is happening to RDMI within this same group of countries: the National Science Foundation in the U.S. provides grants for data curation projects through its DataNet program; the European Union supported the Global Research Data Infrastructures 2020 project to help chart the course for developing a global data ecosystem; Australia established the Australian National Data Service to support researchers with their data curation needs; in the U.K. JISC offers its Managing Research Data program, which funds projects in RDMI.  These examples are all top-down driven and involve incentive programs for data infrastructure.  At this stage, the development of RDMI in Canada has very little in common with CANARIE and Compute Canada’s international peers.  A subsequent Blog entry will address who the international peers currently are for Canada’s RDMI.

The next entry discusses RDMI components of technology, services and expertise and how they are organized locally or globally.

[The views expressed in this Blog are my own and do not necessarily represent those of my institution.]

Research Data Management Infrastructure I

Beginning in 2010, the authors of the CARL application to the Canada Foundation for Innovation (see Community Actions to Preserve Research Data in Canada) used the term, Research Data Management Infrastructure (RDMI), to identify the confluence of e-Science, Cyberinfrastructure, and data-intensive science (see From National Institution to National Infrastructure).  We like to believe that we coined this concept, although in the U.K. a JISC funding envelope used the identical terminology at approximately the same time.  The JISC program description mentions several drivers that shaped the purpose of this specific funding envelope, many of which are just as relevant in Canada as in the U.K.

Higher education institutions are under increasing pressure to provide services and infrastructure for research data management. These pressures come from a variety of sources: the opportunities of more data intensive and more open, collaborative research; the requirements of research funders; the increasing concern for research transparency and integrity; institutions concern to avoid [reputational] damage caused by poor responses to FoI requests or by data loss.  JISC website

The definition I use for RDMI builds on the context described in the JISC description:

RDMI is the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.

This discussion will focus specifically on the context served by RDMI, namely, data activities across the research lifecycle.

The Research Lifecycle

The research process is made up of a large set of activities that tend to be grouped into a series of fairly discrete stages.  Each stage typically consists of a set of related activities to accomplish a primary task, the outcomes of which are then passed to the next stage. For example, a survey’s design stage will result in the selection of a sample and an instrument for collecting data. The completion of these activities flow to a data collection stage where interviews are conducted and information is gathered from the sample.  While not all stages are necessarily linear, many of them do have logical dependencies that require sequential ordering.  For example, a research proposal is typically prepared before a grant application is submitted.

As with any project management operation, the granularity at which activities are described presents different views of a project. Similarly, the stages in the research lifecycle can be aggregated or disaggregated into larger or smaller groupings. Nevertheless, there is a level at which a primary task will be accomplished and its outcomes passed to another stage.  In a survey, for example, there is a point where data processing is completed and a data product is passed along for analysis.

Research workflow of a typical scholar showing the nonlinear development of research projects and the multiple stages at which data are collected

The Jahnke and Asher diagram of the workflow of a typical researcher is intended to show the nonlinear nature of the research process.  I feel that the more important message depicted in this workflow is the connection to data throughout the various stages.  Many of the activities in the research lifecycle indirectly or directly involve aspects of data management.  The above diagram shows examples of data-related tasks in the feasibility research, project design, and active research stages.

The Research Lifecycle

There is a second lifecycle that is closely interrelated with the research lifecycle.  This is the data lifecycle, which overlaps stages in the research lifecycle but also consists of some important stages independent of project-based research, including stages dealing with data dissemination, preservation, discovery, and repurposing.  While the Humphrey diagram does not identify stages specific to the data lifecycle, it does consist of more stages outside the project level shown in the Jahnke and Asher diagram, including references to knowledge transfer and repositories for data and research outputs.

Research Data Management

Research data management involves the practices and activities across the research lifecycle that involve the operational support of data through design, production, processing, documentation, analysis, preservation, discovery and reuse.  Collectively, these data-related activities span the stages of project-based research as well as the extended stages that tend to be institutionally based.  The activities are about the “what” and “how” of research data.

RDMI is the configuration of staff, services, and tools assembled to support data management across the research lifecycle and more specifically to provide comprehensive coverage of the stages making up the data lifecycle.

Data Stewardship

In contrast to Data Management, Data Stewardship is about the identity of those responsible for ensuring data management activities are performed to best practice levels and standards across the lifecycle.  Stewardship addresses “who” is responsible for a specific data activity (I’d like to acknowledge Wendy Watkins’ contribution in making this distinction between responsibility and activity).  Data policies, institutional norms, granting council requirements, and domain practices all contribute to defining the roles of those who are responsible for data at the various lifecycle stages.  Ideally, a comprehensive plan at the beginning of a research project would identify the supporting parties across the data lifecycle.  If a data management plan fails to identify who is responsible for specific data-related activities, the risk that not all activities will be completed is heightened.  Data Management Plans should be broadened to become Data Management and Stewardship plans.

The design of RDMI needs to enable data stewards across the data lifecycle to fulfill their responsibilities.

Project-level and Institutional-level Stewardship

A clarification needs to be made about the parties responsible for the various stages of the data lifecycle.  We are currently in a period during which data stewardship roles are under scrutiny.  Clearly, there are stages for which the researcher is the data steward.  The model for conducting research has traditionally been at the project level.  In this context, the researcher is responsible for both defining and conducting the work.  They are also often responsible for securing the funds to do the research.

However, as noted in the JISC quote above, increasingly institutions are discovering a need to take on new responsibilities dealing with research data management, which often entails providing services and infrastructure.  University administrators are much more aware of the value of data to their institution than they were necessarily in the past.  Both operational and research data are now being treated as digital assets that need policies, practices, services, and infrastructure to secure their future.  One consequence is the willingness of some institutions to support stages in the data lifecycle that previously had fallen between the cracks.  Some of these new responsibilities for data require additional investments in services and infrastructure, while others will involve the redeployment of staff or reconfiguration of services to fulfill newly accepted data responsibilities.

Research Data Management Interventions

The Jeffreys’ diagram shows stages in the institutional model from a JICS-funded project to develop research data management infrastructure at the University of Oxford.  While this graphic was not necessarily intended to depict the shared responsibilities between project-level research and the institution, one can see the interplay between both.  The left-hand stages of Project Planning, Project Setup, Data Creation, Documentation, and aspects of Local Storage largely are the researcher’s responsibilities, while the institution assumes responsibility for aspects of Local Storage, Institutional Storage, Rediscovery Mechanism, and Retrieval Mechanism.  Oxford has chosen to provide a mix of services across all of these stages even though the researcher is the primary data steward in half of the stages.  The infrastructure in these stages is to help researchers accomplish their data management tasks without yielding their control over them.

Working with researchers on their campus, university senior administrators have an important leadership role in developing ground-level RDMI.

Institutional Collaboration

The most innovative nations in the future will be those that best manage their research data today.  This is a meaningful incentive for institutions in Canada to collaborate in the development of RDMI, and all the more important in the absence of top-down national support.  No single institution can on its own manage the problem posed by research data.  But collectively, institutions working together can build the shared infrastructure needed by the research community.

Several successful models of collaboration across institutions attest to the viability of building national RDMI through a shared approach.  The Canadian Polar Data Network is one example of a cross-sector collaboration between the higher education and federal government sectors that provided data curation and preservation services for Canadian-funded research in the recent International Polar Year.  Collectively, this network of institutions is able to provide a greater service than any one could offer individually.

Institutional engagement in data stewardship becomes an important step in developing bottom-up national RDMI.

The next essay addresses the three pillars making up research infrastructure in Canada and compares RDMI with the support for high speed research networks and high performance computing.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]