Research Data Management Infrastructure I

Beginning in 2010, the authors of the CARL application to the Canada Foundation for Innovation (see Community Actions to Preserve Research Data in Canada) used the term, Research Data Management Infrastructure (RDMI), to identify the confluence of e-Science, Cyberinfrastructure, and data-intensive science (see From National Institution to National Infrastructure).  We like to believe that we coined this concept, although in the U.K. a JISC funding envelope used the identical terminology at approximately the same time.  The JISC program description mentions several drivers that shaped the purpose of this specific funding envelope, many of which are just as relevant in Canada as in the U.K.

Higher education institutions are under increasing pressure to provide services and infrastructure for research data management. These pressures come from a variety of sources: the opportunities of more data intensive and more open, collaborative research; the requirements of research funders; the increasing concern for research transparency and integrity; institutions concern to avoid [reputational] damage caused by poor responses to FoI requests or by data loss.  JISC website

The definition I use for RDMI builds on the context described in the JISC description:

RDMI is the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.

This discussion will focus specifically on the context served by RDMI, namely, data activities across the research lifecycle.

The Research Lifecycle

The research process is made up of a large set of activities that tend to be grouped into a series of fairly discrete stages.  Each stage typically consists of a set of related activities to accomplish a primary task, the outcomes of which are then passed to the next stage. For example, a survey’s design stage will result in the selection of a sample and an instrument for collecting data. The completion of these activities flow to a data collection stage where interviews are conducted and information is gathered from the sample.  While not all stages are necessarily linear, many of them do have logical dependencies that require sequential ordering.  For example, a research proposal is typically prepared before a grant application is submitted.

As with any project management operation, the granularity at which activities are described presents different views of a project. Similarly, the stages in the research lifecycle can be aggregated or disaggregated into larger or smaller groupings. Nevertheless, there is a level at which a primary task will be accomplished and its outcomes passed to another stage.  In a survey, for example, there is a point where data processing is completed and a data product is passed along for analysis.

Research workflow of a typical scholar showing the nonlinear development of research projects and the multiple stages at which data are collected

The Jahnke and Asher diagram of the workflow of a typical researcher is intended to show the nonlinear nature of the research process.  I feel that the more important message depicted in this workflow is the connection to data throughout the various stages.  Many of the activities in the research lifecycle indirectly or directly involve aspects of data management.  The above diagram shows examples of data-related tasks in the feasibility research, project design, and active research stages.

The Research Lifecycle

There is a second lifecycle that is closely interrelated with the research lifecycle.  This is the data lifecycle, which overlaps stages in the research lifecycle but also consists of some important stages independent of project-based research, including stages dealing with data dissemination, preservation, discovery, and repurposing.  While the Humphrey diagram does not identify stages specific to the data lifecycle, it does consist of more stages outside the project level shown in the Jahnke and Asher diagram, including references to knowledge transfer and repositories for data and research outputs.

Research Data Management

Research data management involves the practices and activities across the research lifecycle that involve the operational support of data through design, production, processing, documentation, analysis, preservation, discovery and reuse.  Collectively, these data-related activities span the stages of project-based research as well as the extended stages that tend to be institutionally based.  The activities are about the “what” and “how” of research data.

RDMI is the configuration of staff, services, and tools assembled to support data management across the research lifecycle and more specifically to provide comprehensive coverage of the stages making up the data lifecycle.

Data Stewardship

In contrast to Data Management, Data Stewardship is about the identity of those responsible for ensuring data management activities are performed to best practice levels and standards across the lifecycle.  Stewardship addresses “who” is responsible for a specific data activity (I’d like to acknowledge Wendy Watkins’ contribution in making this distinction between responsibility and activity).  Data policies, institutional norms, granting council requirements, and domain practices all contribute to defining the roles of those who are responsible for data at the various lifecycle stages.  Ideally, a comprehensive plan at the beginning of a research project would identify the supporting parties across the data lifecycle.  If a data management plan fails to identify who is responsible for specific data-related activities, the risk that not all activities will be completed is heightened.  Data Management Plans should be broadened to become Data Management and Stewardship plans.

The design of RDMI needs to enable data stewards across the data lifecycle to fulfill their responsibilities.

Project-level and Institutional-level Stewardship

A clarification needs to be made about the parties responsible for the various stages of the data lifecycle.  We are currently in a period during which data stewardship roles are under scrutiny.  Clearly, there are stages for which the researcher is the data steward.  The model for conducting research has traditionally been at the project level.  In this context, the researcher is responsible for both defining and conducting the work.  They are also often responsible for securing the funds to do the research.

However, as noted in the JISC quote above, increasingly institutions are discovering a need to take on new responsibilities dealing with research data management, which often entails providing services and infrastructure.  University administrators are much more aware of the value of data to their institution than they were necessarily in the past.  Both operational and research data are now being treated as digital assets that need policies, practices, services, and infrastructure to secure their future.  One consequence is the willingness of some institutions to support stages in the data lifecycle that previously had fallen between the cracks.  Some of these new responsibilities for data require additional investments in services and infrastructure, while others will involve the redeployment of staff or reconfiguration of services to fulfill newly accepted data responsibilities.

Research Data Management Interventions

The Jeffreys’ diagram shows stages in the institutional model from a JICS-funded project to develop research data management infrastructure at the University of Oxford.  While this graphic was not necessarily intended to depict the shared responsibilities between project-level research and the institution, one can see the interplay between both.  The left-hand stages of Project Planning, Project Setup, Data Creation, Documentation, and aspects of Local Storage largely are the researcher’s responsibilities, while the institution assumes responsibility for aspects of Local Storage, Institutional Storage, Rediscovery Mechanism, and Retrieval Mechanism.  Oxford has chosen to provide a mix of services across all of these stages even though the researcher is the primary data steward in half of the stages.  The infrastructure in these stages is to help researchers accomplish their data management tasks without yielding their control over them.

Working with researchers on their campus, university senior administrators have an important leadership role in developing ground-level RDMI.

Institutional Collaboration

The most innovative nations in the future will be those that best manage their research data today.  This is a meaningful incentive for institutions in Canada to collaborate in the development of RDMI, and all the more important in the absence of top-down national support.  No single institution can on its own manage the problem posed by research data.  But collectively, institutions working together can build the shared infrastructure needed by the research community.

Several successful models of collaboration across institutions attest to the viability of building national RDMI through a shared approach.  The Canadian Polar Data Network is one example of a cross-sector collaboration between the higher education and federal government sectors that provided data curation and preservation services for Canadian-funded research in the recent International Polar Year.  Collectively, this network of institutions is able to provide a greater service than any one could offer individually.

Institutional engagement in data stewardship becomes an important step in developing bottom-up national RDMI.

The next essay addresses the three pillars making up research infrastructure in Canada and compares RDMI with the support for high speed research networks and high performance computing.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]

From National Institution to National Infrastructure

The idea of a distributed network providing data archive functions was presented as one of three models in the 2002 report of the National Data Archive Consultation (NDAC). This was a radical departure from the concept of a national institution supporting research data. After all, preservation requires the longevity of a trusted, enduring institution. Individuals and technology come and go but an institution is needed to span the centuries. In comparison, the notion of a series of nodes connected to a network configuration seemed very ephemeral. We all know that technology is anything but static. How could a national data archive be based simply on one of today’s technology platforms?  This perception, however, was a misunderstanding about how a distributed network for digital preservation could be organized.

At the time of the NDAC final report, digital preservation as a field was starting to come into its own, having only seriously taken root in the latter part of the 1990s.  Much of the initial focus within digital preservation was on individual institutions developing practices and building infrastructure to preserve local digital collections of texts and images.  With the development of computing platforms to support institutional repositories and with the popularization of open access publishing, activity in digital preservation accelerated.  While these developments tended to focus on single institutional initiatives, the underlying infrastructure was capable of supporting a nationally distributed research data preservation network consisting of institutions collaboratively committed to the longevity of the service.

This became the backdrop to a fundamental shift in the way national research data preservation services in Canada might be established.  The introductory essay to this Blog indicated that several studies over the years proposed building a new national institution for this purpose.  This was the dominant model until approximately 2006.  Until then, implementing a national data archive was seen primarily to depend on a champion to stir up the necessary political will to build the new institution.  In addition, this vision was very much a top-down approach of accomplishing this mission.

At the time that NCASRD was underway in Canada, e-Science had established itself in Europe, while equivalent activities in the United States were called Cyberinfrastructure.  Both e-Science and Cyberinfrastructure have their origins in national funding programs supporting computationally intensive infrastructure for the management and processing of very large datasets (now commonly known as “big data”).  Of course, this included high-speed optical research networks and high-performance computing (HPC).  Around 2007, Jim Gray broadened the understanding of e-Science through his work on data-intensive science, which he characterized as data capture, data curation, and data analysis (see The Fourth Paradigm, which was dedicated in his memory).   Data-intensive research quickly unveiled the need for data interoperability across scientific domains. In fact, data interoperability has become an integral part of e-Science and Cyberinfrastructure.  The net result of e-Science, Cyberinfrastructure, and data-intensive science has been an investment in and development of new computational services built around research data.

The CARL application to the Canada Foundation for Innovation called these new computational services, Research Data Management Infrastructure (RDMI). It represents the confluence of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.  Understanding infrastructure for research data from this perspective changes the focus from a dependence on top-down initiatives to the potential for bottom-up organization.  The CARL contribution also consisted of persistent institutions dedicated to digital preservation.  Several research libraries committed to the long-term, collaborative operation of digital preservation infrastructure could replace the model of a single, national data archive.  Instead of a national institution, there is now a viable alternative of national infrastructure to support the management and preservation of data.

The next essay goes more deeply into research data management infrastructure.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]

Community Actions to Preserve Research Data in Canada

It takes a research community to preserve its data.

Without leadership from a national institution for research data management and preservation (see this Blog’s Introduction), communities that have interests in research data in Canada have become essential in moving forward an agenda to build this needed infrastructure.  At this stage, the strategy for data in Canada has become reliant on community-level actions.  The diversity of domains, sectors, and jurisdictions with stakes in research data complicates efforts to mobilize a grassroots, bottom-up plan for action.  There have, however, been some recent community activities around data that are encouraging.

Libraries and Archives Canada (LAC) tapped into the cultural, heritage, and academic sectors to achieve community engagement in identifying basic principles and goals for a national digital information strategy.  The Canadian Association of Research Libraries (CARL) undertook the drafting of an application to the Canada Foundation for Innovation (CFI) for research data management infrastructure.  As background support for its application, the steering committee for this initiative consulted widely across the scholarly research community, bringing together data interests from diverse research domains.  The Research Data Strategy Working Group (RDSWG), composed of representatives from organizations and agencies concerned about research data in Canada, sought ways to implement recommendations from the deadlocked National Consultation on Access to Scientific Research Data (NCASRD).  The work of this group contributed to the successful 2011 National Data Summit that attracted the participation of over 160 senior officials with interests in research data across sectors.

The Canadian Digital Information Strategy

LAC undertook a community-based consultation beginning in 2005 to develop a national digital information strategy.  Working with over 200 organizations from the public, private, and academic sectors, a National Summit was held in December 2006 bringing together representatives from these stakeholders to identify key components of a digital strategy.  Early in 2007, the Strategic Development Committee (SDC) was struck to synthesize the output of the Summit and to provide substantive input into a draft strategy.  Three sub-groups (Science and Research; Cultural Heritage; and Government Information) were formed to tackle this Committee’s workload.  The contributions of the Science and Research sub-group made the draft version of the digital strategy a valuable statement for and relevant to research data.  The resulting draft digital information strategy was released in the fall of 2007, which launched a public review that was conducted until early 2008.  In March 2010, stakeholders who contributed to the consultation were sent a copy of the final report entitled Canadian Digital Information Strategy: Final Report of Consultations with Stakeholder Communities 2005–2008, bringing closure to the process.

Federal inter-departmental politics intervened between 2008 and 2010, undermining the important community involvement that went into forming this strategy.  Industry Canada laid claims on the digital economy and perceived the national digital information strategy as treading on its turf.  The outcome of this internal political struggle surfaced in the May 10, 2010 announcement of National Consultations on a Digital Economy Strategy, made jointly by the then Minister of Industry (Tony Clement), the Minister of Canadian Heritage and Official Languages (James Moore), and the Minister of Human Resources and Skills Development (Diane Finley).  This consultation was conducted online for only one month, drawing upon only a fraction of the community involvement in the digital information strategy.

While the politics over strategies between digital information and a digital economy undermined the eventual adoption of the Canadian Digital Information Strategy, the engagement of the community in shaping the LAC-developed strategy was very successful and demonstrated common ground among the diverse group of stakeholders that have interests in managing, providing access to, and preserving digital content.

A Proposal for a National Collaborative Research Data Infrastructure

The CARL Directors launched an initiative in June 2010 to prepare a proposal for a national collaborative research data infrastructure.  While the Canada Foundation for Innovation had yet to announce the program envelope for its next CARL Research Data Management Infrastructurefunding round, there was anticipation of a program that might support applications for a national platform.  The vision was to finance the development of a national network of research data services at contributing CARL member institutions, including ingest centres to work with researchers in receiving their data, staging repositories to assist researchers with the management of their data over the life of a project, and data repositories responsible for the long-term preservation of digital research data.  From the beginning of this initiative, the CARL Directors met with many stakeholders, seeking their endorsement for the proposal.  They achieved support from organizations representing other components of Canada’s research infrastructure: Canada’s high-speed optical research network (CANARIE), Canada’s high performance computing grid (Compute Canada), and the Canadian University Council of Chief Information Officers (CUCCIO).  They also held a meeting with researchers from several domains to identify their requirements of a national research data infrastructure.  Out of these discussions with fellow stakeholders, CARL built a network of supporters within the research community.

When CFI announced it funding program, it did not include a national platform competition.  This complicated the logistics of the CARL proposal.  The CFI program that was being run would require each university to make the CARL proposal a high priority among the other proposals on their campus.  In the end, not enough support could be garnered from campuses to compete in this funding round.  The politics of funding envelopes rather than inter-departmental turf prematurely ended this effort at building national research data services.  Nevertheless, the CARL Directors were successful in communicating their ideas and in building community support for this vision.

The Research Data Strategy Working Group’s National Data Summit

The phoenix rising out of the ashes of the National Data Archive consultation and NCASRD was the Research Data Strategy Working Group.  This informal group, without having financial backing, seeks to find ways of advancing the recommendations of the two earlier consultations.  With roots in a number of organizations and agencies for which research data are important, members in the RDSWG strive to keep one another informed about projects and opportunities that will push the research data agenda forward.  When the government signaled its support for open data in 2010, the RDSWG capitalized on this new direction by proposing to host a National Data Summit that would bring together senior officials to discuss the challenges around research data in Canada.  Funding doors opened as the RDSWG promoted the idea of such a Summit and by the spring of 2011 a program was put in place for September 2011.

The outcome of the National Data Summit was the widespread recognition that research data activities need to be coordinated in Canada.  The discussions in the Summit revealed many common issues across research domains and sectors, demonstrating the value of a forum for sharing and debating data issues.  The Summit participants recommended holding a similar event within eighteen months and endorsed formalizing a secretariat to support such a forum.  In the fall of 2012, the RDSWG reorganized itself into Research Data Canada and continues to develop its role as a national forum for data stewardship issues.

Lessons for the Research Data Community

Both the experience of the Canadian Digital Information Strategy caught in the crosshairs of inter-departmental politics and of CARL’s withdrawn CFI application provide important lessons.  Neither the level of community engagement in defining strategic directions nor its endorsement of such a course were exempt from an inter-departmental power grab.  Some political battles are difficult to anticipate; others fall into a consistent pattern.  After all, the Federal Minister who buried the Canadian Digital Information Strategy also dealt the deathblow to the 2011 Census mandatory long form, which would have produced one of Canada’s highly valuable digital information assets.  One lesson from this experience is to avoid turf battles between federal departments, unless Treasury Board is on your side.

Similarly, one cannot assume that innovative ideas, even ones that could accelerate Canada to the forefront of research data infrastructure, will trump local interests.  A lesson from the CARL experience is that funding to develop nationally shared services will face stiff competition from local interests, even though the national services may benefit those locally.  This is one situation where strong community intervention may be able to persuade local interests that national gains outweigh any perceived local loss.

The response to the National Data Summit and Research Data Canada shows that Canada’s research community is willing and eager to engage in activities that may shape strategies and plans around data management and preservation.  This undercurrent of support needs to be nurtured and channeled to achieve a national collaborative research data infrastructure.

The next essay looks at the strategic shift from a national institution to national infrastructure for research data in Canada.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]