Who Are Canada’s Research Data Peers?

In October 2007, Kevin Schurer, who was then the Director of the U.K. Data Archive, made a presentation at the ICPSR Official Representatives Meeting at the University of Michigan about establishing a data world wide web.  Kevin's World Data ViewHe used this graphic to illustrate the current status of social science data curation around the globe.  Each country has been crudely scaled according to the level of its social science data services.  He noted that the U.S. and U.K. are disproportionally larger in this projection than their actual physical size because of the large volume of social science data curated in these two countries.  He went on to say that Canada in his map is much smaller than its size because, “Canada can’t get its act together [regarding research data].”  While this was a rather dismaying statement to have proclaimed about my home country in an international meeting, grounds exist for him coming to such a conclusion (see the introductory Blog item for evidence.)  This observation about Canada raises two important questions:

  1. Who are Canada’s international peers in research data?
  2. How far behind is Canada in research data management infrastructure?

Canada’s International Data Peers

The Introduction to this Blog touched upon this topic.  Canadians typically view their international research peers as the United States, United Kingdom, Australia, and Germany.  In many fields of research and in some areas of research infrastructure, this is the case.  For example, CANARIE is a world-class research network that is comparable with Europe’s research network, GÉANT.  Contributing to the validity of this comparison is the level of top-down impetus both receive through government policy, programs, and funding for these networks.

Research Data Management Infrastructure (RDMI) in Canada, however, does not compare with the developments in data infrastructure in these four countries.  As mentioned previously, bottom-up actions by higher education institutions willing to collaborate with one another around cost-sharing initiatives are the driving force for RDMI in Canada, which by comparison is a very different environment.

Who then are Canada’s data peers?  Looking at Shurer’s 2007 map, Canada appears to be grouped with the rest of the world outside the United States, Europe, and Australia.  I had an opportunity to observe firsthand a few of Canada’s peers at a European Commission sponsored workshop on “Global Research Data Infrastructures: The Big Data Challenges,” held in Brussels in October 2011.  The objective of this workshop was to further the development of a 2020 roadmap for global research data infrastructure.  There were representatives from Africa, Asia, Australia, Canada, Europe, South America, and the United States, each asked to speak about data infrastructure in their country.  I was asked to talk about data infrastructure in Canada.

The presenters from Brazil and Taiwan spoke about having to build data infrastructure from the bottom-up without the top-down guidance or incentives common in the U.S., Europe, or Australia.  I was struck by how similar data infrastructure development in Brazil and Taiwan is to Canada.  Who are Canada’s data peers?  Nations building their RDMI from the bottom-up.

How Far Behind Is Canada From the Frontrunners on the Planet?

Internationally, RDMI consists of a real patchwork of activities regardless of whether the development is top-down or bottom-up.  Looking at the various parts of the patchwork can provide different perspectives about where a country is positioned globally.  This patchwork has been characterized as a Digital Science Ecosystem in the Global Research Data Infrastructure 2020 Roadmap (GRDI2020).  Thinking of research data infrastructure as an ecosystem focuses attention on the complex relationships among important components of scientific research.  To understand these complex relationships in an environment of data-intensive, multidisciplinary research is as challenging as it is to comprehend the interdependency among species in a biological ecosystem.  The authors feel that the broader research environment is as much of a contributor to advances and transformations in scientific fields as technological progress (see p. 17).

Digital Science EcosystemThe GRDI2020 report describes the Digital Science Ecosystem as being composed of Digital Data Libraries, Digital Data Archives, Digital Research Libraries, and Communities of Research. The relationships among these four components make up the patchwork environment in which this report envisions future scientific research to be conducted.  From both a technical and organizational standpoint, relationships in a digital ecosystem are established and maintained through interoperability mechanisms among these four components.  An earlier entry to this Blog highlighted the importance of institutions in preserving research data.  Three of the GRDI2020 components are based on institutions: digital data libraries, digital data archives, and digital research libraries.  The earlier Blog entry argued that these institutions do not have to be national, central services but can be distributed across existing institutions with a mandate to preserve research data.  The success of such a distributed inter-institutional preservation network will depend on its interoperability across the network and with the wider research environment.

This digital science ecosystem model can be used to assess the current state of research data infrastructure in a country.  Putting aside the various challenges of top-down or bottom-up development, what aspects of the four components of the GRDI2020 ecosystem does a country have?  Furthermore, what interoperability relationships have been established among these components?  Looking specifically at Canada, a strong network of data libraries exist on campuses across the country because of the Data Liberation Initiative (DLI).  Since 1996, academic libraries have provided data services to support the dissemination of standard data products from Statistics Canada.  In addition to providing access to data, DLI also conducts annual training regionally in Canada, constantly upgrading the skills of those who provide data services on their local campus.  Compared to Europe, Canada is much farther along in developing a network of data libraries that support local access to data.  Canada also has a strong network of research libraries with large and growing digital collections, including repository services for research results.  The Achilles heel for Canada is digital data archives.  This is the ecosystem component for which Canada lags far behind the U.S., U.K., Australia, and Germany, although a few research libraries are beginning developments in this area that hopefully will begin to close the gap.  The Canadian Polar Data Network is an example of a new Canadian collaborative, inter-institutional, cross-sectoral, distributed data archive that serves as a model for other Canadian institutions to emulate.

With strategic top-down investment in data preservation services, Canada could have leapfrogged to be among the frontrunners in the digital science ecosystem.  In the absence of top-down development, research libraries working collaboratively with research communities must build from the bottom-up to establish data preservation services.  The engagement of senior administrators at Canadian universities in the development of research data infrastructure is critical to a bottom-up strategy.  There is a need for university policies that establish an institutional mandate to preserve research records and that identify institutional data stewardship responsibilities covering the research lifecycle.

Finally, taking on these tasks at the institutional level will help begin the conversation between universities and national funding agencies around the bigger question of who should be doing what regarding data.  Currently, both parties are at loggerheads on this topic.

[The views expressed in this Blog are my own and do not necessarily represent those of my institution.]

Research Data Management Infrastructure III

In earlier entries to this Blog, Research Data Management Infrastructure (RDMI) was defined as the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.  The context for RDMI has already been discussed in terms of the research lifecycle and of the two additional components making up research infrastructure: Canada’s high speed research network and high performance computing services.  This essay will address the elements of data infrastructure and how they are organized.

In developing its Cyberinfrastructure program, the U.S. National Science Foundation funded a project to investigate how best to build successful infrastructure.  CyberinfrastructureComing out of this study was the report, Understanding Infrastructure. The authors establish early in their work the significant connection between social organization and the use of communication technology.  Regarding cyberinfrastructure, they stress that it “is about more than just pipes and machines” (p. 5) and emphasize the importance of social organizational factors in shaping solutions.  They note that in developing cyberinfrastructure, solutions can be social, technical, or a combination.  They feel that the distribution of solutions is central to building infrastructure.  Using the diagram by Millerand, solutions are portrayed as being distributed across two dimensions: technical-social and local-global.

[C]yberinfrastructure is the set of organizational practices, technical infrastructure and social norms that collectively provide for the smooth operation of scientific work at a distance. All three are objects of design and engineering; a cyberinfrastructure will fail if any one is ignored. Understanding Infrastructure (p. 6)

A Textbook Example

Earlier this year I experienced a textbook example of this conceptual model of infrastructure while visiting Bryn Mawr University just as they were changing the way they provide campus wireless services to guests.  When I arrived on campus, I was given a sheet of paper containing the name of the campus wireless service, an account ID and password to log into this service, and a set of instructions for different devices and operating systems.  I was required to obtain a separate account for each device on which I wished to use campus wireless services.

This approach to providing guests with wireless access to the campus network and the Internet falls under the social-local set of solutions in the above infrastructure model.  The procedures were organized around human intervention, i.e., having to find and speak with a person who could provide me with the information sheet, and around social norms requiring me to sign an agreement statement, confirming my acceptance of the rules for using their wireless.  The wireless technology, however, was typical industry-standard WIFI.

On the second day of my visit, a new wireless service was launched for guests on their campus: Eduroam.  This is the international service that allows academic guests from university members of Eduroam to gain access to secure wireless networking while visiting another Eduroam site.  Because my home institution is an Eduroam member and can authenticate my credentials through this service, I simply open my wireless device, go to the list of available wireless services where I am, and if Eduroam is among them, I select it.  The system behind the scene allows the local Eduroam host to verify my credentials with my home institution and to provide me with selective network services on their campus.  For example, if the Library has a license for a database that does not allow guests access, the local implementation of Eduroam can hide this database from my guest access.

This service approach falls under the technology-global set of solutions.  My credentials are validated through my home institution using technology, allowing me to connect to wireless services at a member Eduroam campus, without having to go through another person or having to obtain temporary authentication credentials.  Eduroam has provided me with guest access easily to wireless services in the United States, Germany, and Canada.  There are higher education institutions in over fifty-five nations now supporting Eduroam.  It truly is a global solution to providing guest access to secure wireless networking.

Cyberinfrastructure and RDMI

How does this particular Cyberinfrastructure (CI) model relate to Research Data Management Infrastructure?  First, the CI model provides a conceptual framework for the definition of RDMI.  The RDMI elements of technology, services, and expertise are part of CI, although not expressed in exactly the same terms.   Applied to RDMI, organizational practices and social norms are aspects of the services supporting data management across the research lifecycle.  Services embody organizational responses to data management.  For example, offering researchers assistance with data management plans requires organizing resources to deliver such a service.  Social norms and expectations are also expressed in services.  A funding agency may require data management plans to get researchers to describe how they will share the data from their project, setting an expectation to share data.  Thinking of services in the context of RDMI is a combination of CI characteristics around social norms and organization.

Expertise is another component of CI and RDMI.  Data management activities span the research lifecycle and involve many different skills, drawing upon a variety of expertise.  The demands for data management expertise depend on the scale of the research project.  A small project may involve only a couple of people, who can manage with a general set of skills.  A much larger project may require a team of experts with each team member responsible for a specific specialization.  Expertise also is aligned with responsibilities for data management activities, which was identified as aspects of data stewardship in a previous Blog discussion.

Place is significant in CI and RDMI.  Research is increasingly conducted in collaborative, inter-institutional teams that span nations.  High speed optical research networks are vital for researchers who work at a distance from one another.  Whether working together in real time or asynchronously in different places, the network allows them to organize their workflow so each can contribute.  Similarly, researchers may require access to high performance computing (HPC) but are not located at an HPC site.  Over a research network they may gain access to the computing resources they require.  Distance also comes into play with RDMI.  Data may be gathered in one location, processed at another site, analyzed at yet another place, and preserved in an institution separate from these other locations.  Through a collaborative initiative, such as the Canadian Polar Data Network, an institution may offer preservation services for research data that behind the scenes consists of a distributed dark archive shared among several institutions.  The scope of some research data infrastructure requires global solutions.  One example is the need for infrastructure that will overcome barriers in the free exchange of scientific data across national borders.

The implementations of RDMI will vary from institution to institution but the set of solutions will be distributed locally or globally across technology, services, and expertise.

The next Blog entry will focus on the question:  Who are Canada’s international peers in Research Data Management Infrastructure?

[The views expressed in this Blog are my own and do not necessarily represent those of my institution.]

Research Data Management Infrastructure II

In the previous entry, Research Data Management Infrastructure (RDMI) was defined as the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.  The context for RDMI within the research lifecycle was described and the importance of institutional-level engagement in data stewardship was emphasized.  Finally, the position was taken that cross-institutional collaboration would enable building collectively the national RDMI that has eluded Canada without top-down design or resources.  How does this context compare with the two other pillars of Canada’s research infrastructure?

Research Infrastructure: The Three Pillars

The Canadian University Council of Chief Information Officers (CUCCIO) hosted the Digital Infrastructure Summit in June 2012 in Saskatoon to address the unclear future of research infrastructure in Canada today.  Concerns have been expressed about the lack of a vision for research infrastructure in Canada and the need for more coordinated planning.  For example, the current business models for CANARIE, the coordinating agency for Canada’s high-speed optical research network, and for Compute Canada, the organization for high performance computing, operate on funding cycles that are less than optimal and on brinksmanship review processes that seem to threaten the very existence of this critical infrastructure.  Borrowing from the National Data Summit format, the CUCCIO Summit invited around sixty leaders in research infrastructure to discuss how best to approach these concerns.  Coming out of this forum was the establishment of a Leadership Council with a mission to articulate a vision for research infrastructure and to organize a follow-up summit.

Canada's Research Infrastructure PillarsWhile Canada does not have a formally recognized national organization for RDMI (Research Data Canada and CARL are working to fill part of this void), CUCCIO recognizes data infrastructure as one of three pillars constituting Canada’s research infrastructure, along with a high speed research network and high performance computing. There are some important differences between the formal support for these latter two infrastructure pillars and RDMI.  First, different forces drive these three infrastructure pillars.

  1. CANARIE provides top-down coordination and incentives, working with a group of Optical Regional Advanced Networks (ORANs) across the country.   The ORANs keep the operational delivery of the high speed network close to the researchers in their areas, while CANARIE works to weave the regional communication networks into a national research service.
  2. High Performance Computing (HPC) in Canada has a similar organizational structure of regional services (WestGrid, Compute Ontario, Calcul Quebec, Compute Atlantic) with national governance provided through Compute Canada, although the regional services tend to operate with a tradition of independence.  Nevertheless, HPC has received top-down incentives, including financial support through the Canada Foundation for Innovation.
  3. As already stated, RDMI does not have a formal national organization to represent its interests, although there are national coordinating roles for both Research Data Canada and CARL to play in data curation and infrastructure within their communities.  Unfortunately, no regional organizations for data infrastructure exist.

While RDMI has been embraced as an equal infrastructure partner by leaders in CANARIE and Compute Canada, the playing field is clearly unequal at this stage.  The good news is that Research Data Canada and CARL continue to be invited to participate in events organized by the other two infrastructure partners.

Second, the voice for RDMI is often ad hoc and diluted.  CANARIE and Compute Canada serve as single points of contact for their infrastructure.  Typically, individual researchers are called to speak on behalf of data infrastructure, even though they may represent only a narrow perspective on data management infrastructure.  A consequence is that the voice for research data often becomes haphazard.  The risks are that a data advocate may not be present at an important research infrastructure event or that the message is too narrow for today’s range of research data issues.

Third, RDMI is dependent on bottom-up initiatives, requiring a great deal of coordination and cooperation to be successful.  The organization of top-down initiatives typically depend on control and governance.  With bottom-up projects, the most important organizational factors are trust, collaboration, and cooperation.  These two different organizational structures also tend to result in different styles of internal politics.

Finally, the international peers for each of Canada’s infrastructure pillars are different.  Both CANARIE and Compute Canada see their counterpart organizations in the United States, Australia, United Kingdom, and the rest of Europe as their peers.  The models and practices for funding and planning are also similar among these peers.  Look at what is happening to RDMI within this same group of countries: the National Science Foundation in the U.S. provides grants for data curation projects through its DataNet program; the European Union supported the Global Research Data Infrastructures 2020 project to help chart the course for developing a global data ecosystem; Australia established the Australian National Data Service to support researchers with their data curation needs; in the U.K. JISC offers its Managing Research Data program, which funds projects in RDMI.  These examples are all top-down driven and involve incentive programs for data infrastructure.  At this stage, the development of RDMI in Canada has very little in common with CANARIE and Compute Canada’s international peers.  A subsequent Blog entry will address who the international peers currently are for Canada’s RDMI.

The next entry discusses RDMI components of technology, services and expertise and how they are organized locally or globally.

[The views expressed in this Blog are my own and do not necessarily represent those of my institution.]

Research Data Management Infrastructure I

Beginning in 2010, the authors of the CARL application to the Canada Foundation for Innovation (see Community Actions to Preserve Research Data in Canada) used the term, Research Data Management Infrastructure (RDMI), to identify the confluence of e-Science, Cyberinfrastructure, and data-intensive science (see From National Institution to National Infrastructure).  We like to believe that we coined this concept, although in the U.K. a JISC funding envelope used the identical terminology at approximately the same time.  The JISC program description mentions several drivers that shaped the purpose of this specific funding envelope, many of which are just as relevant in Canada as in the U.K.

Higher education institutions are under increasing pressure to provide services and infrastructure for research data management. These pressures come from a variety of sources: the opportunities of more data intensive and more open, collaborative research; the requirements of research funders; the increasing concern for research transparency and integrity; institutions concern to avoid [reputational] damage caused by poor responses to FoI requests or by data loss.  JISC website

The definition I use for RDMI builds on the context described in the JISC description:

RDMI is the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.

This discussion will focus specifically on the context served by RDMI, namely, data activities across the research lifecycle.

The Research Lifecycle

The research process is made up of a large set of activities that tend to be grouped into a series of fairly discrete stages.  Each stage typically consists of a set of related activities to accomplish a primary task, the outcomes of which are then passed to the next stage. For example, a survey’s design stage will result in the selection of a sample and an instrument for collecting data. The completion of these activities flow to a data collection stage where interviews are conducted and information is gathered from the sample.  While not all stages are necessarily linear, many of them do have logical dependencies that require sequential ordering.  For example, a research proposal is typically prepared before a grant application is submitted.

As with any project management operation, the granularity at which activities are described presents different views of a project. Similarly, the stages in the research lifecycle can be aggregated or disaggregated into larger or smaller groupings. Nevertheless, there is a level at which a primary task will be accomplished and its outcomes passed to another stage.  In a survey, for example, there is a point where data processing is completed and a data product is passed along for analysis.

Research workflow of a typical scholar showing the nonlinear development of research projects and the multiple stages at which data are collected

The Jahnke and Asher diagram of the workflow of a typical researcher is intended to show the nonlinear nature of the research process.  I feel that the more important message depicted in this workflow is the connection to data throughout the various stages.  Many of the activities in the research lifecycle indirectly or directly involve aspects of data management.  The above diagram shows examples of data-related tasks in the feasibility research, project design, and active research stages.

The Research Lifecycle

There is a second lifecycle that is closely interrelated with the research lifecycle.  This is the data lifecycle, which overlaps stages in the research lifecycle but also consists of some important stages independent of project-based research, including stages dealing with data dissemination, preservation, discovery, and repurposing.  While the Humphrey diagram does not identify stages specific to the data lifecycle, it does consist of more stages outside the project level shown in the Jahnke and Asher diagram, including references to knowledge transfer and repositories for data and research outputs.

Research Data Management

Research data management involves the practices and activities across the research lifecycle that involve the operational support of data through design, production, processing, documentation, analysis, preservation, discovery and reuse.  Collectively, these data-related activities span the stages of project-based research as well as the extended stages that tend to be institutionally based.  The activities are about the “what” and “how” of research data.

RDMI is the configuration of staff, services, and tools assembled to support data management across the research lifecycle and more specifically to provide comprehensive coverage of the stages making up the data lifecycle.

Data Stewardship

In contrast to Data Management, Data Stewardship is about the identity of those responsible for ensuring data management activities are performed to best practice levels and standards across the lifecycle.  Stewardship addresses “who” is responsible for a specific data activity (I’d like to acknowledge Wendy Watkins’ contribution in making this distinction between responsibility and activity).  Data policies, institutional norms, granting council requirements, and domain practices all contribute to defining the roles of those who are responsible for data at the various lifecycle stages.  Ideally, a comprehensive plan at the beginning of a research project would identify the supporting parties across the data lifecycle.  If a data management plan fails to identify who is responsible for specific data-related activities, the risk that not all activities will be completed is heightened.  Data Management Plans should be broadened to become Data Management and Stewardship plans.

The design of RDMI needs to enable data stewards across the data lifecycle to fulfill their responsibilities.

Project-level and Institutional-level Stewardship

A clarification needs to be made about the parties responsible for the various stages of the data lifecycle.  We are currently in a period during which data stewardship roles are under scrutiny.  Clearly, there are stages for which the researcher is the data steward.  The model for conducting research has traditionally been at the project level.  In this context, the researcher is responsible for both defining and conducting the work.  They are also often responsible for securing the funds to do the research.

However, as noted in the JISC quote above, increasingly institutions are discovering a need to take on new responsibilities dealing with research data management, which often entails providing services and infrastructure.  University administrators are much more aware of the value of data to their institution than they were necessarily in the past.  Both operational and research data are now being treated as digital assets that need policies, practices, services, and infrastructure to secure their future.  One consequence is the willingness of some institutions to support stages in the data lifecycle that previously had fallen between the cracks.  Some of these new responsibilities for data require additional investments in services and infrastructure, while others will involve the redeployment of staff or reconfiguration of services to fulfill newly accepted data responsibilities.

Research Data Management Interventions

The Jeffreys’ diagram shows stages in the institutional model from a JICS-funded project to develop research data management infrastructure at the University of Oxford.  While this graphic was not necessarily intended to depict the shared responsibilities between project-level research and the institution, one can see the interplay between both.  The left-hand stages of Project Planning, Project Setup, Data Creation, Documentation, and aspects of Local Storage largely are the researcher’s responsibilities, while the institution assumes responsibility for aspects of Local Storage, Institutional Storage, Rediscovery Mechanism, and Retrieval Mechanism.  Oxford has chosen to provide a mix of services across all of these stages even though the researcher is the primary data steward in half of the stages.  The infrastructure in these stages is to help researchers accomplish their data management tasks without yielding their control over them.

Working with researchers on their campus, university senior administrators have an important leadership role in developing ground-level RDMI.

Institutional Collaboration

The most innovative nations in the future will be those that best manage their research data today.  This is a meaningful incentive for institutions in Canada to collaborate in the development of RDMI, and all the more important in the absence of top-down national support.  No single institution can on its own manage the problem posed by research data.  But collectively, institutions working together can build the shared infrastructure needed by the research community.

Several successful models of collaboration across institutions attest to the viability of building national RDMI through a shared approach.  The Canadian Polar Data Network is one example of a cross-sector collaboration between the higher education and federal government sectors that provided data curation and preservation services for Canadian-funded research in the recent International Polar Year.  Collectively, this network of institutions is able to provide a greater service than any one could offer individually.

Institutional engagement in data stewardship becomes an important step in developing bottom-up national RDMI.

The next essay addresses the three pillars making up research infrastructure in Canada and compares RDMI with the support for high speed research networks and high performance computing.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]

From National Institution to National Infrastructure

The idea of a distributed network providing data archive functions was presented as one of three models in the 2002 report of the National Data Archive Consultation (NDAC). This was a radical departure from the concept of a national institution supporting research data. After all, preservation requires the longevity of a trusted, enduring institution. Individuals and technology come and go but an institution is needed to span the centuries. In comparison, the notion of a series of nodes connected to a network configuration seemed very ephemeral. We all know that technology is anything but static. How could a national data archive be based simply on one of today’s technology platforms?  This perception, however, was a misunderstanding about how a distributed network for digital preservation could be organized.

At the time of the NDAC final report, digital preservation as a field was starting to come into its own, having only seriously taken root in the latter part of the 1990s.  Much of the initial focus within digital preservation was on individual institutions developing practices and building infrastructure to preserve local digital collections of texts and images.  With the development of computing platforms to support institutional repositories and with the popularization of open access publishing, activity in digital preservation accelerated.  While these developments tended to focus on single institutional initiatives, the underlying infrastructure was capable of supporting a nationally distributed research data preservation network consisting of institutions collaboratively committed to the longevity of the service.

This became the backdrop to a fundamental shift in the way national research data preservation services in Canada might be established.  The introductory essay to this Blog indicated that several studies over the years proposed building a new national institution for this purpose.  This was the dominant model until approximately 2006.  Until then, implementing a national data archive was seen primarily to depend on a champion to stir up the necessary political will to build the new institution.  In addition, this vision was very much a top-down approach of accomplishing this mission.

At the time that NCASRD was underway in Canada, e-Science had established itself in Europe, while equivalent activities in the United States were called Cyberinfrastructure.  Both e-Science and Cyberinfrastructure have their origins in national funding programs supporting computationally intensive infrastructure for the management and processing of very large datasets (now commonly known as “big data”).  Of course, this included high-speed optical research networks and high-performance computing (HPC).  Around 2007, Jim Gray broadened the understanding of e-Science through his work on data-intensive science, which he characterized as data capture, data curation, and data analysis (see The Fourth Paradigm, which was dedicated in his memory).   Data-intensive research quickly unveiled the need for data interoperability across scientific domains. In fact, data interoperability has become an integral part of e-Science and Cyberinfrastructure.  The net result of e-Science, Cyberinfrastructure, and data-intensive science has been an investment in and development of new computational services built around research data.

The CARL application to the Canada Foundation for Innovation called these new computational services, Research Data Management Infrastructure (RDMI). It represents the confluence of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.  Understanding infrastructure for research data from this perspective changes the focus from a dependence on top-down initiatives to the potential for bottom-up organization.  The CARL contribution also consisted of persistent institutions dedicated to digital preservation.  Several research libraries committed to the long-term, collaborative operation of digital preservation infrastructure could replace the model of a single, national data archive.  Instead of a national institution, there is now a viable alternative of national infrastructure to support the management and preservation of data.

The next essay goes more deeply into research data management infrastructure.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]

Community Actions to Preserve Research Data in Canada

It takes a research community to preserve its data.

Without leadership from a national institution for research data management and preservation (see this Blog’s Introduction), communities that have interests in research data in Canada have become essential in moving forward an agenda to build this needed infrastructure.  At this stage, the strategy for data in Canada has become reliant on community-level actions.  The diversity of domains, sectors, and jurisdictions with stakes in research data complicates efforts to mobilize a grassroots, bottom-up plan for action.  There have, however, been some recent community activities around data that are encouraging.

Libraries and Archives Canada (LAC) tapped into the cultural, heritage, and academic sectors to achieve community engagement in identifying basic principles and goals for a national digital information strategy.  The Canadian Association of Research Libraries (CARL) undertook the drafting of an application to the Canada Foundation for Innovation (CFI) for research data management infrastructure.  As background support for its application, the steering committee for this initiative consulted widely across the scholarly research community, bringing together data interests from diverse research domains.  The Research Data Strategy Working Group (RDSWG), composed of representatives from organizations and agencies concerned about research data in Canada, sought ways to implement recommendations from the deadlocked National Consultation on Access to Scientific Research Data (NCASRD).  The work of this group contributed to the successful 2011 National Data Summit that attracted the participation of over 160 senior officials with interests in research data across sectors.

The Canadian Digital Information Strategy

LAC undertook a community-based consultation beginning in 2005 to develop a national digital information strategy.  Working with over 200 organizations from the public, private, and academic sectors, a National Summit was held in December 2006 bringing together representatives from these stakeholders to identify key components of a digital strategy.  Early in 2007, the Strategic Development Committee (SDC) was struck to synthesize the output of the Summit and to provide substantive input into a draft strategy.  Three sub-groups (Science and Research; Cultural Heritage; and Government Information) were formed to tackle this Committee’s workload.  The contributions of the Science and Research sub-group made the draft version of the digital strategy a valuable statement for and relevant to research data.  The resulting draft digital information strategy was released in the fall of 2007, which launched a public review that was conducted until early 2008.  In March 2010, stakeholders who contributed to the consultation were sent a copy of the final report entitled Canadian Digital Information Strategy: Final Report of Consultations with Stakeholder Communities 2005–2008, bringing closure to the process.

Federal inter-departmental politics intervened between 2008 and 2010, undermining the important community involvement that went into forming this strategy.  Industry Canada laid claims on the digital economy and perceived the national digital information strategy as treading on its turf.  The outcome of this internal political struggle surfaced in the May 10, 2010 announcement of National Consultations on a Digital Economy Strategy, made jointly by the then Minister of Industry (Tony Clement), the Minister of Canadian Heritage and Official Languages (James Moore), and the Minister of Human Resources and Skills Development (Diane Finley).  This consultation was conducted online for only one month, drawing upon only a fraction of the community involvement in the digital information strategy.

While the politics over strategies between digital information and a digital economy undermined the eventual adoption of the Canadian Digital Information Strategy, the engagement of the community in shaping the LAC-developed strategy was very successful and demonstrated common ground among the diverse group of stakeholders that have interests in managing, providing access to, and preserving digital content.

A Proposal for a National Collaborative Research Data Infrastructure

The CARL Directors launched an initiative in June 2010 to prepare a proposal for a national collaborative research data infrastructure.  While the Canada Foundation for Innovation had yet to announce the program envelope for its next CARL Research Data Management Infrastructurefunding round, there was anticipation of a program that might support applications for a national platform.  The vision was to finance the development of a national network of research data services at contributing CARL member institutions, including ingest centres to work with researchers in receiving their data, staging repositories to assist researchers with the management of their data over the life of a project, and data repositories responsible for the long-term preservation of digital research data.  From the beginning of this initiative, the CARL Directors met with many stakeholders, seeking their endorsement for the proposal.  They achieved support from organizations representing other components of Canada’s research infrastructure: Canada’s high-speed optical research network (CANARIE), Canada’s high performance computing grid (Compute Canada), and the Canadian University Council of Chief Information Officers (CUCCIO).  They also held a meeting with researchers from several domains to identify their requirements of a national research data infrastructure.  Out of these discussions with fellow stakeholders, CARL built a network of supporters within the research community.

When CFI announced it funding program, it did not include a national platform competition.  This complicated the logistics of the CARL proposal.  The CFI program that was being run would require each university to make the CARL proposal a high priority among the other proposals on their campus.  In the end, not enough support could be garnered from campuses to compete in this funding round.  The politics of funding envelopes rather than inter-departmental turf prematurely ended this effort at building national research data services.  Nevertheless, the CARL Directors were successful in communicating their ideas and in building community support for this vision.

The Research Data Strategy Working Group’s National Data Summit

The phoenix rising out of the ashes of the National Data Archive consultation and NCASRD was the Research Data Strategy Working Group.  This informal group, without having financial backing, seeks to find ways of advancing the recommendations of the two earlier consultations.  With roots in a number of organizations and agencies for which research data are important, members in the RDSWG strive to keep one another informed about projects and opportunities that will push the research data agenda forward.  When the government signaled its support for open data in 2010, the RDSWG capitalized on this new direction by proposing to host a National Data Summit that would bring together senior officials to discuss the challenges around research data in Canada.  Funding doors opened as the RDSWG promoted the idea of such a Summit and by the spring of 2011 a program was put in place for September 2011.

The outcome of the National Data Summit was the widespread recognition that research data activities need to be coordinated in Canada.  The discussions in the Summit revealed many common issues across research domains and sectors, demonstrating the value of a forum for sharing and debating data issues.  The Summit participants recommended holding a similar event within eighteen months and endorsed formalizing a secretariat to support such a forum.  In the fall of 2012, the RDSWG reorganized itself into Research Data Canada and continues to develop its role as a national forum for data stewardship issues.

Lessons for the Research Data Community

Both the experience of the Canadian Digital Information Strategy caught in the crosshairs of inter-departmental politics and of CARL’s withdrawn CFI application provide important lessons.  Neither the level of community engagement in defining strategic directions nor its endorsement of such a course were exempt from an inter-departmental power grab.  Some political battles are difficult to anticipate; others fall into a consistent pattern.  After all, the Federal Minister who buried the Canadian Digital Information Strategy also dealt the deathblow to the 2011 Census mandatory long form, which would have produced one of Canada’s highly valuable digital information assets.  One lesson from this experience is to avoid turf battles between federal departments, unless Treasury Board is on your side.

Similarly, one cannot assume that innovative ideas, even ones that could accelerate Canada to the forefront of research data infrastructure, will trump local interests.  A lesson from the CARL experience is that funding to develop nationally shared services will face stiff competition from local interests, even though the national services may benefit those locally.  This is one situation where strong community intervention may be able to persuade local interests that national gains outweigh any perceived local loss.

The response to the National Data Summit and Research Data Canada shows that Canada’s research community is willing and eager to engage in activities that may shape strategies and plans around data management and preservation.  This undercurrent of support needs to be nurtured and channeled to achieve a national collaborative research data infrastructure.

The next essay looks at the strategic shift from a national institution to national infrastructure for research data in Canada.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]

Canada’s Long Tale of Data

The challenges around preserving research data in Canada have reached a point where we can no longer wait for a solution to be handed down from on high. If we are to save data produced from today’s research, we are going to have to work together with “memory” institutions in Canada willing to incorporate research data into their mandates for preservation.  The essays in this Blog address different issues around preserving research data in Canada.

The “Long Tale of Data” subtitle for this Blog is a play on the word, tail. Long Tail of Data
“The Long Tail of Data” describes the distribution of the number of datasets by their file storage size.  The curve for this distribution shows relatively fewer very large datasets compared to the ten of thousands of smaller sized datasets.  Currently, “big data”, i.e., the very large datasets, are receiving a lot of attention, while the myriad of smaller datasets pose their own daunting challenges for management and preservation.  This story about the data tail is only one of many tales about research data today.

Early Attempts

The focus here is to share several data tales, beginning with the story behind earlier attempts to establish a national data archive in Canada.  These efforts span four decades.  During the 1960s and early 1970s, several countries, including the United States, United Kingdom, Australia, and Germany, established social science data archives (these same nations are often viewed as Canada’s research peers, which is discussed further in two following Blog entries: RDMI 3 and Data Peers).  The social sciences became the domain of early national developments in data access and preservation.

In the late 1970s, a Canadian initiative established a catalogue for social science research data.  Known as the Data Clearing House for the Social Sciences, this organization did not hold any data files but did produce at least one printed catalogue of social science research data before shutting down its operations rather hastily.  After the demise of the short-lived Data Clearing House, it was difficult to find a Canadian funding agency willing to take a risk in this type of national research data infrastructure.

In 1973, the Public Archives of Canada established the Machine Readable Archives Division that provided research data preservation services for federal government departments and agencies.  Unfortunately, a reorganization within the Archives in 1987 resulted in this division being disbanded and its staff being dispersed among the remaining divisions within the Archives.  As a consequence, no coordinated effort was made to replace the services provided by the Machine Readable Archives and the gap between what had been collected and what failed to be collected grew rapidly.

The demise of the Machine Research Archives Division hurt Canada in several significant ways.  No longer was there a national proponent for data preservation in Canada.  Stakeholders were without a formal body with whom they could express their concerns about research data, i.e., no national forum for data existed.  No formal structure existed to develop standards or best practices in data management and preservation.  Without a formal, national body for data, Canada was without a unified voice in the international data arena.  All of these factors have contributed to stunting the stewardship of research data in Canada.

There have been a few stopgap efforts to archive research data in the social sciences in the absence of a national institution.  Most notably, the Data Library at the University of British Columbia under the leadership of Laine Ruus began collecting data in the 1970s.  A dozen university data libraries across Canada agreed to receive data from SSHRC-funded projects beginning in 1989, but only until a national preservation service could be established.  This group became know as Appendix J, which was the appendix in the SSHRC application guide where they were listed.  Without strong incentives to submit datasets to a member of the Appendix J group, very few researchers deposited any data.

A Body of Evidence

Several studies have documented the need for a national data archive or a national institution providing data management and preservation services.  One of the earliest cases appears in the report, Survey research: report of the Consultative Group on Survey Research in 1976.  While the report emphasizes access to survey data without directly tackling preservation, the authors do recommend that “the initial preparation of the data should be done not only for immediate use but also in view of ultimate storage in a data bank [emphasis added, archaic] (p. 1.21).”  Providing long-term access without specifically naming preservation is common even today among elements of the research community.

In 1996, the Data and Information Systems Panel of the Canadian Global Change Program released a report that now serves as a benchmark against which progress in research data management in Canada can be assessed.  Data Policy and Barriers to Data Access in Canada: Issues for Global Change Research contains ten recommendations under five categories: Infrastructure, Archiving, Documentation, Access, and Standards.  Regarding the preservation of research data, this report states: “There is a lack of focus for archival standards and processes in Canada (p. 51).”  This absence of focus has been a significant obstacle in getting the appropriate attention of senior officials in Canada to address research data management and preservation.  The 2011 National Data Summit (see below) was an important, recent step in gaining the focus of a group of senior administrators.

The Canadian Association for Public Data Use (CAPDU) called for a national data archive in a submission to John English’s review of the National Library and Archives of Canada in 1998.  The final report, The Role of the National Archives of Canada and the National Library of Canada, included a recommendation calling for action to preserve research data.  An outcome to the English report was the striking of the National Data Archive Consultation (NDAC) in 2001 and 2002, which the National Archives of Canada and the Social Sciences and Humanities Research Council jointly sponsored.  This consultation produced two reports.  The first volume, Phase One: Needs Assessment Report, documented the case for national data archive services, while the second volume, Building Infrastructure for Access to and Preservation of Research Data, described various models for such services.  The momentum from these two publications was lost when the search to find a senior official to champion the consultation’s findings within Government failed to happen within a year and a half of the final report’s release.

In 2004, the National Consultation on Access to Scientific Research Data (NCASRD) was launched to address the issues of data access in the physical and life sciences.  This consultation was directed to build and expand upon the work completed two years earlier in the humanities and social sciences, .  The growing interest in e-Science and the OECD Principles and Guidelines for Access to Research Data from Public Funding were instrumental in the timing of this consultation.  The Final Report of the National Consultation on Access to Scientific Research Data was released in June 2005 and called for the establishment of a national steering body, Data Canada, to help coordinate data management and preservation services.  Again a champion was sought to advance this study’s findings but no one was found within a reasonable period of time, leaving this study’s agenda sidelined like the previous efforts.

In 2008, a working group, under the guidance of Pam Bjornson, Executive Director of CISTI, began to explore ways of implementing some of the recommendations in the NCASRD final report in the absence of a national research data steering body.  Known as the Research Data Strategy Working Group (RDSWG), they conducted a study in 2008 assessing the gaps in data stewardship in Canada.  This analysis provided an update to the NDAC needs assessment from earlier in the decade.   In 2011, the gap analysis was brought up to date and incorporated into the backgrounder information disseminated  in advance of the September 2011 National Data Summit organized by the RDSWG.   Approximately 160 senior managers concerned about the management of research data in Canada attended this event.  The Summit’s final report, Mapping the Data Landscape: Report of the 2011 Canadian Research Data Summit, included a set of recommendations to develop stronger community involvement in research data management and preservation (this is discussed further in the next Blog entry.)

Moving Forward

By 2006, the possibility of a new national institution established specifically for research data management and preservation was clearly not in the cards for Canada.  The failure to find a senior official to champion the recommendations from either the NDAC or NCASRD studies was a clear indicator that this was not going to happen.  The requirement for such an institution had been demonstrated multiple times; however, the political will essential to make it happen could not be mobilized.  At the same time that this quest was dead-ending, new developments in e-Science and Cyberinfrastructure were taking shape internationally that opened a new strategy for research data in Canada.  This is discussed in more detail in the Blog entry, From National Institution to National Infrastructure.

Was the pursuit of a national institution dedicated to research data in Canada a foolhardy idea?   Comparing Canada to its usual peers, one finds institutions dedicated to social science research data that are now approaching sixty years of operation.  Given this, the quest for a Canadian national data archive made a great deal of sense.  Canada simply seemed to be lagging behind their peers and was in need of quickly catching up.  More recent evidence, however, suggests that many of us in Canada were mistaken about which countries we should consider as our peers in research data infrastructure, especially in light of the absence of a political will in Canada to establish a new institution for this purpose.  This topic is discussed in the another Blog entry.

There are several other reasons why a national institution was far from being foolish.  A national data archive would help Canada address several important issues that require a national focus.  There is the need to identify clear mandates that define the data stewardship roles of various organizations.  These mandates span federal and provincial jurisdictions as well as the public and private sectors.   The complexity of multiple mandates in such an environment would be best handled through a national forum.  Legislation directed at the general management and use of sensitive or confidential data often conflicts with valid research uses of such data.  A national data archive could facilitate the resolution of data issues around sensitive or confidential data.  Canada is in need of national leadership to build standards and best practices for the management and preservation of research data.  Finally without a formal institutional voice for data, Canada is disadvantaged internationally.  Representation in international data agreements and initiatives is critical for Canada’s researchers to stay competitive.  Even without a national institution for research data management and preservation, the need still exists to coordinate and manage these national data issues.

The next essay looks at a growing community of support around research data management in Canada.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]