Canada’s Long Tale of Data

The challenges around preserving research data in Canada have reached a point where we can no longer wait for a solution to be handed down from on high. If we are to save data produced from today’s research, we are going to have to work together with “memory” institutions in Canada willing to incorporate research data into their mandates for preservation.  The essays in this Blog address different issues around preserving research data in Canada.

The “Long Tale of Data” subtitle for this Blog is a play on the word, tail. Long Tail of Data
“The Long Tail of Data” describes the distribution of the number of datasets by their file storage size.  The curve for this distribution shows relatively fewer very large datasets compared to the ten of thousands of smaller sized datasets.  Currently, “big data”, i.e., the very large datasets, are receiving a lot of attention, while the myriad of smaller datasets pose their own daunting challenges for management and preservation.  This story about the data tail is only one of many tales about research data today.

Early Attempts

The focus here is to share several data tales, beginning with the story behind earlier attempts to establish a national data archive in Canada.  These efforts span four decades.  During the 1960s and early 1970s, several countries, including the United States, United Kingdom, Australia, and Germany, established social science data archives (these same nations are often viewed as Canada’s research peers, which is discussed further in two following Blog entries: RDMI 3 and Data Peers).  The social sciences became the domain of early national developments in data access and preservation.

In the late 1970s, a Canadian initiative established a catalogue for social science research data.  Known as the Data Clearing House for the Social Sciences, this organization did not hold any data files but did produce at least one printed catalogue of social science research data before shutting down its operations rather hastily.  After the demise of the short-lived Data Clearing House, it was difficult to find a Canadian funding agency willing to take a risk in this type of national research data infrastructure.

In 1973, the Public Archives of Canada established the Machine Readable Archives Division that provided research data preservation services for federal government departments and agencies.  Unfortunately, a reorganization within the Archives in 1987 resulted in this division being disbanded and its staff being dispersed among the remaining divisions within the Archives.  As a consequence, no coordinated effort was made to replace the services provided by the Machine Readable Archives and the gap between what had been collected and what failed to be collected grew rapidly.

The demise of the Machine Research Archives Division hurt Canada in several significant ways.  No longer was there a national proponent for data preservation in Canada.  Stakeholders were without a formal body with whom they could express their concerns about research data, i.e., no national forum for data existed.  No formal structure existed to develop standards or best practices in data management and preservation.  Without a formal, national body for data, Canada was without a unified voice in the international data arena.  All of these factors have contributed to stunting the stewardship of research data in Canada.

There have been a few stopgap efforts to archive research data in the social sciences in the absence of a national institution.  Most notably, the Data Library at the University of British Columbia under the leadership of Laine Ruus began collecting data in the 1970s.  A dozen university data libraries across Canada agreed to receive data from SSHRC-funded projects beginning in 1989, but only until a national preservation service could be established.  This group became know as Appendix J, which was the appendix in the SSHRC application guide where they were listed.  Without strong incentives to submit datasets to a member of the Appendix J group, very few researchers deposited any data.

A Body of Evidence

Several studies have documented the need for a national data archive or a national institution providing data management and preservation services.  One of the earliest cases appears in the report, Survey research: report of the Consultative Group on Survey Research in 1976.  While the report emphasizes access to survey data without directly tackling preservation, the authors do recommend that “the initial preparation of the data should be done not only for immediate use but also in view of ultimate storage in a data bank [emphasis added, archaic] (p. 1.21).”  Providing long-term access without specifically naming preservation is common even today among elements of the research community.

In 1996, the Data and Information Systems Panel of the Canadian Global Change Program released a report that now serves as a benchmark against which progress in research data management in Canada can be assessed.  Data Policy and Barriers to Data Access in Canada: Issues for Global Change Research contains ten recommendations under five categories: Infrastructure, Archiving, Documentation, Access, and Standards.  Regarding the preservation of research data, this report states: “There is a lack of focus for archival standards and processes in Canada (p. 51).”  This absence of focus has been a significant obstacle in getting the appropriate attention of senior officials in Canada to address research data management and preservation.  The 2011 National Data Summit (see below) was an important, recent step in gaining the focus of a group of senior administrators.

The Canadian Association for Public Data Use (CAPDU) called for a national data archive in a submission to John English’s review of the National Library and Archives of Canada in 1998.  The final report, The Role of the National Archives of Canada and the National Library of Canada, included a recommendation calling for action to preserve research data.  An outcome to the English report was the striking of the National Data Archive Consultation (NDAC) in 2001 and 2002, which the National Archives of Canada and the Social Sciences and Humanities Research Council jointly sponsored.  This consultation produced two reports.  The first volume, Phase One: Needs Assessment Report, documented the case for national data archive services, while the second volume, Building Infrastructure for Access to and Preservation of Research Data, described various models for such services.  The momentum from these two publications was lost when the search to find a senior official to champion the consultation’s findings within Government failed to happen within a year and a half of the final report’s release.

In 2004, the National Consultation on Access to Scientific Research Data (NCASRD) was launched to address the issues of data access in the physical and life sciences.  This consultation was directed to build and expand upon the work completed two years earlier in the humanities and social sciences, .  The growing interest in e-Science and the OECD Principles and Guidelines for Access to Research Data from Public Funding were instrumental in the timing of this consultation.  The Final Report of the National Consultation on Access to Scientific Research Data was released in June 2005 and called for the establishment of a national steering body, Data Canada, to help coordinate data management and preservation services.  Again a champion was sought to advance this study’s findings but no one was found within a reasonable period of time, leaving this study’s agenda sidelined like the previous efforts.

In 2008, a working group, under the guidance of Pam Bjornson, Executive Director of CISTI, began to explore ways of implementing some of the recommendations in the NCASRD final report in the absence of a national research data steering body.  Known as the Research Data Strategy Working Group (RDSWG), they conducted a study in 2008 assessing the gaps in data stewardship in Canada.  This analysis provided an update to the NDAC needs assessment from earlier in the decade.   In 2011, the gap analysis was brought up to date and incorporated into the backgrounder information disseminated  in advance of the September 2011 National Data Summit organized by the RDSWG.   Approximately 160 senior managers concerned about the management of research data in Canada attended this event.  The Summit’s final report, Mapping the Data Landscape: Report of the 2011 Canadian Research Data Summit, included a set of recommendations to develop stronger community involvement in research data management and preservation (this is discussed further in the next Blog entry.)

Moving Forward

By 2006, the possibility of a new national institution established specifically for research data management and preservation was clearly not in the cards for Canada.  The failure to find a senior official to champion the recommendations from either the NDAC or NCASRD studies was a clear indicator that this was not going to happen.  The requirement for such an institution had been demonstrated multiple times; however, the political will essential to make it happen could not be mobilized.  At the same time that this quest was dead-ending, new developments in e-Science and Cyberinfrastructure were taking shape internationally that opened a new strategy for research data in Canada.  This is discussed in more detail in the Blog entry, From National Institution to National Infrastructure.

Was the pursuit of a national institution dedicated to research data in Canada a foolhardy idea?   Comparing Canada to its usual peers, one finds institutions dedicated to social science research data that are now approaching sixty years of operation.  Given this, the quest for a Canadian national data archive made a great deal of sense.  Canada simply seemed to be lagging behind their peers and was in need of quickly catching up.  More recent evidence, however, suggests that many of us in Canada were mistaken about which countries we should consider as our peers in research data infrastructure, especially in light of the absence of a political will in Canada to establish a new institution for this purpose.  This topic is discussed in the another Blog entry.

There are several other reasons why a national institution was far from being foolish.  A national data archive would help Canada address several important issues that require a national focus.  There is the need to identify clear mandates that define the data stewardship roles of various organizations.  These mandates span federal and provincial jurisdictions as well as the public and private sectors.   The complexity of multiple mandates in such an environment would be best handled through a national forum.  Legislation directed at the general management and use of sensitive or confidential data often conflicts with valid research uses of such data.  A national data archive could facilitate the resolution of data issues around sensitive or confidential data.  Canada is in need of national leadership to build standards and best practices for the management and preservation of research data.  Finally without a formal institutional voice for data, Canada is disadvantaged internationally.  Representation in international data agreements and initiatives is critical for Canada’s researchers to stay competitive.  Even without a national institution for research data management and preservation, the need still exists to coordinate and manage these national data issues.

The next essay looks at a growing community of support around research data management in Canada.

[The opinions expressed in this Blog are my own and do not necessarily represent those of my institution.]