The Data Repository Community Needs To Take Certification Seriously

An editorial in the 2015/01/02 issue of Science notes that they will be identifying data repositories to promote among its authors and readers this year. The selection criteria mentioned include repositories that “are well managed, have long-term support, and are responsive to community needs.

Up to this point, many librarians have been concerned about publishers establishing their own data repositories and then charging for access to these collections of research data. This recent Science editorial presents a new potential concern regarding publisher engagement in research data management. Might a publisher’s endorsement of a data repository be construed as certification for that repository? Will publishers end up setting a de facto bar for trusted status of data repositories? What will be the implications for domain or local data repositories outside Science’s scope? These are topics needing the attention of the data repository community now.

I see the value of journals being able to recommend data repositories to authors who might be unaware of the choices available to them. However, in the absence of a widely supported and independent certification process, the data repository community runs the risk of journals conducting assessments using their own yardsticks. Without a standard set of criteria, comparisons of data repositories across journal ratings become problematic. Not only are common measures necessary, but a sense of fair assessment conducted by entities at arms-length is desirable. For example, an assessment conducted by a publisher of its own data repository has less face value than one performed by an independent party.

Rather than see Science strike out on its own to assess data repositories, I would prefer to have them work collaboratively with organisations already engaged in these activities. The Standards and Interoperability Committee of Research Data Canada has a report soon to be released that presents a set of criteria used to assess a number of data repositories. The Research Data Alliance has a working group on the audit and certification of data repositories developed on a partnership between the World Data System and the Data Seal of Approval. In Germany, a catalogue of criteria for trusted digital repositories (nestor) has been developed through community involvement. Journal editors and publishers should work with these organisations when preparing a list of data repositories to recommend.

Plans for Data Management or Mobilisation?

I had a recent conversation with a researcher who had just been informed that her letter of intent for a major grant competition had been accepted and that she was invited to submit a full application. As she started this process, she organised a committee to provide her advice on a knowledge mobilisation plan. Such a plan is a requirement of the funding agency to which she is applying. One person she asked to serve on her committee inquired if she had also thought about preparing a data management plan (DMP). Because Canada’s three major federal funding councils have yet to institute DMP policies, she was unfamiliar with such plans and was referred to me to learn more about them.

When we met to discuss her research, I discovered that the project consisted of an international team with data from several countries. The data will be a mix of qualitative and quantitative information, some of which will be collected by the project and some of which will be obtained from national agencies. We reviewed the types of consent required from her research participants that would enable sharing these data with other researchers and preserving it for long-term access. We talked about the technical options for making the data safely accessible to the researchers on her team from other countries. We spoke about developing a data charter for the project that addresses governance issues around the use of the data by everyone within the project. As we went through these and other data topics, she paused and said, “I now understand that I will need a data mobilisation plan as well as a knowledge mobilisation plan.” This observation struck me that the “M” in DMP should possibly be mobilisation instead of management.

Data Mobilisation Plans

For most funding councils, the administrative purpose of data management plans is to learn the steps that researchers will apply to share the data from their projects. One way to frame this data-sharing goal is to address expectations around data stewardship. In taking this direction, I believe that mobilisation is a more appropriate concept than management.

  • First, management is about controlling or administering activities and resources, while mobilisation is about organising or preparing something for use (see remarks in the previous blog entry about organising versus managing). From the perspective of data stewardship, the planning steps for sharing data have more to do with organising the custodial care of data across the lifecycle than with controlling the details of data management.

    Mark Parsons, Secretary General for the Research Data Alliance, illustrated this point in a comment that he made during the CASRAI Reconnect 2014 Conference. He noted that when his staff at the National Snow and Ice Data Centre at the University of Colorado helped researchers prepare DMPs for U.S. funding agencies, the researchers had difficulty describing in advance the details around how they would manage their data. Such decisions often come later in the project and depend on the technology available at that time. My response was that assuming DMPs to be statements about the nuts and bolts of managing data misses the policy intent of the plan to elicit how the data will be shared. Instead of small details, these plans should be about the strategies that researchers will follow throughout their project around managing their data.

    Preparing strategies in a DMP should draw upon the data stewards with whom solutions might be formulated. For example, if a DMP asks a researcher to identify the data repository with which she or he will deposit the project’s data, one answer might be to discuss with a liaison librarian the identity of an appropriate domain or campus data repository. Another strategy might be to contact a curator from a data repository about her or his involvement in the project from its beginning. A DMP consisting of strategies for finding solutions that can be implemented during a project directs the researcher’s focus toward mobilising data stewards and services to deal with data management requirements as they arise.

  • Second, knowledge mobilisation plans are a funding agency requirement already known to many Canadian researchers, although some agencies may identify them as knowledge transfer or translation plans. Researchers see the value of these plans, which chart the dissemination activities of research findings. The rewards of having such plans are well understood by researchers. These statements identify pathways to influence other researchers, policy makers, and practitioners that will increase the likelihood of a larger readership of the researcher’s findings and potentially more citations of the researcher’s work. These valued outcomes translate into increased prestige and greater promotion opportunities.

    Data mobilisation plans may benefit from the widely recognised value already attributed to knowledge mobilisation plans. We may soon see rewards structured around data sharing, especially if data citation takes root and the linkage between data and research articles becomes universally adopted through the use of persistent digital identifiers. The more incentives are associated with data sharing, the more data mobilisation plans will be linked to researcher rewards.
  • Third, one should not lose sight of the role that a DMP plays as an administrative tool to promote research practices supportive of an organisation’s data policy. This connection between data policy and a DMP is fundamental to its function. Whether the data policy is directed at data stewardship, data sharing, reproducible research, or a combination of these, the DMP should elicit responses that are expressive of the policy’s values. The level of abstraction called for in this context is more directed at organising than managing things. As a policy instrument, the goal of DMPs keeps our attention more centred on mobilising than managing resources.

Data: a rose by any other name (part 2)

In an earlier blog entry, I spoke about the importance of having a technical language that allows data curators to talk within their profession about the details of their work. The words they use may be part of society’s everyday vocabulary but carry a meaning specific to data curation. Confusion can arise during conversations between data curators and others outside the profession when a term is used that carries different meanings for each group. For example, I was in a meeting recently with people from a variety of technical backgrounds, including librarians and research administrators. One librarian spoke about sharing resources across libraries. For the librarian, resources meant information tools, such as, library guides, while the administrator assumed that resources referred to money. The administrator was confused about why libraries would be exchanging money.

Communication problems can also arise within a campus’ research community. We encountered this with humanities researchers on our campus earlier in the year when our library hosted a week of workshops and talks on research data management. Speakers at this event consisted of researchers from all areas on the campus, including two prominent researchers from the digital humanities. One of the humanists said in reference to the title of the event, Research Data Management Week, that researchers in the humanities don’t see their research involving data. Rather, they see data as something belonging to the sciences. When the other humanist spoke, she commented on management in the event’s title, saying that in the humanities, management is seen as a topic for discussion in the business school. Of the four words in the event’s title, only research and week were acceptable concepts in the eyes of these humanists.

Subsequent to this event, a few of us in research data management services met with a humanities researcher who has a unique collection of digital video recordings of live musical performances from a Middle East country. His immediate concern was about the survival of the digital content. In addition to his copy of these recordings, only one other person on the globe has a set. As we worked through the options for making secure copies of his research content, I realized that we were talking primarily about organising his research materials, which happen to be in digital format.

Those of us providing research data management services learned an important lesson from these encounters. When talking with researchers from the humanities, we need to talk about organising their digital research materials rather than managing their data. A meeting with the liaison librarians in the humanities library later confirmed this approach. As data curators, we will continue to talk about managing data with most of the researchers on our campus, but with humanists, we have a new way of talking with them that lowers communication barriers when discussing their digital research content.

Are Libraries Organized to Provide Research Data Management Services?

Where do research data management services fit into today’s organisational charts for academic libraries? I have had this discussion several times in the past couple of months with librarians from different institutions. Each of these conversations has been independent of the other, suggesting that this is becoming a topic of interest as academic libraries move to offer research data management services. To address this question, it is helpful to start with the organisation of data services already being offered in libraries and then to consider how these service areas might work together.

Over the past twenty-five years, many North American academic libraries established outstanding services for students and researchers to help them access data produced by organisations or agencies outside their institution. The staff of these services often assist with locating data, interpreting data documentation, retrieving data files, and providing the data in a format that can be directly loaded into analytic software. Data distributors typically require a licence to be signed before granting use of their data on a campus. Therefore, these services also manage data licences, educate patrons about the terms around which data may be used, and monitor these activities.

Organisationally, these data services have been located, for the most part, in a subject library associated with a particular data type, for example, social survey microdata are in the social sciences library while company and market data are in the business library. Those using these secondary data resources are often regular patrons of the subject library in which the service is located. Familiarity with the subject library has proven to be important to these services because elsewhere on the campus, awareness of their existence tends to be low. These services struggle to increase their visibility on campus and to promote the value of their service to a larger user community.

With the emergence of research data management as a service area, one obvious question is whether it should simply be amalgamated with existing data services. After all, a common skill set around providing access to data is shared by both service areas. On the other hand, existing data services are only part of the wider mandate of research data management services, which covers all stages of the research data lifecycle and applies to all research on campus. With such a widespread mandate, should a new vertical service division be created in the library to house both research data management and existing data services? Or should existing data services remain in their current organisational location but be coordinated in conjunction with a larger research data management service?

While there are undoubtedly many successful ways of organising research data management services in a library, the following list raises some important considerations about the location of these services.

Research Data Management Services and Organisational Factors

  1. Research data management involves horizontal activities that cut across the vertical organisational structure of today’s academic library.
    • Research data management touches on almost all operations of the library. Whether organised around facility, function, domain, or some combination of these, research data will span these organisational divisions.
    • Because of its ubiquitous nature, research data management needs to be part of the library’s mission and service culture.
    • The vast majority of librarians must embrace research data management as part of their responsibilities.
  2. Research data management requires personnel practices that will support flexible work assignments for both horizontal and vertical activities.
    • The mandate for research data management is large and draws upon a range of skills and knowledge. The professionals with these talents are spread across the vertical divisions of the library, requiring the need to call upon staff from the whole organisation.
    • To work horizontally in a vertical structure, flexible work assignments must be accommodated by the system.
    • To operate within a vertical reporting structure, management methods are needed to pool staff from across the library. One method that has proven successful is the use of teams that are formed on the basis of a charter defining a fixed set of objectives. Once the team completes its work, the team is disbanded.
  3. Research data management must be intentionally coordinated across the vertical organisational structure of the library.
    • Research data management requires a full-time coordinator who has been granted authority to organise this service’s activities across the vertical divisions of the library.
    • The coordinator position needs to be high enough on the organisational structure to work effectively with fellow managers.
    • The coordinator should be supported as an ambassador for research data management on campus.
  4. Through coordinated supervision, the functions supporting research data management service can be distributed across the library system.
    • An existing data service with an identity well established within a specific subject library should be allowed to stay in its location. The staff will likely be called to participate on team projects but this in itself does not require an organisational relocation.
    • Liaison and subject librarians need to incorporate research data management materials into the portfolio of resources that they maintain for students and researchers.
    • When drawing upon library system resources and services, research data management services must be given the priority attention it needs to ensure the delivery of systemwide support for its services.

DMP Rights & Responsibilities

A consultation on Open Data was conducted in the United Kingdom in February 2012 providing valuable insight into governing principles for open data.  In particular, a series of rights and responsibilities regarding researchers, public and private funders, and  the public was identified in the study’s final report.  Emerging from this dialogue was a prominent policy role for data management plans (DMPs) to record agreements among stakeholders and to state clearly their rights and responsibilities associated with the data.  Viewing data management plans this way is closely associated to the position taken in the previous entry to this blog, The Value of Data Management Plans.  In this context, DMPs serve as a document of relationships and agreements.

Page 35 of this report contains a table summarizing rights and responsibilities among stakeholders.  Four stand out about DMPs:

  1. Researchers have a responsibility to “develop data management plans;”
  2. Funders have a right to expect researchers to prepare and implement data management plans;
  3. Funders have a responsibility to “enforce and publish data management policies and practices,” including DMPs; and
  4. The public has a right to know about research data in the public interest, which can be partially achieved through publishing DMPs.

The discussion in this report addresses several ways in which DMPs interplay across stakeholders’ interests.  For example, a concern among some researchers about “vexatious requests for data [p. 38]” was seen as being mediated through developing and publishing DMPs.  Furthermore, DMPs were seen as a method of communicating a timeframe for exclusive use of data by researchers prior to it being shared.  The expectation of funders to publish DMPs was seen as a transparency factor, keeping everyone informed of the agreements around the rights and responsibilities of a project’s data.

Other stakeholders can be seen also to have rights and responsibilities communicated in DMPs.  For example, a university has a right to know the demands on research data management infrastructure that the data across all locally based projects cumulatively have on a campus’ resources, including data curation services, storage, network capacity, and computational power.   On the flip side, a campus has the responsibility to support data management infrastructure that will facilitate high quality research, something to be gleaned from its researchers’ DMPs.

As Canadian institutions look to introduce DMPs as a policy tool, a wider discussion should take into account the relationships to be expressed in such plans.  We should expect to get full value out of this tool.

The Value of Data Management Plans

A big news item coming out of the Digital Infrastructure Summit held in Ottawa on January 28-29, 2014 was the announcement that Canada’s federal research councils will introduce policy changes over the next 24 months that will require applicants to include data management plans in their funding proposals. This announcement came quickly on the heels of a Fall 2013 consultation conducted by these same councils on Capitalizing on Big Data. Within the background material prepared for this study, these councils were challenged to adopt “agency-based and focused data stewardship plans (p. 8)” of which data management plans (DMPs) were seen as integral.  The push toward this policy change will now likely face some opposition, although momentum currently seems to be with those promoting policies in support of a Canadian data stewardship culture.

Some research councils in other countries have already implemented DMPs. For example, a guideline among the data principles of the Research Councils of the United Kingdom (RCUK) specifically encourages its members to develop data management plans:

Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long-term value should be preserved and remain accessible and usable for future research.

Provided as an umbrella framework, each of the seven research councils of RCUK is independently responsible for its data policies.  For example, the Economic and Social Research Council (ESRC) describes its reasons for requiring data management plans as:

We believe that a structured approach to data management results in better quality data that is ready to deposit for further sharing.

This single sentence is very revealing about the expected returns on DMPs.  To begin, a DMP is seen to contribute structure to the handling of data within a project.  An outcome of this approach is believed to be higher quality data.  Furthermore, the data will be better prepared for deposit with an organization that will make the data available for others.

On the surface, data management plans appear to be a very straightforward policy tool. They simply lengthen current funding applications by another page or two. However, the purposes they fulfill and the processes they embody will enrich the production and custodial care of research data.  The ESRC anticipation of higher quality data for sharing also implies collaboration with data curation services and with data repositories.  Ultimately, a DMP should engage researchers in conversations with those providing such services.  In this context, a DMP becomes a document of relationships that should be shared, edited, and monitored among those contributing to a project.  From this viewpoint, a DMP functions as a dynamic document of agreements.

To serve the multiple purposes just described, DMPs should be designed for easy digital exchange across a variety of applications.  The best way to approach this in today’s complex world of  information technology is through a metadata standard describing a data model of elements constituting a DMP.   CASRAI, a community-based standards body for research administrative information, is well positioned to do this.  In fact, the U.K. chapter of CASRAI has already begun work on a set of elements for a DMP data model.  In conjunction with this, it would be helpful if the Standards and Interoperability Committee of Research Data Canada would develop a fundamental flowchart representing the interplay of purposes, uses, and relationships expressed in a DMP.  This would be both informative for the CASRAI working group developing specifications for DMPs as well as helpful in validating the completeness of a DMP data model.

Data: a rose by any other name

Specializations in research data management are quickly multiplying. Alma Swan identified data creators or authors, data scientists, data managers, and data librarians in her 2008 report, The Skills, Roles and Career Structure of Data Scientists and Curators: an assessment of current practice and future needs. New experts in data tools, data infrastructures, data sciences, and data management were identified one year later in the report of the U.S. Interagency Working Group on Digital Data, Harnessing the Power of Digital Data for Science and Society.  This fast growth in new professional positions has taken place concurrently with an equally rapid developing technical language to help data experts communicate among themselves and the wider communities they serve. Unfortunately, as this professional vocabulary continues to evolve quickly, it can often confuse the very communities that data professionals are seeking to help.

An example of the changing nature of this technical language is the shift from archiving to preserving.  Those of us who today are specialists in research data management talk about preserving research data. Two decades ago, we spoke about data archiving. Moving from archiving to preserving came about in the late 1990s when digital preservation established itself as a field encompassing digital content. This change of terms carried an explicit identification of objects that are digital rather than analogue. Preservation became associated with digital content and archiving was largely left with analogue material.

Preserving or archiving. Does this distinction warrant a change in terminology? It can. Our use of common language for technical purposes can cause confusion. Sometimes it is more appropriate to adopt a lesser used term to introduce a new technical usage. For example, my initial reaction to the use of metadata in the early days of the Web was that we didn’t need a replacement term for cataloguing information. Subsequently, however, I saw the value of this new term. Metadata covers a greater range of descriptive information than a catalogue record and in the digital context, metadata can be actionable, driving automated processes. Metadata as a concept added new meaning and functionality. This term has even become a household word with Edward Snowden’s revelations about the U.S. National Security Agency’s use of metadata to identify telecommunications for snooping.

Think about how vacuous the term data has become. Everything that is digital is now called data. We have data plans with our telecom providers and WIFI digital cameras that store data in the cloud. There are digital collections of texts, images, sound, and video in our libraries. All of this content is also called data. How do we distinguish research data from everything else that is digital?  I prefer to think of research data as information structured by methodology and organized in digital products that are used as evidence in the research process.  This leaves all other digital content that is not research data with the potential of becoming research data.  From this perspective, research data are a special class of digital data, allowing us to talk about the technical activities of research data management without confusing it with everything else that is digital.

Part of being a professional in research data management is ensuring that the concepts we use in a technical context are consistent. This brings me to a story that took place recently between a senior scientist with a federal government department and members of the Canadian Polar Data Network, who were providing advice on research data management infrastructure. The scientist’s training was in biology and whenever we spoke about data preservation, he would smile. Finally, he said, “When you say preserving data, I think of my mother’s preserves … canning data in jars. Wouldn’t it be more appropriate to talk about conserving data?” The question surprised us because we did not have a context for research data conservation. For biologists, the act of conserving is one of protecting something or restoring it to an earlier state. While data preservation does involve processes to protect digital content as well as the contextual and technical information describing this content, the intent is to maintain the materials in their original digital state indefinitely. Activities involved in repairing research data or its supporting metadata may be part of the curation of the data prior to processing the content for preservation. However, the act of preserving research data is one of keeping the digital content in its pristine state.

This example illustrates the communication challenge that can occur across domains when concepts differ. While biologists may more readily use conservation than preservation, we need to stay within the context of our data profession. In describing digital preservation practices within research data management, we need to convey a technical meaning that applies to activities supporting the long-term access to research data. Today, this happens to be data preservation.