The Data Repository Community Needs To Take Certification Seriously

An editorial in the 2015/01/02 issue of Science notes that they will be identifying data repositories to promote among its authors and readers this year. The selection criteria mentioned include repositories that “are well managed, have long-term support, and are responsive to community needs.

Up to this point, many librarians have been concerned about publishers establishing their own data repositories and then charging for access to these collections of research data. This recent Science editorial presents a new potential concern regarding publisher engagement in research data management. Might a publisher’s endorsement of a data repository be construed as certification for that repository? Will publishers end up setting a de facto bar for trusted status of data repositories? What will be the implications for domain or local data repositories outside Science’s scope? These are topics needing the attention of the data repository community now.

I see the value of journals being able to recommend data repositories to authors who might be unaware of the choices available to them. However, in the absence of a widely supported and independent certification process, the data repository community runs the risk of journals conducting assessments using their own yardsticks. Without a standard set of criteria, comparisons of data repositories across journal ratings become problematic. Not only are common measures necessary, but a sense of fair assessment conducted by entities at arms-length is desirable. For example, an assessment conducted by a publisher of its own data repository has less face value than one performed by an independent party.

Rather than see Science strike out on its own to assess data repositories, I would prefer to have them work collaboratively with organisations already engaged in these activities. The Standards and Interoperability Committee of Research Data Canada has a report soon to be released that presents a set of criteria used to assess a number of data repositories. The Research Data Alliance has a working group on the audit and certification of data repositories developed on a partnership between the World Data System and the Data Seal of Approval. In Germany, a catalogue of criteria for trusted digital repositories (nestor) has been developed through community involvement. Journal editors and publishers should work with these organisations when preparing a list of data repositories to recommend.

Plans for Data Management or Mobilisation?

I had a recent conversation with a researcher who had just been informed that her letter of intent for a major grant competition had been accepted and that she was invited to submit a full application. As she started this process, she organised a committee to provide her advice on a knowledge mobilisation plan. Such a plan is a requirement of the funding agency to which she is applying. One person she asked to serve on her committee inquired if she had also thought about preparing a data management plan (DMP). Because Canada’s three major federal funding councils have yet to institute DMP policies, she was unfamiliar with such plans and was referred to me to learn more about them.

When we met to discuss her research, I discovered that the project consisted of an international team with data from several countries. The data will be a mix of qualitative and quantitative information, some of which will be collected by the project and some of which will be obtained from national agencies. We reviewed the types of consent required from her research participants that would enable sharing these data with other researchers and preserving it for long-term access. We talked about the technical options for making the data safely accessible to the researchers on her team from other countries. We spoke about developing a data charter for the project that addresses governance issues around the use of the data by everyone within the project. As we went through these and other data topics, she paused and said, “I now understand that I will need a data mobilisation plan as well as a knowledge mobilisation plan.” This observation struck me that the “M” in DMP should possibly be mobilisation instead of management.

Data Mobilisation Plans

For most funding councils, the administrative purpose of data management plans is to learn the steps that researchers will apply to share the data from their projects. One way to frame this data-sharing goal is to address expectations around data stewardship. In taking this direction, I believe that mobilisation is a more appropriate concept than management.

  • First, management is about controlling or administering activities and resources, while mobilisation is about organising or preparing something for use (see remarks in the previous blog entry about organising versus managing). From the perspective of data stewardship, the planning steps for sharing data have more to do with organising the custodial care of data across the lifecycle than with controlling the details of data management.

    Mark Parsons, Secretary General for the Research Data Alliance, illustrated this point in a comment that he made during the CASRAI Reconnect 2014 Conference. He noted that when his staff at the National Snow and Ice Data Centre at the University of Colorado helped researchers prepare DMPs for U.S. funding agencies, the researchers had difficulty describing in advance the details around how they would manage their data. Such decisions often come later in the project and depend on the technology available at that time. My response was that assuming DMPs to be statements about the nuts and bolts of managing data misses the policy intent of the plan to elicit how the data will be shared. Instead of small details, these plans should be about the strategies that researchers will follow throughout their project around managing their data.

    Preparing strategies in a DMP should draw upon the data stewards with whom solutions might be formulated. For example, if a DMP asks a researcher to identify the data repository with which she or he will deposit the project’s data, one answer might be to discuss with a liaison librarian the identity of an appropriate domain or campus data repository. Another strategy might be to contact a curator from a data repository about her or his involvement in the project from its beginning. A DMP consisting of strategies for finding solutions that can be implemented during a project directs the researcher’s focus toward mobilising data stewards and services to deal with data management requirements as they arise.

  • Second, knowledge mobilisation plans are a funding agency requirement already known to many Canadian researchers, although some agencies may identify them as knowledge transfer or translation plans. Researchers see the value of these plans, which chart the dissemination activities of research findings. The rewards of having such plans are well understood by researchers. These statements identify pathways to influence other researchers, policy makers, and practitioners that will increase the likelihood of a larger readership of the researcher’s findings and potentially more citations of the researcher’s work. These valued outcomes translate into increased prestige and greater promotion opportunities.

    Data mobilisation plans may benefit from the widely recognised value already attributed to knowledge mobilisation plans. We may soon see rewards structured around data sharing, especially if data citation takes root and the linkage between data and research articles becomes universally adopted through the use of persistent digital identifiers. The more incentives are associated with data sharing, the more data mobilisation plans will be linked to researcher rewards.
  • Third, one should not lose sight of the role that a DMP plays as an administrative tool to promote research practices supportive of an organisation’s data policy. This connection between data policy and a DMP is fundamental to its function. Whether the data policy is directed at data stewardship, data sharing, reproducible research, or a combination of these, the DMP should elicit responses that are expressive of the policy’s values. The level of abstraction called for in this context is more directed at organising than managing things. As a policy instrument, the goal of DMPs keeps our attention more centred on mobilising than managing resources.

Data: a rose by any other name (part 2)

In an earlier blog entry, I spoke about the importance of having a technical language that allows data curators to talk within their profession about the details of their work. The words they use may be part of society’s everyday vocabulary but carry a meaning specific to data curation. Confusion can arise during conversations between data curators and others outside the profession when a term is used that carries different meanings for each group. For example, I was in a meeting recently with people from a variety of technical backgrounds, including librarians and research administrators. One librarian spoke about sharing resources across libraries. For the librarian, resources meant information tools, such as, library guides, while the administrator assumed that resources referred to money. The administrator was confused about why libraries would be exchanging money.

Communication problems can also arise within a campus’ research community. We encountered this with humanities researchers on our campus earlier in the year when our library hosted a week of workshops and talks on research data management. Speakers at this event consisted of researchers from all areas on the campus, including two prominent researchers from the digital humanities. One of the humanists said in reference to the title of the event, Research Data Management Week, that researchers in the humanities don’t see their research involving data. Rather, they see data as something belonging to the sciences. When the other humanist spoke, she commented on management in the event’s title, saying that in the humanities, management is seen as a topic for discussion in the business school. Of the four words in the event’s title, only research and week were acceptable concepts in the eyes of these humanists.

Subsequent to this event, a few of us in research data management services met with a humanities researcher who has a unique collection of digital video recordings of live musical performances from a Middle East country. His immediate concern was about the survival of the digital content. In addition to his copy of these recordings, only one other person on the globe has a set. As we worked through the options for making secure copies of his research content, I realized that we were talking primarily about organising his research materials, which happen to be in digital format.

Those of us providing research data management services learned an important lesson from these encounters. When talking with researchers from the humanities, we need to talk about organising their digital research materials rather than managing their data. A meeting with the liaison librarians in the humanities library later confirmed this approach. As data curators, we will continue to talk about managing data with most of the researchers on our campus, but with humanists, we have a new way of talking with them that lowers communication barriers when discussing their digital research content.

Are Libraries Organized to Provide Research Data Management Services?

Where do research data management services fit into today’s organisational charts for academic libraries? I have had this discussion several times in the past couple of months with librarians from different institutions. Each of these conversations has been independent of the other, suggesting that this is becoming a topic of interest as academic libraries move to offer research data management services. To address this question, it is helpful to start with the organisation of data services already being offered in libraries and then to consider how these service areas might work together.

Over the past twenty-five years, many North American academic libraries established outstanding services for students and researchers to help them access data produced by organisations or agencies outside their institution. The staff of these services often assist with locating data, interpreting data documentation, retrieving data files, and providing the data in a format that can be directly loaded into analytic software. Data distributors typically require a licence to be signed before granting use of their data on a campus. Therefore, these services also manage data licences, educate patrons about the terms around which data may be used, and monitor these activities.

Organisationally, these data services have been located, for the most part, in a subject library associated with a particular data type, for example, social survey microdata are in the social sciences library while company and market data are in the business library. Those using these secondary data resources are often regular patrons of the subject library in which the service is located. Familiarity with the subject library has proven to be important to these services because elsewhere on the campus, awareness of their existence tends to be low. These services struggle to increase their visibility on campus and to promote the value of their service to a larger user community.

With the emergence of research data management as a service area, one obvious question is whether it should simply be amalgamated with existing data services. After all, a common skill set around providing access to data is shared by both service areas. On the other hand, existing data services are only part of the wider mandate of research data management services, which covers all stages of the research data lifecycle and applies to all research on campus. With such a widespread mandate, should a new vertical service division be created in the library to house both research data management and existing data services? Or should existing data services remain in their current organisational location but be coordinated in conjunction with a larger research data management service?

While there are undoubtedly many successful ways of organising research data management services in a library, the following list raises some important considerations about the location of these services.

Research Data Management Services and Organisational Factors

  1. Research data management involves horizontal activities that cut across the vertical organisational structure of today’s academic library.
    • Research data management touches on almost all operations of the library. Whether organised around facility, function, domain, or some combination of these, research data will span these organisational divisions.
    • Because of its ubiquitous nature, research data management needs to be part of the library’s mission and service culture.
    • The vast majority of librarians must embrace research data management as part of their responsibilities.
  2. Research data management requires personnel practices that will support flexible work assignments for both horizontal and vertical activities.
    • The mandate for research data management is large and draws upon a range of skills and knowledge. The professionals with these talents are spread across the vertical divisions of the library, requiring the need to call upon staff from the whole organisation.
    • To work horizontally in a vertical structure, flexible work assignments must be accommodated by the system.
    • To operate within a vertical reporting structure, management methods are needed to pool staff from across the library. One method that has proven successful is the use of teams that are formed on the basis of a charter defining a fixed set of objectives. Once the team completes its work, the team is disbanded.
  3. Research data management must be intentionally coordinated across the vertical organisational structure of the library.
    • Research data management requires a full-time coordinator who has been granted authority to organise this service’s activities across the vertical divisions of the library.
    • The coordinator position needs to be high enough on the organisational structure to work effectively with fellow managers.
    • The coordinator should be supported as an ambassador for research data management on campus.
  4. Through coordinated supervision, the functions supporting research data management service can be distributed across the library system.
    • An existing data service with an identity well established within a specific subject library should be allowed to stay in its location. The staff will likely be called to participate on team projects but this in itself does not require an organisational relocation.
    • Liaison and subject librarians need to incorporate research data management materials into the portfolio of resources that they maintain for students and researchers.
    • When drawing upon library system resources and services, research data management services must be given the priority attention it needs to ensure the delivery of systemwide support for its services.

DMP Rights & Responsibilities

A consultation on Open Data was conducted in the United Kingdom in February 2012 providing valuable insight into governing principles for open data.  In particular, a series of rights and responsibilities regarding researchers, public and private funders, and  the public was identified in the study’s final report.  Emerging from this dialogue was a prominent policy role for data management plans (DMPs) to record agreements among stakeholders and to state clearly their rights and responsibilities associated with the data.  Viewing data management plans this way is closely associated to the position taken in the previous entry to this blog, The Value of Data Management Plans.  In this context, DMPs serve as a document of relationships and agreements.

Page 35 of this report contains a table summarizing rights and responsibilities among stakeholders.  Four stand out about DMPs:

  1. Researchers have a responsibility to “develop data management plans;”
  2. Funders have a right to expect researchers to prepare and implement data management plans;
  3. Funders have a responsibility to “enforce and publish data management policies and practices,” including DMPs; and
  4. The public has a right to know about research data in the public interest, which can be partially achieved through publishing DMPs.

The discussion in this report addresses several ways in which DMPs interplay across stakeholders’ interests.  For example, a concern among some researchers about “vexatious requests for data [p. 38]” was seen as being mediated through developing and publishing DMPs.  Furthermore, DMPs were seen as a method of communicating a timeframe for exclusive use of data by researchers prior to it being shared.  The expectation of funders to publish DMPs was seen as a transparency factor, keeping everyone informed of the agreements around the rights and responsibilities of a project’s data.

Other stakeholders can be seen also to have rights and responsibilities communicated in DMPs.  For example, a university has a right to know the demands on research data management infrastructure that the data across all locally based projects cumulatively have on a campus’ resources, including data curation services, storage, network capacity, and computational power.   On the flip side, a campus has the responsibility to support data management infrastructure that will facilitate high quality research, something to be gleaned from its researchers’ DMPs.

As Canadian institutions look to introduce DMPs as a policy tool, a wider discussion should take into account the relationships to be expressed in such plans.  We should expect to get full value out of this tool.

The Value of Data Management Plans

A big news item coming out of the Digital Infrastructure Summit held in Ottawa on January 28-29, 2014 was the announcement that Canada’s federal research councils will introduce policy changes over the next 24 months that will require applicants to include data management plans in their funding proposals. This announcement came quickly on the heels of a Fall 2013 consultation conducted by these same councils on Capitalizing on Big Data. Within the background material prepared for this study, these councils were challenged to adopt “agency-based and focused data stewardship plans (p. 8)” of which data management plans (DMPs) were seen as integral.  The push toward this policy change will now likely face some opposition, although momentum currently seems to be with those promoting policies in support of a Canadian data stewardship culture.

Some research councils in other countries have already implemented DMPs. For example, a guideline among the data principles of the Research Councils of the United Kingdom (RCUK) specifically encourages its members to develop data management plans:

Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long-term value should be preserved and remain accessible and usable for future research.

Provided as an umbrella framework, each of the seven research councils of RCUK is independently responsible for its data policies.  For example, the Economic and Social Research Council (ESRC) describes its reasons for requiring data management plans as:

We believe that a structured approach to data management results in better quality data that is ready to deposit for further sharing.

This single sentence is very revealing about the expected returns on DMPs.  To begin, a DMP is seen to contribute structure to the handling of data within a project.  An outcome of this approach is believed to be higher quality data.  Furthermore, the data will be better prepared for deposit with an organization that will make the data available for others.

On the surface, data management plans appear to be a very straightforward policy tool. They simply lengthen current funding applications by another page or two. However, the purposes they fulfill and the processes they embody will enrich the production and custodial care of research data.  The ESRC anticipation of higher quality data for sharing also implies collaboration with data curation services and with data repositories.  Ultimately, a DMP should engage researchers in conversations with those providing such services.  In this context, a DMP becomes a document of relationships that should be shared, edited, and monitored among those contributing to a project.  From this viewpoint, a DMP functions as a dynamic document of agreements.

To serve the multiple purposes just described, DMPs should be designed for easy digital exchange across a variety of applications.  The best way to approach this in today’s complex world of  information technology is through a metadata standard describing a data model of elements constituting a DMP.   CASRAI, a community-based standards body for research administrative information, is well positioned to do this.  In fact, the U.K. chapter of CASRAI has already begun work on a set of elements for a DMP data model.  In conjunction with this, it would be helpful if the Standards and Interoperability Committee of Research Data Canada would develop a fundamental flowchart representing the interplay of purposes, uses, and relationships expressed in a DMP.  This would be both informative for the CASRAI working group developing specifications for DMPs as well as helpful in validating the completeness of a DMP data model.

Data: a rose by any other name

Specializations in research data management are quickly multiplying. Alma Swan identified data creators or authors, data scientists, data managers, and data librarians in her 2008 report, The Skills, Roles and Career Structure of Data Scientists and Curators: an assessment of current practice and future needs. New experts in data tools, data infrastructures, data sciences, and data management were identified one year later in the report of the U.S. Interagency Working Group on Digital Data, Harnessing the Power of Digital Data for Science and Society.  This fast growth in new professional positions has taken place concurrently with an equally rapid developing technical language to help data experts communicate among themselves and the wider communities they serve. Unfortunately, as this professional vocabulary continues to evolve quickly, it can often confuse the very communities that data professionals are seeking to help.

An example of the changing nature of this technical language is the shift from archiving to preserving.  Those of us who today are specialists in research data management talk about preserving research data. Two decades ago, we spoke about data archiving. Moving from archiving to preserving came about in the late 1990s when digital preservation established itself as a field encompassing digital content. This change of terms carried an explicit identification of objects that are digital rather than analogue. Preservation became associated with digital content and archiving was largely left with analogue material.

Preserving or archiving. Does this distinction warrant a change in terminology? It can. Our use of common language for technical purposes can cause confusion. Sometimes it is more appropriate to adopt a lesser used term to introduce a new technical usage. For example, my initial reaction to the use of metadata in the early days of the Web was that we didn’t need a replacement term for cataloguing information. Subsequently, however, I saw the value of this new term. Metadata covers a greater range of descriptive information than a catalogue record and in the digital context, metadata can be actionable, driving automated processes. Metadata as a concept added new meaning and functionality. This term has even become a household word with Edward Snowden’s revelations about the U.S. National Security Agency’s use of metadata to identify telecommunications for snooping.

Think about how vacuous the term data has become. Everything that is digital is now called data. We have data plans with our telecom providers and WIFI digital cameras that store data in the cloud. There are digital collections of texts, images, sound, and video in our libraries. All of this content is also called data. How do we distinguish research data from everything else that is digital?  I prefer to think of research data as information structured by methodology and organized in digital products that are used as evidence in the research process.  This leaves all other digital content that is not research data with the potential of becoming research data.  From this perspective, research data are a special class of digital data, allowing us to talk about the technical activities of research data management without confusing it with everything else that is digital.

Part of being a professional in research data management is ensuring that the concepts we use in a technical context are consistent. This brings me to a story that took place recently between a senior scientist with a federal government department and members of the Canadian Polar Data Network, who were providing advice on research data management infrastructure. The scientist’s training was in biology and whenever we spoke about data preservation, he would smile. Finally, he said, “When you say preserving data, I think of my mother’s preserves … canning data in jars. Wouldn’t it be more appropriate to talk about conserving data?” The question surprised us because we did not have a context for research data conservation. For biologists, the act of conserving is one of protecting something or restoring it to an earlier state. While data preservation does involve processes to protect digital content as well as the contextual and technical information describing this content, the intent is to maintain the materials in their original digital state indefinitely. Activities involved in repairing research data or its supporting metadata may be part of the curation of the data prior to processing the content for preservation. However, the act of preserving research data is one of keeping the digital content in its pristine state.

This example illustrates the communication challenge that can occur across domains when concepts differ. While biologists may more readily use conservation than preservation, we need to stay within the context of our data profession. In describing digital preservation practices within research data management, we need to convey a technical meaning that applies to activities supporting the long-term access to research data. Today, this happens to be data preservation.

Who Are Canada’s Research Data Peers?

In October 2007, Kevin Schurer, who was then the Director of the U.K. Data Archive, made a presentation at the ICPSR Official Representatives Meeting at the University of Michigan about establishing a data world wide web.  Kevin's World Data ViewHe used this graphic to illustrate the current status of social science data curation around the globe.  Each country has been crudely scaled according to the level of its social science data services.  He noted that the U.S. and U.K. are disproportionally larger in this projection than their actual physical size because of the large volume of social science data curated in these two countries.  He went on to say that Canada in his map is much smaller than its size because, “Canada can’t get its act together [regarding research data].”  While this was a rather dismaying statement to have proclaimed about my home country in an international meeting, grounds exist for him coming to such a conclusion (see the introductory Blog item for evidence.)  This observation about Canada raises two important questions:

  1. Who are Canada’s international peers in research data?
  2. How far behind is Canada in research data management infrastructure?

Canada’s International Data Peers

The Introduction to this Blog touched upon this topic.  Canadians typically view their international research peers as the United States, United Kingdom, Australia, and Germany.  In many fields of research and in some areas of research infrastructure, this is the case.  For example, CANARIE is a world-class research network that is comparable with Europe’s research network, GÉANT.  Contributing to the validity of this comparison is the level of top-down impetus both receive through government policy, programs, and funding for these networks.

Research Data Management Infrastructure (RDMI) in Canada, however, does not compare with the developments in data infrastructure in these four countries.  As mentioned previously, bottom-up actions by higher education institutions willing to collaborate with one another around cost-sharing initiatives are the driving force for RDMI in Canada, which by comparison is a very different environment.

Who then are Canada’s data peers?  Looking at Shurer’s 2007 map, Canada appears to be grouped with the rest of the world outside the United States, Europe, and Australia.  I had an opportunity to observe firsthand a few of Canada’s peers at a European Commission sponsored workshop on “Global Research Data Infrastructures: The Big Data Challenges,” held in Brussels in October 2011.  The objective of this workshop was to further the development of a 2020 roadmap for global research data infrastructure.  There were representatives from Africa, Asia, Australia, Canada, Europe, South America, and the United States, each asked to speak about data infrastructure in their country.  I was asked to talk about data infrastructure in Canada.

The presenters from Brazil and Taiwan spoke about having to build data infrastructure from the bottom-up without the top-down guidance or incentives common in the U.S., Europe, or Australia.  I was struck by how similar data infrastructure development in Brazil and Taiwan is to Canada.  Who are Canada’s data peers?  Nations building their RDMI from the bottom-up.

How Far Behind Is Canada From the Frontrunners on the Planet?

Internationally, RDMI consists of a real patchwork of activities regardless of whether the development is top-down or bottom-up.  Looking at the various parts of the patchwork can provide different perspectives about where a country is positioned globally.  This patchwork has been characterized as a Digital Science Ecosystem in the Global Research Data Infrastructure 2020 Roadmap (GRDI2020).  Thinking of research data infrastructure as an ecosystem focuses attention on the complex relationships among important components of scientific research.  To understand these complex relationships in an environment of data-intensive, multidisciplinary research is as challenging as it is to comprehend the interdependency among species in a biological ecosystem.  The authors feel that the broader research environment is as much of a contributor to advances and transformations in scientific fields as technological progress (see p. 17).

Digital Science EcosystemThe GRDI2020 report describes the Digital Science Ecosystem as being composed of Digital Data Libraries, Digital Data Archives, Digital Research Libraries, and Communities of Research. The relationships among these four components make up the patchwork environment in which this report envisions future scientific research to be conducted.  From both a technical and organizational standpoint, relationships in a digital ecosystem are established and maintained through interoperability mechanisms among these four components.  An earlier entry to this Blog highlighted the importance of institutions in preserving research data.  Three of the GRDI2020 components are based on institutions: digital data libraries, digital data archives, and digital research libraries.  The earlier Blog entry argued that these institutions do not have to be national, central services but can be distributed across existing institutions with a mandate to preserve research data.  The success of such a distributed inter-institutional preservation network will depend on its interoperability across the network and with the wider research environment.

This digital science ecosystem model can be used to assess the current state of research data infrastructure in a country.  Putting aside the various challenges of top-down or bottom-up development, what aspects of the four components of the GRDI2020 ecosystem does a country have?  Furthermore, what interoperability relationships have been established among these components?  Looking specifically at Canada, a strong network of data libraries exist on campuses across the country because of the Data Liberation Initiative (DLI).  Since 1996, academic libraries have provided data services to support the dissemination of standard data products from Statistics Canada.  In addition to providing access to data, DLI also conducts annual training regionally in Canada, constantly upgrading the skills of those who provide data services on their local campus.  Compared to Europe, Canada is much farther along in developing a network of data libraries that support local access to data.  Canada also has a strong network of research libraries with large and growing digital collections, including repository services for research results.  The Achilles heel for Canada is digital data archives.  This is the ecosystem component for which Canada lags far behind the U.S., U.K., Australia, and Germany, although a few research libraries are beginning developments in this area that hopefully will begin to close the gap.  The Canadian Polar Data Network is an example of a new Canadian collaborative, inter-institutional, cross-sectoral, distributed data archive that serves as a model for other Canadian institutions to emulate.

With strategic top-down investment in data preservation services, Canada could have leapfrogged to be among the frontrunners in the digital science ecosystem.  In the absence of top-down development, research libraries working collaboratively with research communities must build from the bottom-up to establish data preservation services.  The engagement of senior administrators at Canadian universities in the development of research data infrastructure is critical to a bottom-up strategy.  There is a need for university policies that establish an institutional mandate to preserve research records and that identify institutional data stewardship responsibilities covering the research lifecycle.

Finally, taking on these tasks at the institutional level will help begin the conversation between universities and national funding agencies around the bigger question of who should be doing what regarding data.  Currently, both parties are at loggerheads on this topic.

[The views expressed in this Blog are my own and do not necessarily represent those of my institution.]

Research Data Management Infrastructure III

In earlier entries to this Blog, Research Data Management Infrastructure (RDMI) was defined as the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.  The context for RDMI has already been discussed in terms of the research lifecycle and of the two additional components making up research infrastructure: Canada’s high speed research network and high performance computing services.  This essay will address the elements of data infrastructure and how they are organized.

In developing its Cyberinfrastructure program, the U.S. National Science Foundation funded a project to investigate how best to build successful infrastructure.  CyberinfrastructureComing out of this study was the report, Understanding Infrastructure. The authors establish early in their work the significant connection between social organization and the use of communication technology.  Regarding cyberinfrastructure, they stress that it “is about more than just pipes and machines” (p. 5) and emphasize the importance of social organizational factors in shaping solutions.  They note that in developing cyberinfrastructure, solutions can be social, technical, or a combination.  They feel that the distribution of solutions is central to building infrastructure.  Using the diagram by Millerand, solutions are portrayed as being distributed across two dimensions: technical-social and local-global.

[C]yberinfrastructure is the set of organizational practices, technical infrastructure and social norms that collectively provide for the smooth operation of scientific work at a distance. All three are objects of design and engineering; a cyberinfrastructure will fail if any one is ignored. Understanding Infrastructure (p. 6)

A Textbook Example

Earlier this year I experienced a textbook example of this conceptual model of infrastructure while visiting Bryn Mawr University just as they were changing the way they provide campus wireless services to guests.  When I arrived on campus, I was given a sheet of paper containing the name of the campus wireless service, an account ID and password to log into this service, and a set of instructions for different devices and operating systems.  I was required to obtain a separate account for each device on which I wished to use campus wireless services.

This approach to providing guests with wireless access to the campus network and the Internet falls under the social-local set of solutions in the above infrastructure model.  The procedures were organized around human intervention, i.e., having to find and speak with a person who could provide me with the information sheet, and around social norms requiring me to sign an agreement statement, confirming my acceptance of the rules for using their wireless.  The wireless technology, however, was typical industry-standard WIFI.

On the second day of my visit, a new wireless service was launched for guests on their campus: Eduroam.  This is the international service that allows academic guests from university members of Eduroam to gain access to secure wireless networking while visiting another Eduroam site.  Because my home institution is an Eduroam member and can authenticate my credentials through this service, I simply open my wireless device, go to the list of available wireless services where I am, and if Eduroam is among them, I select it.  The system behind the scene allows the local Eduroam host to verify my credentials with my home institution and to provide me with selective network services on their campus.  For example, if the Library has a license for a database that does not allow guests access, the local implementation of Eduroam can hide this database from my guest access.

This service approach falls under the technology-global set of solutions.  My credentials are validated through my home institution using technology, allowing me to connect to wireless services at a member Eduroam campus, without having to go through another person or having to obtain temporary authentication credentials.  Eduroam has provided me with guest access easily to wireless services in the United States, Germany, and Canada.  There are higher education institutions in over fifty-five nations now supporting Eduroam.  It truly is a global solution to providing guest access to secure wireless networking.

Cyberinfrastructure and RDMI

How does this particular Cyberinfrastructure (CI) model relate to Research Data Management Infrastructure?  First, the CI model provides a conceptual framework for the definition of RDMI.  The RDMI elements of technology, services, and expertise are part of CI, although not expressed in exactly the same terms.   Applied to RDMI, organizational practices and social norms are aspects of the services supporting data management across the research lifecycle.  Services embody organizational responses to data management.  For example, offering researchers assistance with data management plans requires organizing resources to deliver such a service.  Social norms and expectations are also expressed in services.  A funding agency may require data management plans to get researchers to describe how they will share the data from their project, setting an expectation to share data.  Thinking of services in the context of RDMI is a combination of CI characteristics around social norms and organization.

Expertise is another component of CI and RDMI.  Data management activities span the research lifecycle and involve many different skills, drawing upon a variety of expertise.  The demands for data management expertise depend on the scale of the research project.  A small project may involve only a couple of people, who can manage with a general set of skills.  A much larger project may require a team of experts with each team member responsible for a specific specialization.  Expertise also is aligned with responsibilities for data management activities, which was identified as aspects of data stewardship in a previous Blog discussion.

Place is significant in CI and RDMI.  Research is increasingly conducted in collaborative, inter-institutional teams that span nations.  High speed optical research networks are vital for researchers who work at a distance from one another.  Whether working together in real time or asynchronously in different places, the network allows them to organize their workflow so each can contribute.  Similarly, researchers may require access to high performance computing (HPC) but are not located at an HPC site.  Over a research network they may gain access to the computing resources they require.  Distance also comes into play with RDMI.  Data may be gathered in one location, processed at another site, analyzed at yet another place, and preserved in an institution separate from these other locations.  Through a collaborative initiative, such as the Canadian Polar Data Network, an institution may offer preservation services for research data that behind the scenes consists of a distributed dark archive shared among several institutions.  The scope of some research data infrastructure requires global solutions.  One example is the need for infrastructure that will overcome barriers in the free exchange of scientific data across national borders.

The implementations of RDMI will vary from institution to institution but the set of solutions will be distributed locally or globally across technology, services, and expertise.

The next Blog entry will focus on the question:  Who are Canada’s international peers in Research Data Management Infrastructure?

[The views expressed in this Blog are my own and do not necessarily represent those of my institution.]

Research Data Management Infrastructure II

In the previous entry, Research Data Management Infrastructure (RDMI) was defined as the mix of technology, services, and expertise organized locally or globally to support research data activities across the research lifecycle.  The context for RDMI within the research lifecycle was described and the importance of institutional-level engagement in data stewardship was emphasized.  Finally, the position was taken that cross-institutional collaboration would enable building collectively the national RDMI that has eluded Canada without top-down design or resources.  How does this context compare with the two other pillars of Canada’s research infrastructure?

Research Infrastructure: The Three Pillars

The Canadian University Council of Chief Information Officers (CUCCIO) hosted the Digital Infrastructure Summit in June 2012 in Saskatoon to address the unclear future of research infrastructure in Canada today.  Concerns have been expressed about the lack of a vision for research infrastructure in Canada and the need for more coordinated planning.  For example, the current business models for CANARIE, the coordinating agency for Canada’s high-speed optical research network, and for Compute Canada, the organization for high performance computing, operate on funding cycles that are less than optimal and on brinksmanship review processes that seem to threaten the very existence of this critical infrastructure.  Borrowing from the National Data Summit format, the CUCCIO Summit invited around sixty leaders in research infrastructure to discuss how best to approach these concerns.  Coming out of this forum was the establishment of a Leadership Council with a mission to articulate a vision for research infrastructure and to organize a follow-up summit.

Canada's Research Infrastructure PillarsWhile Canada does not have a formally recognized national organization for RDMI (Research Data Canada and CARL are working to fill part of this void), CUCCIO recognizes data infrastructure as one of three pillars constituting Canada’s research infrastructure, along with a high speed research network and high performance computing. There are some important differences between the formal support for these latter two infrastructure pillars and RDMI.  First, different forces drive these three infrastructure pillars.

  1. CANARIE provides top-down coordination and incentives, working with a group of Optical Regional Advanced Networks (ORANs) across the country.   The ORANs keep the operational delivery of the high speed network close to the researchers in their areas, while CANARIE works to weave the regional communication networks into a national research service.
  2. High Performance Computing (HPC) in Canada has a similar organizational structure of regional services (WestGrid, Compute Ontario, Calcul Quebec, Compute Atlantic) with national governance provided through Compute Canada, although the regional services tend to operate with a tradition of independence.  Nevertheless, HPC has received top-down incentives, including financial support through the Canada Foundation for Innovation.
  3. As already stated, RDMI does not have a formal national organization to represent its interests, although there are national coordinating roles for both Research Data Canada and CARL to play in data curation and infrastructure within their communities.  Unfortunately, no regional organizations for data infrastructure exist.

While RDMI has been embraced as an equal infrastructure partner by leaders in CANARIE and Compute Canada, the playing field is clearly unequal at this stage.  The good news is that Research Data Canada and CARL continue to be invited to participate in events organized by the other two infrastructure partners.

Second, the voice for RDMI is often ad hoc and diluted.  CANARIE and Compute Canada serve as single points of contact for their infrastructure.  Typically, individual researchers are called to speak on behalf of data infrastructure, even though they may represent only a narrow perspective on data management infrastructure.  A consequence is that the voice for research data often becomes haphazard.  The risks are that a data advocate may not be present at an important research infrastructure event or that the message is too narrow for today’s range of research data issues.

Third, RDMI is dependent on bottom-up initiatives, requiring a great deal of coordination and cooperation to be successful.  The organization of top-down initiatives typically depend on control and governance.  With bottom-up projects, the most important organizational factors are trust, collaboration, and cooperation.  These two different organizational structures also tend to result in different styles of internal politics.

Finally, the international peers for each of Canada’s infrastructure pillars are different.  Both CANARIE and Compute Canada see their counterpart organizations in the United States, Australia, United Kingdom, and the rest of Europe as their peers.  The models and practices for funding and planning are also similar among these peers.  Look at what is happening to RDMI within this same group of countries: the National Science Foundation in the U.S. provides grants for data curation projects through its DataNet program; the European Union supported the Global Research Data Infrastructures 2020 project to help chart the course for developing a global data ecosystem; Australia established the Australian National Data Service to support researchers with their data curation needs; in the U.K. JISC offers its Managing Research Data program, which funds projects in RDMI.  These examples are all top-down driven and involve incentive programs for data infrastructure.  At this stage, the development of RDMI in Canada has very little in common with CANARIE and Compute Canada’s international peers.  A subsequent Blog entry will address who the international peers currently are for Canada’s RDMI.

The next entry discusses RDMI components of technology, services and expertise and how they are organized locally or globally.

[The views expressed in this Blog are my own and do not necessarily represent those of my institution.]