In today's technological world, sustaining science as a source of new knowledge and innovation has become as important to modern society as maintaining the nation's capabilities in manufacturing, trade, and defense. The extent to which public funding in the developed world supports science is testimony to society's recognition that basic as well as applied research must be carried out to advance the public interest.
Science itself is a living enterprise. With few exceptions, acquisition of scientific knowledge is a cumulative process that depends on researchers' continuing ability to collect and share data. This capability has been strengthened by the advent of information technology, which is supplying powerful new tools and enabling new styles of working. However, far-reaching changes involving complex technical, economic, and legal issues also have begun to alter the conditions for exchange of data among scientists, especially across national boundaries.
To help understand the impact of such changes and to learn what actions are needed to ensure full and open exchange of scientific data1 worldwide among researchers in the natural sciences, the Committee on Issues in the Transborder Flow of Scientific Data undertook a study responding to the following charge:
This study addresses issues in effective access to data in numerical, symbolic, and image forms by scientists for scientific research purposes, rather than to bibliographic or purely textual information. The focus is on digital rather than analog data, since practically all scientific data are now collected and stored digitally, and most older data are being transferred to digitized electronic formats. The scope of inquiry also is limited to data in the natural sciences, which is the principal subject-matter focus of CODATA.2
Because the sponsors of the study are U.S. federal government science agencies, the committee has emphasized those trends, issues, and barriers that have an impact on international access to data collected and used in publicly funded, basic research programs—that is, scientific research conducted as a public good. Despite this emphasis, the committee took into account the continua between fundamental and applied research, between raw data and processed information, and between public and private uses of scientific data. Indeed, the most vexing public policy issues facing the international scientific community in the exchange of data involve defining the appropriate balance of divergent interests.
Underlying the committee's approach, however, and informing its conclusions and recommendations, is the principle that full and open exchange of scientific data—the ''bits of power" on which the health of the scientific enterprise depends—is vital for advancing the nation's progress and for maximizing the social benefits that accrue from science worldwide.
Freedom of inquiry, the full and open availability of scientific data on an international basis, and the open publication of results are cornerstones of basic research that U.S. law and tradition have long upheld. For many decades, the United States has been a leader in the collection and dissemination of scientific data, and in the discovery and creation of new knowledge. By sharing and exchanging data with the international community and by openly publishing the results of research, all countries, including the United States, have benefited. Today, however, many rapid changes portend significant consequences, some possibly adverse, for the conduct of basic research in the natural sciences.
In broad terms, the challenges of greatest import for full and open global sharing of scientific data are those associated with two quite recent trends:
The former obliges scientists to reexamine how they carry out their calling. The latter impels the scientific community to become more involved in understanding the significance of public policies and legislative activities that can have a profound impact on their work.
Chief among recent developments affecting access to scientific data is the widespread use of powerful new technologies for data acquisition, storage, and communication, as well as their inevitable consequence, the rapidly growing quantity of data that scientists are generating, preserving, and distributing. Moreover, because of increasingly diverse applications for the results of scientific research, these data are becoming ever more useful and valuable in many sectors outside the specific areas of research that generate them. Finding ways to distribute such information to all who want it—equitably, reliably, and in keeping with the principle of full and open exchange as a sine qua non of progress in science—is the greatest challenge this committee identified while conducting its study.
Although scientific interchange was an important stimulus for development of the Internet and initially represented one of its greatest uses, commercial activities and entertainment now far surpass scientific use of the network and may be expected to dominate policymaking for the electronic exchange of information. This development raises questions about the scientific community's continuing capability to utilize what has clearly become a beneficial and versatile tool for scientific exchange and interaction. The economic framework for a global information system and legal models for dealing with conflicting interests are increasingly influenced by stakeholders who have no long-term responsibilities for, or concern about, sustaining publicly funded scientific inquiry. Simultaneously, the government science agencies expected to assume long-term responsibilities for sustaining scientific inquiry are questioning their capacity to continue to invest at traditional levels in the creation, preservation, and dissemination of scientific data.
Some technical trends and developments have had a significant, largely positive impact on the management and international exchange of scientific data. These include the steadily decreasing cost of computing and communication;
greatly enhanced capabilities for collecting scientific data, for example, from remote sensors; increasing exploitation of broadband networks and capabilities for transmission of video data over networks; the advent of digital wireless communication; increasing support for collaborative work by long-distance communication; growing capabilities for natural language processing; increasing recognition of the importance of standards in data structures and in networked communication; growing acceptance of the need for cooperation in monitoring and controlling network activity; and increasing use of intranets.
Associated with advances in, and increasing reliance on, information tools and infrastructure are a number of problems that present barriers to access, including the growing congestion of the Internet and consequent constraints on scientific communication and research; the storage and distribution of data that are inadequately described or indexed for significant numbers of potential users; the rapid obsolescence of electronic information-processing tools and storage media; the vulnerability of electronic networks and data repositories to accidental or deliberate damage; and the growing competition for use of currently limited network resources. Another difficulty—the current lack of adequate access to scientific data in developing countries-nevertheless has the potential to improve quickly.
The natural sciences—including the physical, astronomical, geological, and biological sciences—face a number of trends, opportunities, and challenges affecting researchers' capabilities for sharing data. The most obvious involves dealing with the exponentially growing volume of accumulating scientific data, which now, as a result of expanding computational power, also includes elaborate simulations that often incorporate animation as well as quantitative information. With the end of the Cold War has come declassification of some data that are now providing many new opportunities for researchers, particularly in the Earth sciences. In addition, because of the breadth and scale of major interdisciplinary, global-scale research efforts such as the International Geosphere-Biosphere Programme, the Human Genome Project, and the Hubble Space Telescope project, data from individual disciplines have become important to understanding and progress in other fields. Making data available, comprehensible, and useful across disciplinary boundaries has become a far greater imperative than before these projects existed. This task, however, is complicated by the fact that scientific data do not constitute a uniform, easily accessible body of information.
For example, scientific data may be categorized in many ways: by form or coding (numeric, symbolic, still image, animation, or other); by content; by means of generation; by level of quality and complexity; by the source of support for the data-accumulating activity; by time and space, in the case of observational, geospatial records; and by the institutional structures through which the data are distributed and stored. Certain of these characteristics, such as level of quality
(including degree of review and certification) and institutional origin, have given rise to additional complications associated with the increasingly pervasive electronic distribution of scientific data.
Some data issues are more discipline specific. Perennial problems affecting access to data in the observational sciences, for example, include gaps in quality control, incompatibility of data streams, inadequate documentation of data sets, and difficulty in meeting the requirements for long-term retention of data. In the biological sciences, the variety of attributes and qualifiers included with each observation and differences in terminology and usage put a heavy burden on any supplier of data to identify and specify the character of the data precisely enough to prevent misinterpretation. In the laboratory physical sciences, as in many other branches, fragmentation of data into numerous, autonomous, and often incompatible databases with different formats and levels of quality is a chronic problem.
Putting scientific data to use rapidly in sectors outside the immediate discipline of origin poses additional challenges to the longer-term effort to provide full and open access. In the observational environmental sciences, for instance, massive archives and reliable institutional memory are necessary to keep the data accessible and intelligible. Simultaneously, however, data also must be available to meet the public's need for warnings of natural hazards and disasters and for commercial use by the private sector. In addition, availability of data can be affected by governmental concerns related to national security, foreign policy, and international trade. Newly adopted or proposed restrictions on previously open and unrestricted data have caused particular concern in the Earth science communities, for example.
Another significant concern regarding full and open access to scientific data is related to commercialization of electronic publication and electronic databases. Science operates according to a "market" of its own, one that has rules and values different from those of commercial markets. While protection of intellectual property may concern a scientist who is writing a textbook, that same scientist, publishing a paper in a scientific journal, is motivated by the desire to propagate ideas, with the expectation of full and open access to the results. To commercial publishers (including many professional societies), protection of intellectual property means protection of the rights to reproduce and distribute printed material. To scientists, protection of intellectual property usually signifies assurance of proper attribution and credit for ideas and achievements. Generally, scientists are more concerned that their work be read and used rather than that it be protected against unauthorized copying. These conflicting viewpoints pose challenging problems for science and the rest of society. Current discussions are seeking a balance between protecting publicly supported activities that advance the public welfare and strengthening individual rights to intellectual property.
Associated with the internationalization of scientific data collection and use has been the growth of data centers—dedicated, stable institutions supporting collaborative data sharing across international boundaries and providing verifica-
tion, documentation, archiving, and dissemination of large, accumulating data sets. The scientific community is increasingly dependent on these data centers—on their skills in data management and distribution and on their capacity to support international scientific efforts.
Finally, an important concern in global access to scientific data is the need to improve capabilities for electronic communication by researchers working in developing countries. A two-way communication capability is needed: scientists in developing countries, like scientists everywhere, generate data that are just as important to science as the data they acquire. Finding ways to help less developed nations acquire affordable electronic network services is an effort that can and should be undertaken by concerned national and international organizations with the help of the telecommunications sector.
The constraints caused by inequalities among nations in access to scientific data are especially damaging to those sciences concerned with inherently international issues, such as food production, biodiversity, the prevention and cure of communicable diseases, global climate change, and other Earth system processes. Each of these sciences requires the generation of globally compatible, accessible, and usable data sets related to terrestrial ecosystems, the physical environment, and human activities. Collaboration among members of the scientific communities in every nation, rich and poor, in developing global observational data sets and in ensuring the subsequent full and open availability of those data is imperative; its importance cannot be emphasized too strongly.
As the quantities and uses of scientific data have expanded, and as nations' discretionary budgets have become increasingly constrained, some governments have begun to privatize activities previously delivered by the public sector and have sold some products and services on a commercial basis—including the generation and distribution of scientific data. This development has stimulated fears that scientific data may become priced beyond the means of the scientific communities, even in the more developed countries, despite the fact that the conduct of basic scientific research, like other government activities related to public health and safety, serves the public welfare and thus is appropriately supported by government funding.
Although economists may initially see privatization as a positive development for science, careful analysis suggests that a market model different from that of ordinary commerce is more appropriate for scientific activity for several reasons. First, the conduct of some scientific research is itself tightly tied to the collection, maintenance, and distribution of the data generated by that research. In particular, in the observational sciences, whose databases can be massive, separating the gathering, archiving, and maintenance of data from their distribution is likely to be more costly and inefficient than keeping them integrated.
Second, the contributors of scientific data, particularly in basic research, are frequently also the consumers of such data, and nonmonetized exchange of data may be most efficient in such cases. Third, in many situations, the market for scientific data is not large enough to support more than a single commercial supplier, if that. Finally, most basic research is necessarily funded from public sources. Privatizing the distribution of those data would mean that the funds now provided in grants to institutions supplying data would be channeled instead (if such funds were still available) to grants to individual scientists as users of data. Such funds in small grants to individuals are likely to be vulnerable to even the slightest budgetary pressure, thus potentially compromising the long-term health of science. Direct appropriation or block grant support to institutions with broad responsibilities for data management, preservation, and distribution, while not assured of continuity, is typically more stable and secure and is fortified by institutional memory that recognizes and supports the continued utility of archived data.
At issue now is whether or when the government should remove itself entirely as a distributor of scientific data. (There is no question here regarding the continued support by government of data generation; it is a part of the process of doing basic research that falls outside the charge of this study.) Largely because of the possibility of monopoly control and the potential threat to the principle of full and open availability of data, the government should not remove itself as a primary distributor of the scientific data that its funding has produced, without adequate safeguards as discussed below.3
The concern that privatization, accompanied by high prices and legal restrictions, would limit scientists' access to data needed for their work is paralleled by a similarly serious concern among economists about the possibilities for unrestricted monopolization, particularly by any party whose objectives do not include advancing the public interest. Whether they are private or governmental in nature, profit-making monopolies would endanger science, whereas privatization structured so as to encourage competition in supplying value-added data to multiple user communities could well represent good public policy.
Any pricing policies that bear on the availability of scientific data should reflect this information's characteristics as a public good—a resource that is both nondepletable (cannot be diminished by repeated use) and indivisible and nonexcludable (once having been supplied to some, cannot easily be denied to others). Because there is no social cost from repeated use, price differentiation may be justified in many situations, to ensure that the needs of the scientific community are met. Pricing of government-funded data in a differentiated system should ensure that data are available at no cost to those who provide them or otherwise contribute substantively to any given data set; for others, including commercial users, prices for data should cover the costs of serving those users. Because there is a cost associated with repeated distribution, marginal pricing has been the policy in many of the sciences. It allocates the smallest nonzero cost to users and thus is consistent with the principle of full and open exchange of data.
Internet congestion, a growing problem for transnational exchange of scientific data, has obvious economic aspects and will be resolved only if participating nations and network providers work together. For the scientific community, a partial solution may involve the creation of separate intranets.
The emergence of a new intellectual property rights model that protects the contents of electronic databases as well as those in print has the potential to significantly affect the international flow of scientific data. The problem has reached a crux with the current attempts, national and international, to establish a legal framework that threatens to subordinate the needs of data users working in the public interest to the desires of those seeking protection of investments in creating and maintaining databases. Unfortunately, and until very recently, the input into this legislative process at all levels by the scientific and educational communities has been all but nonexistent. Sustained action by those sectors is needed to avert possible restrictions on the full and open exchange of scientific data.
The U.S. Constitution articulates the legal protection of technological inventions and of literary and artistic works through the patent and copyright systems, which attempt to balance incentives to create against the public interest in free competition. Any publicly disclosed technology or information that does not meet the eligibility requirements for protection under U.S. patent and copyright laws becomes public domain matter that anyone can appropriate freely. Moreover, the special needs of libraries, educators, and researchers for access to the copyrighted literature has been recognized under the concept of fair use.4
But this traditional balancing of private and public rights has become more complex in the information age. Many information goods with commercial value, notably the contents of most electronic databases, are not eligible for patent or copyright protection, and database producers consequently face the threat of rapid duplication by free-riding competitors who do not contribute to the costs of collecting, managing, or disseminating the relevant data. In its 1991 decision in Feist Publications, Inc. v. Rural Telephone Service Co.,5 the U.S. Supreme Court raised the threshold of eligibility for copyright protection, requiring significant original and creative authorship in the selection and arrangement of contents and not simply industrious compiling efforts. Earlier, the Commission of the European Communities (CEC) had started to develop a new protection framework for databases to encourage their commercialization in Europe. This culminated in the formal adoption of a new European Directive on Databases by the CEC in March 1996, which reflected influences by the Feist decision, as well as other concerns in Europe. In May 1996, legislation similar to the final European Directive, but even more protective, was introduced in the U.S. House of Representatives (H.R. 3531), and in August 1996, a proposal almost identical to
the proposed U.S. legislation was placed before a Diplomatic Conference under the auspices of the World Intellectual Property Organization (WIPO) with a view to adopting a new protocol to the Berne Convention that would protect non-copyrightable databases in a tailor-made legal regime. Action on this proposal has been postponed until later in 1997.
Scientific data already largely compiled and distributed in electronic form constitute one of many types of data and information that will be affected by the legal framework now evolving in response to conflicting needs. Although new forms of legal protection may be needed to attract private investment to finance the creation and maintenance of electronic databases, including those for use in science and technology, current European and U.S. initiatives would confer a monopoly on database developers far broader and stronger than is needed to avert market failure. The pending legislation would create exclusive, monopolistic property rights of virtually unlimited duration, but without public policy limitations. If adopted in their current form, these legal proposals could jeopardize basic scientific research and education, eliminate competition in the markets for value-added products and services, and raise existing thresholds to entry into insuperable legal barriers to entry.
If put into practice, such measures could restrict the full and open access to data on which scientists and educators have depended. Neither the already adopted European Directive on Databases nor the proposed WIPO protocol and pending U.S. legislation would provide adequate fair use safeguards that recognize the needs of the scientific and educational communities for unrestricted access to data at affordable prices. They take little or no cognizance of the public-good character of scientific data for research and educational purposes.
More generally, such an approach ignores the contribution of basic science to the ability of U.S. firms to predominate in markets for technology and information goods. Despite a general consensus on the need for sustained levels of investment in research and development, the proposed database laws could change the status quo—without anyone's wanting it to happen—by elevating the price of the one raw material to which U.S. researchers have always had ready access. If less available scientific information were to translate to fewer applications of economic importance, the end result would be a loss of U.S. technological competitiveness in an integrated world market.
It is therefore essential to retain a "fair use" zone in cyberspace and in other media to protect the strong public interest in ensuring that certain uses and certain users, including the scientific and educational communities, are neither priced out of the market nor forced to cut back the basic research that has played a crucial role as a public good in the economic and technological growth of the United States. The pending legislative proposals, which the committee considers to be precipitous and radical attempts to alter the terms and conditions under which scientific data may be accessed and used on a worldwide basis, have the potential to do severe damage to the scientific enterprise. The scientific commu-
nity and its defenders must step in quickly to insist on further, open debate before these changes reach implementation.
Based on its deliberations and understanding of the issues involved, the committee believes that the following overarching principle should guide all policy decisions concerning the management and international exchange of scientific data in the natural sciences: The value of data lies in their use. Full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research. The public-good interests in the full and open access to and use of scientific data need to be balanced against legitimate concerns for the protection of national security, individual privacy, and intellectual property.
The committee recommends that the economic aspects of facilities for storage and distribution of scientific data generated by publicly funded research be evaluated according to the following criteria:
The appropriate price ceiling for nonscientific users of scientific data generated through government research is incremental cost, as defined in the section titled ''Pricing Publicly Funded Scientific Data" in Chapter 4. The price of scientific data to the contributing scientific community should be zero, or at most marginal cost.
The new proposals supporting an overly protectionist property rights regime for the contents of databases and for on-line transmissions of data and other scientific information have reached an advanced stage of legislative consider-
ation at both the national and the international levels. The committee believes that these legislative changes do not reflect adequate consideration of the potential negative impacts on scientific research and education and that they have been proposed for implementation at an unnecessarily precipitous pace. The committee therefore recommends that the Office of Science and Technology Policy, leaders from the science agencies and professional societies, and all those concerned with sustaining the health of the scientific enterprise should immediately take the following actions:
|
|
of Science and Technology Policy, DOE/EP-0001P, Washington, D.C., and National Research Council, Committee on Geophysical and Environmental Data (1995), On the Full and Open Exchange of Scientific Data, National Academy Press, Washington, D.C., p. 2. |
|
2. |
Throughout this report, the term "scientific data" refers to data in the natural sciences. |
|
3. |
The Landsat privatization effort, described in Chapter 4, is one example of unrestricted monopolistic data distribution under which the scientific community suffered loss of access. Nevertheless there may be situations in which the scientific community would benefit if a body of data were distributed either by a competitive set of private firms or by a single adequately constrained private source. |
|
5. |
Feist Publications, Inc. v. Rural Telephone Service Co., 111 S. Ct. 1282 (1991). |