The Office of Data and Informatics is effectively meeting its mission to provide leadership and expertise to meet the data challenges and opportunities for the rest of the Material Measurement Laboratory (MML) and National Institute of Standards and Technology (NIST) research data infrastructure. Its mission also includes providing services to MML and to NIST through its expertise, guidance, and the resources it offers in order to enhance the discoverability, usability, and interoperability of data collected by other NIST laboratories and occasionally, outside sources. The office is similarly meeting this portion of its mission through programs that it offers, including creating data infrastructure workflows, modern and robust data repositories, and thought leadership for the use and dissemination of research data beyond the federal government (i.e., the Research Data Framework). There is potential for expansion and replication of these initiatives throughout MML and the rest of NIST.
Overall, the capabilities within the Office of Data and Informatics to increase the usability, interoperability, and discoverability of NIST’s and MML’s outputs are absolutely vital to meeting NIST’s mission to “promote U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology in ways that enhance economic security and improve our quality of life” and MML’s mission to “conduct measurement science research to benefit industries based in the chemical, biological and material sciences.” Furthermore, these critical goals can only continue to be met if the initiatives and projects of the office are widely implemented across MML. Fortunately, this imperative is achievable owing to the foundations that have been laid over the past decade by the staff at the Office of Data and Informatics. Nevertheless, it will take concerted effort and planning to apply the lessons, tools, and workflows more broadly.
The Office of Data and Informatics has a team that can bring consideration of experimental metadata early into every division. Experimental metadata describes the conditions under which the data were collected and allows later investigators to use that data for additional research or verification. This skill and perspective are too easily overlooked in common practice because many of the benefits are realized only in hindsight, by future users or projects or by external community members when accessing and making use of the results of past work. The fact that the benefits of high-quality metadata are being considered and implemented in the projects that this office spearheads will help to improve future reuse and may also improve research practice through the improved data management practices. These efforts benefit the wider scientific community and the nation because the data outputs from MML’s work are preserved, curated, and disseminated allowing for others to reuse them and to increase the transparency, credibility, and impact of MML. This is enhanced by the consideration given to openness and transparency, which require technical skills to manage and to create workflows that enable both.
A major success has been the data pipeline created through the NexusLIMS1 infrastructure. This networked system automatically captures scientific data as it is generated throughout the electron microscopy working groups and automatically applies relevant metadata. This benefits individual
___________________
1 A LIMS is a laboratory information management system.
researchers, who are better able to access, store, manage, analyze, and retrieve the original experimental data. Colleagues and supervisors can see and support the research as it progresses and as results are generated. The automatic addition of important metadata and tags about the equipment, its settings, and the context in which the images were created addresses key shortcomings that affect reproducibility in scientific settings. The benefits to society grow exponentially as it feeds into a truly FAIR2 pipeline that eventually allows for public discoverability through NIST’s Public Data Repository, which allows for human and machine-readable data access.
This office’s staff possess expertise in assessing technical solutions to determine if the reuse, modification, or creation of new custom tools is required. Prior to creating new software or workflows, the staff effectively evaluates commercial and open-source solutions for the degree of fit and an overall cost-benefit analysis. They possess the capability to create novel solutions that are effective and valuable where suitable commercial and open-source solutions do not exist.
The Materials Science and Engineering Division, with the support of the Office of Data and Informatics effectively uses virtual machines to increase the longevity of aging equipment that is controlled by computers running on obsolete operating systems (e.g., Windows XP or 2000). This increases and extends the return on investment that was made in the equipment in the first place.
The Office of Data and Informatics in collaboration with world experts helped to create the Research Data Framework. This framework is a useful tool for researchers, librarians, research support staff, and external stakeholders such as data repositories and scientific publishers because it provides specific recommendations for practices at each stage of the research process. Success in implementing this organizational infrastructure is establishing NIST’s credibility and leadership in data information practices.
Recommendation 9-1: The Material Measurement Laboratory should periodically and frequently engage with industry and academia through activities such as workshops, meetings, conferences, and roundtables to standardize data management frameworks that fully support U.S. science advancement and industry competitiveness via open tools and instruments alike.
Recommendation 9-2: The Material Measurement Laboratory should disseminate broadly the Materials Science and Engineering Division’s protocols for the use of virtual machines to increase the longevity of aging equipment controlled by obsolete computers, effectively empowering industry, academia, and government laboratories to extend the return on capital equipment investments.
The Office of Data and Informatics has 23 total staff, comprising 14 scientists, 2 support staff, and 7 contract staff. Building, maintaining, and growing the capability to properly structure useful data repositories requires sufficient scientific expertise. This understanding is particularly important to be able to appropriately incorporate the needs of a broad technical community. These diverse communities at MML comprise fundamental and applied sciences like chemistry, physics, biology, materials, and engineering. The Office of Data and Informatics meets these requirements with a strong core team possessing scientific backgrounds in astrophysics, biology, and materials. Their ability to leverage other resources is exemplified through internal and external collaborations like their engagement with faculty from the University of North Florida. This relationship afforded a recent paper in Nature (Hanisch et al. 2022), in which staff from and peers of the Office of Data and Informatics call for machine readable units
___________________
2 FAIR is a set of principles to ensure that data are findable, accessible, interoperable, and reusable; see GOFAIR, “FAIR Principles,” https://www.go-fair.org/fair-principles, for more information.
of measurement and stands out as evidence of the strength and potential impacts these collaborations can have on the Office of Data and Informatics’ ability to affect the NIST brand in a very positive manner.
The importance of incorporating internal scientific domain expertise from within MML into the Office of Data and Informatics is demonstrated from the significant impact made by a fully embedded material scientist who has extensive knowledge of electron microscopy. This inclusion of technical knowledge into the office’s work provides the team with deep knowledge of the data requirements when using this critical tool for MML. Similarly, the epidemiological modeling work presented is another example of the impact domain experts have on the division’s work as it grows its collaborations in data-intensive methodologies applied to biology (e.g., collaboration with the Jet Propulsion Laboratory). Despite these first-rate examples of collaborative efforts, a potential challenge in maintaining the necessary scientific breadth is conspicuously shallow, in some cases being only one person deep. For example, a significant capability gap would occur immediately within the Office of Data and Informatics if their only materials scientist were to pursue other career opportunities.
A home-grown capability described by the division’s staff, and mentioned in several other assessment groups, was the development of virtual machines. This clever construct allows MML researchers to maintain instruments running proprietary software necessary for using equipment with proper security and technical support even after their original computers or operating systems are obsolete. This enables and extends the data capture that is instrumental for the Office of Data and Informatics. It is worth noting that the individual who built this capability is not officially a member of the office’s staff. This situation demonstrates the team’s ability to leverage resources available throughout MML and represents a kernel of an idea to scale up the office in a low-cost fashion while providing a significant opportunity for NIST’s postdoctoral cohort as described below.
One possibility to bring additional resources to the Office of Data and Informatics in a way that will increase its sphere of influence and improve its ability to maintain the needed scientific breadth, is to develop a model that would create sustainable connections to the various laboratories. A hybrid of the two models described in the last two paragraphs would afford a sustainable solution: namely, 6-month rotations of postdocs from other groups through the office. It is possible that non-Office of Data and Informatics groups may find this idea to be a distraction to the postdoctoral cohort. However, this model would bring about several other tangible and impactful benefits to both the Office of Data and Informatics and postdocs such as:
Conclusion 9-1: Embedding outside team members, such as postdocs from other groups rotating through the Office of Data and Informatics, to work on projects would benefit the divisions for which those researchers work. This would allow for technical expertise to be embedded with the workflow expertise in the office so that solutions for the MML divisions would be more specifically tailored to the language, tools, and other nuances of those divisions.
The Office of Data and Informatics has a budget in fiscal year 2023 of $5.615 million, with $4.825 million coming from appropriations and $790,000 coming from providing standards reference data services. It has 23 staff, comprising 14 scientists, 2 support staff, and 7 contract staff.
MML’s creation of the Office of Data and Informatics laboratory with world experts who are truly committed to data governance, FAIR data practices, and are regarded as leaders in scientific data management, is commendable. These experts are in a unique position to lead U.S. industry, government, and academic laboratories in establishing standardized best-in-class data practices. However, their impact could be increased if they were in a position to apply their expertise into additional laboratories within MML, or indeed across NIST where data infrastructure is also likely needed (though not directly assessed during this review process). As mentioned in previous sections on technical expertise, the Office of Data and Informatics is able to expand the useful life of existing equipment through the use of virtual machines and is able to increase research efficiency through the creation of high-quality data workflows. This efficiency gain has long term benefits for other laboratories that implement them by conducting more research on the same budget, although it requires an upfront investment to establish. A key example of this is the NexusLIMS that connects electron microscopy equipment to a centralized data management system. The investment that was made to create this system is worth replicating in additional research groups throughout MML.
Recommendation 9-3: To facilitate the deployment and adoption of state-of-the-art data practices in the Material Measurement Laboratory (MML) and across National Institute of Standards and Technology (NIST), the Office of Data and Informatics and MML should:
The Office of Data and Informatics has evidenced a true commitment to its mission to be the premier, pioneering resource specializing in the large and information-rich data sets now common in many disciplines focusing on researchers and institutions in the life and physical sciences (i.e., biological, chemical, and materials sciences), areas that data science and informatics can support. The office’s
research and implementation programs are driven by stakeholder needs both from within MML and from external constituencies such as users of the external-facing data repositories and patterns in the Research Data Framework, discussed in more detail below. As such, the NIST Public Data Repository3 has been an innovative and effective mechanism for NIST researchers to disseminate their efforts since it allows others to explore or reuse data and the related tools and resources supporting science, engineering, and technology. Researchers in NIST labs communicated their excitement about how easy it is now to make their data public. This demonstrates that the Office of Data and Informatics does not solely provide service to other MML entities but across all of NIST. The NIST Public Data Repository includes fields for data set metrics that display usage statistics on total file downloads, total data set downloads, total bytes downloaded, total unique users, and last downloaded. Nine of 10 randomly selected data sets had only “Metrics not available” for the resource. The reader does not know if metrics cannot be counted for that data set (perhaps because it is hosted externally) or if there has been no usage of that data set.
Recommendation 9-4: The Material Measurement Laboratory should enter more information in the NIST Public Data Repository field to explain why data set metrics are not available where that is the case.
The NIST Alloy Data4 application and the Metals Alloy User Interface are two examples of the quality and depth of the data in this public repository. In the past 3 years, the growth of data sets (now more than 30,000) and of the amount of data stored (now more than 20 TB) is a testament to the value of the repository to NIST scientists and to the public. Similarly, the Office of Data and Informatics has taken the leadership to create an open-source code repository, the NIST Opensource Contributions Portal.5 This is a well-used hub for NIST open-source projects and has a full catalog that is updated regularly as repositories are added or modified. Its features include private or restricted access that enable data sharing with appropriate protections.
A new achievement has been the launch of the Research Data Framework version 1.5, which provides the research community with a structured approach to develop a customizable strategy for the management of research data. The State University of New York system’s use of this framework is a testament to the awareness of and trust that large institutions have put in the NIST-led framework.
The Research Data Framework can become the data framework to guide NIST as a premier research organization in standards. Any other research and development organizations (in academia, industry, government, or other public or private institutes) can also rely on this framework as they become digitized and data centric. Its adoption will help standardize fragmented efforts, agree on a common data vocabulary, and provide necessary data governance and roles.
The Office of Data and Informatics’ outreach includes working closely with world-class organizations on data science and engineering. For example, together with the Research Data Alliance, an external association of governments and researchers that promote best practices in data management, they now provide tools for data discovery within the Materials Resource Registry by using and providing a materials science metadata schema (Medina-Smith et al. 2021). Staff from this office are also active in the FAIR Digital Objects Forum Steering Group, the Committee on Data of the International Science Council Task Group, the Certified Information Privacy Manager Expert Group, and many others.
The Office of Data and Informatics has also taken leadership in the area of data models, tools, and infrastructure at the national level. As such, they are active in CENDI6 focusing on improving the productivity of U.S. federal research and development efforts through collaborative agency participation, in the Department of Commerce Data Governance Board, and in the National Science Foundation Materials Research Data Alliance.
___________________
3 See NIST, “Science Data Portal,” https://data.nist.gov, for more information.
4 See NIST, “NIST Alloy Data,” https://www.nist.gov/mml/acmd/trc/nist-alloy-data, for more information.
5 See the NIST Opensource Portal website at https://code.nist.gov for more information.
6 See the CENDI website at https://www.cendi.gov for more information.
The Office of Data and Informatics also fosters strong collaborations with academia, notably with the NUANCE Center at Northwestern University, Johns Hopkins University, the City University of New York, and the University of Pennsylvania.
This office’s staff have made efforts to include the private sector, for example, in 2023, in 3 conferences and 15 workshops. These efforts go beyond industry presence at meetings and could also include funded projects with lead companies within particular chemical or materials industrial sectors to adopt and implement standardized data repositories, tools, and approaches. Such projects can help to ensure the true adoption of data repositories or data management practices demonstrated by the Office of Data and Informatics. Efforts toward data standardization by equipment manufacturers and vendors of data solutions can be catalyzed by the Office of Data and Informatics’ leadership. The basis of centralized data engineering, infrastructure, analytics methodologies, and tools have already been developed by office.
The Office of Data and Informatics has developed a common ecosystem for data management and sharing that has already been shown to be successful with the electron microscopy community in various sites and divisions as well with the additive manufacturing community. This ecosystem can be generalized to all MML divisions. This ecosystem can also be applied to any research organization in the physical sciences whether in industry, academia, or government.
Understanding the need for inspiring the next generation of data savvy researchers, the Office of Data and Informatics has created a NIST Educational STEM Resource Registry,7 or NEST-R, which is used by students and teachers and serves as an additional mechanism for creating awareness of the work done in MML and at NIST more broadly.
The Office of Data and Informatics collects metrics that measure the use of all of the above programs. These metrics show an increase in their use in the past years. It also indicates that the adoption is limited. This could be enhanced by conducting use analysis jointly with key stakeholders.
Conclusion 9-2: The Office of Data and Informatics is positioned as a national and an international leader in data infrastructure and standardization because of its commitment to enabling open data, efficient data workflows, and because of its work with equipment manufacturers and vendors of data solutions. The office has already piloted and successfully demonstrated the usefulness of centralized data engineering, infrastructure, analytics methodologies, and tools.
Conclusion 9-3: The Office of Data and Informatics’ efforts to inspire the next generation of physical scientists to be data natives is commendable. The office does this through the National Institute of Standards and Technology (NIST) Educational STEM Resource Registry or NEST-R while simultaneously creating awareness of the work done at the Material Measurement Laboratory and at NIST more broadly. The office is also uniquely positioned to offer NEST-R to all of the scientific community.
Recommendation 9-5: The Material Measurement Laboratory should strengthen its commitment to the Office of Data and Informatics by establishing clear data governance and data policies. It should establish the National Institute of Standards and Technology (NIST) Public Data Repository as the place to publish all open-source projects and take advantage of its restricted access functionality for any NIST confidential or internal project.
Recommendation 9-6: The Material Measurement Laboratory managers and researchers should adopt the materials science schema developed jointly with the Research Data Alliance.
___________________
7 For more information, see the NIST Educational STEM Resource Registry (NEST-R) website at https://nestr.nist.gov.
Recommendation 9-7: The Office of Data and Informatics should collect and make more visible the usage metrics to measure the impact of its repositories. Numbers that report access, download, and reuse (although that is the hardest to measure) would showcase how data and teaching materials are used in and out of government.
Hanisch, R., S. Chalk, R. Coulon, et al. 2022. “Stop Squandering Data: Make Units of Measurement Machine-Readable.” Nature 605(7909):222–224. https://doi.org/10.1038/d41586-022-01233-w.
Medina-Smith, A., C.A. Becker, R.L. Plante, et al. 2021. “A Controlled Vocabulary and Metadata Schema for Materials Science Data Discovery.” Data Science Journal 20:18. https://doi.org/10.5334/dsj-2021-018.