The demand for data storage is projected to continue its explosive growth path into the future at a compound annual growth rate of approximately 27.5 percent. Projections built from TrendFocus growth trends forecast the worldwide installed storage capacity to grow to ~26.3 zettabytes (ZB) by 2030 and ~251.8 ZB by 2040.32 The yottabyte era may be reached by the 2050s. At the same time, magnetic storage scaling is slowing down, as evidenced by the industry HDD roadmap (see Figure 1), which ultimately limits the scaling roadmap for tape as well, because it is intentionally following the HDD areal density capacity scaling roadmap, which is leveraging amortized technology. The outcome is that supply is projected to fall short of demand, and Gartner is projecting a zettabyte-scale supply gap in the second half of this decade. Increasing demand for media diversity (given that tape is the only true archive storage medium today), the requirement to migrate data typically every 7–10 years, and increasing data storage sustainability considerations, are all driving demand for new long-term data storage media technologies.
The fact that technologies are perishable in today’s world presents great challenges to any archival storage that demands data accessibility for any period longer than a decade. Periodic data migration to newer storage systems might be necessary, and the IC may need to adapt as new technologies become available.
The decadal plan for semiconductors published in 2020 by the Semiconductor Research Corporation (SRC)33 highlighted five seismic shifts for this decade, noting that the growth of memory demand will outstrip global silicon supply, presenting opportunities for radically new memory and storage solutions. In conclusion, a storage grand goal is to discover storage technologies with more than 100 times storage density capability and new storage systems that can leverage these new technologies. In the SRC Microelectronics and Advanced Packaging Technologies Roadmap,34 the goal is further evolved, stating that storage density must improve by one or two orders of magnitude by 2037. The roadmap furthermore targets that the industry should aim to store and recover 1 EB of binary data in DNA in a standard data center rack by 2037.
There are only two candidate technology domains that hold promise to meet these scaling goals: organic storage, with the lead technology being DNA data storage, and inorganic storage, with the lead technology being optical or particle data storage, such as Microsoft’s Project Silica and Cerabyte’s Ceramic Nano Memory. Additionally, it would be desirable for the U.S. government to fund disruptive new innovative data storage technologies. One current such initiative is the Intelligence Advanced Research Projects Activity’s (IARPA’s) Molecular Information Storage Technologies (MIST)35 program, which is exploring the use of scalable sequenced-controlled polymers for long-term data storage.
Surprisingly, the idea of using DNA to store data originated in Richard Feynman’s 1959 lecture “There’s Plenty of Room at the Bottom.”36 Scientists have been storing digital data in DNA since
__________________
32 Horison Information Strategies, 2023, “The Storage Stack Evolution¾From the Zettabyte to the Yottabyte Era,” Boulder, CO.
33 Semiconductor Research Corporation, 2020, Decadal Plan for Semiconductors: A Pivotal Roadmap Outlining Research Priorities, https://www.src.org/about/decadal-plan.
34 Semiconductor Research Corporation, 2023, Microelectronics and Advanced Packaging Technologies Roadmap, https://srcmapt.org.
35 Intelligence Advanced Research Projects Activity, “MIST: Molecular Information Storage,” https://www.iarpa.gov/research-programs/mist, accessed November 22, 2023.
36 R. Feynman, 1959, “There’s Plenty of Room at the Bottom,” lecture, annual meeting of the American Physical Society at the California Institute of Technology, December 29, https://calteches.library.caltech.edu/47/2/1960Bottom.pdf.
2012.37 Initially, a 52,000-word book was encoded into DNA, followed by tens of megabytes (MB) of compressed video in 2016, 200 MB of data in 2018, and 16 GB of Wikipedia in 2019. In addition to traditional DNA chemistry, enzymatic or water-based DNA chemistry evolved starting in 2020. In 2018, the SRC published a SemiSynBio roadmap for molecular information storage technologies.38 In 2020, the DNA Data Storage Alliance was formed with the mission to create and promote an interoperable storage ecosystem based on manufactured DNA as a data storage medium.
In general, the workflow for DNA data storage can be divided into the following six steps39:
The storage mechanism and environment must protect the DNA from deterioration through oxidation, ultraviolet (UV) light, radiation, and high temperature. For long-term storage, DNA is typically dehydrated, which allows it to be stored for a very long time given proper conditions. Retrieval of the data from DNA requires rehydration and preparation to be read using a DNA sequencer, which generates a read out that must be post processed to derive the original data encoded. To allow for non-destructive readout, polymer chain reaction is used to generate many copies of the DNA strand, and only the single copy read is destroyed by the readout process. DNA synthesis and sequencing is solutioned through molecular writers and readers using semiconductor chip technology, creating a cross linkage of advanced semiconductor and life sciences that is not necessarily obvious. Some companies are pursuing solutions to address challenges in the above workflow through a lab-on-a-chip approach, while at least one company is pursuing an end-to-end chip-scale solution.
With current DNA data storage techniques, all writing mechanisms are biological- and chemical-based, and the writing process is therefore quite slow in comparison with that of conventional technologies such as HDD and magnetic tape. The writing processes are often erroneous, but techniques for adding error correcting bits have been developed to compensate such deficiencies. For example, Nanopore, a technology that combines electronics with biologic methods, can be used for fast reading of DNA sequences.40
DNA data storage has unique properties that make it an ideal choice for archival storage, especially for time capsule use cases also referred to as “frozen” data. The tremendous density of DNA is several orders of magnitude higher than any other current mainstream technology, which makes it one of the leading candidates for long-term storage. DNA can remain intact for thousands of years at room temperature in a dry atmosphere. No energy is required to retain the data and no bit rot is expected, assuming proper storage conditions. The use of water-based DNA chemistry in conjunction with the described energy consumption make DNA data storage a green, sustainable storage
__________________
37 R.F. Service, 2017, “DNA Could Store All of the World’s Data in One Room,” Science March 2, https://doi.org/10.1126/science.aal0852.
38 Semiconductor Research Corporation and National Institute of Standards and Technology, 2018, 2018 Semiconductor Synthetic Biology Roadmap, https://www.src.org/program/grc/semisynbio/ssb-roadmap-2018-1st-edition-final.pdf.
39 A. Doricchi, C.M. Platnich, A. Gimpel, F. Horn, M. Earle, G. Lanzavecchia, A.L. Cortajarena, et al., 2022, “Emerging Approaches to DNA Data Storage: Challenges and Prospects,” ACS Publications 16(11):17552–17571.
40 Oxford Nanopore Technologies plc, “How Nanopore Sequencing Works,” https://nanoporetech.com/platform/technology, accessed December 12, 2023.
medium. Lastly, DNA has another interesting characteristic because it is truly software-defined storage, given that the media is created as the data are written, there is no blank DNA media¾the data are the media, and the media are the data.
DNA data storage spans several use cases that include smart/secure storage, cold/archival storage, and computational storage. DNA data storage can be embedded in materials or objectives in form or molecular tokens, allowing for traceability and authentication as well as provenance. Certain DNA data storage technologies enable secure storage solutions that are quantum-unhackable. The most public use case is the storage of massive amounts of cold or archival storage that is either never or rarely retrieved. Lastly, there is work under way that is focused on molecular computational storage where the compute as well as the storage is realized in the molecular domain.
David Markowitz from IARPA presented a status update on DNA data storage at the 2022 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), referencing DNA synthesis cost at more than $100 k/GB and sequencing cost at more than $500/GB. The current the largest published DNA data archive is 200 MB in size and required nine separate synthesis runs. He outlined the goal of the IARPA MIST program for 2025 to make DNA data storage at 1 TB/system at $1/GB for enterprise archival use with end-to-end workflows on a tabletop.
The innovation opportunities for DNA data storage are in four major areas: (1) scaling the synthesis, or writing, to terabyte per chip and terabyte per day using water-based DNA chemistry; (2) scaling sequencing, or reading, to scale to terabyte per chip and terabyte per day using a non-destructive read mechanism; (3) scaling of storage and retrieval systems to petabyte and subsequently exabyte scale, enabling fully automated workflows providing an easy mechanism to copy and store DNA; and (4) seamless integration of DNA data storage systems into the information technology infrastructure hardware and software stack avoiding a lengthy and costly process to build and scale a new ecosystem.
With component validation occurring in a laboratory environment, DNA data storage has an estimated TRL of 4.
Unlike charge or magnetic data storage technologies, inorganic data storage technologies are using atomic storage mechanisms. All technologies seek to address long-term data storage using mechanisms that range from fluorescent multi-layered film to multi-color silver halide film and phase change metal alloy tape to ceramics on glass and voxels in glass. The two candidate technologies with the most advanced public workflow demonstrations are Microsoft’s Project Silica and Cerabyte’s Ceramic Nano Memory. Project Silica writes data “into” pure silica glass, while Ceramic Nano Memory writes data “onto” the surface of inexpensive, widely available glass media.
Silica is a future storage technology that uses glass as the media and is expected to become available for large-scale deployment in the next decade. It is an optical storage technology that uses femtosecond lasers to write data in glass and polarization sensitive microscopy using regular light to read.
Silica takes advantage of glass as media, which is a low-cost, durable, WORM media, is electromagnetic field–proof and offers lifetimes of tens of hundreds of years. Silica is currently being developed by Microsoft Research as a ground-up, cloud-scale storage technology with long-term preservation, airgaps for data security, and sustainability as its core principles. The disaggregation of
media, writers, and readers makes it a very scalable technology to address multiple workloads within the archive segment, without the need for media refresh cycles every few years, unlike magnetic storage technologies.
Silica offers volumetric density higher than current magnetic tapes (raw capacity upward of 7 TB in a square glass platter of 120 mm × 120 mm × 2 mm). The data are written inside the glass into voxels, which make this storage resistant to bit rot, and it is unlike most other technologies that write data on the surface. Data can be written in layers, which helps with density and durability. Silica can support exabytes and zettabytes of archival data storage in the coming decades.
Silica is being developed by Microsoft, which plans to make it available for small-scale deployments by the end of the decade and ready for large-scale volume deployments at the beginning of the next decade.41 Silica will have a huge advantage over the existing technologies, given its complete disaggregation model and 100 percent passive storage in glass, which mitigates the critical environmental concerns associated with managing vast quantities of data. This will enable Silica to achieve a cheaper TCO across archive applications irrespective of the capacity and workload requirements. Silica media requires no power while in storage, which reduces the operational cost of storage and enables long-term durability, avoiding media refresh cycles.
Silica is a new technology that is now in the prototype phase, with Microsoft solving engineering challenges and engaging in industry collaborations where necessary. Silica data storage has an estimated TRL of 6.
Cerabyte’s Ceramic Nano Memory uses a glass substrate with ceramic nano-coating as a data storage medium. Nanoscale writing and reading are enabled using laser or particle beam technology and leveraging QR codes. The tempering during the production process creates a strong bond between the coating and substrate, which generates superior resilience of the media over extreme temperatures and adverse environmental conditions. Ceramic is harder and more temperature resistant than steel, with the media being electromagnetic pulse–proof and UV and radiation resistant, with excessive physical force being the only threat model to avoid. The data are virtually retained forever, eliminating the need for data migration and offering an immutable data storage media that does not suffer from bit rot or data corruption.
Ceramic Nano Memory uses an elaborate write/read system to write and read media assembled in a cartridge. The cartridges can be selected and loaded into the system for writing or reading, similar to a tape library system. Ceramic Nano Memory plays similar functions as the current magnetic tape storage system with similar objectives and concept of system approach, even though the storage mechanisms are quite different.
The initial data storage media size is 9 × 9 cm, fitting up to 100 media within a cartridge that has the same form factor as LTO tape cartridges, enabling the use of the existing LTO tape automation ecosystem. Using femtosecond laser technology, initial densities of 125 GB per data medium with 100 nm bit size, data center rack densities up to 100 PB are achievable. This can be scaled to 1 TB+ per data medium with 30 nm bit size, translating to data center rack densities of up to 1 EB using particle beam technology, which has been demonstrated in proof-of-concept studies. Ultimately,
__________________
41 P. Anderson, E.B. Aranas, Y. Assaf, R. Behrendt, R. Black, M. Caballero, P. Cameron, et al., 2023, “Project Silica: Towards Sustainable Cloud Archival Storage in Glass,” paper presented at the 29th ACM Symposium on Operating Systems Principles, https://www.microsoft.com/en-us/research/uploads/prod/2023/09/ProjectSilica-SOSP23.pdf.
helium ion beam technology is holding promise to scale bit size down to 3 nm, which would enable up to 100 EB data center rack storage capacities by the middle of this century.
The data are written with a laser matrix consisting of up to 2 million bits in one shot derived out of a single laser source using digital micromirror device technology. Data are then read back with a microscope, combined with high-speed, post-processing electronics. Terabyte is targeting write and speeds of 100 MB/s with hopes to scale up to 1 GB/s+ within this decade. The high signal-to-noise ratio allows for precise reading, requiring minimal post-processing. Data can specifically be erased or sanitized through additional ablation of the ceramic layer. Access time for the media is mainly dominated by the choice for automation. The technology is backward or downward compatible because larger structures can always be read by more advanced microscopes.
Ceramic Nano Memory requires minimal power to write because it simply ablates small portions of a 50–100 atoms-thick absorbent ceramic layer, no power is required to retain the data on the medium, and it is fully recyclable, making it an ideal sustainable long-term storage choice. Cerabyte taps into existing high-volume mass production supply chains; for example, for its glass media. Going forward, it expects to benefit from the foundational innovation power of the $1 trillion-plus semiconductor industry, adapting semiconductor manufacturing technologies for long-term storage applications as it did when leveraging maskless lithography technology for its initial write/read unit.
A first demo system showcasing the entire workflow from writing to storing to reading was developed using commercial, off-the-shelf components and is currently operational. Cerabyte plans to conduct evaluations with research laboratories, including the European Council for Nuclear Research, in 2024. The company plans to introduce a petabyte-scale data center rack system in 2024 with a roadmap scaling to 100 PB per rack with GB/s+ write and read speeds and less than 10 seconds access to first byte with a media cost of less than $1 per terabyte by the end of this decade. A multi-sourcing supply chain is intended to enable a new long-term data storage ecosystem that supports scaling to address the multibillion-dollar market opportunity ahead. Use cases range from accessible long-term storage leveraging robotic libraries to storage of “frozen” data in vaults or warehouses providing cybersecure air-gap-protected data storage solutions.
With prototypes beginning to be deployed in a relevant environment, ceramic data storage has a TRL of 6.
Despite the growth outlook for data storage, some industry analysts project a supply shortage or even a potential vertical market failure on the horizon.42
__________________
42 Trendfocus, 2022, “Sustainable Growth of Archive Storage Will Require New Secondary Storage Solutions,” Cupertino, CA.
The reasons for the projected shortage are rooted in the challenges that current mainstream data storage technologies are not expected to scale up as needed from an areal density capability, as well as resource availability, energy limitations, and sustainability and affordability considerations.
The fear of vertical market failure is driven by the slowing rate of improvements to HDD and tape storage and the need for storage providers to continue to compete on storage pricing.
This leads to smaller profit margins on storage, which disincentives manufacturers from assuming innovation risks when investing in new technology. With suppliers and customers unwilling to make significant investments into major technology advancements, current data storage technologies enjoy modest gains that increasingly fall short of market demand.
At the same time, a small number of companies have come to dominate the zettabyte-scale archival storage market.43 The supply and demand side consolidation increases the risk of a single point of failure that could disrupt the entire archival data storage media marketplace. Suppliers and customers must find new business models to reasonably share the innovation risk, and the government should consider allocating funding to help drive innovation in technology areas deemed strategic.
Preceding sections of this REC have been primarily focused on storage technologies. However, while technology is a necessary consideration, it is not sufficient to ensure the IC’s goals of safely stored, searchable, accessible, and retrievable information. A survey addressing corporate “archiving in the wild”44 reports that a majority of privately held organizations in the United States, both for-profit and not-for-profit, expressed dissatisfaction with their information management and archiving strategies. Unsurprisingly, few of the organizations surveyed had hired library and/or archiving professionals to design, implement, and support a long-term corporate information strategy.
With that in mind, this REC emphasizes the need for ODNI to consider how technology and policy can beneficially reinforce each other to enable the preservation, accessibility, and security of critical national security information, above and beyond narrow consideration of the pros and cons of various storage technologies. Over the more than 50-year timespan indicated in the ODNI request, tens of thousands of employees are likely to create and interact with IC data and information. Long-term information management is therefore as much a problem of organizational policies and practices as it is a technological one and is a topic for which considerable guidance exists.
For example, the National Archives and Records Administration (NARA) provides extensive guidance regarding the long-term storage and management of federal records, while the National Institutes of Health and the National Library of Medicine (NLM) are responsible for stewarding the world’s largest compilation of biomedical data and information. It may be worth reaching out to NARA, the NLM, and similar federal-level institutions (e.g., the Smithsonian, the Library of Congress) to ascertain the “best-in-class” technology investment models and accompanying archival practices for data storage and archiving at the scale the IC is likely to require.
__________________
43 F. Moore, 2023, “Storage Outlook for the Zettabyte Era and the Rise of Tape for Secondary Storage,” paper presented at Ultrium LTO MGM75, Napa Valley, November 15–16, 2023.
44 N. Jinfang, 2023, “Corporate Archives in the Wild,” Global Knowledge, Memory and Communication 6/7, https://doi.org/10.1108/GKMC-12-2022-0283.