So much tape, so little time
The existential threat to videotape has long been a hot topic within the audiovisual archival community. Richard Wright gave this stark description of this magnetic media mass extinction:
“…there is just no time to wait any longer. Analogue videotape needs to be under a preservation plan, now, or it has little hope of survival…So that’s the message: do it now, as there is no expectation of affordable or even unaffordable digitisation beyond the next decade.“
Richard Wright, No Time To Wait 3. BFI Southbank, October 2018.
The twin issues faced may be summed-up in the portmanteau ‘degralescence’:
Wear and tear and imperfect storage conditions have in some cases led to irreparable videotape degradation over time, rendering the material unwatchable or irretrievable.
Secondly, replacement parts are no longer commercially available and the technical skill sets and manuals needed to maintain playback equipment are rare and dwindling; rendering videotape essentially obsolete.
With less durable formats believed to have a high volume digitisation lifespan of less than ten years, the audiovisual preservation community are faced with a race against time.
The digitisation project in context
Videotape Digitisation is a major strand of the BFI’s Heritage 2022 lottery-funded five year programme. Its aim is to digitise and digitally preserve 100,000 works from at-risk videotapes.
The tapes were selected from the UK’s regions and nations archives (RNAs) and from the BFI National Archive’s own collections.
The project encompasses a wide gamut of videotape formats ranging from the first commercially available analogue format in the 1950s, the open reel 2-inch Quadruplex, to the twenty-first century HDCAM SR cassette. Each format brings unique complexities, rare equipment and technical expertise needs.
Following the rigorous process of selection, the archives’ tapes were distributed to a framework of trusted suppliers in the UK and mainland Europe for digitisation.
The tapes are digitised by the suppliers’ technicians and the video content encoded to digital files to the project’s file specifications. These files are in-turn supplied to the BFI National Archive for ingest and ongoing preservation in our Digital Preservation Infrastructure (DPI).
In DPI our workflows are managed by a suite of scheduled Bash and Python scripts built and maintained in-house by the Data and Digital Preservation department. The scripts harness free and open source software (F.O.S.S.) where possible. With its spirit of openness and community collaboration, Open Source mitigates some of the operational and financial risks inherent with more closed proprietary software.
In the Data and Digital Preservation department our concern lies solely with the bits and bytes; the 0s and 1s if you will. As a data specialist, my role within our team is principally to shepherd the digital files through various automated delivery, validation, documentation and segmentation steps, ensuring the files are ready for ingest to the preservation system.
We will walk through each step of the workflow in-turn, but let us first consider our choice of preservation file.
FFV1 Matroska – small but perfectly formed
At the BFI National Archive the preferred preservation video file format is FFV1 Matroska. FFV1 is an ideal format for archival requirements. As a lossless compression codec, FFV1 yields files of around half the size of its uncompressed counterpart – though this varies depending on the type of video content. In real terms this means increased processing efficiency and a reduction in storage costs to the archive, without compromising on data loss. When dealing in petabytes of storage these savings are sizable.
FFV1 is also fully supported by the open source FFmpeg, the powerful toolkit for managing audiovisual files, and has other major benefits for archival preservation, including the ability it affords to video decoders to detect errors in the bitstream.
Once captured, video streams are wrapped in a Matroska (MKV) container, allowing for sidecar files to be attached if needed. FFV1 Matroska adheres to the Archive’s preference for open source, and it has been undergoing standardisation through the Internet Engineering Task Force’s Cellar working group.
In the formative years of the project the digitisation suppliers delivered uncompressed QuickTime V210 MOV files, and we managed the transcode to FFV1 Matroska using FFmpeg. Following a period of upskilling and R&D, most of the project’s digitisation suppliers can now manage this process directly. This development has seen a significant upturn in throughput for the project, as well as generating FFV1 Matroska awareness and skills in the commercial videotape digitisation sector.
Big pipes for big data
Files are delivered over a high-performance, scaleable transfer solution, which allows for multiple simultaneous transfer sessions, targeting a NAS (Network-attached storage) server housed in the BFI National Archive in Hertfordshire.
As you can imagine, file delivery at this scale inevitably brings problems. Issues encountered include delivery session drop-outs and the question of how to manage an equitable distribution of network bandwidth. The transfer solution provides real time monitoring and management of both of these, as well as providing templates for personalisable report creation. As any seasoned project manager will attest, detailed reports are invaluable to efficient tracking, as well as for evidencing KPIs.
When in the transfer admin console each supplier has access to their own Windows folder hierarchy within the NAS. Included are a transfer target folder and a file validation hot folder / watched folder. With this the supplier is given autonomy to control the volume, frequency and sizes of batches delivered. The supplier will only move files to the validation folder when they are satisfied the transfer is complete.
Video files are delivered with the following naming convention: ArchiveInitials_TapeIdentifier.mkv
These identifiers are mapped in tracking spreadsheets available to all project stakeholders, to monitor progress and identify problems.
From the validation hot folder, files are picked up and enter our workflows.
These are not the file headers you’re looking for
Video files are subject to a two-step validation process, ensuring absolute compliance with project specs.
The first step utilises MediaArea’s compliance checker MediaConch. MediaConch has proven an indispensable F.O.S.S. tool for the project, examining video files’ technical metadata and comparing the data against user-defined specification policies. We use both MediaConch’s graphical user interface for manual checking of individual files and command-line interface for implementation into our scripting. Where the file conforms it passes, and progresses to the second stage. Should it fail to comply on even a solitary count, the file is invalid and pushed to a failures folder. Here the offending failure reasons are logged for investigation. This logging has proven invaluable in refining our specification in discussions with suppliers.
The next validation step ensures that the FFV1 Matroska file in our possession is a bit-perfect representation of its uncompressed source file. Two frameMD5 text files are provided by the supplier along with the video file. One for their source V210 MOV and a second for the supplied FFV1 MKV. Should these match exactly we can be confident that the transcode was bit-perfect.
The frameMD5 is a simple text file generated by FFmpeg, containing a stream of MD5 checksums. One hash relates to one video or audio frame, not one per file. The MD5 is a digital fingerprint, from which even the smallest change to a byte of data will transform the alphanumeric hash returned entirely. MD5s are a widely used method of fixity checking. With fixity, in the preservation sense, we can be assured that a digital file has remained unchanged, or fixed.
For our purposes, we use a simple Linux tool called diff to perform the comparison. Should the two frameMD5s differ the transcode is presumed to have been imperfect or lossy. Another benefit of the humble frameMD5 is in aiding error identification, namely pinpointing in which frame(s) the issue resides. Some more on FrameMD5 as an archivist’s tool.
Now that we are sure that the FFV1 Matroska files are compliant, files from RNA and BFI files have a split in their journeys. RNA files progress to database documentation and BFI files move to segmentation.
Self-describing files (with some help)
Heritage 2022 was conceived primarily as a preservation project, and thus the descriptive metadata for the videotapes is minimal. We automate the creation of records in our database CID for the files being preserved, using Python scripts that gather info from the folder and filename (which we standardise rigorously).
The data standard employed for moving image records in CID is EN 15907, which uses a Work – Manifestation – Item hierarchy.
Title and archive name are gathered from the filename, and digitisation supplier is gathered from the parent folder. Basic technical metadata – including duration and file type – is extracted from the file. An Item record in CID is then created with this data, giving us a unique identifier that is essential in our subsequent workflows.
The documented file is then pushed through to the next workflow step.
To split, or not to split?
Following validation and documentation, files are moved from their supplier folder to segmentation folders in a separate NAS using another Linux application called rsync. Here scripts triage the correct course of action, dependent on data from CID.
One Item per videotape source: these cases get ingested as they are, no splitting workflow required.
Multiple Items per videotape source: this is complex, and invokes an automated segmentation process. Each videotape Item record includes start and end timings information, added to the database by the engineer who recorded the content to videotape, as long as 30 years before. The splitting script first models the tape as a set of Items with start and end times, and adds a safety handle to each start and end to avoid data loss.
The script then uses FFmpeg’s stream copy function, supplying the start and end time of each section, to copy the video and audio streams to a new Matroska container file. It uses FrameMD5 to confirm that each section matches the source perfectly, and it names each section for the relevant Item record, using our filenaming convention – Identifier_PartofWhole.extension. For example: N_12345_01of01.mkv.
Following this segmentation each new MKV file progresses to the ingest processing queue, and the source file gets deleted.
At this stage a viewing h.264 MP4 proxy file is created – again using FFmpeg – and ingested along with the FFV1 MKV. Full technical metadata for the MKV (extracted using Media Info), and checksum data are stored in a newly created digital media record.
Although ingested into our Digital Preservation Infrastructure we are careful not to describe the file as ‘digitally preserved’ – instead we tend to say ‘under preservation’. Digital preservation is an ongoing process, never finished!
The state of playback
As with most major projects of this scale, Heritage 2022 videotape digitisation has not been without its ups-and-downs.
The pandemic caused considerable set-backs and delays. The BFI National Archive was impacted by some furloughing of staff, as were most of the digitisation suppliers. As we moved beyond Covid-19 lockdowns and we all got used to working from home, Zoom and Skype have replaced our offices and boardrooms as the ‘new normal’.
More suppliers have needed onboarding, as the complexities of dealing with ever more niche formats became apparent.
Despite these obstacles I feel we have a lot to be proud of. At time of writing, with a year of the project to go we are approaching 90% of our target. Coupled with this we are facilitating file return to the regions and nations archives, so that the preservation files are held by those archives as well as in the BFI National Archive.
The BFI National Archive has also made significant progress on developing the capacity to digitise videotape directly to FFV1 Matroska, after investing in development of FFV1-support with our video capture card manufacturer. The ability to skip any transcoding is a significant leap forward.
The BFI is on the cusp of rolling out our new video player platform BFI Replay to a UK consortium of public lending libraries throughout the United Kingdom. This will provide public access to thousands of hours of regional and national archive video content.
Lastly, in the spirit of openness engendered by World Digital Preservation Day may I invite you to visit our BFI Data and Digital Preservation GitHub repositories. Here you will find access to many of the scripts which fuel the workflows described.