HILT trip report « Eira Tansey

HILT trip report

October 9, 2014

In early August, I attended the Humanities Intensive Learning and Teaching institute (HILT) at the University of Maryland. I attended the Digital Forensics course, and kept a daily trip report during my week. It took me a while to clean it up, and while there’s still some informal language (and possible tense-switches!), I didn’t want to procrastinate any longer on getting the report up.

DAY 1:

Our instructors: Kam Woods (UNC-Chapel Hill) and Porter Olson (UMD PhD student, Community Lead for BitCurator).

Our group: seven archivists/librarians/students

The first part of our workshop was based loosely on the 2-day SAA workshop on Digital Forensics. We did lecture content for the first few days, and then moving on to doing hands-on digital forensics work with the disks we brought later in the week.

We started by reviewing the general concepts behind digital forensics, and how they apply to archival workflows. Digital forensics originated in the law enforcement community as a way to obtain legally admissible digital evidence. The methods have been adopted by the digital archives community in order to establish the authenticity of a record and to demonstrate what interactions took place with a record over time. Digital forensics techniques are used to capture a larger package of files (e.g. a disk image) than what can be see with “the naked eye” through the GUI. Capturing a disk image ensures that digital archivists have access to metadata, file structures, and hidden files critical to preserving the archival qualities of electronic records.

The second half of Day 1 dug into the many technical challenges associated with digital forensics. This was primarily about understanding how data is stored on disks. This took us on a whirlwind tour of thinking about different levels of digital representation (data as part of a group of digital objects, data as a single digital object, data in a GUI, data through the file system, data’s physical manifestation). As our instructors pointed out, even electronic records have some form of physical representation because of the method in which the data is recorded (e.g., pits on a CD).

Kam then gave us a long lesson on counting in binary. If you’ve only ever been used to counting in base 10 (e.g., each “place” in a number represents a power of 10 — 152 is 1 one hundred, 5 tens, and 2 ones), learning to count in binary feels like a real brain teaser at first. Data can be compressed by identifying the duplicative parts of the bitstream and making substitutions for their representation.

At the end of the day, we had a lecture by Tara McPherson who spoke about some of the digital humanities projects on her radar. It was during this talk where it really struck me how much the digital humanities and librarian/archivist communities need to talk to each other. McPherson talked about her regret that they did not talk to librarians sooner when starting Vectors — and admonished the audience to work with their librarians more. This caused some discussion on Twitter — I guess I am still a bit shocked that someone starting a big project involving questions of open access and data management would not think to consult librarians. This says a lot about the gap between what we know we can provide and others’ willingness (or even knowledge of!) to use our services.

DAY 2:

Day 2 of our workshop we finished off the lecture content. For a long time I’ve understood that taking disk images is a best practice for working with digital archives, but I’m not sure I could really articulate why until today. A disk image involves making an exact replica of the entire bitstream of a disk. Not having a computer science background, I never really thought about the ways in which data is stored on disks, much less what happens when you delete data.

I understand now that when data is deleted, it isn’t wiped (typically) clean automatically. When you “delete” something, that space on the disk is reallocated to be written over in the future, and the data sits there until its written over. This is probably CS101 for many, but was a big revelation for me. In addition to diving into the deep end on file system architecture and file allocation we talked about the nature of files themselves — for example, I did not know that names of files are not inherently part of the file itself, but essentially a directory entry. This crash course in computer science was something that I don’t think many archivists and librarians are exposed to on a regular basis, as was clear from our group’s discussion on the way to lunch, when we kept trying to remember how we would explain slack space to someone else.

When you begin learning about what’s really on a disk — whether it’s a USB thumb drive or a laptop’s hard drive — you quickly learn that what is seen through the GUI is only a small portion of what’s actually there. This is why digital archivists have widely embraced making disk images — much of the data and files needed to prove the authenticity of files over time, and to support the metadata and preservation needs of material — is simply not visible from the GUI. A disk image allows capture of that information that is normally running under the hood.

A point that our instructors made over and over was that while taking a complete disk image might not be necessary for all projects all the time, it is the best way to capture all the essential information of a potentially archival nature one might want now or in the future from digital materials. In addition, using digital forensics methods allows interaction with material without inadvertently writing over it. We heard many examples of how access/modification dates, file names, and file content can be significantly altered when directly interacting with materials. When using digital forensics methods, one would use some kind of write-blocker — hardware and/or software. A write-blocker is a physical device or software that allows the archivist to reads the source material, but not to write over it.

The best analogy our instructor shared was this — simply copying and pasting files off a drive without a disk image would be like accepting a box of photocopies of manuscripts instead of getting the original documents. This means that when we do disk imaging, we are not making “copies” per se, we are getting the originals — not some kind of partial material. We were encouraged to image materials first, and do analysis later. This really flips the archival process of appraisal and accessioning on its head — appraisal is usually done prior to accessioning in traditional archival workflows. In a model of digital forensics, the archivist makes an early appraisal decision when deciding which disks are worthy of imaging, imaging (i.e. accessioning) the data, and then doing additional appraisal post-accession to decide how to handle various files comprising the disk image.

We ended the day by taking an early journey with BitCurator, and learning about forensic disk image file formats (AFF is being phased out, E01 is the most commonly-used, and while it remains proprietary, it has been reverse-engineered), and comparing checksums. We made some disk images, and learned how to compare checksums.

HILT Ignite was the final event of the day. It was similar to a set of lightning talks. Here were some of the presentations:

French pamphlets translations and digitization at UMD College Park, funded by http://mith.umd.edu/research/project/digital-humanities-incubator/

@keenera of Northwestern on Digital Apparatus for Renaissance texts

Nabil Kashyap of Swarthmore on translation of Russian texts and creating middleware to visualize translation activities

Arden Kirkland (@ardeninred) of Vassar on various projects — she started with Vassar costume collection — she’s building CostumeCore for a metadata profile for historic clothing http://www.ardenkirkland.com/costumecore/

George Williams @georgeonline on Accessibility in digital environments — will be having a series of workshops, the next two are in Nebraska and Atlanta http://www.accessiblefuture.org/

Jim McGrath on the Our Marathon Boston marathon “crowdsourced archive” — will eventually become part of Northeastern’s SpecColl. Uses Neatline, add-on tool for Omeka http://marathon.neu.edu/bca This looks super duper awesome too — http://www.northeastern.edu/nulab/

Chip Oscarson from BYU — ecological networks and linkages — “topic modeling” — this seems to be a form of text cluster analysis (to what degree are words and terminology showing up in text?)

Priscilla Pena Ovalle of University of Oregon — idea for a pedagogical tool on hair — how does the appearance of hair influence his/her agency in media depictions? SO AWESOME! This was in the Idea stage, she is considering phone or website app

Raffaele Viglianti from UMD “Performing the Digital Edition” — performing a digital edition of a music score — scores that can listen to you to figure out where to turn the page automatically. Music Encoding Initiative — like TEI. Plotting breath marks over a digital score.

DAY 3:

Today we worked with bulk extractor to get a glimpse behind what was going on with our disk images. By doing this, we were able to see what sorts of URLs, emails, PII, or other sensitive information might be in our files. If one were to make a large disk image, these things turn up with surprising frequency. Knowing where on the disk this information resides allows archivists to make redaction or embargo decisions regarding content or files that might otherwise be made public.

This probably says a lot about my priorities while traveling away from home, but one of the highlights of the week was eating at the Maryland Food Cooperative — aka the cooperatively owned sandwich shop in the UMD student center. It was everything I hoped it would be — funky and delicious. Yum. Definitely check them out if you’re ever in College Park.

During Wednesday afternoon, we attended a number of field trips to check out local cultural heritage organizations. I went to the Holocaust Museum — during Q&A with the curators, I was able to ask a question that has often been on my mind at various points during my career, which is how cultural heritage professionals deal with intensely disturbing and traumatic materials encountered in their work. At my last institution, I often handled plantation records that had evidence of profoundly violent things done to enslaved people and their families, as well as vivid descriptions of scenes during the Civil War. More than once while working with these materials I had horrifying nightmares. I’ve always wondered how others handle these issues, and I’m very grateful that the USHMM staff shared their thoughts about this with me.

DAY 4:

This was the day we really got in the weeds and put all our legacy media we brought with us to work. We started off with thinking about a very common question digital archivists might encounter— if we get a big stack of floppy disks, where do we start? Floppy disks have a variety of formats, sizes, encodings, operating systems, etc. There is no single source that can tell you exactly what you have in hand, so it’s important to look for whatever clues are available on the disk itself. Wikipedia has an extensive list of disk formats, which is critical information when making disk images. Some of the forensics tools require the user to instruct it how to read the disk, meaning you must have the disk information, including capacity, number of tracks, density, and so on.

Much of this day we spent in the UMD MITH lab, in the basement of Hornbake Library. MITH is a great space, and our group set up at several computers containing all manners of drives. The middle of this article on building a forensics workstation has a good picture of the set-up. I brought a massive bag of legacy media, and a partner and I tried imaging the following items: a 5.25” floppy with a finding aid, several 3.5” floppies, an optical disk (i.e., CD-R, which are deceptively easy to image, though I didn’t appreciate how susceptible they are to damage until I tried to image one. Thousands of sectors were identified as damaged — though it was still readable), and a USB drive. Where necessary, we used write blockers to prevent accidentally writing over the data. It is still pretty easy to image most 3.5” floppy disks (new and cheap 3.5” floppy to USB drives are available online), but 5.25” drives are no longer made. This means the archivist must buy a used one, and many libraries use a device called FC5025, which is a controller that allows connection of a 5.25” floppy drive to USB. This will likely reveal my age, but I had never handled an 8” floppy until today — though apparently they’re still quite popular for US nuclear capability. Unfortunately finding a way to image these has proven a significant challenge for the profession.

The Special Collections at University of Maryland has a FRED machine that we also visited, though we did not use it. FRED machines are widely used for law enforcement purposes, though an increasing number of cultural heritage institutions are buying them. Although they do many cool things, the machine still requires purchase of external floppy drives, and FRED machines are expensive. Many institutions choose to create their forensics workstations iteratively by starting with a DIY workstation, and adding on components gradually with an eventual purchase of a FRED if the situation warrants it.

After our class, I spoke with Trevor Munoz from MITH about their efforts at beginning a Digital Humanities incubator that targeted librarians for its first rounds of programming. A major topic in general at the conference has been how digital humanities projects factor into RPT criteria. I believe a related concern for librarians is how they acquire new skills, and receive the required support, to be successful in these new areas of digital work.

DAY 5:

On our last day, we reviewed a few of the additional tools in BitCurator. One of the pretty cool ones that can be used in the command line is sdhash , which essentially compares the content of two differently-hashed files to assess the similarity between the file contents.

One of the highlights of this day was our discussion with Matt Kirschenbaum. Kirschenbaum is a UMD faculty member, Associate Director of MITH, and co-PI on BitCurator. Kirschenbaum has done significant work with digital forensics in born-digital archives. We discussed the nature of this changing set of skills, and how access to digital archives may change over time. I really respect the connections Matt has actively cultivated among archivists, and it was a great way to cap off our coursework.

At the final event of the week, all the groups were asked to prepare a brief (5 minute!) show and tell.

Concluding thoughts:

I went into this course not feeling very confident about the hands-on work associated with digital forensics. As a result of the several days we spent together, I feel much more comfortable returning to my institution and doing some early groundwork on recovering material. Of course, this means that I will be assembling a proposal to build a DIY digital forensics workstation. Even though this course exposed me to a lot, it’s clear I still have a lot to learn in this area. We covered so much of the “first steps” work of accessioning materials, but it seems there is still a lot of question about the access aspect of born-digital archives.

Overall, the entire HILT MITH experience was phenomenal. Many thanks to our instructors, Kam Woods and Porter Olsen for a superb job. The folks at HILT put together a hell of a week for us, and next year it sounds like the show will be on the road towards my neck of the woods — look for the next round to be held at IUPUI. This was a wonderful event that had many librarians and archivists in attendance — I hope y’all will consider putting it in your conference and travel budget requests for the upcoming year.

Bibliography

The recommended pre-readings for our course

Garfinkel, Simson, and David Cox. “Finding and Archiving the Internet Footprint.” Paper presented at the First Digital Lives Research Conference: Personal Digital Archives for the 21st Century, London, UK, February 9-11, 2009.
Kirschenbaum, Matthew G., Richard Ovenden, and Gabriela Redwine. “Digital Forensics and Born-Digital Content in Cultural Heritage Collections.” Washington, DC: Council on Library and Information Resources, 2010. http://www.clir.org/pubs/reports/pub149/pub149.pdf
Lee, Christopher A., Kam Woods, Matthew Kirschenbaum, and Alexandra Chassanoff. “From Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting Institutions.” September 30, 2013. http://www.bitcurator.net/docs/bitstreams-to-heritage.pdf
Ross, Seamus, and Ann Gow. “Digital Archaeology: Rescuing Neglected and Damaged Data Resources.” London: British Library, 1999. http:// http://eprints.erpanet.org/47/
Woods, Kam, Christopher A. Lee, and Simson Garfinkel. “Extending Digital Repository Architectures to Support Disk Image Preservation and Access.” In JCDL ’11: Proceeding of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, 57-66. New York, NY: ACM Press, 2011.

Readings that were directly or indirectly referenced during the Digital Forensics workshop — some recommended, some I found Googling around on my own

For counting in binary: http://en.wikipedia.org/wiki/Hexadecimaland http://en.wikipedia.org/wiki/Binary_system_(numeral)
The way data is stored on disks was eye-opening for me, this is a light overview of what’s going on http://en.wikipedia.org/wiki/Fragmentation_(computing)
For thinking about some of my challenges involved with setting up a forensics workstation: http://www.spellboundblog.com/2011/07/25/rescuing-5-25-floppy-disks-from-oblivion/and http://www.bitcurator.net/2013/08/02/building-a-digital-curation-workstation-with-bitcurator-update/ and http://practicaltechnologyforarchives.org/issue2_goldman/ and http://rbm.acrl.org/content/12/1/11.full.pdf+html?sid=dd598f8a-3823-4b29-82d5-0fbe83be5997
Martin J. Gengenbach “The Way We Do it Here”: Mapping Digital Forensics Workflows in Collecting Institutions, 2012: http://digitalcurationexchange.org/system/files/gengenbach-forensic-workflows-2012.pdf
Matthew J. Farrell. Born-Digital Objects in the Deeds of Gift of Collecting Repositories: A Latent Content Analysis, 2012: https://cdr.lib.unc.edu/indexablecontent/uuid:385c4fd9-a403-4ba3-85ac-2ea128400ddb
Simson Garfinkel came up over and over again in our course. His personal webpage looks like an awesome resource http://simson.net/page/Main_Pageand I might read Database Nation eventually: http://monoskop.org/images/3/3f/Garfinkel_Simson_Database_Nation_The_Death_of_Privacy_in_the_21st_Century.pdf
This talks about making raw dd images from FTK, but similar to our workflow for making E01 images http://shoestringforensics.wordpress.com/2009/09/11/imaging-using-ftk-imager/
CDs not as stable as we thought: http://www.theatlantic.com/technology/archive/2014/05/the-library-of-congress-wants-to-destroy-your-old-cds-for-science/370804/
Repository of software packages and hash data sets, can be used to evaluate stuff against “known” data that you may not want: http://en.wikipedia.org/wiki/National_Software_Reference_Library
Everything you ever wanted to know about floppy disks: http://en.wikipedia.org/wiki/Floppy_diskand http://en.wikipedia.org/wiki/Floppy_disk_format and http://en.wikipedia.org/wiki/List_of_floppy_disk_formats and http://archiveteam.org/index.php?title=Rescuing_Floppy_Disks
I’m really down with the work in general at UMD MITH. Here’s a fun Reddit AMA with Trevor Munoz of MITH http://www.reddit.com/r/IAmA/comments/22mbrj/we_are_an_npr_librarian_and_a_digital/
Digital Forensics, hacking, and social justice: http://en.wikipedia.org/wiki/FinFisher
Thinking about similarity of hashing: http://en.wikipedia.org/wiki/Bloom_filterand http://roussev.net/journals.html who developed sdHash http://roussev.net/sdhash/sdhash.html

Things I got to through internet rabbit holes that have me thinking about the intersection of DH and archives/libraries

Posner, Miriam. (2013). No Half Measures: Overcoming Common Challenges to Doing Digital Humanities in the Library. Journal of Library Administration, 53(1), 43 – 52. UCLA: 10.1080/01930826.2013.756694. Retrieved from: http://www.escholarship.org/uc/item/6q2625np
Nowviskie, Bethany. (2013). Skunks in the Library: a Path to Production for Scholarly R&D. Journal of Library Administration, 53(1), 53 – 66. http://libra.virginia.edu/catalog/libra-oa:2745
Trevor Muñoz. (2012). Digital humanities in the library isn’t a service. http://trevormunoz.com/notebook/2012/08/19/doing-dh-in-the-library.html
Posner, Miriam. (2012). What are some challenges to doing DH in the library? http://miriamposner.com/blog/what-are-some-challenges-to-doing-dh-in-the-library/

Interesting things I saw on the #hilt2014 Twitter feed

Succession planning project management: http://journal.code4lib.org/articles/6393
Data Curation as Publishing: http://journalofdigitalhumanities.org/2-3/data-curation-as-publishing-for-the-digital-humanities/and http://trevormunoz.com/notebook/2013/05/30/data-curation-as-publishing-for-dh.html
Educational BaseCamp accounts for educators: https://basecamp.com/teachers

Tagged with:

Categorised as: Uncategorized