Smithsonian Collections Blog

Highlighting the hidden treasures from over 2 million collections

Collections Search Center

Tuesday, December 17, 2019

The Smithsonian’s Journey of Computerized Library and Archives (1994-2009)

Read Part I: The First Integrated Library System


PART II: STEPPING OUTSIDE OF THE BOX

Jump starting and Supporting Digitization


In 1994, OIRM SIRIS began a new venture in the field of library and archives automation: the support of online media files.   At the time, the Smithsonian had several Collection Information Systems including the library’s system, but no catalog records were linked to images or video files, which prohibited public access.   

One of the NAA images digitized during early digitization
With a newly implemented internet, we modified a new WebPac application configuration to enable images to display with catalog records online, demonstrating the technical potential to library and archives staff. This new and exciting feature required Smithsonian staff to digitize images and then link the image files to catalog records by referencing the image URL in the MARC 856 field.  It was a challenge to get started because no one knew how this would work, so we had to lead by example.

By 1995, OIT (Office of Information Technology, successor of OIRM) purchased a couple of image scanners.   SIRIS helped the NMAA Art Inventory project digitize about 200 photographs of sculptures and linked them to their catalog records.  The first Smithsonian public online system that could display object records with images was born!   In 1996 at the Smithsonian Institution’s 150th Anniversary Event on the National Mall, we showcased the brand-new functionality to the public.  The online demonstration using the Netscape Navigator web browser even included a few cephalopod video clips from NMNH.   The excitement for the new functionality energized archives staff.  Although more archives professionals accelerated their image digitization efforts, most of them did not have the resources to host images online.  The digitized image files accumulated on hard drives, CD-ROMs, and laser disks.  Many of these storage devices sat on bookshelves or under desks; they were not accessible to the public online.

In 1996, OIT SIRIS created the first Smithsonian central “Multi-Media Server” that hosted online images for SIRIS members. This service included online storage and web server support, image maintenance support, digitization training programs, image linking trainings, and usage statistic reporting.  Until 2014, this multimedia server hosted over 900,000 images, video and sound files for 18 SI units.  Jim Felley, (SIRIS senior system administrator), provided critical support and management of the service until it was retired in 2014, after all images were migrated to a new Digital Asset Management System (DAMS).


Leadership in Data Standard and Vocabulary Control

In 1999, the Smithsonian library system was upgraded to the Ameritech (now SIRSIDYNIX) Horizon system.  This new system came with flexible system-configuration capability and a strong authority (vocabulary) control function.  Most importantly, it allowed the Smithsonian to establish many locally defined fields, supported record relationship linking capability and supported specialized indexes that met the needs of the Smithsonian’s nontraditional challenges.  SIRIS had grown to support eight databases:  Library, Archives, Art Inventories, SAAM Photo Archives, Art Exhibition, Research Bibliography, History of Smithsonian, and Directory of Airplanes. 
Over the years, 14 archives, 20 library branches and several museum research departments depended on SIRIS to do a wide variety of collection management functions.  More and more data sets were added to the eight databases using custom programming and data importing.  By 2006, nearly 50% of the 955,000 non-library records were transferred from local databases such as DBASE, MS Access, Excel, C-Quest, FileMaker Pro, WordPerfect, Text, etc. 
Library of Congress Subject Headings Catalog

Mapping these different datasets into the MARC format was a big challenge, but dealing with data inconsistencies was an even bigger one!  Much of the data from these random databases lacked consistency from record to record and across datasets, and very few datasets followed national data standards.  So, our priority shifted to data cleanup of the records created by the staff at 14 Smithsonian archives .  Our goal was to following national data standards and cataloging guidelines.  This approach proved to be a wise decision on multiple levels.  First, we avoided internal disagreements as to how to standardize the data among several archival units.  Secondly, we were able to hire professionals whose knowledge was applicable to our goal.  Finally, standardizing the data in different databases across the Smithsonian made building the Smithsonian wide Collections Search Center platform much easier. We didn’t know the benefit of this final point at the time. 

We used a few main approaches that were very productive:
  • Conducted extensive data analysis, created reports using thousands of programming scripts, looked for exceptions and patterns in data and listed them out for catalogers to review or make changes. This approach took advantage of both human intelligence and computer speed to handle complex data issues.
  • Conducted several thousand global data modifications based on cataloger’s requests.  This allowed us to make changes to thousands of records at a time, thus speeding up progress and efficiency.
  • Prioritized access points and authority records for Names, Subjects, Form & Genre, Geographical, and Culture terms which greatly improved searchability and discovery.
  • Sent out authority records to professional vendors for authority heading matching, then flipping incorrect terms to Library of Congress standards and reloading the records back into our system. While expensive, it provided high quality data.
  • Conducted regular cataloging and metadata training and encouraged collaboration among cataloging units to maintain high-quality cataloging practices.  The regular face to face meetings reinforced the importance of data quality and improved interactions among staff across the Institution.
For more than ten years, we continued to transform and standardize metadata within the eight Horizon databases.  We established methodologies as to how to handle chaotic situations and developed creative solutions to solve problems.  The result of our persistent efforts became the solid foundation for the next phase: creating a centralized searching system for the Smithsonian Institution and filling the goal and wish from 1980.


Pushing Beyond the Norm and Changing Culture - First Large-Scale Library, Archives and Museum Online Search Center

By 2005, the Smithsonian’s libraries, archives and museum collection records had been growing rapidly across the Institution thanks to the advancement of and wide use of database technology.  Large numbers of computer records were created and maintained in highly specialized commercial and local database systems.  However, collection records were available on over 100 disparate websites, which made them difficult for the public to use. 

In 2006, OCIO LASSB (Library Archives System Support Branch, successor of SIRIS) began to design a one-stop discovery platform that would include all Smithsonian collection data regardless of data format, professional disciplines or data owning organization.  We decided that this Cross Search Center should support simple keyword searching and be able to filter search results by data categories such as Name, Topic, Place, Culture, Date, Media type.  Since no one had done this at a large scale before, we had to innovate and find the best solutions to problems as they arose. 
We started with the eight SIRIS Horizon datasets. Our first challenge was to address the diverse data types and find ways to make the data consistent in the Cross Search Center.  We reviewed technology platforms, data standard options and data mapping possibilities.  We identified common data elements in records from across different disciplines including art, science, culture and history, and defined a new metadata format that supports a wide range of material and object types (i.e. books, journals, bibliographies, photographs, art objects, and archival materials). 

Andrew Gunther(senior software developer), took the lead in selecting an open source technology (Solr) platform that supported easy searching, faceted filtering and fast indexing functions.  The platform also allowed searching with automatic stemming for word matching, configurable relevancy ranking of search results, positive and negative limit options, and scalability for large data sets.

Insisting on consistent metadata standards was the key to our success.  After evaluating several existing metadata standards (MARC, VRE, MEDS, CDWLITE, CCO), we identified the most common data elements and created the Smithsonian Index Metadata Model.  George Bowman (senior system administrator), took the lead in designing this flexible metadata model that accommodated many specific use cases.  The LASSB (Library and Archives System Support Branch, the successor to SIRIS) team consulted OCLC FAST (Faceted Application of Subject Terminology) schema and used it to break up our LCSH subject heading by subfields from our MARC records, thus allowing faceted searching and filtering in the Cross Search Center. 

The system was designed to aggregate data from multiple databases into a central Solr index.  Jim Felley (senior system administrator), led our team in extracting data from the Horizon databases.  All data was mapped to follow the Smithsonian Index Metadata Model and each dataset required custom extraction programs to support the necessary data mapping.  We carefully tracked the highly complex data mapping requirements in a spread sheet, which allowed us to update the data and refresh it daily in the data repository for the Cross Search Center.  Randy Arnold (system administrator) ensured all systems are integrated and monitored multiple servers and system operations.
Data Aggregated from different databases into EDAN for Collections Search Center

In 2007, the Cross Search Center (http://collections.si.edu)  went live with nearly two million records from the Smithsonian libraries, 14 archival units, and several other research offices from two museums.  For the first time, the public was able to search all library and archival collection records in one platform at once.  These search capabilities were the result of Smithsonian staff’s diligence in working on metadata and authority control over the past ten years. The public and the reference staff loved the new user-friendly system.  Anne Spire, Director of the Office of the Chief Information Officer, advised that the Cross Search Center (CSC) be expanded to support all Smithsonian museum collections. The Cross Search Center was renamed the Smithsonian Collections Search Center (http://collections.si.edu), and the back-end data indexing and data repository platform was named the Enterprise Digital Access Network (EDAN). 

Getting more museums to contribute data to EDAN and Collection Search Center required effort to build relationships between OCIO LASSB and the museums.  Even though the technology and system design were fully ready to take on the wide variety of data, changing institutional culture took a lot more time and work.  Smithsonian collection staff had not traditionally worked together across the institution, and letting go of their carefully curated data that was compiled over many years required a new way of thinking. 

LASSB made sure that this collaborative work with the museum staff created mutual benefits. 
  • To make sure the museums can control their own data in the centralized system, we allow the museums to decide on which data elements to contribute and the display labels for their data element.   
  • In the Collection Search Center, museum names were prominently on display and every record had a link to the hosting museum’s website which greatly increased online traffic to the museums’ collection website. 
LASSB first approached smaller museums that were more willing and had more to gain in participating in this project.  Some of the early participating museums included the National Portrait Gallery, the National Postal Museum and the Freer Sackler Gallery.  Mike Trigonoplos (system administrator) extracted most museum data in this phase.  These museums’ holdings were seemingly unrelated, but in the Collection Search Center, search results produced surprising connections.  The positive feedback and testimonies from staff helped to propel the project forward.  The message was clear: collaboration among units produced powerful results.   

Screen shot of Collections Search Center in 2009

By December 2009, the Collection Search Center became the first large scale LAM system in the United States with more than two million catalog records from several SI museums. The system added data from more museums over the next few years.   Today, this system includes 15.5 million records and five million online images, audio and video files from all major Smithsonian libraries, archives, museums, blogs and YouTube websites.  Once again, we had to tackle data consistency issues submitted by the different museums.  Capitalizing on our previous experience in vocabulary control, we quickly developed a systematic method to address these issues.   George Bowman created a sophisticated data mapping database system that defines exception terms and enables replacements by the controlled vocabulary and data categories.  This database contains about 50,000 specific use cases and instructions.  The standardized terms significantly improved the performance of the Collections Search Center and the accuracy of search results.

In 2012, a public tagging functionality was added to the Collections Search Center.  It allowed the public to add keywords to catalog records online, with those tags searchable within ten minutes. During the trial period, 1.6 million records from nine Smithsonian units were released for tagging.  In just six-month, the public entered more than 1,000 tags.  Public users filled in blanks for creator names, classified object types, identified historical events, individuals, ethnic groups, genders, aesthetic characteristics and style, characterized film clips and pointed out mistakes. 
A Tagging Screenshot from the Collections Search Center in 2009
The tags function improved searchability and increased public participation.  However, the Smithsonian staff did not have the resources to shift through all the tags and add them to catalog records, so the project ended after 5 years.




Ching-hsien Wang,  Branch Manager
Library and Archives Systems Support Branch (LASSB)
Office of the Chief Information Officer



No comments:

Post a Comment