PART II: STEPPING OUTSIDE OF THE BOX
Jump starting and Supporting
Digitization
In 1994, OIRM SIRIS began a new
venture in the field of library and archives automation: the support of online
media files. At the time, the
Smithsonian had several Collection Information Systems including the library’s
system, but no catalog records were linked to images or video files, which
prohibited public access.
One of the NAA images digitized during early digitization |
With a newly
implemented internet, we modified a new WebPac application configuration to enable
images to display with catalog records online, demonstrating the technical
potential to library and archives staff. This new and exciting feature required
Smithsonian staff to digitize images and then link the image files to catalog
records by referencing the image URL in the MARC 856 field. It was a challenge to get started because no
one knew how this would work, so we had to lead by example.
By 1995, OIT (Office of Information Technology, successor of
OIRM) purchased a couple of image scanners.
SIRIS helped the NMAA Art
Inventory project digitize about 200 photographs of sculptures and linked them
to their catalog records. The first Smithsonian
public online system that could display object records with images was born! In 1996
at the Smithsonian Institution’s 150th Anniversary Event on the
National Mall, we showcased the brand-new functionality to the public. The online demonstration using the Netscape
Navigator web browser even included a few cephalopod video clips from NMNH. The
excitement for the new functionality energized archives staff. Although more archives professionals accelerated their image
digitization efforts, most of them did not have the resources to host images
online. The digitized image files
accumulated on hard drives, CD-ROMs, and laser disks. Many of these storage devices sat on bookshelves
or under desks; they were not accessible to the public online.
Leadership in Data Standard and Vocabulary Control
In 1999, the Smithsonian library system was upgraded to the Ameritech
(now SIRSIDYNIX) Horizon system. This
new system came with flexible system-configuration capability and a strong
authority (vocabulary) control function.
Most importantly, it allowed the Smithsonian to establish many locally
defined fields, supported record relationship linking capability and supported specialized
indexes that met the needs of the Smithsonian’s nontraditional challenges. SIRIS had grown to support eight
databases: Library, Archives, Art
Inventories, SAAM Photo Archives, Art Exhibition, Research Bibliography,
History of Smithsonian, and Directory of Airplanes.
Over the years, 14 archives, 20 library branches and several
museum research departments depended on SIRIS to do a wide variety of
collection management functions. More
and more data sets were added to the eight databases using custom programming
and data importing. By 2006, nearly 50%
of the 955,000 non-library records were transferred from local databases such
as DBASE, MS Access, Excel, C-Quest, FileMaker Pro, WordPerfect, Text, etc.
Mapping these different datasets into the MARC format was a big challenge, but dealing with data inconsistencies was an even bigger one! Much of the data from these random databases
lacked consistency from record to record and across datasets, and very few
datasets followed national data standards.
So, our priority shifted to data cleanup of the records created by the
staff at 14 Smithsonian archives . Our
goal was to following national data standards and cataloging guidelines. This approach proved to be a wise decision on
multiple levels. First, we avoided
internal disagreements as to how to standardize the data among several archival
units. Secondly, we were able to hire
professionals whose knowledge was applicable to our goal. Finally, standardizing the data in different
databases across the Smithsonian made building the Smithsonian wide Collections
Search Center platform much easier. We didn’t know the benefit of this final
point at the time.
We used a few main approaches that were very productive:
- Conducted extensive data analysis, created reports using thousands of programming scripts, looked for exceptions and patterns in data and listed them out for catalogers to review or make changes. This approach took advantage of both human intelligence and computer speed to handle complex data issues.
- Conducted several thousand global data modifications based on cataloger’s requests. This allowed us to make changes to thousands of records at a time, thus speeding up progress and efficiency.
- Prioritized access points and authority records for Names, Subjects, Form & Genre, Geographical, and Culture terms which greatly improved searchability and discovery.
- Sent out authority records to professional vendors for authority heading matching, then flipping incorrect terms to Library of Congress standards and reloading the records back into our system. While expensive, it provided high quality data.
- Conducted regular cataloging and metadata training and encouraged collaboration among cataloging units to maintain high-quality cataloging practices. The regular face to face meetings reinforced the importance of data quality and improved interactions among staff across the Institution.
For more than ten years, we continued to transform and standardize metadata within the eight Horizon databases. We established methodologies as to how to handle chaotic situations and developed creative solutions to solve problems. The result of our persistent efforts became the solid foundation for the next phase: creating a centralized searching system for the Smithsonian Institution and filling the goal and wish from 1980.
Pushing Beyond the Norm and Changing Culture - First
Large-Scale Library, Archives and Museum Online Search Center
By 2005, the Smithsonian’s libraries, archives and museum
collection records had been growing rapidly across the Institution thanks to
the advancement of and wide use of database technology. Large numbers of computer records were
created and maintained in highly specialized commercial and local database
systems. However, collection records
were available on over 100 disparate websites, which made them difficult for
the public to use.
In 2006, OCIO LASSB (Library Archives System Support Branch,
successor of SIRIS) began to design a one-stop discovery platform that would include
all Smithsonian collection data regardless of data format, professional disciplines
or data owning organization. We decided
that this Cross Search Center should support simple keyword searching and be
able to filter search results by data categories such as Name, Topic, Place, Culture,
Date, Media type. Since no one had done
this at a large scale before, we had to innovate and find the best solutions to
problems as they arose.
We started with the eight SIRIS Horizon datasets. Our first
challenge was to address the diverse data types and find ways to make the data consistent
in the Cross Search Center. We reviewed
technology platforms, data standard options and data mapping
possibilities. We identified common data
elements in records from across different disciplines including art, science,
culture and history, and defined a new metadata format that supports a wide
range of material and object types (i.e. books, journals, bibliographies,
photographs, art objects, and archival materials).
Andrew Gunther(senior software developer), took the lead in selecting
an open source technology (Solr) platform that supported easy searching,
faceted filtering and fast indexing functions.
The platform also allowed searching with automatic stemming for word
matching, configurable relevancy ranking of search results, positive and
negative limit options, and scalability for large data sets.
Insisting on consistent metadata standards was the key to our success. After evaluating several existing metadata standards (MARC, VRE, MEDS, CDWLITE, CCO), we identified the most common data elements and created the Smithsonian Index Metadata Model. George Bowman (senior system administrator), took the lead in designing this flexible metadata model that accommodated many specific use cases. The LASSB (Library and Archives System Support Branch, the successor to SIRIS) team consulted OCLC FAST (Faceted Application of Subject Terminology) schema and used it to break up our LCSH subject heading by subfields from our MARC records, thus allowing faceted searching and filtering in the Cross Search Center.
The system was designed to aggregate data from multiple
databases into a central Solr index. Jim
Felley (senior system administrator), led our team in extracting data from the Horizon databases. All data was mapped to follow the Smithsonian
Index Metadata Model and each dataset required custom extraction programs to
support the necessary data mapping. We
carefully tracked the highly complex data mapping requirements in a spread
sheet, which allowed us to update the data and refresh it daily in the data
repository for the Cross Search Center. Randy Arnold (system administrator) ensured all systems are integrated and monitored multiple servers and system operations.
In 2007, the Cross Search Center (http://collections.si.edu) went live with nearly two million
records from the Smithsonian libraries, 14 archival units, and several other research
offices from two museums. For the first
time, the public was able to search all library and archival collection records
in one platform at once. These search capabilities
were the result of Smithsonian staff’s diligence in working on metadata and
authority control over the past ten years. The public and the reference staff
loved the new user-friendly system. Anne
Spire, Director of the Office of the Chief Information Officer, advised that
the Cross Search Center (CSC) be expanded to support all Smithsonian museum
collections. The Cross Search Center was renamed the Smithsonian Collections
Search Center (http://collections.si.edu),
and the back-end data indexing and data repository platform was named the Enterprise
Digital Access Network (EDAN).
Getting more museums to contribute data to EDAN and
Collection Search Center required effort to build relationships between OCIO LASSB
and the museums. Even though the
technology and system design were fully ready to take on the wide variety of
data, changing institutional culture took a lot more time and work. Smithsonian collection staff had not
traditionally worked together across the institution, and letting go of their
carefully curated data that was compiled over many years required a new way of
thinking.
LASSB made sure that this collaborative work with the museum
staff created mutual benefits.
- To make sure the museums can control their own data in the centralized system, we allow the museums to decide on which data elements to contribute and the display labels for their data element.
- In the Collection Search Center, museum names were prominently on display and every record had a link to the hosting museum’s website which greatly increased online traffic to the museums’ collection website.
Screen shot of Collections Search
Center in 2009
By December 2009, the Collection Search Center became the
first large scale LAM system in the United States with more than two million
catalog records from several SI museums. The system added data from more
museums over the next few years. Today, this system includes 15.5 million
records and five million online images, audio and video files from all major
Smithsonian libraries, archives, museums, blogs and YouTube websites. Once again, we had to tackle data consistency
issues submitted by the different museums. Capitalizing on our previous experience in
vocabulary control, we quickly developed a systematic method to address these
issues. George Bowman created a
sophisticated data mapping database system that defines exception terms and enables replacements by the controlled vocabulary and data categories. This database contains about 50,000 specific
use cases and instructions. The
standardized terms significantly improved the performance of the Collections
Search Center and the accuracy of search results.
In 2012, a public tagging functionality was added to the
Collections Search Center. It allowed
the public to add keywords to catalog records online, with those tags searchable
within ten minutes. During the trial period, 1.6 million records from nine
Smithsonian units were released for tagging. In just six-month, the public entered more
than 1,000 tags. Public users filled in
blanks for creator names, classified object types, identified historical
events, individuals, ethnic groups, genders, aesthetic characteristics and
style, characterized film clips and pointed out mistakes.
A Tagging Screenshot from the Collections Search Center in 2009 |
The tags function improved searchability and increased public
participation. However, the Smithsonian
staff did not have the resources to shift through all the tags and add them to
catalog records, so the project ended after 5 years.
Ching-hsien Wang, Branch Manager
Library and Archives Systems Support Branch (LASSB)
Office of the Chief Information Officer
No comments:
Post a Comment