Languages in Libraries: How we Code, Catalog, Transliterate, and Translate

The three of us agreed to research different approaches to accessing and foreign language resources in libraries and museums. Initially we were looking into how translation is performed, whether technologies like google’s visual translation tools were reliable, and how information was validated by people who may not speak certain languages they are trying to catalog and describe.

Our website aims to give an overview of the current scholarship and practice in managing and translating foreign language materials in libraries, archives, and museums.

Our project focuses on the various challenges presented by working with foreign language materials in libraries, archives, and museums and their possible solutions. We broke our research down into three categories based on our interests: cataloging, crowdsourcing, and user experience.


Daisy Tainton looks at how metadata is created and shared, and standardized, in the library world, particularly in regard to non-Latin scripts on Latin-based platforms. How are items described and how do users discover and access them? This section includes pages about Unicode, OCLC, Searching for non-Latin items, and interviews with professionals in the field.

For cataloging, OCLC connexion macros(such as those created by Joel Hahn and Terry Reese), Unicode, and google translate were the most-used technologies. OCLC allows users to transliterate scripts back and forth with minimal errors, generally, though small libraries may not be able to afford access. Unicode is a standard for encoding scripts, and has developed since 1991, with a current capacity to encode over a million characters. Google translate is used constantly during cataloging within large and small organizations to match incoming items with their existing entries in WorldCat or any accessible system.

OCLC transliteration tools work fairly well, it appears, though in order to check your results and make sure they are correct, you still need an expert or at least a speaker of the language. This is why there are top-level catalogers at major institutions(like Harvard) who create definitive metadata entries that become the gold standard and are spread through WorldCat across the globe. Any duplicate entries with small mistakes, perhaps created by a cataloger who could not locate a previously existing entry, will confuse users and clog library catalogs everywhere they are copied or shared.

The internet as we understand it has forced libraries to adapt and grow, and set new highs for discovery expectations. International, borderless finding means using searchable multi-language text, including latinized versions of texts, and making those descriptive texts available to external researchers.

As the population of readers of foreign languages with access to our collections continues to grow, both in-person and online, physical library collections have been  slowly expanding collections to meet the needs of those users. These foreign language materials must be cataloged and described, ideally in English as well as the language in which they were written(language of origin and language of agency). Non-Roman scripts(Arabic, Chinese, Japanese, Tibetan, Vietnamese, etc.) may also need to be described in a latinized form of the language in which they were written, for ease of typing and copying the information across platforms and with older computer systems that do not have or cannot support keyboard programs in those scripts.

In museums and archives, shared meta data projects will help reduce difficulties in this area, and presumably allow good data to receive reinforcement through consensus and improving algorithms(increasing personal name recognition, clarifying fine points in translated Latin names). One source, Don Wheeler, Collection Development/Technical Services Librarian at the New York Botanic Garden  has already mentioned that catalogers in specialized collections (plant-related items at the NYBG) benefit greatly from access to existing catalogs for comparison when faced with foreign language items. In order to access existing metadata, visual translators can be extremely helpful and reduce time spent looking up each word. Getting a visual match would be excellent verification. But those entries have to be made, first, and found.

Given more time, it would be great to look into regional variations and developments under way for new transliteration or translation tools. It would also be fascinating to learn more about small archives with unique items, and how they handle items for which entries can’t be matched in WorldCat.


Meg Edison

One way libraries, archives, and museums (LAMs) can tackle the challenges presented by foreign language materials (particularly archival objects) is by turning to the crowd to help translate.

The first section, What is Crowdsourcing?, gives a brief overview of the topic. I explain how Jeff Howe coined the term in 2006 and how crowdsourcing often focuses on “microtasks.” I conclude by explaining some tasks libraries, archives, and museums use crowdsourcing for. The second section, Crowdsourcing and Translation, discusses how researchers are trying to find practical ways for translation services to be done through internet crowdsourcing. I explain the basics of how translation crowdsourcing works and describe the projects of two teams of researchers. The third section, LAMs and Crowdsourcing, explains how libraries, archives, and museums can use crowdsourcing for their translation projects. The final section, Current Projects, gives several examples of LAM crowdsourcing projects that are either ongoing or recently completed. The National Archives and Record Administration Citizen Archivist, Smithsonian Transcription Center, and the New York Public Library Labs are all crowdsourcing projects that either do include, or could include, translation features.

User Experience

Leah Constantine 

The gap between the user and the users’ results exists in the realm of cataloging, which contains accurate description, subject, and classification of resources from the cataloger. The fundamental reason a catalog is unsuccessful for a user is when these description standards are not met by the institution, and data from the system is unattainable in standard research practices for the public.

A successful use of new technology methods and digital research practices allowed for The British Library to catalog records from their Hebrew manuscripts without compromising user experience or translation descriptions. The British Library resolved XML with a digital tool called TEI Viewer to display all of the code attributed to the description of the manuscripts, and later the images, all at once (Keinan-Schoonbaert). This allowed the cataloger the ability to analyze the data as a whole and create enhanced standards of searching for the end user, even in foreign language. In contrast, The Silvermuseet unsuccessfully translated and digitized their linguistic materials due to the challenge in not being able to meet the necessary skillset to document them in XML standards (Wilbur, 2014). This proves to be a great disadvantage to the museum’s community and those users who could benefit from learning about the history of the culture and language.

Libraries are constantly challenged by meeting the requirements of making resources accessible to users in their online catalog as a public service. As seen by The British Library, this service was met, and user experience is successful in delivering accurate information on foreign language materials. However, with the expansion of data in library systems comes challenges to maintain descriptions access to the public. This is often done by parsing digital technologies with languages like XML and making it available through open source software like OpenRefine and OPACs (Martin, 2014). These open source software systems create more accurate search practices for the user that bridge the gap that exists between user and catalog.

With more time to use these services and to research the institutions that practice them in catalogs, it would have been helpful to see what challenges were overcome based on materials of foreign languages and study how successful the language translations have been for users as opposed to just the catalogers. Even in the given amount of time spent with the topic, seeing successful and unsuccessful examples of user experience provide a general understanding of the challenges institutions face when achieving public access to catalogs.

All of our citations are organized by section in our References page.