31 March 2009

Searching OCR? Be flexible

OCR (Optical character recognition) is technology that is used to scan printed words into a computer and then convert it into machine editable text. This technology is what is used to bring many newspapers and books to the internet on sites such as ancestry.com, footnote.com and more.

This technology is allowing family history researchers to access materials quickly, without having to wait for a person to transcribe books and newspapers page by page. However, this technology is prone to errors as it's automated and done by machines. In order to find good results, I've found that I have to be flexible and imaginative.

For example, I'm currently researching my brother-in-law's family tree on newspaperarchive.com. A particular family group consists of the surnames Finlan and Engstrom. I searched for "Engstrom" and found some great obituary results. Once I felt I'd run out of Engstrom results, I moved onto searching for "Finlan." One of my results:
Those present from a distance Harold strom from New York Merle strom of liia Wayne strom of Pasadena Miss Ruth Snowdon of and Mr and Mrs Geo Finlan of Pa
("Youngsville,"
Warren Morning Mirror,Warren, Warren, PA, 26 Aug 1927, p12)

Ah-ha, more Engstrom results! Only, the OCR technology has recored the surname as "strom." Looking at the sentence, it looks like it was written by a near illiterate journalist. In reality, the sentence reads:
Those present from a distance were: Harold Engstrom from New York City, Merle Engstrom of Philadelphia, Wayne Engstrom, of Pasadina Calif., Miss Ruth Snowdon of Philidelphia, and Mr. and Mrs. Geo Finlan of Pittsburgh, Pa.

A big difference! If the Finlan family members hadn't of been listed, I wouldn't have found this article, due to the funny transcription. Another easy error that I've noticed is that "r" is often written as "i." This makes a big difference when searching for my Craft ancestors, who I've found listed as "Ciaft" more than once.

So, when searching for results that are transcribed using OCR, be sure to be flexible. Try cutting surnames in half, exchanging out similar looking letters (ie., o for e) or, when available, using wild-card characters. It'll make a big difference in results.

1 comment:

Apple said...

Also try i for l. My best search for Hollington becomes Holllngton.

Another tip - if you have an address search that way. I turned up daughter's married names that way and discovered other relatives.

LinkWithin

Related Posts with Thumbnails