Monday, March 24, 2008

more information retrieval thoughts

if one were asked to compare certain aspects, like full-text vs. surrogate record searches, or information retrieval through different methods such as MARC records (surrogate), the internet, or aggregated databases, what would you talk about?

There are pros and cons for each method. How accurate you want your answer might determine which avenue you choose and both have differing rates of precision (optimized retrieval of right/relevant information) and recall (optimized retrieval) of their results.

Let's take full text searching for instance. Aggregated databases and most web based texts offer "full-text" searching. That sounds great but there are some drawbacks. With databases you're usually only able to search the citation or abstract - leaving out the biggest portion of the article. You may miss the very articles that are most relevant to your needs because synonyms and other helpful associated terms weren't included in the abstract/citation. The same can be said for web searches - it's a jumble of information in a far less organized environment and, so, many relevant pieces just aren't found (Google's kick-ass page ranking algorithm can increase your chances of success but it's still not a controlled vocab). Too, this kind of "free-form" searching, as I sometimes think of it, can, and does, easily produce a good number of false hits.

This last point, however, isn't always so bad. How often have you done a Google search (or another search engine of your choice) and come across information that may or may not be directly related to what you were looking for but proved to be important and interesting none the less? Some of this success has to do with "natural language" and the way web content is indexed, both of which we'll get to in a bit

Then there is the surrogate records, I always think of them as MARC records, which I access a lot in my current position. This individual item inventory has a list of fields that will tell you, in a controlled manner, a host of relevant facts like title, publisher, preceding or successive titles, publication date, and so on. Any one of these fields can be both an access point and a source of retrieval. The problem is that you'll only get the location of the item, not the item itself (except, perhaps, there is a PURL included - a persistent url that will take you directly to a digitized object). That's become a point of serious disappointment in the Google age - people not only want to find the information they're looking for but they want it now. They don't generally want to schlep to a library to look at a book or article.

Far more than most web content, surrogate records are derived by a controlled vocabulary.
Unlike natural language which let's users build the bulk of web information, a controlled vocabulary takes skill and time to generate. You can add synonyms, homonyms, and polysemes that enhance the aboutness of the item, thus increasing the precision of a hit. But this comes with significant drawbacks, in that, it's expensive to maintain and, because it is labor intensive, it may very quickly lag behind current and new information.

Natural language indexing doesn't suffer this problem. It's automatically indexed by the computer and, so, has an immediate result of the information. Granted, it may not be right or relevant - it may be so far off the mark you walk away scratching your head going "I asked for checks for my bank account, not checks in cotton material." A computer can not effectively process natural language. It doesn't know the difference between being serious and being sarcastic (though the idea of the semantic web aims to make computers intelligent enough to read and know the difference between this and all other differing language quirks).

Jesus - this subject is just huge when you think about it, trying to connect the index to the controlled vocabulary to the natural language vocab to the internet verses commercial databases verses a surrogate record and all the stuff that goes into each one. For everything you say, there's 2x as much you'll leave out. poo - I hate that kind of stuff

No comments: