Saturday, March 22, 2008

meta metadata data

Can you have a conversation about metadata without discussing information retrieval? For that matter, can you talk about information retrieval without discussing metadata? Prolly not in a modern library context since they're so intertwined - so let's not waste our time - Let's talk about both (this may take two days)!

Metadata - data about data - sounds redundant on the outside but it is, and has always been, at the core of what libraries do. Let's consider the old card catalogue for a moment. Those little cards had information (metadata) about books; author, title, Dewey decimal number, publisher, etc. It was data about....data. And each was a point of access, a way to find the book you wanted amongst the thousands on the shelves (this is important when we talk about information retrieval in a bit).

So why should it be such a stretch to think of metadata as any thing other than what it is? information about a book; A way to find the book you're looking for. And in today's electronic world, it's a way to find a book anywhere in the world a la the web! People seem to struggle with it for two reasons, I think.

1. because it's called something fancy sounding - metadata - a name synonymous with nothing in layman terms (the word information is infinitely more universal) and

2. the electronic nature of metadata, or schemas under which metadata is generally ruled, is relatively new (in the big scope of things) and ever changing. It's damn hard to keep up with if you're not knee deep in the conversation all the time. To be perfectly honest, it's hard even if you are - it changes a lot, standards are in flux, and the language is not always intuitive.

Plus, there are so many different kinds of metadata collected and preserved for so many different reasons that knowing one schema does not guarantee across the board understanding. It's been my experience that the more traditional librarians seem to struggle with the concept more than younger people who are used to navigating in the computer world. New monikers aren't as likely to scare the bejeezus out of them, where as traditional librarians are feeling the technological crunch more and more and any new ingredient can increase anxiety. (Obviously, a topic for another discussion, anyway...)

METS, MODS, Dublin Core, DACS, TEI, EAD, the list goes on - they're all schemas, or standards, meant to preserve a set of structural, administrative, and preservation assets about an object, whether it be analog or digital in nature. These schemas are generally aimed at particular kinds of collections.

METS, for instance, is used with the Library of Congress' National Digital Newspaper Program. Written in XML (eXtensible markup language), METS acts as a wrapper (don't you love how real life translates into the virtual realm?) in that - it harbors a lot of information concerning a large group of files (in the case of NDNP, a large group of containers full of files) and arranges it in a meaningful way so that many things about the files (objects) can be read and understood, in whole or in part, without ever opening the files. Take, for example, newspaper images; things like title, publisher, microfilm date and microfilming agency, page number, issue number, date of publication, reel sequence number, etc. etc. are all collected and stored in the METS files, either with the reel, the issue, or the image. Too, the images are themselves embedded with certain metadata like file format, scanner and software version used to capture the image, and more. Plus, different file formats are capable of being embedded with more or less metadata. DACS may collect and arrange information about archival collections in a similar or different fashion based on the DACS guidelines from an object level, collection level, folder level, any number of ways one might choose to get at the collection(s). Each schema is designed for the kinds of collections they aim to describe, preserve, and access.

It should be noted that using XML to markup the metadata goes a long way in making the electronic collection information interoperable and bodes well for future transitions and improvements.

This metadata language, then, no matter the schema, allows for certain accessibilities, for instance, through a browser interface. This point is important when we think about information retrieval. In some ways, the whole point, or perhaps I should say - the joy of, metadata is that it provides for wider access in a digital environment. What we choose to collect and embed in our metadata schemas (or our image metadata) directly impacts how well our objects can be found by users. I'm talking purely from an electronic standpoint now, forget the card catalogue of yore.

It's possible that every point could be an access point if one chose to make them so. On first glance one might say, "well, of course, every point should be an access point" but that's not always the best approach depending on the circumstances. Here is where librarians can really be a huge help, not only in designing metadata standards but in the retrieval of the information gleaned from them. As information professionals, we're perhaps in the best position to know how users tend to look for their information (some may challenge this assumption what with the uprising folksonomies and other Web 2.0 ideas but let's forego that idea for right now).

There's a few ways one might find information; a library's OPAC; a database of surrogate records (MARC records most likely). Perhaps the most widely known is OCLC's WorldCat union catalog that acts sort of as the do all and be all of catalogues because it can easily connect to records within it's 10,000 member library's catalogues. Mind you, it doesn't get you anything more than the location of the object or, in some instances now, you might find a purl (persistent URL) field in a record that will take you to a digital object (sometimes a book or audio/video stream), but if it's an analog object (such as an undigitized book), you're not going to actually get the book, only where it's located. Though it's difficult to gain access directly to the object you want, these record databases are vital to a library's inventory so, it's not like they're not necessary. They just don't do everything users demand today.

Another way to find information is through commercial databases. These contain an aggregation of journals, periodicals, articles, books, etc. Personally, I find these avenues very frustrating. If you can gain access to them (for instance, if you're a student at a university, let's say, you may need access to the proxy server if you choose to do research off campus or outside the library itself and, if you're not a student, you may not have access anywhere BUT the library) they're not easy to navigate and they may or may not offer full-text access. You may only find a simple citation.

This is a personal pet peeve of mine, and I don't think I'm alone on this. Here's how these things work; some vendor decides to provide digital imaging of, let's say a scientific journal for example, and they lump a bunch of these journals together. They go off and sell these packaged journals to libraries. The good thing for libraries is; less physical storage is necessary and, in some cases, they may even get a break in the deal if they, say, take X package of journals at a discount if they also buy Y package. It can be vastly cheaper on many fronts than if they buy the paper products. Sounds good, right? The problem, and this is where the peeve comes in, is that these vendors can suddenly decide to drop any of their journals even though you, the library, bought the rights to the electronic copies for X number of years. Not only does this cause a business dilemma but it causes a great ethical dilemma concerning your users. What if you're a library in a major medical university setting? Do you think a disease is going to just stop happening because your doctors and internists can't get the information they need? Of course not. So, it becomes, to me at least, a way for the commercial vendors to hold information hostage. But libraries have little choice - nobody has the resources to digitize the stuff themselves and space is a rare commodity for everyone it seems - the content is massive in scale - so they're actually better off sticking with the vendor and hoping the information remains available. In those instances where something is discontinued, usually the vendor supplies a copy of the digital files to the library - after all, they've paid for it. Then, lo and behold, the journals show up on CDs even though a library generally isn't equipped to store or manage OR have the internal interface to allow access to their users through a browser. It's a catch 22 and I look forward to some whiz kid coming up with a way to circumvent this hold vendors have over our information.

I might mention that both of the information retrieval methods mentioned above have some amount of keyword search capabilities, more so with the commercial databases, though both employ controlled vocabularies - especially OPAC records. But the aggregated databases don't generally offer keyword searches of the full-text documents - only the abstracts or citations. As such, a lot of relevant information may be left unidentified to the user. And, of course, with MARC records, full-text keyword searching does little good since the records are succinct, distilled information bits, well informed and carefully crafted bits that come at the price of labor of skilled workers i.e. $$$, but bits none the less.

The other method of information retrieval, and this is where metadata plays a huge and developing role, is the internet. The good things are: it's free to anyone who has access, it provides wide ranging content in a number of languages, the goods sit on millions of servers so it's loss isn't so volatile, and it has become very user friendly what with tag clouds, folksonomies, and the like. Users generate the information, users may be better able to find the information. Of course the problems are everything I just mentioned. With all that comes the fact that, because it's user generated, it may not always be reliable information and there's tons of redundancy in the information. How many times have you done a Google search for something and 10+ pages of hits have been returned to you? And a lot of those hits come from commercial sites; hardly the harbingers of reliable, unbiased information. Too, a search engine won't always be able to search the "deep web" so, there's a lot of hidden information that won't be returned to you.

This is not to say you can't find good stuff on The Web - you can - but the information illiterate among us will just take the first thing that comes to them and think it's the gospel truth - like the email circular claiming Obama is a Muslim - hello - my mother sent it to me so it must be true, right? My mother would never lie. Right. But what we face as information professionals is the speed and convenience the web offers to everyone. In may ways., it levels the playing field between the haves and the have nots. Information isn't locked in a dusty archive or ivory tower university anymore - and because it's not, because that playing field is there for each and every one with the where with all to go looking for it, means that our older, trusty avenues of information, like our OPACs and commercial databases, are lagging behind the expectations of our users. And, so, there ya go.

Stick a fork in me, I'm done.....

No comments: