Definition and Terminology of Information Retrieval (IR)
The term “IR”, as used in this paper, refers to the retrieval of unstructured records, that is, records
consisting primarily of free-form natural language text. Of course, other kinds of data can also be
unstructured, e.g., photographic images, audio, video, etc. However, IR research has focused on
retrieval of natural language text, a reasonable emphasis given the importance and immense volume
of textual data, on the internet and in private archives.
Some points of terminology should be clarified here. The records that IR addresses are often
called “documents”. That term will be used here. IR often addresses the retrieval of documents
from an organized (and relatively static) repository, most commonly called a “collection”. That
term will also be used here. (The word “archive” is also used. So is the word “corpus”. The term
“digital library” is becoming very common. But the generic term “collection” is still the term
most commonly used in the research literature.) However, it should be understood that IR is not
restricted to static collections. The collection may be a stream of messages, e.g., E-mail messages,
faxes, news dispatches, flowing over the internet or some private network.
Structured vs. Unstructured vs. Semi-Structured Documents
Records may be structured, unstructured, semi-structured, or a mixture of these types. A record is
structured if it consists of named components, organized according to some well-defined syntax.
Typically, a structured database will have multiple record types such that all records of a given
type have the same syntax, e.g., all rows in a table of a relational database will have the same columns.
[Date, 1981, Salton, 1983] Moreover, each component of a record will have a definite
“meaning” (“semantics”) and a given component of a given record type will have the same
semantics in every record of that type. The practical effect is that given the name of a component,
a search and retrieval engine (such as a DBMS) can use the syntax to find the given component in
a given record and retrieve its contents, its “value”. Similarly, given a component and a value, the
search engine can find records such that the given component contains (“has”) the given value.
For example, a relational DBMS can be asked to retrieve the contents of the “age” column of an
“Employee” table in a “Personnel” database. The DBMS knows how to find the “Employee” table
within the “Personnel” database, and how to find the “age” column within each record of the
“Employee” table. And every “age” column within the “Employee” table will have the same
semantics, i.e., the age of some employee. The column name “age” may not be sufficient to idenPage
tify the column; an “Equipment” table in the same or a different database may also have an “age”
column. Hence in general, it may be necessary to specify a path, e.g., database name, table name,
column name, to uniquely identify the syntactic component to the search engine. However, the
syntax of a well-structured database is such that it is always possible to specify a given syntactic
component uniquely and hence it is always possible for the search engine to find all occurrences
of a given component. If the given component has a definite semantics, then it is always possible
for the search engine to find data with that semantics, e.g., to find the ages of all employees.