Indexing Databases

<< Click to Display Table of Contents >>

Navigation:  Administration > Database Administration >

Indexing Databases

A Concordance Desktop database must be indexed prior to searching. When an index is run during the creation of a new database, it is actually creating and populating the Concordance Desktop dictionary and index files for the first time. How your database records are indexed ultimately affects how reviewers search for information:

Full-text searching works only on indexed fields

Relational searching is for non-indexed fields or keyed fields

Paragraph fields are indexed by default

You can, however, index just about anything in your database depending on how you create the database structure. When you are building your database, you want to plan which fields to include in the dictionary and index. The smaller the dictionary and index, the faster your searching and indexing speeds will be.

Indexing versus Reindexing:

Indexing is performed when the initial database is built, and needs to be performed after every database modification, including changes in fields, punctuation or the stopwords list. Indexing is an exclusive process.

Reindexing is performed when new records are added to the database or when there are new annotations and modifications to record content. Reindexing appends new information to both the index and dictionary files.

Indexing Considerations

Indexing and reindexing databases is integral in keeping your database updated with current review information, free of unnecessary and obsolete files, and processing efficiently for full-text searches by the review team. Indexing large datasets is time consuming, but is a standard process and part of your database maintenance schedule.

When Concordance Desktop databases are built, the index and dictionary are generated from the document contents. The dictionary contains a list of every word or string of characters in the database's record collection. The index contains directions to every word or string of characters in the database.

Example: airforce1 or ABC00001

Both airforce1 and ABC000001 qualify as index entries and are searchable. Concordance Desktop considers both examples, including both the letters and numbers, to be a word, because there is no space between the characters. Spaces between characters would disqualify them as words; a space between airforce and 1 would be read as two words.

Please consider the following before you create a database or index it and create the dictionary:

Avoid indexing serial and Bates numbers (unique value fields)

Punctuation needs to be set only once and only pertains to indexed fields

Punctuation is indexed only if embedded between alphanumeric characters. All leading and trailing punctuation is trimmed.

Update the database's stopwords list to exclude additional words that you want ignored during indexing and searching

A well-defined stopwords list keeps your dictionary and index lean

If indexing or reindexing processing speeds seem slow, you may want to increase your Concordance Desktop server's RAM and check your Indexing cache settings on the Settings tab in the Admin Console.

Index Files

Indexing scans the database and notes where each word occurs. These occurrences are stored in two files created during the indexing process, the dictionary file (.dct) and the inverted text file (.ivt). When you perform a search, Concordance Desktop looks in these two files for your words, not in the actual text of the database. Due to the structure of these files, the search is performed very rapidly, much faster than searching each document one-by-one for every word.

The dictionary file is stored in the database directory folder. The .dct file contains all dictionary words and their hit and document counts.

The inverted text file is accessed along with the .dct file when reviewers perform searches. The .ivt file contains a path to all words, along with the applicable number for each record, field, line, and word for each word in the .dct file.

Both the .dct and .ivt files contain a B-tree data structure, and the size of each file is important. When full-text searches are performed, this process only accesses these two files.

Building the initial index for these records can take many hours. Reindexing a day’s worth of new documents could take a few hours, so it is better to reindex after entering a few records in order to have that content searchable in a timely manner.

Indexing Process

Concordance Desktop follows several rules when indexing the database. Words must begin with an alphabetic or numeric character. Once the beginning of a word is found, Concordance Desktop scans until it finds the first non-alphanumeric character. This character is compared against the list of embedded punctuation characters. If the character, such as a decimal point, is found in the list and the following character is alphanumeric, then that punctuation is included in the word. Otherwise, the first non-alphanumeric character will mark the end of the word.

Embedded Punctuation

User definable embedded punctuation is provided so that hyphenated words, dates, decimal numbers, and contractions are not split into two or more words. By default Concordance Desktop uses ‘ . , / characters for embedded punctuation. Note that the hyphen is not included in the default set. You may want to include them, but it is recommended that you leave them out. Proper names, such as Mary Smith-Jones, would only be searchable under the Smith prefix if hyphens were used as embedded punctuation. Use the Punctuation field in the Modify dialog box to change the default characters.

Case Sensitivity

Concordance Desktop is not case sensitive. All words are converted to upper case letters when placed into the dictionary. All searches are likewise converted to upper case before being processed. This upper case conversion does not affect the original text that exists in your documents.

Word Length

A word can be any length of characters, but only the first 64 are considered significant. Longer words are truncated to 64 characters when they are stored in the dictionary. When you search for a word longer than 64 characters, your search word is truncated before being looked up in the dictionary. The source text is not affected.

Stopwords

Words that occur frequently, such as the, and, and or, have little search value. Such words are commonly referred to as noise words. Concordance Desktop stores these types words in a stopword list. Words that occur in the stop list are not stored in the dictionary. Excluding them from the dictionary saves time in the indexing process and significant disk space on your computer without impairing the database's ability to retrieve data. The list of stopwords is user defined and can be printed or changed in the Stopwords dialog box. If you have not specified any stopwords for your database, Concordance Desktop uses the default list in the concordance_[version #].stp stopwords file.

The stopword dictionary is used during the indexing process. Adding or deleting words from the stopword dictionary does not affect the existing database dictionary. Editing the stopword dictionary requires a complete index of the database for the changes to take effect.

For more information about stopwords, see Updating the stopwords list.