Indexing Databases

<< Click to Display Table of Contents >>

Navigation:  Concordance > Concordance Administration >

Indexing Databases

A Concordance database must be indexed prior to searching. When you index your database for the first time, you are actually creating and populating the Concordance dictionary and index files. How your database records are indexed ultimately affects how reviewers search for information:

Full-text searching works only on indexed fields

Relational searching is for non-indexed fields or keyed fields

Paragraph fields are indexed by default

You can index just about anything in your database depending on how you create the database structure. When you are building your database, you want to plan which fields to include in the dictionary and index. The smaller the dictionary and index, the faster your searching and indexing speeds will be.

Indexing and reindexing databases is important for keeping your database updated with current review information, free of unnecessary and obsolete files, and efficient for processing full-text searches. Indexing large datasets is time consuming, but is a standard process and part of database maintenance.

When Concordance databases are built, the index and dictionary are generated from the document contents. The dictionary contains a list of every word or string of characters in the database's record collection. The index contains directions to every word or string of characters in the database.

Please consider the following before you index a database and create the dictionary:

Avoid indexing serial and Bates numbers (these are unique value fields)

Punctuation needs to be set only once and only pertains to indexed fields

Punctuation is indexed only if embedded between alphanumeric characters. All leading and trailing punctuation is trimmed.

Update the database's stopwords list to exclude additional words that you want ignored during indexing and searching

A well-defined stopwords list keeps your dictionary and index lean

Note

If indexing or reindexing speeds seem slow, you may want to increase your computer's RAM and check your cache settings on the Indexing tab in Preferences.

Index Files

Indexing scans the database and notes where each word occurs. These occurrences are stored in two files created during the indexing process, the dictionary file (.dct) and the inverted text file (.ivt).  When you perform a search, Concordance looks in these two files for your words, not in the actual text of the database.  Due to the structure of these files, the search is performed very rapidly, much faster than searching each document one-by-one for every word.

The dictionary file is stored in the database directory folder and contains all dictionary words and their hit and document counts.  The inverted text file file contains a path to all words, along with the applicable number for each record, field, line, and word for each word in the .dct file.  Both the dictionary file and inverted text file contain a B-tree data structure, and the size of each file is important. When full-text searches are performed, the search only uses these two files.

Note

Building the initial index for these records can take many hours.  Reindexing a day’s worth of new documents could take a few hours, so it is better to reindex after entering a few records in order to have new content searchable in a timely manner.

Indexing Process

Concordance follows several rules when indexing the database. Words must begin with an alphabetic or numeric character. Once the beginning of a word is found, Concordance scans until it finds the first non-alphanumeric character. This character is compared against the list of embedded punctuation characters.  If the character is found in the punctuation list and the following character is alphanumeric, then that punctuation is included in the word. Otherwise, the first non-alphanumeric character will mark the end of the word.

Embedded Punctuation

User definable embedded punctuation is provided so that hyphenated words, dates, decimal numbers, and contractions are not split into two or more words. By default Concordance uses ‘ . , / characters for embedded punctuation. Note that the hyphen is not included in the default set. You may want to include them, but it is recommended that you leave them out. Proper names, such as Mary Smith-Jones, would only be searchable under the Smith prefix if hyphens were used as embedded punctuation. Use the Punctuation field in the Modify dialog to change the default characters.

Case Sensitivity

Concordance is not case sensitive. All words are converted to upper case letters when placed into the dictionary. All searches are likewise converted to upper case before being processed. This upper case conversion does not affect the original text that exists in your documents.

Word Length

A word can be any length of characters, but only the first 64 are considered significant. Longer words are truncated to 64 characters when they are stored in the dictionary. When you search for a word longer than 64 characters, your search word is truncated before being looked up in the dictionary. The source text is not affected.

Stopwords

Words that occur frequently, such as the, and, and or, have little search value. Such words are commonly referred to as noise words. Concordance stores these types words in a stopword list. Words that occur in the stopword list are not stored in the dictionary. Excluding them from the dictionary saves time in the indexing process and significant disk space on your computer without impairing the database's ability to retrieve data. The list of stopwords is user defined and can be printed or changed. If you have not specified any stopwords for your database, Concordance uses the default list in the concordance_<version>.stp stopwords file.

The stopword dictionary is used during the indexing process. Adding or deleting words from the stopword dictionary does not affect the existing database dictionary. Editing the stopword dictionary requires a complete index of the database for the changes to take effect.