Reviewing load files

<< Click to Display Table of Contents >>

Navigation:  Administration > Database Administration > About Databases >

Reviewing load files

Load files or delimited text files typically have extensions ending in .dat, .csv, or .txt. Each file contains record metadata, but some may also include document-level text, sometimes referred to as OCR text.

Though Concordance Desktop can import OCR text that is contained in a load file (DAT file), it is best practice, and our recommendation that your OCR text be separated into individual text files and imported using an OPT file to reference the location of each OCR text file.

As an administrator, always make it a best practice to open and review delimited text files when you receive them, as the files are not always prepared perfectly and may need to be modified.

To Review a Load File

1.Open the load file in any text editor program.

Delimited text files can be opened with any text editor program, such as Notepad. We recommend using an advanced text editor program like TextPad® or UltraEdit®.

2.Review the load file for the following elements:

The file must be a text-based format with an extension of .dat or .csv.

If there is not a header row containing field names, open the associated .tif file for the first record and match the data in the record to the data in the .tif file.

Note the delimiters used in the file. Concordance Desktop can handle any standard text delimiters.

Note the date format used in the file. Concordance Desktop can load dates containing slashes in any order with either 2- or 4-digit years, with a maximum of 8 digits. The only date formats Concordance Desktop can load without slashes is the universal date format of YYYYMMDD and the mm-dd-yyyy date format with dashes.

Check the final line number for your last record to note the number of records expected to be added to the database. Subtract header row if the Skip first line option is checked during import.

Is there a carriage return at the end of the record? If not, add a carriage return at the end of the record. Concordance Desktop will not load the last record if the carriage return at the end of the record is missing.

Concordance Desktop database field names do not support special characters.  Therefore, make sure that the header row does not contain any special characters (i.e. #, $, etc).  Special characters will prevent the wizard from creating the fields properly.

Load files that include OCR text

If the OCR text is included in the load file (DAT file), Concordance Desktop provides a check box you can select, that will enable the import process to read the text from the load file and write it to the OCR designated field in the database records. It is recommended that before creating a database with a load file that includes OCR text, you ensure that there is a designated OCR field for each OCR text entry.

Note that an OCR field can contain up to 12 million characters. As the import process reaches the 12 million character limit, the text is overflowed into the next OCR field. If one does not exist in the database, the import process creates the field. At that point, a '1' is added to the original OCR field (OCR1), and the second OCR field is named the same as the original OCR field, but with a '2' added to the end of the field name (OCR2). The database then have two OCR fields; OCR1 and OCR2 (or TEXT1, TEXT2). All characters following the 12 millionth character are placed into the second OCR field (OCR2) of the records. If the character limit is reached a second time, the import process creates a third and final OCR field, naming it the same as the other two, and adding a number 3 to the field name (OCR3). All characters following the 24 millionth character are placed into the third OCR field (OCR3).

As you create a new database from a load file that contains the OCR text (document-level text), during the import process the OCR text (document-level text) is imported into each corresponding record and placed in the specific OCR field, usually named OCR, TEXT, or something similar. When the document-level text exceeds the capacity of that OCR field (capacity is 12 million characters), Concordance Desktop will automatically overflow the text into the next available field, regardless if that field is setup for the OCR text or not. In order to prevent the overflow text from filling a field not designed for that text, you need to ensure that there are multiple OCR fields to contain the document-level text (ex., OCR1, OCR2, OCR3, etc.), and that after the 12 million characters has been reached, the OCR field for the next character starts with the next incremented OCR field name. . With OCR text that is external to the load file, this isn't a problem, as you can create additional OCR fields using the Customize feature in the import process. However, when the OCR text is included in the load file, the field name is included with each line of textwhen the document-level text reaches the 12 million character limit, you need to ensure that the next character begins

When your OCR (document-level text) is included in the delimited text file, the process is the same, except that on the field in the delimited text file that contains the document level text. This field is usually named OCR, but may be named TEXT or something similar. During the import process, the field needs to be identified so that the document-level text can be copied into each record. When checking the delimited text file, you need to keep in mind that should the total of all the document-level text , Concordance Desktop automatically overflows the text into the next available field. For this reason, there needs to be a defined 2, 3, etc. extension of the OCR field for this character overflow (example: OCR2, OCR3, OCR4), and the document-level text in the delimited file needs to be broken up after every set of 12 million characters and the OCR field incremented so the next character set begins in the next OCR field.  What this means is you will need to change the field name every 12 million characters, by incrementing the additional number by 1. For example, at the first 12,000,001 (12 million + 1) character you will need to split the text between two lines and ensure that the 12,000,001 characters gets placed into a

When your OCR is placed into individual document-level text files, the text is imported into each corresponding record based on fields you create in the database to contain that text. The fields can be created during the import process by using the Customize feature. you have to create fields (usually OCR1, OCR2, OCR3, etc.) during the import process to contain the document-level text (OCR) from the individual text files. This is done by using the Customize feature in the Load File window.