About the Unicode Standard

<< Click to Display Table of Contents >>

Navigation:  References >

About the Unicode Standard

The Unicode Standard provides a consistent way to digitally represent the characters used in the written languages of the world. As an accepted universal standard in the computer industry, the Unicode Standard assigns each character a unique numeric value and name. This encoding standard provides a uniform basis for processing, storing, searching, and exchanging text data in any language.

In Concordance version 10.x, the Unicode Standard is supported in Arabic, Chinese, English, Hebrew, Japanese, Korean, Russian, and other languages.

Note

Some Adobe PDF files with Arabic text do not display the Arabic text in the proper right-to-left order in Concordance. These PDF files display the text in reverse order (left-to-right) because the files report the language incorrectly or are not in the standard format.

Note

When sending data to a 3rd party software program using the Send To command, only ANSI text is sent..

Installing Language Packs

To display characters in Unicode within Concordance, the appropriate language packs need to be installed on the computer.

Issues and Tips

Currently, the Unicode Standard is supported when importing, searching, printing, and exporting documents in the supported languages. The following issues and tips are important to know working with non-English documents.

Right-to-Left Documents

When importing documents with Right-To-Left (RTL) languages, such as Arabic, the imported text may be incorrectly justified to the left side. To correct this and change the justification to the right side, select the text and press the right [Ctrl] +right [Shift] keys.

Microsoft Excel Files

When importing Microsoft Excel files in Right-To-Left (RTL) languages, the spreadsheet cells may be displayed Left-To Right instead of Right-To-Left.

File Names

Files names containing Unicode characters are supported in Concordance.

Delimiters

The delimiters available from the drop-down lists in the Import Wizard, Import Delimited Text dialog, in the Export Wizard and Export Delimited ASCII dialog, and the Overlay Database dialog may appear as square symbols or may not be displayed. How the lists are displayed depends on the computer's language environment.

Delimiters use the Tahoma font, which displays the characters regardless of the language environment.  All of the delimiter characters can be selected as a delimiter, even if the symbols they represent do not appear in the drop-down lists.

Removing Kashida Characters

Kashida characters are used in Arabic text to lengthen a word by elongating characters at certain points. The added Kashida characters change the word.

For example, the word for Term in Arabic is no_kashida. When Kashida characters are added, the word changes to kashida.

Searching for the word Term with Kashida characters results in inaccurate search results since it will not include the word  Term without Kashida characters.

To prevent inaccurate searches, the Concordance administrator can remove the Kashida characters from the searchable text in the current database.  This can be done in Concordance by going to the File menu, selecting Administration, and then Remove Kashida characters.

Words That Sound Like the Selected Word

When doing an Advanced Search from the Search task pane, selecting Display a list of words that sound like the selected word (also known as Fuzzy Search) only works with English language words. Using this option with words in other languages will display a list of words that do not sound like the selected word.

Navigating Search Results for Ideographic Languages

A character in an ideographic language, like Chinese, can represent a word. When navigating search results, each character is considered a separate hit. Clicking the Next hit and Previous hit buttons jumps to the next character in the search results.

For example, if your search term is the Chinese word for Mandarin Language School (国語学院) you will need to click Next hit four times for each word.

Data Validation Options

Database fields can be assigned data validation options from the Data Entry Attributes dialog box. However, certain validation options are only supported with English text. These include:

Upper case

Lower case

Alphabetic only

Numeric only

Match Whole Word Only

When searching for text using the Find or Replace commands, the Match whole word only check box does not work with ideographic languages such as Chinese. Clear the Match whole word only check box before searching for text in these languages.

Additional Options for Hit Highlighting

When printing documents with ideographic text, like Chinese, a character underlined for hit highlighting can easily be confused with other characters.

To allow hit highlighting in these languages, additional options have been added to the Formatting tab in the Print documents dialog box. Now you can use underline, bold, italics, color formatting or a combination of these options to highlight the search hits in your reports.

Exporting to ANSI or ASCII Format

You can export data from Concordance version 10.x to ANSI or ASCII format. The file can then be imported into an application that does not support the Unicode Standard; for example, into Concordance 2007 or earlier versions.

This option is available for delimited text files in the Export Wizard dialog box and the Export Delimited ASCII dialog box. It is also available when exporting database transcripts.

Warning

When exporting to ANSI or ASCII format, characters that cannot be represented as a single-byte character will be lost in the export. So exporting documents with double-byte characters, such as Chinese, to ANSI or ASCII format will result in data loss.