Вы находитесь на странице: 1из 3

DVX Segmentation Exceptions for Dummies

Modifying sentence delimitation rules ................................................................................................................. 1 Troubleshooting entries....................................................................................................................................... 3

Dj Vu X is widely acclaimed for its sophisticated ease of use. But as is the case with any manyfeatured software, there are bound to be some aspects that are a little complex, even downright arcane. The sentence delimiter settings could fairly be thought to fall into this category. These are the rules that, if correctly configured, can prevent a sentence from being broken into two segments if "approx." or "Mr." or some other abbreviation appears in it. These settings are configured individually for each language, and unfortunately the defaults are far short of the mark for ordinary use in many languages. As a result, one often wastes time laboriously joining segments, and TM matches are often missed in analyses unless the segmentation is cleaned up. Modifying sentence delimitation rules 1 - Select Tools>Options>Delimiters 2 - The Delimiters tab is displayed with the source language of your current project.

Fig. 1 Default segmentation rules and exceptions for US English

3 - In the left part of the tab you can see the Rules, and in the right part the Exceptions to the rules. The meaning of the individual characters is shown in Table 1 and the interpretation of some of the entries in Figure 1 is shown in the examples in Table 2.

To define rules or exceptions you can use any actual character plus these symbols: Symbol Meaning space a digit (1, 2, 3...) a letter (upper-case, lower-case, or any case) a lower-case letter an upper-case letter any character the caret character (^) itself

^w ^# ^$ ^a ^A ^? ^^

Table 1 Symbols for DVX segmentation rules and definitions

Before split .^w :^w .^?^w

After split

Meaning / Comment a period followed by a space a colon followed by a space a period followed by any character then a space

Example The dog is mean. It bit me. There is a suspect: John. (dogs, cats, etc.) Consult your veterinarian if you have questions. Hark! the herald angels sing Warm-blooded creatures with fur (dogs, cats, etc.) are all mammals. The new head of the firm is Mr. Johnson.

Type rule rule rule

!^w .^?^w

^a ^a

an exclamation point followed by a space and a lower case letter a period followed by any character then a space and a lower case letter the abbreviation "Mr." followed by a space. Without specification of the space, the .^w rule will take precedence and the sentence will be split!

exception exception

Mr.^w

exception

Table 2 Explanation of selected rule and exception examples

4 - Select the desired language (see Fig. 1) 5 - Type the desired character(s) and symbol(s) in the Before Split and After Split fields. or Type the desired character. To enter the desired symbols, right-click and select the symbols from the shortcut menu.

Fig. 2 Definition fields context menu

6 Once you have entered or modified your rule, click Add.

Troubleshooting entries In order to test the validity of your rule and exception entries, it is a very good idea to create a test file with examples of all the relevant cases you can think of. It is also a good idea to allow for variations; for example, when entering "e.g." as an exception, consider that the author of the text might type it as "e. g." with an extra space. Note that in your test samples, you should vary the capitalization after the element, for example "There are many cities with such facilities, e.g. Seattle, San Francisco and San Diego." If the entry for "e.g." is incorrect (i.e. it doesn't have "^w" at the end), it will appear to work if the exception ".^w" with "^a" after the split is defined. The point which causes many people difficulties initially when trying to enter exceptions is the need to place "^w" at the end of the exception text. This is "obvious" when one looks at the examples of the defaults, but even the obvious can be pretty hard to notice sometimes. Remembering to add the definition for the space at the end of your exception will usually fix the problem. Note also that the definitions are sensitive to capitalization! In German, for example, one finds numerous variations (thanks, among other things, to sloppy typing of the source texts) for the abbreviation for "zum Beispiel", including z.B. z. B. z. B Z.B. Z. B. z.b. (with extra space) (with extra space, forgotten period after the B) (with extra space)

These would have to be entered individually with the following "before split" settings (and no "after split"):

z.B.^w z.^wB.^w z.^wB^w Z.B.^w Z.^wB.^w z.b.^w

Вам также может понравиться