Duali

From ويكي عربآيز
Revision as of 15:13, 8 October 2006 by Djihed (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Duali, named after the legendary founder of the Arabic grammar (Abul Aswad al Du'ali - d. 688), is an Arabic spell-checker that is designed to accommodate to the Arabic language (and extendible to other non-Arab based languages as well).

The Duali project page can be found here.

How It Works

Duali's dictionary data comes in six files (from the Buckwalter Morphological Analyzer). These files are:

  • prefixes
  • suffixes
  • stems
  • tableab (compatibility table prefix+stem)
  • tableac (compatibility table prefix+suffix)
  • tablebc (compatibility table stem+suffix)

Current, compatibility support is not implemented, which means that some of the incorrectly spelled words are actually flagged as correct, but no correctly spelled word would be flagged incorrect.

The data files are encoded in UTF-8. However, the Python version of Duali allows the user to generate the dictionary data files using CP-1256 if they are inclined to do so. Choosing an encoding other than UTF-8 makes Duali slower since the look-ups are done in UTF-8 and so a character encoding conversion would have to happen on each look-up if CP-1256 is used.

There is a lot of data that comes with those above mentioned files. However, only a small subset of this data is used in Duali. What happens is the following:

1. Duali parses file 2. Arabic word recognized 3. Word is then segmented to all possible combinations (to prefix+stem+suffix) 4. Each of those possible combinations is then checked against the prefixes, stems and suffixes. 5. Once a match is found it moves on, else the word is incorrectly spelled.

Due to the fact that Arabic words are written in different forms (ie. the spellings of a word are sometimes simplified), some of the correctly spelled words may be flagged as incorrect. For this reason, Duali has a feature to 'normalize' words. The normalization process does the following:

1. Removes ALEF_MADDA ALEF_HAMZA ALEF_HAMZA_BELOW from an ALEF 2. Combines a YEH and HAMZA into a YEH_HAMZA 3. Replaces an ALEF_MAKSURA with a YEH 4. Replaces a TEH_MARBUTA with a HEH

All of this happens internally in Unicode. However, this has been changed to UTF-8 in the C++ version of Duali. This is mainly due to the lack of regex engines that support Unicode properly.

How It Should Work

Despite all of this above, Duali's true intention is not to do any of the above. This method is a "second best" alternative. The real goal of Duali is to produce a very compact dictionary which is root based. That is to say, a dictionary would hold the following information:

  • root word
  • possible variations (derivatives) of the root word
  • this would be represented numerically from a table of the possible derivatives of the root forms

The spell checker would then:

1. Parse file 2. Recognize Arabic word 3. Strip the word from its prefx and suffix if any 4. Get the root from the stem (if not already a root) 5. Look up the root word in dictionary 6. Verify the derivative is a valid variation of the root 7. Flag word accordingly

Unfortunately this is not currently possible due to the lack of data. In other words, this data that would form the ideal dictionary data set is not available and is not likely to happen without a massive effort.