نقاش:Semantic Arabic Encoding and Format

من ويكي عربآيز
اذهب إلى: تصفح، ابحث

File Problems

Currently, users are unable to access the zip file with the demo code here.

The Basic Concept Needs Reviewing

This suggestions was posted by Amru Gameel 07:43, 9 يوليو 2007 (PDT)

A Quick Overview of the Problem

After reviewing the page here, one user suggested that this encoding will not solve the problem, but just lead to a new problem altogether. This encoding system tries to separate morphology to the point that the way it is read and the way it is encoded become separate. The average user tends to write a given Arabic word as it is read. Consider the following examples:

Original Arabic, the way the user reads and writes:

"الاستخدام المتكرر"

Arabic via the Tarmeez Semantic Arabic Encoding Format:

"خدم/6ال كرر/5م"

Thus, this will only lead to a new problem, text input. This format means that we need another program that will properly understand the input in such a format. This problem will become even more complex if the user tries to input a word from an unknown verbal root form into the text. This is ignoring the different lexicons for the different varieties of Arabic, from Classical to Modern Standard to dialectical Arabic. We do not have the ability to make such large and complex dictionaries.

So Where Do We Go From Here?

Instead of looking at the encoding system at the lexical (e.g. word) level, we should look at it according to the letters in the word. The following picture will illustrate the improved encoding system:

Ar new encoding diagram.gif

The idea here is that we are focusing on the naked letter, which means no diacritic markers (e.g. the technical name for the dots you see above certain Arabic letters). So the possible letters are as follows:

The Basic Alphabet in the Encoding:

ا ب ح د ر س ص ط ع ف ك ل م ن ه null

If you ignore the dot underneath the dot underneath the ba'a (as if it were a base form for ba'a, ta'a, and tha'a), then you will see that the other divisions of the encoding will add the diacritics, grammatical vowelling (tashkeel), so on and so forth.

A more thorough explanation of this proposed encoding is to follow.

Pros of the New Encoding

There are several benefits to this new encoding:

  • The grammatical vowelling (tashkeel) is linked directly to the letter and therefore independent letters are not permissible. This is truer to the combination of morphology and syntax in the original Arabic script.
  • This leads to the added bonus of reducing bitspace needed for the field.
  • Although they are linked, the letters and grammatical vowelling are still separate. This means it will be possible to have bare forms, without any diacritic markers. These bare forms could be extremely useful.
  • This allows for the possibility of automatic vowelling systems that will take a text that the user inputted and vowel it for him or her completely.
  • There is limitless potential for the new characters and bare forms, for instance, allowing all short vowels above or underneath a single character. These new characters and their derivatives can have very useful applications.

Cons the New Encoding

Although there are many good uses for the proposed encoding, it also has several problems:

  • There will be tripartite input process. This means the user will have to input 1) the bare letter 2) the diacritic marker (if necessary) 3) the tashkeel vowel (if applicable).
  • Foreign character input (e.g. numbers, symbols, English alphabet) has not been fully discussed and there have been no suggestions on how to implement it. The simple solution here is pretty simple, just designate the first four bits to encoding whether it is an Arabic letter or something else, as long as we can agree upon an encoding scheme for the different situations that might arise.
  • This is trying to reinvent the wheel, as the expression goes in English. Why should we? If this proposed encoding is really successful and beneficial for everyone, it is possible we can have it added to Unicode or have Unicode modified to include our ideas on the matter.

What Is Wrong with Unicode?

What is wrong with Unicode exactly?

Comments

Dr. Amru's ideas are very smart. They will be very useful in production systems.

A Review of the Basic Concept

Notes Concerning the Suggestions

I wonder if this work if I at least have text under here.

Extending the Short Vowelling

I would hope so.