نقاش:Semantic Arabic Encoding and Format
محتويات
File Problems
Currently, users are unable to access the zip file with the demo code here.
The Basic Concept Needs Reviewing
This suggestions was posted by Amru Gameel 07:43, 9 يوليو 2007 (PDT)
A Quick Overview of the Problem
After reviewing the page here, one user suggested that this encoding will not solve the problem, but just lead to a new problem altogether. This encoding system tries to separate morphology to the point that the way it is read and the way it is encoded become separate. The average user tends to write a given Arabic word as it is read. Consider the following examples:
Original Arabic, the way the user reads and writes:
"الاستخدام المتكرر"
Arabic via the Tarmeez Semantic Arabic Encoding Format:
"خدم/6ال كرر/5م"
Thus, this will only lead to a new problem, text input. This format means that we need another program that will properly understand the input in such a format. This problem will become even more complex if the user tries to input a word from an unknown verbal root form into the text. This is ignoring the different lexicons for the different varieties of Arabic, from Classical to Modern Standard to dialectical Arabic. We do not have the ability to make such large and complex dictionaries.
So Where Do We Go From Here?
Instead of looking at the encoding system at the lexical (e.g. word) level, we should look at it according to the letters in the word. The following picture will illustrate the improved encoding system:
The idea here is that we are focusing on the naked letter, which means no diacritic markers (e.g. the technical name for the dots you see above certain Arabic letters). So the possible letters are as follows:
The Basic Alphabet in the Encoding:
ا ب ح د ر س ص ط ع ف ك ل م ن ه null
If you ignore the dot underneath the dot underneath the ba'a (as if it were a base form for ba'a, ta'a, and tha'a), then you will see that the other divisions of the encoding will add the diacritics, grammatical vowelling (tashkeel), so on and so forth.
A more thorough explanation of this proposed encoding is to follow.
Pros of the New Encoding
There are several benefits to this new encoding:
- The grammatical vowelling (tashkeel) is linked directly to the letter and therefore independent letters are not permissible. This is truer to the combination of morphology and syntax in the original Arabic script.
- This leads to the added bonus of reducing bitspace needed for the field.
- Although they are linked, the letters and grammatical vowelling are still separate. This means it will be possible to have bare forms, without any diacritic markers. These bare forms could be extremely useful.
- This allows for the possibility of automatic vowelling systems that will take a text that the user inputted and vowel it for him or her completely.
- There is limitless potential for the new characters and bare forms, for instance, allowing all short vowels above or underneath a single character. These new characters and their derivatives can have very useful applications.
Cons the New Encoding
Although there are many good uses for the proposed encoding, it also has several problems:
- There will be tripartite input process. This means the user will have to input 1) the bare letter 2) the diacritic marker (if necessary) 3) the tashkeel vowel (if applicable).
- Foreign character input (e.g. numbers, symbols, English alphabet) has not been fully discussed and there have been no suggestions on how to implement it. The simple solution here is pretty simple, just designate the first four bits to encoding whether it is an Arabic letter or something else, as long as we can agree upon an encoding scheme for the different situations that might arise.
- This is trying to reinvent the wheel, as the expression goes in English. Why should we? If this proposed encoding is really successful and beneficial for everyone, it is possible we can have it added to Unicode or have Unicode modified to include our ideas on the matter.
What Is Wrong with Unicode?
What is wrong with Unicode exactly?
Comments
Dr. Amru's ideas are very smart. They will be very useful in production systems. -يوسف 04:47, 10 يوليو 2007 (PDT)
There was a subproject dealing with how to classify verbs that might be of help. Check this archived message from the Arabeyes General List.
Reviewing The Suggestion
Some Thoughts
Having reviewed my suggestion again, I realized the problem is really the input of Arabic text itself. The focus on correcting the encoding scheme was not correct. Using my encoding scheme brings in a few problems as well, and we already neglect tashkeel enough without the ability to have multiple short vowels on one letter. The same is true, to a lesser extent, for diacritic markers. At any rate, the input problem is not the problem with the encoding in the first place. د.عمرو 09:05, 11 يوليو 2007 (PDT)
Comments
I think the problem in the first place is the fonts technology itself, which haven never been able to meet the demands of the Arabic language's script in general. Even OpenType, and the majority of contemporary technological innovations with font, are not really changing anything. Rendering Graphite, however, looks to be a promising free technology that is currently being developed. خالد حسني 10:39, 13 يوليو 2007 (PDT)
Notes Concerning the Suggestions
I wonder if this work if I at least have text under here.