نقاش:Semantic Arabic Encoding and Format

من ويكي عربآيز
اذهب إلى: تصفح، ابحث

File Problems

Currently, users are unable to access the zip file with the demo code here.

The Basic Concept Needs Reviewing

This suggestions was posted by Amru Gameel 07:43, 9 يوليو 2007 (PDT)

A Quick Overview of the Problem

After reviewing the page here, one user suggested that this encoding will not solve the problem, but just lead to a new problem altogether. This encoding system tries to separate morphology to the point that the way it is read and the way it is encoded become separate. The average user tends to write a given Arabic word as it is read. Consider the following examples:

Original Arabic, the way the user reads and writes:

"الاستخدام المتكرر"

Arabic via the Tarmeez Semantic Arabic Encoding Format:

"خدم/6ال كرر/5م"

Thus, this will only lead to a new problem, text input. This format means that we need another program that will properly understand the input in such a format. This problem will become even more complex if the user tries to input a word from an unknown verbal root form into the text. This is ignoring the different lexicons for the different varieties of Arabic, from Classical to Modern Standard to dialectical Arabic. We do not have the ability to make such large and complex dictionaries.

So Where Do We Go From Here?

Instead of looking at the encoding system at the lexical (e.g. word) level, we should look at it according to the letters in the word. The following picture will illustrate the improved encoding system:

Ar new encoding diagram.gif

The idea here is that we are focusing on the naked letter, which means no diacritic markers (e.g. the technical name for the dots you see above certain Arabic letters). So the possible letters are as follows:

The Basic Alphabet in the Encoding:

ا ب ح د ر س ص ط ع ف ك ل م ن ه null

If you ignore the dot underneath the dot underneath the ba'a (as if it were a base form for ba'a, ta'a, and tha'a), then you will see that the other divisions of the encoding will add the diacritics, grammatical vowelling (tashkeel), so on and so forth.

A more thorough explanation of this proposed encoding is to follow.

Pros of the New Encoding

There are several benefits to this new encoding:

  • The grammatical vowelling (tashkeel) is linked directly to the letter and therefore independent letters are not permissible. This is truer to the combination of morphology and syntax in the original Arabic script.
  • This leads to the added bonus of reducing bit space needed for the field.
  • Although they are linked, the letters and grammatical vowelling are still separate. This means it will be possible to have bare forms, without any diacritic markers. These bare forms could be extremely useful.
  • This allows for the possibility of automatic vowelling systems that will take a text that the user inputted and vowel it for him or her completely.
  • There is limitless potential for the new characters and bare forms, for instance, allowing all short vowels above or underneath a single character. These new characters and their derivatives can have very useful applications.

Cons the New Encoding

Although there are many good uses for the proposed encoding, it also has several problems:

  • There will be tripartite input process. This means the user will have to input 1) the bare letter 2) the diacritic marker (if necessary) 3) the tashkeel vowel (if applicable).
  • Foreign character input (e.g. numbers, symbols, English alphabet) has not been fully discussed and there have been no suggestions on how to implement it. The simple solution here is pretty simple, just designate the first four bits to encoding whether it is an Arabic letter or something else, as long as we can agree upon an encoding scheme for the different situations that might arise.
  • This is trying to reinvent the wheel, as the expression goes in English. Why should we? If this proposed encoding is really successful and beneficial for everyone, it is possible we can have it added to Unicode or have Unicode modified to include our ideas on the matter.

What Is Wrong with Unicode?

What is wrong with Unicode exactly?

Comments

Dr. Amru's ideas are very smart. They will be very useful in production systems. -يوسف 04:47, 10 يوليو 2007 (PDT)

There was a subproject dealing with how to classify verbs that might be of help. Check this archived message from the Arabeyes General List.

Reviewing The Suggestion

Some Thoughts

Having reviewed my suggestion again, I realized the problem is really the input of Arabic text itself. The focus on correcting the encoding scheme was not correct. Using my encoding scheme brings in a few problems as well, and we already neglect tashkeel enough without the ability to have multiple short vowels on one letter. The same is true, to a lesser extent, for diacritic markers. At any rate, the input problem is not the problem with the encoding in the first place. د.عمرو 09:05, 11 يوليو 2007 (PDT)

Comments

I think the problem in the first place is the fonts technology itself, which haven never been able to meet the demands of the Arabic language's script in general. Even OpenType, and the majority of contemporary technological innovations with font, are not really changing anything. Graphite, however, seems to be a promising free technology that is currently being developed. خالد حسني 10:39, 13 يوليو 2007 (PDT)

Notes Concerning the Suggestions

Thanks Doctor Amru. I am sorry for the long delay in my response, but I wanted to review everything thoroughly. It is important that you, and all other contributors, supporters, and those just plain interested in the project realize the semantic component. The word level distinction here is very important because the derivational word structure of Arabic, which in turn influences morphology like tashkeel that represents its grammar, is the focal point of the system. I want to make full advantage of such an organizational principle in linguistics. The bottom line here is that form and function are linked in Semitic languages. This means that the focus should be a) Arabic (not languages with Arabic script that do not have the same grammatical paradigm like Urdu, Farsi, etc.), and b) the relationship between form and function. These suggestions steer away from the function (syntax, semantics, etc.) and focus on the form (morphology).

Extending the Short Vowelling

If we really wanted to improve upon the suggested encoding system for Arabic, it would make sense to get rid of the short vowels completely, and extend them. This means that the encoding will treat them like long vowels where alif equals fatha, waaw equals damma, and ya'a equals kesra. If we look at it like this there is no need to encode the sukoon because it is merely the absence of a short vowel.

So, this means that there will only be seven potential values for the harakaat, with the long vowels and short vowels becoming equal in the encoding as vowelling, and not separate entities that waste bit space for redundant components of Arabic morphology. This also comes back to the issue of the duality of the ''alif' and the ''hamza''.

I think the remaining bits should be used to encode information about the tanween and the sheddah. This allows for a spell checker to work on areas never previously covered, such as automatic correction of sun and moon letters, unnecessary nunation and extension of short vowels, so on and so forth.

There are also seem to be a lot of diffuse ideas here. A MediaWiki discussion page might not be the best place for this discussion. Keep in mint that we are in the process of revamping the website.

Amru Gharbiyya 06:53, 5 سبتمبر 2007 (PDT)