Optical Character Recognition
OCR is the ability to scan a document (or grab a PDF file) and run an OCR program on it and it will generate, based on optical recognition and approximation, an editable text file. For an idea about OCR see http://www.students.cs.uu.nl/people/mjkammer/Work/intro_2_OCR.html
Current Status of Open Source Arabic OCR software
The only FOSS OCR system with Arabic support is Tesseract, help is needed in testing and training it.
Arabic OCR Links
- Automatic Recognition Using Zernike Moments As A Feature Extractor (Paper)
- Graph Based Segmentation .. (Paper)
- Structural Features Of Cursive Arabic Scripts (Paper)
- Multilingual Machine Printed OCR (Paper)
- Test of two Arabic OCR programs
- Performance Evaluation of two Arabic OCR products
- Tesseract is an open source OCR, initially developed by HP, and released under the Apache License. 3.x versions has Arabic support.
- GOCR - included in Debian and other distributions. No Arabic support.
- GNU Ocrad "is an OCR [...] program based on a feature extraction method". No Arabic support.
- How to encode image produced by a recognition system (mailing thread) http://lists.arabeyes.org/archives/general/2002/March/msg00001.html
- Rapidly Retargetable Translingual Detection http://tides.umiacs.umd.edu/description.html