Arabic Language Technology International Conference (ALTIC) 2011

Track 1

Large Vocabulary Printed Arabic Optical Character Recognition Competition

Motivation:

There are many printed Arabic OCR products in the market. As well, many research groups worldwide are actively working to enhance the Arabic OCR performance, in particular, for degraded documents quality, and for new capturing devices. No standard dataset is available for Arabic OCR to train and evaluate systems. The Arabic Language TEchnology Center (ALTEC) is producing a large dataset (14,000 pages) for printed Arabic documents which is intended for training Omni-font OCR systems, with different quality and different acquisition systems. We invite all companies and research groups to participate in this competition, which would have four main benefits. 1- To have a standard testing dataset where all systems can be compared against, as well as a large training corpus which can be used by industry and academia. 2- To be able to measure the state of the art of Arabic OCR technology. 3- To have the opportunity to compare different approaches and ideas that would ultimately benefit the Arabic OCR industry. 4- Finally, to give the opportunity to research groups and developing companies to have the well designed large training data which would also benefit the industry.

Data Sets:

1- General description:

The ALTEC Arabic Printed Text Corpus consists of more than 14,000 pages. The dataset has two main parts;

(I) Text documents that are printed then scanned (about 6500 pages).

(II) Arabic Books (4500 pages) and Theses (3000 pages).

The text part of the dataset has more than 100,000 unique Arabic words, taken from the Arabic Gigaword database.

These words were carefully chosen to represent each character and ligature shape for all targeted fonts. Each shape is represented at least 100 times in the clean version. The fonts that covered evenly are 10,12,14,16,18,20,22. Smaller and larger fonts also occurred in the books and theses. The resolutions covered are 200, 300 and 600 dots per inch (dpi).

The fonts covered evenly are:

a. For Windows Platform

1) Simplified Arabic

2) Arabic Transparent

3) Traditional Arabic

b. For MAC Platform

1) Dahab

2) Riadh

3) Naskh

Where each font is covered twice for both Normal and Bold.

c. Manual Typewriter (fixed mode and font)

The text documents were produced in a clean quality, and a copier output quality before the scanning. The capturing systems were scanners and digital/mobile cameras for the text documents and book digitizers and digital/mobile cameras for the books and theses part.

2- Training dataset

The dataset will have tentatively 3000 pages from the text documents, 2000 pages from the books and 1000 pages from the theses. These pages will represent evenly all different fonts, sizes, shapes, resolutions, qualities and capturing systems. Word transcription files will be included for the whole data, and character level segmentation will be available for about 10% of the data. A list of all the words in the corpus will be available.

3- Test Dataset

The participants should run their recognition engines on the test data pages, and produce corresponding text outputs for each line in the given pages. Transcription file will be given for each page. Each participating entity can have up to three different engines.

No adaptation on test data is allowed.

No training on test data is allowed.

It is allowed to use a dictionary with the list of the words in the corpus.

The results should be delivered to the competition committee exactly 72 hours from the test data availability.

A draft paper is expected from each participant to describe the engines used and the obtained results.

Evaluation Process:

The test data has mainly four categories: (a) Clean data from scanned text documents. (b) Noisy data by using a copier machine before the scanning. (c) Documents captured by digital/mobile cameras

(d) Books and theses documents captured by book-digitizers and digital/mobile cameras.

The results for each category and the average result are going to be considered and the winner in each category and the overall winner will be announced.
Each participant group may submit up to three results using three different engines.
Participating groups will be anonymous. A code number will be given to each candidate engine, and the results will be announced by the code number only.
For the sake of fairness and authenticity, each participant will be asked to bring their engines to ALTEC at a specific date (August 20^th 2011) to make random tests on sample documents to compare with their submitted results.
In case a group cannot come to ALTEC premises (for being abroad) they are asked to send us their engine to make that authenticity test.
The dataset in training and testing is composed of pages. Each page has a name, for example, the image for the Clean Version, 200 dpi page that came from 5^th list, 3^rd page, with font traditional Arabic, size 16, normal is: WIN_05_03_TA_16_N_C_2.JPG

It is required to have the results with the exact same base name as:

WIN_05_03_TA_16_N_C_2.res

Each page (image) will have a number of lines. A transcription file will be given to participants with each image stating the coordinates of each line and the number of lines in the image. The results file is required to be in the exact same format (which will also be available with the training data). The missing transcriptions in the test transcription files should be filled by the participant engine and hence producing the .res file.

Important dates:

1- All participants should register for the competition by June 15^th 2011.

2- The training data will be available to registered participants on July 15^th 2011, after signing copyright and confidentiality forms.

3- The test data will be available on August 16^th at 5pm (Egypt time).

4- The recognition results should be sent to the committee before August 19^th 2011 by 5pm (Egypt time).

5- The competition results and winners will be announced by September 7^th 2011.

6- Draft paper should be submitted on September 20^th 2011.

7- The participants are encouraged to attend the workshop that will be held during the ALTIC conference (October 9-10 2011) www.altec-center.org to discuss the competition results.

8- More details about the formats of the training and test data, and the submission format will be announced on the ALTEC web site: www.altec-center.org on May 15^th 2011.

Contact Information:

Dr. Mohamed Waleed Fakhr

Waleed.Fakh[at]altec-center[dot]org

Dr. Sherif Abdou

Sherif.Abdou[at]altec-center[dot]org