Arabic Discourse Treebank

Semi-Automatic Annotation Approach for Annotating Discourse Relations

 

Welcome

Arabic Discourse Treebank is a research project at College of Computer Sciences and Technology Imam Muhammad ibn Saud Islamic University, funded by KACST (13-INF2246-08) for improving AI technologies for Arabic NLP.

 

Description

Natural language processing (NLP) is a field of artificial intelligence concerned with the interactions between computers and human users in a natural language. NLP is a multidisciplinary field involving knowledge about phonology, morphology, syntax, semantics, and discourse. This project deals with the discourse as a higher-level of text analysis beyond a single sentence. The discourse level can be used in automatic summarization, translation, text generation and authorship identification.

The project aims to enhance Arabic semantic and discourse resources using the best practice of manual and auto-automatic ML annotation approaches. The Arabic Treebank (produced by LDC) has 4 parts each one has morphological and syntax gold standard annotation. The project will enhance the ATB by annotating semantic, discourse relations, Named-entity and attribution relations. We also will produce well-developed annotation guidelines and annotation tools to the tasks. In more details, this project focuses on three tasks:

  1. Enhances semantic relations of Arabic words in Arabic WordNet by importing words of Arabic TreeBank (Arabic TB) as lemmas and synset and relations.
  2. Identify discourse relations between arguments which could be implicit or explicit relations, attribution relations, named entity relations between clause and noun. Where the first effort of annotating explicit relations only for Arabic was done by the primary investigator of this project (DR. Amal Alsaif) in 2011. [Ref: The Leeds Arabic Discourse Treebank].
  3. Identify three basic elements of Attribution: source, cue, content (linguistic material) and supplements. Attribution can include other features such as polarity and factuality, supplements, source attitude and determinacy.

 

People

  • Dr. Amal Alsaif (Primary Investigator)
    •  
      • Interests: Arabic NLP, Machine translation, Assistive technology for disabled people, Artificial learning and mobile applications for language technology. (asmalsaif@imamu.edu.sa, amal.sm.alsaif@gmail.com)
  • Dr. Areeb Alowisheq (Co-Investigator)
    •  
      • Interests: Natural language processing, Creating resources for Arabic, Text mining, deep learning, Language modelling, Information extraction, Sentiment analysis. (E-mail: aaalowisheq@imamu.edu.sa)
  • Abeer Alsheddi (Project Manager)
    •  
      • Interests: Natural language processing. (asalsheddi@imamu.edu.sa, abeer.alsheddi@gmail.com)

 

Project Outcomes

 

Corpus


2. Imam-DTB for implicit discourse relations and attribution

 
  • Author(s):       
  • Language(s):
  • Data Type:      
  • Size:                
  • Format:                       
  • Data source:
  • Source:                       

  • 2. License: Imam-Friday Katieb Corpus:

 
  • Author(s):       
  • Language(s):
  • Data Type:      
  • Size:                
  • Format:                       
  • Data source:
  • Source:                       
  • License:                      

 

Publications

 

Book

  • Amal Alsaif, Katja Markert (Book Chapter). The Leeds Arabic Discourse Treebank: Guidelines for Annotating Discourse Connectives and Relations. The Book: Arabic Corpus Linguistics, 2018, Edinburgh University Press.

 

Paper

  • A. Al-Saif, T Alyahya, M Alotaibi, H Almuzaini, A Algahtani. Annotating Attribution Relations in Arabic. LREC 2018. Japan.
  • Amal Alsaif, Katja Markert (Book Chapter). The Leeds Arabic Discourse Treebank: Guidelines for Annotating Discourse Connectives and Relations. The Book: Arabic Corpus Linguistics, 2018, Edinburgh University Press.
  • Al-Saif, A. and Markert, K. Modelling Discourse Relations for Arabic. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, July 2011,Edinburgh, Scotland, UK. P. 736—747.
  • أمل السيف، كاتجا ماركرت .بناء المحتوى الدلالي العربي الأول و اسهاماته في تطوير معالجة اللغة آليا . المجلة العربية لعلوم و هندسة الحاسب.
  • Al-Saif, A. and Markert, K. Building the first Discourse Corpus for Arabic and its Contribution on developing Arabic NLP applications. The International Journal of Computer Science and Engineering in Arabic, 2011, Volume 4, Number 1, ISSN 19360525.
  • Alsaif, A.; Markert, K. 'The Leeds Arabic Discourse Treebank: Guidelines for Annotating Discourse Connectives and Relations'. A workshop on Arabic Corpus Linguistics. Lancaster University, April 2011.
  • Al-Saif Amal (2010). 'How would LADTB Provide a New Generation of Advanced NLP Applications for Arabic?'. The Saudi International Conference, Manchester2010, UK.
  • Amal Al-Saif and Katja Markert. 2010. The Leeds Arabic Discourse Treebank: Annotating discourse connectives for Arabic. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010). Pages 2046-2053. Valletta, Malta. May, 2010.
  • Alsaif, A.; Markert, K. and Abdul-Raof, H. Corpus-based Study: Extensive Collection of Discourse Connectives for Arabic. The Saudi International Conference (SIC 2009). Surrey, UK. 2009.

 ​