Program

Courses
Course Instructors
Course Topics
Final Certificate
Social and Extracurricular Program
Venue

Courses

The course program will comprise two graduate level courses, which are divided up into two modules each. Therefore we will have four modules in total.
The two courses are:

  1. Advanced Syntactic and Semantic Modeling
  2. Machine Learning Applications in Language Engineering.

Each of these will be taught over four weeks in parallel and will be divided into two modules, each of which will last two weeks. Each module will be taught in daily 3 hour slots, of which we are planning 1 ½ hours as lecture time and 1 ½ hours as hands-on lab time, working on exercises and problems set in the course. The lab time will be supervised by the instructors.

Course 1: Advanced Syntactic and Semantic Modeling
Module A: Grammar Engineering for South Asian Languages
(Prof. Miriam Butt, University of Konstanz & Dr. Martin Forst, NetBase)
Module B: Resources and Methods in Corpus-Based Lexical Semantics
(Prof. Stefan Evert, Friedrich-Alexander-UniversitätErlangen-Nürnberg)

Course 2: Machine Learning Applications in Language Engineering
Module A: Information Extraction
(Dr. Alex Fraser, Ludwig-Maximilians-UniversitätMünchen)
Module B: Optical Character Recognition
(Prof. Sarmad Hussain, University of Engineering and Technology)

We have divided each of the courses into two modules: 1) to ensure that leading experts in the field could be available for all of the sub-topics addressed during the summer school and as each course comprises quite a large topical area, this is best done by having more than one instructor per course; 2) from a practical point of view, it is very difficult to persuade high quality instructors to commit to more than two weeks of teaching at a stretch since their schedules tend to be very full. We have been fortunate in finding very high quality instructors for our courses.

Course Instructors

Professor Dr. Miriam Butt is professor for theoretical and computational linguistics in the Department of Linguistics at the University of Konstanz. Professor Butt specializes in South Asian languages. She is one of just a handful of experts internationally on the grammatical structure of Urdu and has conducted a DFG funded project on grammar engineering for Urdu. She has also taken a leadership position in the parallel grammar (ParGram) initiative for over 20 years. In this initiative, grammar engineering solutions are developed for typologically very different languages. She is one of the very few researchers internationally who has knowledge and expertise on the grammatical structure of South Asian languages, in particular as needed for computational applications and will teach part of the module dealing with Grammar Engineering. Professor Butt has taught at 7 different summer schools internationally (in Bordeaux, Konstanz, Kathmandu, Lahore, Malaga, Trondheim, Utrecht) and has held 9 international tutorials (in Barcelona, Legon (Ghana), Geneva, Hyderabad, Kathmandu, Istanbul, Lahore, Zadar).

Prof. Stefan Evert is at the Friedrich-Alexander-Universität Erlangen-Nürnberg. He has a background in mathematics, physics and English linguistics, and holds a PhD degree in computational linguistics. His research interests include the statistical analysis of corpus frequency data (significance tests in corpus linguistics, statistical association measures, Zipf’s law and word frequency distributions), quantitative approaches to lexical semantics (collocations, multiword expressions and distributional semantics), as well as processing large text corpora (IMS Open Corpus Workbench, data model and query language of the Nite XML Toolkit, tools for the Web as corpus).

Dr. Alex Fraser is a senior lecturer at the Center for Information and Language Processing at the Ludwig-Maximilians-Universität München. Dr. Fraser has industrial experience (he worked for the companies Language Weaver and BBN) and is internationally acknowledged to be an expert on the subject of statistical machine translation. He is currently working on projects together with Prof. Hinrich Schütze, a leading expert on information extraction. Dr. Fraser has previously taught at summer schools in Germany and Nepal and has received very positive evaluations for his teaching methods. He will be conducting the course on statistical and rule based methods in information extraction.

Professor Dr. Sarmad Hussain is a Professor of Computer Science and heads the Center for Language Engineering (www.cle.org.pk) at the University of Engineering and Technology in Lahore, Pakistan. He did his PhD in Speech Science from Northwestern University (USA). He currently holds the IDRC Research Chair on Multilingual Computing. His research interests encompass linguistics, enabling computers in local languages, including localization and Human Language Technologies, and use of this technology. He is involved with multiple projects in the areas of script, speech and language processing, including the PAN Localization project, which is a regional initiative to develop language capacity across developing Asia. He has had significant experience in organizing and conducting summer schools in Asia and in developing courses for training researchers in South Asia.

Course Topics

We have identified two areas which we believe are useful and crucial in developing the state-of-the-art with respect to NLE further, regionally and in Sri Lanka. Given the summer schools and training programs that have already taken place in the region (section 2.4), the types of computational applications that are likely to be within reach in the region and in Sri Lanka, and based on the training courses already offered in the previous summer school, we have decided on both on going deeper in some areas and cover more breadth in others. The first course on Advanced Syntactic and Semantic Modeling adds depth in theoretical framework for syntactic analysis for a deeper understanding of the language structures and the challenges encountered in the computational modelling of these structures. The course also looks at automatically extracting larger amounts of semantic data for the creation of lexical resources and electronic dictionaries (building large linguistic resources manually is a cumbersome and slow process and in many cases not viable to undertake). The significance and manual process of developing lexical semantic resources for South Asian languages was already discussed in the previous summer school. The second course in the current summer school focuses on adding more breadth to the summer school by introducing two application areas for machine learning in the field of language engineering. Information Extraction from texts is an important area of research being used by search engines (such as Google) and in many other fields, including text summarization and is becoming increasingly important for languages other than English and for scripts other than the Latin alphabet. A variety of cursive and complex scripts are used to write languages, including in Sri Lanka. Optical Character Recognition is an essential technology which is used to convert printed and handwritten texts into editable digital text formats efficiently. This area of research is getting increasing important in digital humanities, e.g. in the digitization of old texts, of which Sri Lanka can boast many in several different scripts in its long written history of about 2000 years.

Course 1: Advanced Syntactic and Semantic Modeling
Module A: Grammar Engineering for South Asian Languages
This course introduces powerful state-of-the-art grammar development software that includes interfacing with finite-state morphological analyzer, the integration of large scale lexica, an interaction between a rule-based syntax and statistically-based disambiguation methods, a parser and a generator (XLE) and a very powerful and flexible rewriting system (XFR), which has been used for purposes of machine translation as well as semantic construction. The first half of the course introduces the student to basic grammar writing techniques within XLE on the basis of the architecture assumed by Lexical-Functional Grammar (LFG; Bresnan and Kaplan 1982, Butt et al. 1999, Dalrymple 2001, Butt et al. 2006). The second half of the course focuses on the specific needs of South Asian languages and presents existing grammar engineering solutions that can address these needs (free word order, complex morphology, complex predication, pronoun ommission, etc.)

Module B: Resources and Methods in Corpus-Based Lexical Semantics
In this course an overview over methods and existing resources is provided with respect to corpus-based lexical semantics. The course will consist of: a) the introduction of fundamental theoretical concepts; b) a look at lexicographically oriented existing digital resources (e.g., WordNet, OntoBank, PropBank, FrameNet); c) the introduction of computational linguistic and more generally statistical methods which can be used for the automatic acquisition of lexical resources (e.g., the automatic induction of noun classes in raw corpus data).

Course 2: Machine Learning Applications in Language Engineering

Module A: Information Extraction
In this course, students acquire in-depth knowledge of information extraction tasks and methods and will become familiar with the relevant literature and an open source toolkit incorporating information extraction components. The course will cover the following content: basic introduction to information extraction, information extraction using rules and models, evaluation of information extraction tasks, detecting named entities, coreference, template extraction, citations, CV mining, Wikipedia.

Module B: Optical Character Recognition
The module on Optical Character Recognition will focus on three phases. The first phase focuses on image pre-processing, which includes converting colored images to black and white format, removal of noise and skew and then page segmentation into columns, text areas, lines, figures, etc. The course will then focus on how segmented text may be recognized using machine learning systems. The course will introduce the Tesseract engine released by Google for practical implementation. Finally, the course will focus on how to use the recognition results to re-create the text printed in the input document images using statistical language processing techniques, including word segmentation models and Part of Speech tagging.

Final Certificate
Upon the completion of the Summer School, the participants of the program will be awarded a certificate, which will contain the grades for each of the modules and each of the courses covered during the summer school. For University of Colombo School of Computing (UCSC) graduate students, the summer school will be considered as two full 3 credit courses on “Natural Language Processing” and the achieved grades will be transferred as a final grade of this course.

Social and Extracurricular Program
In order to provide a good balance between academics and leisure time and in order to expose visitors to aspects of Sri Lankan history and culture, UCSC will financially sponsor and organize a sight-seeing tour around Colombo to places of historical and cultural significance and will also host a cultural program introducing Sri Lankan folk songs, dances and acts.

Venue
University of Colombo School of Computing,
No: 35; Reid Avenue,
Colombo 00700.
Sri Lanka.