Language Technology

This course is part of the programme
Bachelor's study programme Slovene Studies (1st Level)

Objectives and competences

The course objectives are to give:
an overview of language technology, related topics in information theory, text copora for Slovenian language and corresponding tools, basic understanding of the structure of web pages, the relevant markup languages such as HTML and XML.
Students get the competence in evaluation of electronic language resources, in preparation of language-related reports for the web environment.
They learn a new approach to the possibilities in solving language problems, an approach offered by contemporary, web-based time.

Prerequisites

The course does not require any special skills or knowledge, not covered by previous education of a future linguist. All that is needed is basic knowledge of computer use, some experience in usage of web resources and, last but not least, reasonable command of English language.

Content

• Overview of the field of language technology
• Basic web skills
• Overview of markup languages such as HTML and XML
• Text corpora and related tools, especially for the Slovenian language
• Term paper in the form of a web page with statistical analysis of a chosen Slovenian or English fiction text, including its lemmatization and preparation of a dictionary of open-class words.

Intended learning outcomes

Students learn how to use a modern tool for text analysis and its potential in testing of linguistic hypotheses. They understand the inner structure of simple and machine-generated web pages, they get an overview of Slovenian language corpora and their use. Students learn how to make a statistical description of a given text, including the preparation of the frequency dictionary of open-class words.

Readings

  • D. Jurafsky, J. H. Martin, 2009. Speech and language processing, 2. izdaja, Prentice Hall, 1024 str. Catalogue
  • C. D. Manning in H. Schütze, 1999. Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA, 620 str. Catalogue
  • A. Witt in D. Metzing (Ur.), 2010. Linguistic Modeling of Information and Markup Languages, zbirka Text, Speech and Language Technology, Vol. 40, Springer, 266 str. E-version
  • G. Leech, P. Rayson, A. Wilson, 2001. Word Frequencies in Written and Spoken English: based on the British National Corpus. Longman, London, 320 str. E-version
  • Prispevki s konferenc Association for Computational Linguistics (ACL) E-version
  • ACL wiki E-version
  • V. Gorjanc, 2005. Uvod v korpusno jezikoslovje. Izolit, Domžale, 163 str. Catalogue
  • P. Jakopin, 2002. Entropija v slovenskih leposlovnih besedilih. Založba ZRC, Ljubljana, 208 str. Catalogue

Assessment

Term paper in the form of a web page, its presentation (60%), oral exam (40%).