Computer-based text analysis

This course is part of the programme
Digital Humanities, interdisciplinary programme

Objectives and competences

The objective of the course is to enable the students to acquire basic knowledge of computational approaches to text analysis. The acquired knowledge comprises the theoretical basis as well as practical aspects from the fields of corpus and computational linguistics and text mining needed for autonomous computer-assisted text analysis.

Students will master the basics of text preparation for computer processing and access and search the digital text collections, other resources as well as tools for computational text analysis.

The students will master the knowledge needed for understanding typical applications and for developing their first autonomous attempts in computational text analysis.

Prerequisites

Basic knowledge of programming.

Content

• Introduction to text analysis
◦ corpora and corpus linguistics
◦ computational linguistics
◦ text mining

• Available resources (available corpora, libraries, semantic resources - WordNet, ConcepNet)

• Preparation for text analysis
◦ Text selection (representativeness of corpora)
◦ Text preprocessing (automatic segmentation, lemmatization, morphological, syntactic and semantic annotation)
◦ Annotation with metadata (XML)

• Methods for computer text analysis
◦ Statistical text analysis
◦ Basic commands and regular expressions for corpus manipulation
◦ Information extraction
◦ Clustering and classification of text documents (e.g. by content, genres, authors)
◦ Evaluation of performance of automatic approaches

• Application areas and examples
◦ Computer-assisted discourse analysis
◦ Computational stylistics (genre analysis, authorship attribution, document similarity)
◦ Computer-assisted terminology and lexicography
◦ Multilingual text analysis
◦ Computational creativity

• Practical use of selected tools

Intended learning outcomes

At the end of the course, the students will be able to:
• understand and use basic concepts of the areas of corpus linguistics, computational linguistics and text mining,
• understand and use the available resources and computational tools for text analysis,
• understand the principles of corpus construction,
• write or use simple programs for statistical text analysis or information extraction.

General competencies:
• capacity for analysis and synthesis,
• proficiency in research methods, procedures and processes
• development of critical and self-critical assessment
• the ability to use knowledge in practice
• autonomous and ethical professional work
• team work.

Subject specific competencies:
• the ability to develop original solutions in the field of computer-assisted text analysis for the needs of digital humanities,
• the ability to evaluate the performance of selected methods and tools,
• the ability to critically evaluate and interpret the results.

Readings

• Jurafsky, D., Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd Edition. Prentice-Hall, 2008.
• McEnery, T., Hardie, A. Corpus Linguistics. Method, Theory and Practice. Cambridge University Press. 2011.
• Feldman, Sager. The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. 2006 (printed)/2007 (electronic)

Recommended:
• Natural Language Processing with Python http://www.nltk.org/book/
• Love, Harold (2002). Attributing Authorship: An Introduction. Cambridge: Cambridge University Press.

Assessment

50% regular knowledge assessment, 50% final written exam

Lecturer's references

JURŠIČ, Matjaž, CESTNIK, Bojan, URBANČIČ, Tanja, LAVRAČ, Nada. HCI empowered literature mining for cross-domain knowledge discovery. V: Third International Workshop, HCI-KDD 2013, Held at SouthCHI 2013, Maribor, Slovenia, July 1-3, 2013. HOLZINGER, Andreas (ur.), PASI, Gabriella (ur.). Human-computer interaction and knowledge discovery in complex, unstructured, big data : proceedings, (Lecture notes in computer science, ISSN 0302-9743, Lecture notes in artificial intelligence, vol. 7947). Berlin; Heidelberg: Springer, 2013, vol. 7947, str. 124-135.

PETRIČ, Ingrid, CESTNIK, Bojan, LAVRAČ, Nada, URBANČIČ, Tanja. Outlier detection in cross-context link discovery for creative literature mining. The Computer journal, ISSN 0010-4620, 2012, vol. 55, no. 1, str. 47-61.

PETRIČ, Ingrid, URBANČIČ, Tanja, CESTNIK, Bojan, MACEDONI-LUKŠIČ, Marta. Literature mining method RaJoLink for uncovering relations between biomedical concepts. Journal of biomedical informatics, ISSN 1532-0464, apr. 2009, vol. 42, no. 2, str. 219-227.
LAVRAČ, Nada, LJUBIČ, Peter, URBANČIČ, Tanja, PAPA, Gregor, JERMOL, Mitja, BOLLHALTER, Stefan. Trust modeling for networked organizations using reputation and collaboration estimates. IEEE trans. syst. man cybern., Part C Appl. rev.. [Print ed.], maj 2007, vol. 37, no. 3, str. 429-439, ilustr. [COBISS.SI-ID 645883]
SOMEREN, Maarten W. van, URBANČIČ, Tanja. Applications of machine learning : matching problems to tasks and methods. Knowl. eng. rev., 2006, vol. 20, no. 4, str. 363-402. [COBISS.SI-ID 506107]
GUBIANI, Donatella, PETRIČ, Ingrid, FABBRETTI, Elsa, URBANČIČ, Tanja. Mining scientific literature about ageing to support better understanding and treatment of degenerative diseases. V: MLADENIĆ, Dunja (ur.), GROBELNIK, Marko (ur.). Izkopavanje znanja in podatkovna skladišča (SiKDD 2015) : zbornik 18. mednarodne multikonference Informacijska družba - IS 2015, 5. oktober 2015, [Ljubljana, Slovenia] : zvezek E = Data mining and data warehouses (SiKDD 2015) : proceedings of the 18th International Multiconference Information Society - IS 2015, October 5th, 2015, Ljubljana, Slovenia : volume E. Ljubljana: Institut Jožef Stefan, 2015, 4 str.

MARTINS, Pedro, URBANČIČ, Tanja, POLLAK, Senja, LAVRAČ, Nada, CARDOSO, Amilcar. The good, the bad, and the AHA! blends. V: TOIVONEN, Hannu (ur.). Proceedings of the Sixth International Conference on Computational Creativity, ICCC 2015, June 29 - July 2, 2015, Park City, UT, USA. Provo: Brigham Young University, 2015, str. 166-173.

POLLAK, Senja, MARTINS, Pedro, CARDOSO, Amilcar, URBANČIČ, Tanja. Automated blend naming based on human creativity examples. V: Twenty-Third International Conference on Case-Based Reasoning, ICCBR 2015, 28-30 September 2015, Frankfurt, Germany. KENDALL-MORWICK, Joseph (ur.). Workshop proceedings. [S. l.: s. n.], 2015, str. 93-102.

University course code: 2DH017

Year of study: 2. year

Course principal:

prof. dr. Tanja Urbančič

Lecturer:

prof. dr. Tanja Urbančič

ECTS: 6

Workload:

Lectures: 30 hours
Exercises: 15 hours
Seminar: 15 hours
Individual work: 120 hours

Course kind: elective (computer sciences)

Languages: slovene, english

Learning and teaching methods:
• lectures • homework • exercise • seminar work