Intelligent Data Crawler of Unstructured Open Data

Innovation Development
Thesis Code: 

Thesis Type: Thesis in Computer Science, Data Engineering, Computer Engineering, Mathematical Engineering, Data Science

• Experience with Python and/or Java and/or Node.js
• Basic knowledge of modular development
• Beginner of (or willing to learn quickly) machine learning and natural language processing
• Curiosity-driven mindset

Italian public administration websites contain a lot of resources published as open data. However, administrations have multiple websites and each has its own semantic structure making harder to autonomous crawlers retrieving the necessary information. The aim of this thesis project is to develop an intelligent data crawler able to fetch specific types of resources across multiple file formats from selected sources. Relatively low precision and high recall is expected. The crawler should be able to detect relevant resources using state-of-the-art techniques based on machine learning and natural language processing techniques that are the core of the Artificial Intelligence stack.

The undergraduate will study and experiment with technologies for:
• extracting semantics from web resources
• understanding the content
• discriminating about the value of the retrieved content.

The thesis will be structured as follows:
• state-of-the-art analysis of information retrieval
• problem formulation: objective function, data structures and resources to be used
• algorithm design and prototyping
• in-lab testing verification with real data and measurement of the performance of the approach.

The thesis will be co-tutored with Synapta Srl, a Spin-off of Politecnico di Torino. It will be an opportunity to work also with the Synapta team experimenting with real data. The undergraduate will benefit from being immersed in a existing start-up environment while applying scientific experimental practises learned in ISMB. At the end of the thesis, the undergraduate will be familiar with machine learning and natural language processing techniques, and she/he will acquire an understanding of the public-procurement domain. As additional benefit, she/he will proficiently use control version systems, continuous integration systems, remote deploying and monitoring techniques.

Contact: send a resume with attached the list of exams to specifying the thesis code and title.