You get really exciting problems by seeing what people in the real world are struggling with.Paul Groth, chair of the group.
The Intelligent Data Engineering Lab (INDElab) is a research group at the Informatics Institute of the University of Amsterdam (UvA). INDElab works on intelligent systems that help people with the preparation, management, integration, and reuse of data.
In today’s society people are confronted with complex information all the time; figures and statistics in the news, search results when looking for a new apartment or trying to identify the right paper or dataset for a research project. INDElab tries to help people manage all this data and make sure that it is correct, transparent and usable. The predominant question that follows: How can we design and build systems to help people understand and work with data?
INDElab focuses on three research areas. The first field is ‘large scale data integration from multiple sources’. The lab researches new methods that can combine structured data (e.g. databases), semi-structured (e.g. spreadsheets) and even unstructured data (e.g. text) together into high quality datasets. This also includes approaches for helping people understand such complex large scale datasets. The second research area is ‘data management for machine learning’. Current machine learning applications are very sensitive to their data input. This leads to concerns about their reliability. The researchers work on techniques to help validate data and understand how that data was used. The third area focuses on ‘causality-inspired machine learning’. The researchers use techniques from causal inference to improve existing machine learning methods, so that they can handle changes in the data and make decisions on new policies.
One real-world application the group is working on is machine unlearning. Many organizations use personal user data to train their machine learning models. With recent laws like the GDPR, users have the ‘right to be forgotten’ and can ask the organization to delete their data. INDElab is working on several custom machine learning models that support very fast deletion of specific user data from such trained models. Another application example is mlinspect. If anything goes wrong with a machine learning application, you need data provenance information to trace back the data in time to see the source of errors or issues. Because it is expensive, people do not usually collect this data upfront. INDElab researchers develop new methods to lower the threshold for people to collect this data and inspect their machine learning pipelines in an efficient way.
INDElab is one of the few research groups to conduct research at the intersection of data management, data science and machine learning. This combination is quite unique. The group works on projects funded by the NWO, the US Air Force Research Lab, as well as company collaborations.
The group is inspired by working with insights from real world data science teams. For example, they worked with data analysts at bol.com to see exactly how they are debugging their data. They have also run a study with colleagues to see how researchers actually search for data. With these observations, they formulate their research questions. For example: What are the general problems about data search? And what would a next generation data debugger look like? The researchers work in an interdisciplinary fashion. They are engaged with social scientists, biologists and humanities scholars.
Collaborations with external partners are numerous. Together with NEN (Nederlands Normalisatie Instituut) and humanities scholars, the group works on the IN-SIGHT project to design data systems to help include public values into the standardization process. Researchers in the group also lead two projects with the MIT-IBM Watson AI Lab: a project with the MIT Bioengieering department and another project with the MIT CSAIL department. Members of the group fill leading roles in the European COST Action CA18230, Interactive Narrative Design for Complexity Representations (INDCOR). This network includes 140 researchers who understand complexity as a societal challenge by representing, experiencing and comprehending complex phenomena.
INDElab contributes to two ICAI (Innovation Center of Artificial Intelligence) labs. Within AIRLab Amsterdam, the researchers work together with retail multinational Ahold Delhaize on recommender systems and data validation. And with the scientific publisher Elsevier and Vrije Universiteit the group collaborates in the Discovery Lab. Within this lab they work on large databases called knowledge graphs to make scientific literature and data searchable.
The INDElab researchers teach courses like Introduction to Databases, Big Data Management, Causal Data Science, Information Organization and Research Methodology within the computer science and information studies bachelors as well as the information studies masters of the UvA. Many students work on projects with companies. In the Data Systems Project for example, master students work as a team over a whole year on a real data science problem with an external partner like the Dutch Police, the Municipality of Amsterdam and ING. That is unique approach of the UvA. One of the goals of INDElab is to help the students find their place within the Amsterdam Data Science community.
In the coming years the group will focus on the automated construction of datasets and on improving data science pipelines. They strive to build high quality datasets from multiple data sources with minimal effort. Besides structured data and text, they also have started to do this with video, in collaboration with other research groups at the UvA. In the longer-term the researchers hope to introduce the notions of cause-effect relations as a unified way of dealing with the vulnerability of data science pipelines to changes in the data, biased data and errors.
INDElab positions itself primarily in the Data Science and AI research themes of the Informatics Institute.