Labelled datasets are crucial for training machine learning research, both in academia and in industry.
- Why are labelled data important to machine learning?
In the age of data abundance and machine learning prevalence on multiple domains, it is crucial to use properly the enormous amount of available data, which are critical to train machine learning models. We, as humans, perceive the real-world by first observing environmental variables and then classifying them into categories according to certain properties or characteristics. Machine learning models are currently being trained on available datasets in order to solve a plethora of problems following a similar method. Among the various challenges of machine learning, is the lack of labelled data. Labelled data are datasets which are tagged with one or more labels in order to identify and further classify the properties of each item of interest in the dataset. The importance of labelled data lies within the training process of a machine learning model. Models are trained on a dataset whose known labelled values are used as a ground truth, that is, the true value or answer. The models are then tested on unlabelled data of the same characteristics, using this ground truth, to determine if the model can predict the correct labels. The result is a model that can predict a final output of high accuracy on new data without manual labelling.
- How to obtain all these labels?
Obtaining high quality annotated datasets is a process that has been accelerated since the introduction of crowdsourcing services such as Amazon Mechanical Turk and CrowdFlower. Crowdsourcing has revolutionised the gathering of labelled data, by letting crowds of workers (humans or algorithms) annotate items at a very efficient, low cost and time-saving way.
However, the quality of the labelled items is often inadequate, and we observe noisy labels. Workers may be lacking knowledge on a particular topic and therefore annotate items incorrectly, or purposely focus on the quantity of labelled items rather than the quality, given the monetary reward you get when you label each item.
Most existing studies that focus on the quality control of crowdsourced data and de-noising crowdsourced labels, use probabilistic graphical models to infer the true label from noisy annotations. Whitehill et al. (2009), introduce the probabilistic GLAD model that infers more accurately not only the latent true label, but also taking into account the expertise of each worker and the difficulty of each item.
In this post, we extend the GLAD model by leveraging the wealth of additional information contained in the correlation between items and workers. We also model the correlations between items and workers as well as the expertise of each worker and the difficulty of each item.
- Why using crowdsourcing data?
Crowdsourcing has revolutionised the gathering of labelled data by letting crowds of workers (humans or algorithms) annotate items at a very low cost. Crowdsourcing platforms such as Amazon Mechanical Turk or CrowdFlower are distinctive examples of massive amounts of acquired labels from crowds. Despite the increased efficiency and high speed, a common issue that emerges from this technique is the compromised quality of the labels for the different subjects. That is due to the fact that various workers can label the same items, whether they are subject experts or not. This is an important issue for specialised domains, where item classification has higher difficulty and require expertise. Moreover, due to the anonymous nature of crowdsourced labelling and competing incentives, we observe cases of spam workers or workers with conflicting interests. Consequently, the obtained labels for items that require a level of domain expertise might be very noisy and of low quality. Thus, acquiring accurate labels from crowdsourcing platforms has become a bottleneck for progress in machine learning.
- What is label aggregation?
To overcome the obstacle of poor labelling, the labels given to each item from multiple workers can be aggregated collectively and then the true label for each instance is inferred. The most simplistic method for this is Majority Voting. Majority Voting is a method in which the given label of an item, is the one that received most of the votes by the workers. This method can also be used to infer the worker’s expertise and the item’s difficulty.
When it comes to modeling worker expertise and item difficulty, there are several approaches. The first advanced work for label aggregation is presented by Dawid & Skene (1979), where they assume a global item difficulty for all workers and a global worker expertise for all items. However, this method assumes that all workers have the same level of expertise when they label an item. Moreover, it is implied that all items have the same level of difficulty, which is not the case in most real-life tasks.
To address this issue, Whitehill et al. (2009), proposes that labels should be generated by a probability distribution over all labels, workers and items. However, this also assumes that items’ difficulty is globally identical to all workers, and that workers’ expertise is globally identical to all items, something that is failing to integrate the correlation among items and workers.
In practice, workers that are experts on a specific subject tend to label the items that belong to this subject more accurately, i.e. the labels that they give to these items are highly related to its true label.
Similarly, items that are considered easy are usually labelled accurately by the workers. Whereas, items of high difficulty get a wider range of different labels, which create noise to the given label.
- How to improve it?
A way to improve the work of Whitehill et al. (2009), is encoding the correlation of workers and items. We could model worker-wise item difficulty and task-wise worker expertise, and by incorporating this information we then aim to yield superior performance in terms of inferring the true label, as well as in terms of learning the parameters of interest.
More specifically, by formulating a probabilistic model for the labelling process, we can manage to infer the true label of the items more precisely. This way we aim to correctly infer the most accurate label for each item, as well as to infer each worker’s expertise parameter, each item’s difficulty parameter, and finally the correlation between the worker and the item. The accuracy and stability of the results is proven to be outstanding and so we solve the problem of inferring the true label of items in a more efficient way!
The results of this project are coming to a publication by Sanida et al., so watch out for further exciting details of this work!
About Paul Verhaar
Data scientist with a passion for Natural Language Processing. Loves complex problems that kindle creativity and out-of-the-box thinking and projects with social impact. Background in linguistics and New Media. Always in for a chat on data science and/or the impact of technology on civilization. Pro-musician, avid motorcycle rider and single speed bike builder in my spare time.
More on Paul Verhaar.