brunch

매거진 논문 리뷰

라이킷 7 댓글

You can make anything
by writing

C.S.Lewis

계정을 잊어버리셨나요?

by 변재현 Jul 20. 2024

논문리뷰 : 딥러닝에서 데이터 수집 및 품질 문제

Whang, S.E., Roh, Y. et al., 2023

Whang, S.E., Roh, Y., Song, H. et al. Data collection and quality challenges in deep learning: a data-centric AI perspective. The VLDB Journal 32, 791–813 (2023).

https://link.springer.com/article/10.1007/s00778-022-00775-9

Keywords

Data collection · Data quality · Deep learning · Data-centric AI

초록

데이터 중심 AI는 소프트웨어 엔지니어링의 근본적인 변화의 중심에 있다. 여기서 머신러닝은 빅데이터와 컴퓨팅 인프라에 의해 새로운 소프트웨어가 된다. 이 과정에서 데이터는 코드와 동등한 수준으로 재고되어야 한다.

Abstract

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here, software engineering needs to be re-thought where data become a first-class citizen on par with code.

머신러닝 과정의 상당 부분이 데이터 준비에 사용된다. 좋은 데이터가 없으면 최고의 머신러닝 알고리즘도 성능을 발휘할 수 없다. 결과적으로 데이터 중심 AI 실천이 주류가 되고 있다.

A significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream.

불행히도 많은 현실 세계의 데이터셋은 작고, 더럽고, 편향되어 있으며 심지어 중독될 수도 있다. 이 설문조사에서는 주로 딥러닝 응용 프로그램을 위한 데이터 수집 및 데이터 품질에 대한 연구 경관을 조사한다.

Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications.

최근 딥러닝 접근 방식에서는 특징 엔지니어링보다는 많은 데이터가 더 필요하다. 데이터 품질을 위해 우리는 데이터 검증, 정리 및 통합 기술을 연구한다. 데이터가 완전히 정리되지 않더라도, 모델 훈련 중 견고한 훈련 기술을 사용하여 불완전한 데이터를 처리할 수 있다.

For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques.

개요

우리는 소프트웨어 엔지니어링에서 머신러닝이 새로운 소프트웨어가 되는 근본적인 패러다임 전환을 겪고 있다 (Software 2.0 [134]라고 함). IDC에 따르면 [41], 전 세계 데이터 양은 2025년까지 175제타바이트(ZB)로 기하급수적으로 증가할 것으로 예상된다.

Overview

We are going through a fundamental paradigm shift in software engineering where machine learning becomes the new software (referred to as Software 2.0 [134]). According to the IDC [41], the amount of data worldwide is projected to grow exponentially to 175 zettabytes (ZB) by 2025.

기존 소프트웨어 엔지니어링은 코드를 설계, 구현 및 디버깅하는 것을 포함한다. 반면, 머신러닝은 데이터로 시작하여 그 데이터에 함수를 훈련시킨다. 데이터 준비는 엔드 투 엔드 머신러닝에서 비용이 많이 드는 단계로 알려져 있다.

Conventional software engineering involves designing, implementing, and debugging code. In comparison, machine learning starts with data and trains a function on the data. It is known that data preparation is an expensive step in end-to-end machine learning.

특히, 데이터를 수집하고, 정리하며, 머신러닝 훈련에 적합하게 만드는 데 전체 시간의 45% [43] 또는 심지어 80-90% [24,153]가 소요된다. 데이터 문제를 해결하는 것은 점점 더 머신러닝 연구에서 중요한 부분이 되고 있다.

In particular, collecting data, cleaning it, and making it suitable for machine learning training takes 45% [43] or even 80-90% [24,153] of the entire time. Solving data issues is increasingly becoming critical in machine learning research.

데이터 수집 및 품질 문제는 중요하지만, 머신러닝 연구는 주로 알고리즘 훈련에 중점을 두고 있다. 산업계에서는 연구 기관들이 머신러닝 노력의 90%를 알고리즘에, 10%를 데이터 준비에 사용한다고 불평한다. 그러나 실제 시간 소요를 기준으로 하면 비율이 반대여야 한다고 한다.

While data collection and quality issues are important, machine learning research has mainly focused on training algorithms instead of the data. According to [153], a common complaint in the industry is that research institutions spend 90% of their machine learning efforts on algorithms and 10% on data preparation, although based on the amounts of time spent, the numbers should be 10% and 90% the other way.

동시에, 많은 기업들이 책임 있는 데이터 중심 AI 실천을 약속하고 있다. 예를 들어, 구글 [120]은 AI가 도전적인 문제를 해결하는 데 상당한 잠재력을 가지고 있다고 말하지만, 이를 책임감 있게 개발하는 것이 중요하다고 강조한다.

At the same time, many companies are promising to use responsible and data-centric AI practices. For example, Google [120] says that AI has a significant potential to help solve challenging problems, but it is important to develop responsibly.

마이크로소프트 [121]는 사람을 우선하는 윤리적 원칙을 통해 AI를 발전시키겠다고 약속한다. 다른 기업들도 유사한 성명을 발표하고 있다 [109,154]. 최근 데이터 중심 AI [42]는 모델 훈련 알고리즘을 개선하는 것이 아니라, 데이터 전처리를 개선하여 모델 정확도를 향상시키는 것을 주요 목표로 삼고 있다.

Microsoft [121] pledges to advance AI using ethical principles that put people first. Other companies make similar statements [109,154]. More recently, data-centric AI [42] is becoming critical where the primary goal is not to improve the model training algorithm, but to improve the data pre-processing for better model accuracy.

불행히도, 많은 산업에서는 데이터 부족과 훈련된 모델의 설명 가능성 부족 때문에 딥러닝을 채택하지 않는다.우리는 데이터 중심 AI를 발전시키기 위해 딥러닝 과정 전체에서 데이터 문제에 대한 조망이 중요하다고 믿는다.

Unfortunately, many industries do not adopt deep learning simply because of the lack of data and the lack of explainability of the trained models. We believe it is important to have a birds-eye view of data issues in the entire deep learning process in order to advance data-centric AI.

그림 1 딥러닝의 데이터 중심 AI 관점에서의 도전 과제. 데이터 수집 및 품질 문제는 머신러닝 과정 전체에서 한 단계로 해결할 수 없다.

Fig. 1 Deep learning challenges from a data-centric AI perspective. Data collection and quality issues cannot be resolved in a single step, but throughout the entire machine learning process.

Fig. 1 Deep learning challenges from a data-centric AI perspective.

본 논문의 범위는 광범위하지만, 우리는 데이터 중심 AI를 발전시키기 위해 딥러닝 과정 전체에서 데이터 문제에 대한 조망이 중요하다고 믿는다. 각 하위 주제는 상당히 중요하며, 서로 다른 커뮤니티에 의해 연구되고 있다.

While the coverage of this survey is broad, we believe it is important to have a birds-eye view of data issues in the entire deep learning process in order to advance data-centric AI. Each subtopic is not only substantial, but studied by different communities.

데이터 수집, 정리 및 검증은 전통적으로 데이터 관리 커뮤니티에서 연구되었다. 견고한 모델 훈련은 머신러닝 및 보안 커뮤니티의 중심 주제이며, 공정한 모델 훈련은 머신러닝 및 공정성 커뮤니티에서 인기 있는 주제이다. 공정성 및 견고성 주제는 입력 데이터와 밀접하게 관련되어 있기 때문에 데이터 관리 커뮤니티에서도 점점 더 연구되고 있다.

Data collection, cleaning, and validation have been traditionally studied in the data management community. Robust model training is a central topic in the machine learning and security communities, while fair model training is a popular topic in the machine learning and fairness communities. Both fairness and robustness topics are increasingly being researched in the data management community as well because they are closely related to the input data.

데이터 중심 AI는 하나의 영역을 해결하는 것만으로는 다룰 수 없는 초기 단계의 분야이지만, 궁극적으로는 전체론적 프레임워크 내에서 조화를 이루어야 한다. 우리의 기여는 이러한 관련 주제들을 높은 수준에서 연결하여 최근의 중요한 연구들에 중점을 둔다.

Data-centric AI is a nascent field that cannot be covered by solving just one of these areas either, but instead will ultimately need an orchestration within a holistic framework. Our contribution is thus to connect these related topics together at a high level with a focus on recent and significant works.

표 1 이 설문조사에서 다루는 기술의 분류

Table 1 Taxonomy of techniques covered in this survey

그림 2는 이러한 기술들이 하나의 워크플로우에서 어떻게 연결되는지 보여주는 의사결정 트리를 보여준다. 우리의 작업은 데이터가 데이터 중심 AI에서 중요한 역할을 하는 방식을 이해해야 하는 연구자와 실무자를 대상으로 한다.

Fig. 2 shows a decision tree of how the techniques connect with each other in one workflow. Our work targets researchers and practitioners who need a starting point of understanding how data plays a key role in data-centric AI.

Fig.2 Decision tree on how data-centric AI techniques connect with each other in one workflow

요약하면, 딥러닝은 빅 데이터와 빠른 계산 덕분에 널리 퍼지고 있으며, 소프트웨어 엔지니어링은 새로운 패러다임 전환을 겪고 있다. 그러나 딥러닝을 위한 빅 데이터는 상대적으로 덜 연구되었지만 데이터 중심 AI에서 중요해지고 있다.

In summary, deep learning is becoming prevalent thanks to big data and fast computation, and software engineering is going through a new paradigm shift. However, big data for deep learning have been relatively understudied, but is becoming critical in data-centric AI.

우리는 다음 섹션에서 다음 주제를 다룬다:

• 머신러닝을 위한 데이터 수집 기술 (2절).

• 머신러닝을 위한 데이터 검증, 정리 및 통합 기술 (3절).

• 잡음 및 중독된 데이터를 다루기 위한 견고한 훈련 기술 (4절).

• 편향된 데이터를 다루기 위한 공정한 훈련 기술 (5절).

• 전체적인 발견 및 미래 방향 (6절).

We cover the following topics in the next sections:

• Data collection techniques for machine learning (Sect. 2).

• Data validation, cleaning, and integration techniques for machine learning (Sect. 3).

• Robust training techniques for coping with noisy and poisoned data (Sect. 4).

• Fair training techniques for coping with biased data (Sect. 5).

• Overall findings and future directions (Sect. 6).

데이터 수집

데이터 수집에는 세 가지 주요 접근 방식이 있다. 첫째, 데이터 획득은 새로운 데이터셋을 발견, 보강 또는 생성하는 문제이다. 둘째, 데이터 레이블링은 머신러닝 모델이 학습할 수 있도록 데이터에 유익한 주석을 추가하는 문제이다. 레이블링은 비용이 많이 들기 때문에 반지도 학습, 크라우드소싱 및 약한 감독을 포함한 다양한 기술을 사용할 수 있다. 마지막으로, 이미 데이터를 가지고 있다면, 처음부터 데이터를 획득하거나 레이블링하는 대신 기존 데이터와 모델을 개선할 수 있다.

Data collection

There are three main approaches for data collection. First, data acquisition is the problem of discovering, augmenting, or generating new datasets. Second, data labeling is the problem of adding informative annotations to data so that a machine learning model can learn from them. Since labeling is expensive, there is a variety of techniques to use including semi-supervised learning, crowdsourcing, and weak supervision. Finally, if one already has data, improving existing data and models can be done instead of acquiring or labeling from scratch.

데이터 획득

데이터가 충분하지 않은 경우, 첫 번째 옵션은 머신러닝 모델을 훈련하기에 적합한 데이터셋을 찾는 과정인 데이터 획득을 수행하는 것이다. 이 설문조사에서는 데이터 검색, 데이터 보강 및 데이터 생성을 다룬다. 데이터 검색은 데이터셋을 색인화하고 검색하는 문제이다. 데이터 보강은 라벨이 지정된 예제를 왜곡하거나 결합하여 합성 예제를 생성하는 것이다. 데이터가 충분하지 않으면, 마지막 수단은 크라우드소싱이나 합성 데이터 생성 기술을 사용하여 직접 데이터를 생성하는 것이다.

Data acquisition

If there is not enough data, the first option is to perform data acquisition, which is the process of finding datasets that are suitable for training machine learning models. In this survey, we cover three approaches: data discovery, data augmentation, and data generation. Data discovery is the problem of indexing and searching datasets. Data augmentation takes labeled examples and distorts or combines them to generate synthetic examples. If there is not enough data around, the last resort is to take matters in one’s own hands and create datasets using crowdsourcing or synthetic data generation techniques.

데이터 검색

데이터 검색은 기업 데이터 호수 [54,157] 또는 웹 [25]에 존재하는 데이터셋을 색인화하고 검색하는 문제이다.

Data discovery

Data discovery is the problem of indexing and searching datasets that exist either in corporate data lakes [54,157] or the Web [25].

데이터 보강

데이터 보강의 경우, 머신러닝 커뮤니티에서 데이터를 생성하는 인기 있는 방법은 생성적 적대 신경망(GAN) [59,60,89]이다. GAN의 한계는 기존 데이터와 완전히 다른 데이터를 생성할 수 없다는 것이다.

Data augmentation

For data augmentation, a popular method for generating data in the machine learning community is generative adversarial networks (GANs) [59,60,89]. One limitation of a GAN is that it cannot generate data that is completely different than the existing data.

데이터 생성

또 다른 데이터 수집 또는 획득 옵션은 데이터를 생성하는 것이다. 인기 있는 옵션은 Amazon Mechanical Turk [1]와 같은 크라우드소싱 플랫폼을 사용하여 작업을 생성하고 인간 작업자에게 데이터를 생성하거나 찾도록 비용을 지불하는 것이다. 도메인 무작위화 [158,159]는 시뮬레이터의 매개변수를 변경하여 광범위한 현실적인 데이터를 생성하는 효과적인 기술이다. 우리는 또한 GAN이 새로운 데이터를 생성하지만 모델 훈련을 위한 충분한 양의 실제 데이터가 필요하다는 점을 주목한다.

Data generation

Another option for collecting or acquiring new data is to generate data. A popular option is to use crowdsourcing platforms like Amazon Mechanical Turk [1] where one can create tasks and pay human workers to create or find data. Domain randomization [158,159] is an effective technique for generating a wide range of realistic data from a simulator by varying its parameters. We note that GANs also generate new data, but they require sufficient amounts of real data for model training.

데이터 레이블링

기존 레이블 활용. 레이블링의 전통적인 접근 방식은 기존 레이블을 사용하여 다른 레이블을 예측하는 반지도 학습 [160,188]이다. 다양한 작업에 대해 라벨이 지정된 데이터를 제공하는 기존 머신러닝 벤치마크 [50,79]를 활용할 수 있다. 가장 단순한 형태는 사용 가능한 레이블이 지정된 데이터에 모델을 훈련하고 레이블이 지정되지 않은 데이터에 적용하는 자기 훈련 [174]이다. 그런 다음, 가장 높은 신뢰 값의 예측이 신뢰되고 훈련 세트에 추가된다. 이 접근 방식은 높은 신뢰를 신뢰할 수 있다고 가정하지만, Tri-training [186], Co-learning [185], Co-training [20]을 포함한 다른 기술들은 이 가정을 사용하지 않는다.

Data labeling

Utilize existing labels. The traditional approach for labeling is semi-supervised learning [160,188] where the idea is to use existing labels to predict the other labels. One can utilize existing machine learning benchmarks [50,79] that provide labeled data for a variety of tasks. The simplest form is Self-training [174] where a model is trained on the available labeled data and then applied to the unlabeled data. Then, the predictions with the highest confidence values are trusted and added to the training set. This approach assumes that we can trust the high confidence, but there are other techniques including Tri-training [186], Co-learning [185], and Co-training [20] that do not rely on this assumption.

기존 데이터 개선

데이터셋을 검색하고 라벨링하는 것 외에도 기존 데이터와 모델의 품질을 향상시킬 수 있다. 이 접근 방식은 몇 가지 시나리오에서 유용하다. 대상 응용 프로그램이 새롭거나 비트리비얼하여 외부에 관련 데이터셋이 없거나, 수집된 데이터가 낮은 품질로 인해 모델의 정확도를 더 이상 향상시키지 못하는 경우, 더 나은 옵션은 기존 데이터를 개선하는 것일 수 있다. 효과적인 접근 방식 중 하나는 레이블을 재표시하여 개선하는 것이다. Sheng 등 [146]은 모델 정확도 추세를 다양한 품질의 데이터셋으로 더 많은 훈련 예제에 대해 보여주며 레이블 개선의 중요성을 보여준다.

Improving existing data

In addition to searching and labeling datasets, one can also improve the quality of existing data and models. This approach is useful in several scenarios. Suppose the target application is novel or non-trivial where there are no relevant datasets outside, or collecting more data no longer benefits the model’s accuracy due to its low quality. Here, a better option may be to improve the existing data. One effective approach is to improve the labels through re-labeling. Sheng et al. [146] demonstrates the importance of improving labels by showing the model accuracy trends against more training examples for datasets with different qualities.

데이터 검증, 정리 및 통합

최근 설문 조사 [75]에 따르면, 견고한 훈련은 모델 훈련 전에 데이터 정리보다 더 효과적인 것으로 간주된다. 또 다른 문제는 모델 공정성과 같은 AI 윤리를 통합하는 것이다 [18]. 데이터가 편향되어 있으면 훈련된 모델이 차별적일 수 있다.

Data validation, cleaning, and integration

A recent survey [75] mentions that robust training is considered more effective than data cleaning before model training. Yet another issue is incorporating AI ethics like model fairness [18] where data may be biased, which may cause the trained model to be discriminatory.

데이터 검증

데이터 시각화는 머신러닝을 위한 데이터를 검증하는 데 널리 사용되는 효과적인 방법이다 (튜토리얼 [117] 및 설문 조사 [118] 참조).

Data validation

Data visualization is a widely used and effective way to validate data for machine learning (see a tutorial [117] and survey [118]).

데이터 정리

데이터 정리는 키 제약 조건, 도메인 제약 조건, 참조 무결성 제약 조건 및 기능 종속성을 포함하여 다양한 잘 정의된 오류를 제거하는 오랜 역사를 가지고 있다. 소개를 위해서는 데이터 정리 책 [74]을 참조하라. 또한 머신러닝을 위한 데이터 정리 기술과 그 반대에 대한 최근 설문 조사가 있다 [75].

Data cleaning

Data cleaning has a long history of removing various well-defined errors by satisfying integrity constraints including key constraints, domain constraints, referential integrity constraints, and functional dependencies. For an introduction, see the book Data Cleaning [74]. There is also a recent survey on data cleaning techniques for machine learning and vice versa [75].

불행히도, 데이터 수정에만 초점을 맞추는 것은 최고의 모델 정확도를 보장하지 않는다. 처음에는 데이터를 완벽하게 정리하는 것이 모델 훈련에 가장 유용할 것처럼 보인다. 그러나 깨끗한 데이터의 개념은 항상 명확하지 않으며 모든 가능한 오류를 제거하는 것은 항상 가능한 것은 아니다.

Unfortunately, only focusing on fixing the data does not necessarily guarantee the best model accuracy. At first glance, it seems that perfectly cleaning the data would be most useful for the model training. However, the notion of clean data is not always clear cut, and removing all possible errors is not always feasible.

CleanML [96]은 다양한 데이터 정리 기술을 평가하고 실제로 모델 정확도를 향상시키는지 확인하는 프레임워크이다. 저자들은 데이터 정리가 다운스트림 머신러닝 모델을 반드시 개선하지는 않는다고 보여준다. 실제로, 정리는 때때로 모델에 부정적인 영향을 미칠 수 있다. 그러나 적절한 머신러닝 모델을 선택함으로써 데이터 정리의 부정적인 영향을 제거할 수 있다. 또한, 많은 데이터 정리 원시 기능은 기계 학습 하이퍼 파라미터 튜닝과 유사하게 조정해야 하는 고충격 매개 변수를 가지고 있다. 따라서 머신러닝을 위해 원래 설계되지 않은 데이터 정리 기술은 신중하게 사용해야 한다.

CleanML [96] is a framework that evaluates various data cleaning techniques and seeing if they actually help model accuracy. The authors show that data cleaning does not necessarily improve downstream machine learning models. In fact, the cleaning may sometimes have a negative effect on the models. However, by selecting an appropriate machine learning model, one can eliminate the negative effects of data cleaning. Moreover, many data cleaning primitives have high-impact parameters like thresholds that need to be tuned, similar to machine learning hyperparameter tuning. Hence, data cleaning techniques that are not originally designed for machine learning must be used carefully.

최근에는 모델 정확도를 향상시키기 위한 특정 목적의 데이터 정리 기술이 있다 [48]. ActiveClean [88]은 더러운 데이터 샘플을 반복적으로 정리하고 모델을 업데이트하는 기념비적인 프레임워크이다. 또 다른 연구 분야는 모델 정확도를 개선하기 위해 레이블을 정리하는 것이다.

Recently, there are data cleaning techniques with the specific purpose of improving model accuracy [48]. ActiveClean [88] is a seminal framework that iteratively cleans samples of dirty data and updates the model. Another branch of research is to clean the labels for the purpose of improving model accuracy.

TARS [47]는 크라우드소싱에서 생성된 잡음이 많은 레이블에서 모델 정확도를 예측하는 시스템이다. 최근에는 머신러닝을 위한 데이터 정리를 지원하기 위한 보다 체계적인 접근 방식이 있다. 한 연구 [128]는 데이터 품질 문제가 MLOps에 어떻게 영향을 미치는지 보여주고 이를 해결하기 위한 다양한 솔루션을 제안한다. 예를 들어, CPClean [82]은 누락된 데이터가 예측의 확실성에 어떻게 영향을 미치는지 분석하는 것을 제안한다.

TARS [47] is a system that predicts model accuracy out of noisy labels that are produced from crowdsourcing. More recently, there are more systematic approaches to support data cleaning for machine learning. One study [128] shows how data quality issues affect MLOps and proposes various solutions to tackle them. For example, CPClean [82] is proposed to analyze how missing data impacts the certainty of predictions.

데이터 정화

데이터 중독은 최근 심각한 문제로 떠오르고 있다. 신뢰할 수 없는 출처에서 온 훈련 데이터의 일부를 변경하면 모델의 행동이 달라질 수 있다. 더러운 데이터와 비교하여 모델을 실패하게 하려는 악의적인 의도가 있다. 데이터 중독은 데이터가 이제 데이터셋 검색 엔진을 통해 더 쉽게 게시될 수 있기 때문에 실제 문제이다. 데이터셋 소유자는 메타데이터를 공개적으로 게시할 수 있으며, 이는 검색 엔진이 자동으로 크롤링할 것이다. 그런 다음, 웹 크롤러를 사용하여 해당 데이터가 중독된 것을 알지 못한 채 데이터를 수집할 수 있다.

Data sanitization

Data poisoning has recently become a serious issue because changing a fraction of training data, which may come from an untrusted source, may alter the model’s behavior. Compared to dirty data, there is a malicious intent to make the model fail. Data poisoning is a real problem because data are now easier to publish through dataset search engines. A dataset owner can simply post metadata to the public, which will be automatically crawled by the search engine. Then, one can simply harvest that data using web crawlers without knowing that the data is poisoned.

데이터 정화 [39]는 이러한 중독 공격을 방어하는 문제이며 데이터 정리의 극단적인 버전으로 볼 수 있다. 최근 데이터 중독 기술은 훨씬 더 정교해져 방어하기가 더 어려워졌다 [143,187].

Data sanitization [39] is the problem of defending against such poisoning attacks and can be viewed as an extreme version of data cleaning. Recently, data poisoning techniques have become much more sophisticated and therefore harder to defend against [143,187].

우리는 최첨단 데이터 중독 기술을 딥러닝을 위해 설명한다 [187]. 데이터 정화 기술 [39,72,114]는 여러 해 동안 제안되었으며, 최근 연구 [87]는 다양한 방어를 평가하여 공격을 개발하고 방어가 여전히 효과적인지 확인한다. 불행히도, 결론은 신중하게 설계된 공격에 적절히 방어할 수 있는 기술이 없다는 것이다. 우리는 데이터 중독 및 정화 기술이 계속해서 발전하고 경쟁할 것이라고 생각한다.

We illustrate a state-of-the-art data poisoning techniques for deep learning [187]. Data sanitization techniques [39,72,114] have been proposed throughout the years, and a recent study [87] evaluates various defenses by developing attacks and seeing if the defenses work are still effective. Unfortunately, the conclusion is that no technique can adequately defend against carefully designed attacks. We suspect that data poisoning and sanitization techniques will continue to evolve and compete with each other.

다중 모달 데이터 통합

고려해야 할 데이터 관리의 또 다른 차원은 다중 모달 데이터 통합 문제이다 [11]. 지금까지 우리는 단일 출처 데이터셋을 암묵적으로 가정했지만, 실제로 데이터 과학자들은 여러 출처의 다중 모달 데이터를 자주 다룬다. 예를 들어, 자율 주행차는 여러 비디오 스트림, 레이더 및 라이다 데이터, 차량의 컨트롤러 영역 네트워크 (CAN)에서 수천 개의 불규칙한 시리즈를 포함한 광범위한 데이터를 생성할 수 있다. 이 모든 데이터를 함께 분석하려면 데이터 통합의 형태가 필요하다.

Multimodal data integration

Another dimension of data management to consider is the issue of multimodal data integration [11]. So far, we implicitly assumed single-source datasets, but in practice, data scientists often deal with multimodal data from multiple sources. For example, autonomous vehicles can generate a wide range of data including multiple video streams, radar and lidar data, and thousands of irregular time series from the Controller Area Network (CAN) of the vehicle. Analyzing all of this data together requires some form of data integration.

머신러닝에서 두 가지 관련 통합 기술은 정렬과 공동 학습이다. 정렬은 여러 모달리티를 가진 인스턴스의 하위 구성 요소 간의 관계를 찾는 것이다. 예를 들어, 다중 뷰 시계열이 있는 경우 시계열을 더 잘 통합할 수 있도록 하위 샘플링, 앞으로 또는 뒤로 채우기, 또는 시간 창에서 집계를 수행할 수 있다.

In machine learning, two relevant integration techniques are alignment and co-learning. Alignment is to find relationships of sub-components of instances that have multiple modalities. For example, if there are multi-view time series, one can perform subsampling, forward or backward filling, or aggregate in time windows so that the time series can be better integrated.

공동 학습은 다른 모달리티를 사용하여 모달리티를 더 잘 훈련하는 것이다. 예를 들어, 다른 모달리티의 임베딩이 있는 경우, 한 접근 방식은 다중 모달 표현예를 들어, 다른 모달리티의 임베딩이 있는 경우, 한 접근 방식은 다중 모달 표현을 위해 이를 함께 연결하는 것이다. 일반적으로, 데이터 통합은 수십 년 동안 연구된 방대한 연구 영역이며 [46,152], 모든 기술이 머신러닝에 관련된 것은 아니다.

Co-learning is to train better on a modality using a different modality. For example, if there are embeddings from different modalities, one approach is to concatenate them together for a multimodal representation. In general, data integration is by itself a large research area that has been studied for decades [46,152], although not all techniques are relevant to machine learning.

견고한 모델 훈련

올바른 데이터를 수집하고 정리한 후에도 모델 훈련 중 데이터 품질은 여전히 문제가 될 수 있다. 실세계 데이터셋은 데이터 정리 과정에도 불구하고 더럽고 오류가 많다는 것이 널리 알려져 있다. 표 2에 요약된 바와 같이, 이러한 데이터셋의 결함은 데이터 값이 잡음이 있거나 누락된 것인지, 그리고 이러한 결함이 데이터 특징(속성) 또는 레이블에 존재하는지에 따라 범주화될 수 있다.

Robust model training

Even after collecting the right data and cleaning it, data quality may still be an issue during model training. It is widely agreed that real-world datasets are dirty and erroneous despite the data cleaning process. As summarized in Table 2, these flaws in datasets can be categorized depending on whether data values are noisy or missing and depending on whether these flaws exist in data features (attributes) or labels.

데이터 중독 문제는 이론(즉, 견고한 통계)과 실제에서 50년 이상 연구되어 왔으며, 머신러닝 커뮤니티에서 많은 관심을 받고 있다 [73,161]. 이 문제는 ’머신러닝 모델이 데이터가 손상되지 않은 것처럼 학습하고 예측할 수 있는가?’라는 기본 질문에서 시작하여, 데이터의 모든 깨끗한 정보를 복구할 수 없는 최악의 손상에 견딜 수 있는 머신러닝 알고리즘을 개발하는 것을 목표로 한다. 주로 데이터 특징의 손상을 고려하며, 여기에는 이상치와 적대적 예제가 포함된다.

The problem of data poisoning has been studied in theory (i.e., robust statistics) and practice for over fifty years and has gained a lot of attention in the machine learning community [73,161]. It starts with a basic question, ‘can the machine learning model learn and predict as if the data was not corrupted?’ and aims to develop machine learning algorithms robust to the worst-case corruptions where we cannot recover the entire clean information from the data. It mainly considers the corruptions in data features, which include outliers and adversarial examples.

견고한 평균 추정 [97]과 같은 통계적 접근 방식은 데이터 결함이 있는 경우에도 분포의 평균을 복구하는 것을 목표로 한다. 볼록 프로그래밍 [44]과 필터링 [33]은 샘플이 손상된 정도에 따라 각 데이터 포인트에 점수를 할당하여 문제를 해결한다. 이러한 일련의 연구는 손실 재조정 및 샘플 선택과 같은 많은 머신러닝 견고성 최적화 기술에 영감을 주었다. 또한, 견고한 머신러닝에는 고려하는 손상의 종류에 따라 여러 문제가 포함된다. 예를 들어, 프라이버시 머신러닝은 데이터를 제공하는 사용자의 프라이버시를 존중하는 것을 목표로 한다 [119].

Statistical approaches like robust mean estimation [97] aim to recover the mean of the distribution in the presence of data flaws. Convex programming [44] and filtering [33] address the problems by assigning a score to each data point based on the degree to which the sample is considered corrupted. This series of studies have been inspiring a lot of machine learning robustness optimization techniques such as loss reweighting and sample selection. In addition, robust machine learning involves many problems depending on what sorts of damages we consider. For example, privacy machine learning aims to respect the privacy of the users providing the data [119].

공정한 모델 훈련

이제 편향된 데이터가 모델이 차별적으로 되어 불공정해질 수 있는 모델 공정성 문제에 초점을 맞춘다. 이 문제는 훈련 데이터의 잡음을 다루는 대신 편향을 다루는 견고한 모델 훈련과 밀접하게 관련되어 있다. 공정성과 윤리에 대한 광범위한 논의는 최근의 공정 ML 책 [12]에서 찾을 수 있으며, 여기서는 기술적 해결책과 관련된 공정성 문제에만 초점을 맞춘다.

Fair model training

We now focus on the issue of model fairness where biased data may cause a model to be discriminating and thus unfair. This problem is closely related to robust model training where instead of addressing noise in the training data, the goal is to address bias. An extensive discussion on fairness and ethics can be found in the recent fair ML book [12], and here, we only focus on fairness issues with technical solutions.

특히, 공정성을 측정하는 방법과 불공정을 완화하는 방법을 논의한다. 또한, 공정하고 견고한 기술이 어떻게 수렴되는지에 대한 최근 경향을 논의한다. 이 경향은 자연스러우며, 편향과 잡음이 서로에게 영향을 미칠 수 있으며, 공정성만 다루면 견고성이 부정적으로 영향을 받을 수 있고 그 반대도 마찬가지이다. 이 섹션은 저자들이 최근에 작성한 튜토리얼 [94,169]을 확장한 것이다.

In particular, we discuss how to measure fairness and how to mitigate unfairness. In addition, we discuss a recent trend of how fair and robust techniques are converging. This trend is natural, as bias and noise can affect each other, and only addressing fairness may negatively affect robustness and vice versa. This section extends recent tutorials [94,169] by the authors.

공정성 측정

공정성은 하나의 개념으로 설명될 수 없으며, 다양한 조사 [12,36,100,165]에서 요약된 수십 가지의 가능한 정의가 있다. 이러한 정의는 범죄 예측, 고용, 대출 제공 등 다양한 분야에서 사용된다. 우리는 다음과 같은 표기법을 사용한다: Y는 샘플의 레이블을 나타내고, \hat{Y}는 모델의 예측을 나타내며, Z는 인종이나 성별과 같은 민감한 속성을 나타낸다. 민감한 속성을 선택하는 것은 애플리케이션에서 민감하게 여겨지는 것에 따라 달라진다. 예를 들어, 회사가 나이를 기준으로 차별할 수 있는 위험이 있다면, 나이와 관련된 속성이 민감한 것으로 간주될 수 있다.

Fairness measures

Fairness cannot be described by one notion, and there are tens of possible definitions summarized in various surveys [12,36,100,165] used for predicting crime, hiring, giving loans, and more. We use the following notations: Y denotes the label of a sample, \hat{Y} the prediction of a model, and Z is a sensitive attribute like race or gender. Choosing a sensitive attribute depends on what is considered sensitive in the application. For example, if a company may run into trouble by discriminating based on age, then an attribute that is related to age can be considered sensitive.

불공정 완화

비록 공정성을 측정하는 많은 방법들이 있지만, 궁극적으로는 불공정을 완화하고자 한다 [12, 14]. 데이터 편향은 모델 훈련 전, 도중, 또는 후에 해결될 수 있다. 이러한 접근 방식은 각각 전처리, 인처리, 후처리 접근 방식으로 불린다. 전처리 접근 방식은 데이터 편향을 제거하여 불공정을 해결하는 것으로 볼 수 있다. 각 접근 방식에 대해 대표적인 기술을 다룬다.

Unfairness mitigation

Although there are many ways to measure fairness, one would ultimately like to perform unfairness mitigation [12, 14]. Data bias can be addressed either before, during, or after model training. These approaches are referred to as pre-processing, in-processing, and post-processing approaches, respectively. Pre-processing approaches can be viewed as data cleaning, but with a focus on improving fairness. For each approach, we cover representative techniques.

전처리 완화

전처리 완화의 목표는 모델 훈련 전에 데이터 편향을 제거하여 불공정을 해결하는 것이다. 장점은 데이터 내에서 불공정의 근본 원인을 해결할 수 있다는 것이다. 단점은 데이터에만 작업할 때 모델 공정성이 실제로 개선되는지 보장하기 어려울 수 있다는 것이다.

Pre-processing mitigation

The goal is to fix the unfairness before model training by removing data bias. The advantage is that we may be able to solve the root cause of unfairness within the data. A disadvantage is that it may be tricky to ensure that the model fairness actually improves when we only operate on the data.

민감한 속성을 제거하는 것(즉, 무지로 접근)은 효과가 없는 단순한 접근 방식이다. 왜냐하면 민감한 속성은 일반적으로 다른 속성과 상관관계가 있기 때문이다. 예를 들어, 인종, 소득, 성별과 같은 민감한 속성을 제거한다고 해서 공정성이 보장되는 것은 아니다. 왜냐하면 그 값들은 우편번호, 신용 점수, 검색 기록과 같은 상관된 속성을 통해 추론될 수 있기 때문이다. 우리는 데이터 수리, 데이터 생성, 데이터 획득의 세 가지 자연스러운 전처리 접근 방식을 다룬다.

A naïve approach that does not work is to remove sensitive attributes (also referred to as unawareness) because they are usually correlated with other attributes. For example, removing sensitive attributes like race, income, and gender does not ensure fairness because their values can be inferred using correlated attributes like zip code, credit score, and browsing history, respectively. We cover three natural approaches for pre-processing—repairing data, generating data, and acquiring data.

공정성을 만족시키기 위해 충분한 데이터가 없는 경우, 사용 가능한 데이터를 사용하여 새로운 데이터를 생성하는 대안이 있다. 최근 방법 [34]은 약한 감독을 사용하여 편향되지 않은 데이터를 생성하는 것이다. 데이터가 점점 더 많이 제공됨에 따라 외부 데이터 소스에서 데이터를 획득하는 것도 실행 가능한 옵션이 되고 있다 [9, 31].

If there is not enough data to satisfy fairness, an alternative is to generate new data using the available data. A recent method [34] is to generate unbiased data using weak supervision. As data are increasingly available, acquiring data from external data sources is also becoming a viable option [9, 31].

인처리 완화

이제 모델 훈련을 수정하는 불공정 완화를 위한 대표적인 인처리 기술을 다룬다. 장점은 정확성과 공정성을 직접 최적화할 수 있다는 것이다.

In-processing mitigation

We now cover representative in-processing techniques for unfairness mitigation where the model training is fixed. The advantage is that one can directly optimize accuracy and fairness.

전반적인 발견 및 미래 방향

우리는 우리의 발견을 요약한다. 2절에서는 데이터 수집 기술이 데이터 획득, 데이터 레이블링, 기존 데이터 및 모델 개선으로 구성된다고 설명했다. 일부 기술은 데이터 관리 커뮤니티에 의해 연구되었고, 다른 기술은 머신러닝 커뮤니티에 의해 연구되었다. 3절에서는 데이터 검증, 데이터 정리, 데이터 정화, 데이터 통합의 주요 접근 방식을 다루었다. 데이터 검증은 시각화 및 스키마 정보를 사용하여 수행될 수 있다. 데이터 정리는 모델 정확도를 향상시키기 위해 최근 기술이 더 맞춤화된 중점 연구 분야였다. 데이터 정화는 중독 공격을 방어하는 다른 성격을 가지고 있다. 데이터 통합은 다중 모달 데이터로 인해 어려움을 겪고 있다.

Overall findings and future directions

We summarize our findings. In Sect. 2, we explained that data collection techniques consist of data acquisition, data labeling, and improving existing data and models. Some of the techniques have been studied by the data management community while others by the machine learning community. In Sect. 3, we covered key approaches in data validation, data cleaning, data sanitization, and data integration. Data validation can be performed using visualizations and schema information. Data cleaning has been heavily studied where recent techniques are more tailored to improving model accuracy. Data sanitization has the different flavor of defending against poisoning attacks. Data integration is challenging due to multimodal data.

4절에서는 잡음이 있거나 누락된 레이블이 테스트 데이터에서 일반화가 잘되지 않는다는 것을 설명했다. 잡음이 있는 레이블에 대한 기존 작업은 (i) 축적된 잡음 또는 (ii) 훈련 데이터의 부분적 탐색에서 고통받는다. 하이브리드 (예: SELFIE) 및 반지도 학습 기술 (예: DivideMix)은 잡음이 있는 훈련 데이터로 매우 높은 정확도를 달성할 수 있다. 반지도 학습 (예: MixMatch) 및 자지도 학습 (예: JigsawNet) 기술은 풍부한 레이블이 없는 데이터를 활용하기 위해 적극적으로 개발되고 있다.

In Sect. 4, we explained that noisy or missing labels incur poor generalization on test data. Existing work for noisy labels suffers from either (i) accumulated noise or (ii) partial exploration of training data. Hybrid (e.g., SELFIE) and semi-supervised techniques (e.g., DivideMix) can achieve very high accuracy with noisy training data. Semi-supervised (e.g., MixMatch) and self-supervised (e.g., JigsawNet) techniques are actively developed to exploit abundant unlabeled data.

5절에서는 공정성 측정, 불공정 완화 기술, 견고성 기술과의 수렴에 대해 다루었다. 완화는 모델 훈련 전, 도중 또는 후에 수행될 수 있다. 전처리는 훈련 데이터를 수정할 수 있을 때 유용하다. 인처리는 훈련 알고리즘을 수정할 수 있을 때 유용하다. 후처리는 데이터와 모델 훈련을 수정할 수 없을 때 사용할 수 있다. 견고성 기술과의 수렴은 공정-견고성 기술, 견고-공정성 기술, 그리고 동등한 통합으로 분류될 수 있다.

In Sect. 5, we covered fairness measures, unfairness mitigation techniques, and convergence with robustness techniques. The mitigation can be done before, during, or after model training. Pre-processing is useful when training data can be modified. In-processing is useful when the training algorithm can be modified. Post-processing can be used when we cannot modify the data and model training. The convergence with robustness techniques can be categorized into fair-robust techniques, robust-fair techniques, and equal mergers.

데이터 중심 AI가 더 확립됨에 따라, 우리는 이러한 연구 영역의 다양한 수렴이 있을 것이라고 믿는다. 우리의 목록은 결코 포괄적이지 않지만, 주요 경향을 식별하려고 시도한다.

As data-centric AI becomes more established, we believe there will be various convergences of these research areas. Our list is certainly not exhaustive, but we attempt to identify the major trends.

데이터 정리 및 견고한 훈련

현재, 데이터 정리는 점점 더 머신러닝 지향적으로 변하고 있지만, 견고한 훈련보다 덜 효과적인 것으로 간주된다. 우리는 두 기술이 통합되어 최상의 결과를 제공해야 한다고 믿는다.

Data cleaning and robust training

Currently, data cleaning is becoming more machine learning oriented, but is considered less effective than robust training. We believe that the two techniques should continue integrating for the best results.

데이터 검증 및 모델 공정성

최근 데이터 검증 작업은 AI 윤리가 검증하기 어려운 측면 중 하나라고 지적한다. 우리는 모델 공정성이 궁극적으로 데이터 검증 과정에 통합될 것이라고 믿는다.

Data validation and model fairness

The recent works in data validation point to AI ethics as one of the challenging aspects to validate. We believe that model fairness will eventually be merged into the data validation process.

데이터 수집

지금까지 대부분의 머신러닝 문헌은 입력 데이터가 이미 주어진 것으로 가정한다. 동시에, 정확한 머신러닝을 위한 데이터 수집은 이제 데이터 관리 커뮤니티에서 활발한 연구 방향이 되고 있다. 우리는 이 경향이 계속 확장되어 데이터 수집이 공정성과 견고성을 모두 고려해야 한다고 믿는다.

Data collection

So far, most of the machine learning literature assumes that the input data are already given. At the same time, data collection for accurate machine learning is now an active research direction in the data management community. We believe this trend will continue to expand where data collection needs to also consider fairness and robustness.

모델 훈련 및 테스트

데이터 품질 문제를 다루기 위해 모델 훈련 및 테스트 프로토콜을 개선하는 것이 또 다른 솔루션이 되고 있다. 데이터 샘플에 대한 모델의 출력은 데이터를 평가하는 데 유용한 지식을 제공하여 정확하고 견고한 추론 파이프라인을 개발하는 데 도움을 준다. 우리는 모델의 학습 역학이 견고성 및 공정성을 해석하는 새로운 관점을 제공한다고 믿는다.

Model training and testing

Improving model training and testing protocols is becoming another solution for dealing with data quality issues. The output of the model on data samples provides useful knowledge for evaluating the data, helping to develop accurate and robust inference pipelines. We believe that the learning dynamics of models provide new perspectives for interpreting robustness and fairness.

모델 공정성과 견고성

신뢰할 수 있는 AI는 머신러닝 커뮤니티에서 점점 더 중요해지고 있으며, 우리는 공정성과 견고성을 함께 다루어야 한다고 믿는다. 신뢰할 수 있는 AI의 다른 요소에는 프라이버시와 설명 가능성이 포함되며, 이는 궁극적으로 데이터 중심 AI의 일부가 되어야 한다.

Model fairness and robustness

Trustworthy AI is becoming increasingly critical in the machine learning community, and we believe its various aspects including fairness and robustness will have to be addressed together instead of one at a time. There are other elements of Trustworthy AI including privacy and explainability that should eventually be part of data-centric AI as well.

결론

데이터 중심 AI 시대에, 데이터를 수집하고 그 품질을 향상시키는 것은 딥러닝에 있어 점점 더 중요해질 것이다. 우리는 데이터 수집, 데이터 정리, 검증 및 통합, 견고한 모델 훈련, 공정한 모델 훈련이라는 네 가지 주요 주제를 다루었으며, 이들은 서로 다른 커뮤니티에 의해 연구되었지만 함께 사용되어야 한다. 우리는 모든 데이터 기술이 궁극적으로 견고하고 공정한 훈련 기술과 통합될 것이라고 믿으며, 우리의 설문 조사가 촉매 역할을 하기를 희망한다.

Concluding remark

In the data-centric AI era, collecting data and improving its quality will only become more critical for deep learning. We covered four major topics (data collection, data cleaning, validation, and integration, robust model training, and fair model training), which have been studied by different communities, but need to be used together. We believe all the data techniques will eventually converge with the robust and fair training techniques as data-centric AI matures, and hope that our survey plays a catalyst role.

References

1. AmazonMechanicalTurk. https://www.mturk.com/.Accessed13 July 2022

2. Amazon SageMaker Ground Truth. https://aws.amazon.com/sagemaker/groundtruth/. Accessed 13 July 2022

3. AmazonscrapssecretAIrecruitingtoolthatshowedbiasagainst women. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G. Accessed 13 July 2022

4. Agarwal, A., Beygelzimer, A., Dudík, M., Langford, J., Wallach., H.M.: A reductions approach to fair classification. In: ICML, pp. 60–69 (2018)

18. Biessmann, F., Golebiowski, J., Rukat, T., Lange, D., Schmidt, P.: Automated data validation in machine learning systems. IEEE Data Eng. Bull. 44(1), 51–65 (2021)

24. CrowdFlower Data Science Report. https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

25. Cafarella, M.J., Halevy, A.Y., Lee, H., Madhavan, J., Cong, Y., Wang, D.Z., Wu, E.: Ten years of webtables. PVLDB 11(12), 2140–2149 (2018)

34. Choi, K., Grover, A., Singh, T., Shu, R., Ermon, S.: Fair generative modeling via weak supervision. In: ICML, pp. 1887–1898 (2020)

39. Cretu, G.F., Stavrou, A., Locasto, M.E., Stolfo, S.J., Keromytis, A.D.: Casting out demons: sanitizing training data for anomaly sensors. In: IEEE S&P, pp. 81–95 (2008)

41. Data age 2025. https://www.seagate.com/our-story/data-age-2025/

42. Data-centric AI resource hub. https://datacentricai.org/

43. Data prep still dominates data scientists’ time, survey

46. Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Burlington (2012)

54. Fernandez, R.C., Abedjan, Z., Koko, F., Yuan, G., Madden, S., Stonebraker, M.: Aurum: a data discovery system. In: ICDE, pp. 1001–1012 (2018)

59. Goodfellow, I.J.: NIPS 2016 tutorial: generative adversarial networks. CoRR arXiv:1701.00160 (2017)

60. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)

72. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)

73. Huber, P.J.: Robust estimation of a location parameter. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics, pp. 492–518. Springer, Berlin (1992)

75. Ilyas, I.F., Rekatsinas, T.: Machine learning and data cleaning: Which serves the other? J. Data Inf. Qual. (2021). Just Accepted

87. Koh, P.W., Steinhardt, J., Liang, P.: Stronger data poisoning attacks break data sanitization defenses. CoRR arXiv:1811.00741 (2018)

89. Kurach, K., Lucic, M., Zhai, X., Michalski, M., Gelly, S.: The GAN landscape: losses, architectures, regularization, and normalization. CoRR arXiv:1807.04720 (2018)

109. Principles for AI ethics. https://research.samsung.com/artificial-intelligence. Accessed 13 July 2022

110. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE TKDE 22(10), 1345–1359 (2010)

114. Paudice, A., Muñoz-González, L., György, A., Lupu, E.C.: Detection of adversarial training examples in poisoning attacks through anomaly detection. CoRR arXiv:1802.03041 (2018)

117. Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data management challenges in production machine learning. In: SIGMOD, pp. 1723–1726 (2017)

120. Responsible AI practices. https://ai.google/responsibilities/responsible-ai-practices. Accessed 13 July 2022

121. Responsible AI principles from Microsoft. https://www.microsoft.com/en-us/ai/responsible-ai. Accessed 13 July 2022

128. Ricci, F., Rokach, L., Shapira, B. (eds.): Recommender Systems Handbook. Springer, Berlin (2015)

134. Software 2.0. https://medium.com/@karpathy/software-2-0-a64152b37c35. Accessed 13 July 2022

143. Shalev, L.: Denoising natural images with a model of sparse coding and overcomplete. Appl. Math. Comput. 205(2), 883–889 (2008)

146. Sheng, V.S., Provost, F.J., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: KDD, pp. 614–622 (2008)

153. Stonebraker, M., Zaniolo, C.: Machine learning and big data: what is important? IEEE Data Eng. Bull. 42, 3–7 (2019)

157. Thirumuruganathan, S., Yang, S., Zhang, N., Joglekar, M., Nirkhiwale, A., Ouzzani, M., Tang, N., Paulson, N., Cao, J., Chiu, D.K.W.: Data curation with deep learning and weak supervision: wrangling the challenge of journey from the lab to the lake. In: DEB (2015)

158. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: IROS, pp. 23–30 (2017)

159. Tremblay, J., Prakash, A., Acuna, D., Brophy, M., Jampani, V., Anil, C., To, T., Cameracci, E., Boochoon, S., Birchfield, S.: Training deep networks with synthetic data: bridging the reality gap by domain randomization. In: CVPR Workshops, pp. 969–977 (2018)

161. Tukey, J.W.: A survey of sampling from contaminated distributions. In: Contributions to Probability and Statistics, pp. 448–485 (1960)

174. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: ACL, pp. 189–196, Stroudsburg, PA, USA (1995). Association for Computational Linguistics

186. Zhu, X.: Semi-supervised learning literature survey. Technical report, Computer Sciences, University of Wisconsin-Madison (2005)

188. Triguero, I., García, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42(2), 245–284 (2015)

keyword