brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Nov 10. 2020

앤드류 응의 머신러닝(11-5) : 머신러닝용 데이터

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Using Large Date Sets (대량의 데이터 셋 사용하기)

Data For Machine Learning (머신 러닝을 위한 방대한 데이터)

In the previous video, we talked about evaluation metrics. In this video, I'd like to switch tracks a bit and touch on another important aspect of machine learning system design, which will often come up, which is the issue of how much data to train on. Now, in some earlier videos, I had cautioned against blindly going out and just spending lots of time collecting lots of data, because it's only sometimes that that would actually help. But it turns out that under certain conditions, and I will say in this video what those conditions are, getting a lot of data and training on a certain type of learning algorithm, can be a very effective way to get a learning algorithm to do very good performance. And this arises often enough that if those conditions hold true for your problem and if you're able to get a lot of data, this could be a very good way to get a very high performance learning algorithm. So in this video, let's talk more about that.

지난 강의에서 평가지표를 설명했습니다. 이번 강의에서 다룰 내용은 머신러닝 시스템 설계에서 중요한 학습할 데이터 양에 관한 문제를 다룹니다. 지난 강의에서 많은 데이터를 수집하기 위해 많은 시간을 소비하지 않을 것을 강조했지만, 실제로 데이터가 많을수록 도움이 되는 경우가 있습니다. 학습 알고리즘의 유형과 특정 조건에 부합하는 상황에서 많은 데이터양은 성능을 향상하는 매우 좋은 방법입니다. 이번 강의에서 이 부분을 다룰 것입니다.

Let me start with a story. Many, many years ago, two researchers that I know, Michelle Banko and Eric Broule ran the following fascinating study. They were interested in studying the effect of using different learning algorithms versus trying them out on different training set sciences, they were considering the problem of classifying between confusable words, so for example, in the sentence: for breakfast I ate, should it be to, two or too? Well, for this example, for breakfast I ate two, 2 eggs. So, this is one example of a set of confusable words and that's a different set. So they took machine learning problems like these, sort of supervised learning problems to try to categorize what is the appropriate word to go into a certain position in an English sentence.

한 가지 이야기를 들려드리겠습니다. 오래전 제가 아는 두 명의 연구원인 마이클 뱅코와 에릭 브룰리는 흥미로운 연구를 하나 진행했습니다. 그들은 여러 종류의 학습 알고리즘과 다양한 학습 셋을 사용하는 것에 대한 효과를 연구하였습니다. 그들은 문장에서 혼동스러운 단어들 중에서 적절한 단어를 선택하는 알고리즘을 만들었습니다. 예를 들어, 같은 발음의 two(2)와 too(너무 많이) 또는 then(그리고)과 than(보다) 중에서 문장에 적절한 단어를 찾는 것입니다. 학습 알고리즘은 영어 문장의 특정 위치에 적합한 단어를 찾습니다..

They took a few different learning algorithms which were, you know, sort of considered state of the art back in the day, when they ran the study in 2001, so they took a variance, roughly a variance on logistic regression called the Perceptron. They also took some of their algorithms that were fairly out back then but somewhat less used now so when the algorithm also very similar to which is a regression but different in some ways, much used somewhat less, used not too much right now took what's called a memory based learning algorithm again used somewhat less now. But I'll talk a little bit about that later. And they used a naive based algorithm, which is something they'll actually talk about in this course. The exact algorithms of these details aren't important. Think of this as, you know, just picking four different classification algorithms and really the exact algorithms aren't important. But what they did was they varied the training set size and tried out these learning algorithms on the range of training set sizes.

그들은 2001년 연구에서 4 가지 첨단 학습 알고리즘을 사용했습니다. 첫 번째는 퍼셉트론이라는 로지스틱 회귀이고, 두 번째 Winnow 알고리즘은 당시에는 많이 사용하였지만 지금은 잘 사용하지 않습니다. 세 번째 메모리 기반 알고리즘도 지금은 많이 사용하지 않습니다. 마지막으로 Naive Bayes 알고리즘도 사용했습니다. 4가지 알고리즘이 무엇인지 알 필요는 없습니다. 그들은 학습 셋의 크기를 변경하면서 네 가지 학습 알고리즘을 테스트하였습니다.

That's the result they got. And the trends are very clear right first most of these outer rooms give remarkably similar performance. And second, as the training set size increases, on the horizontal axis is the training set size in millions go from you know a hundred thousand up to a thousand million that is a billion training examples. The performance of the algorithms all pretty much monotonically increase and the fact that if you pick any algorithm may be pick a "inferior algorithm" but if you give that "inferior algorithm" more data, then from these examples, it looks like it will most likely beat even a "superior algorithm".

우측의 그래프는 연구 결과입니다. 그래프는 분명한 추세를 보여줍니다. 네 개의 학습 알고리즘은 학습 셋의 크기가 증가할수록 놀랍도록 비슷한 성능을 보입니다. 수평축은 학습 셋의 크기를 나타내며 10만 개에서부터 10억 개까지 표시합니다. 수직축은 알고리즘의 성능을 나타냅니다. 그래프에서 가장 안 좋은 알고리즘에 충분한 양의 데이터를 제공하면 가장 우수한 알고리즘보다 더 나은 성능을 보일 가능성이 보입니다.

So since this original study which is very influential, there's been a range of many different studies showing similar results that show that many different learning algorithms you know tend to, can sometimes, depending on details, can give pretty similar ranges of performance, but what can really drive performance is you can give the algorithm a ton of training data. And this is, results like these has led to a saying in machine learning that often in machine learning it's not who has the best algorithm that wins, it's who has the most data So when is this true and when is this not true? Because we have a learning algorithm for which this is true then getting a lot of data is often maybe the best way to ensure that we have an algorithm with very high performance rather than you know, debating worrying about exactly which of these items to use. Let's try to lay out a set of assumptions under which having a massive training set we think will be able to help.

초기 연구 이후로 다양한 알고리즘에 대한 많은 연구들에서 비슷한 결과가 나왔습니다. 엄청난 양의 데이터는 알고리즘의 성능을 향상합니다. 머신 러닝에서 승리자는 최고의 알고리즘을 가진 사람이 아니라 가장 많은 데이터를 보유한 사람이다라고 합니다. 그렇다면, 이것이 언제 사실이고 언제 거짓일까요? 많은 데이터를 수집하는 것이 알고리즘의 최고 성능을 보장하는 가장 좋은 방법이라는 것을 증명하는 학습 알고리즘이 있습니다. 방대한 학습 셋이 있으면 도움이 될 것이라는 전제 하에 일련의 가정들을 정리합니다.

Let's assume that in our machine learning problem, the features x have sufficient information with which we can use to predict y accurately. For example, if we take the confusable words all of them that we had on the previous slide. Let's say that it features x capture what are the surrounding words around the blank that we're trying to fill in. So the features capture then we want to have, sometimes for breakfast I have black eggs. Then yeah that is pretty much information to tell me that the word I want in the middle is TWO and that is not word TO and its not the word TOO. So the features capture, you know, one of these surrounding words then that gives me enough information to pretty unambiguously decide what is the label y or in other words what is the word that I should be using to fill in that blank out of this set of three confusable words. So that's an example what the feature x has sufficient information for specific y.

머신러닝 문제에서 피처 x에 대해 y를 정확하게 예측할 수 있는 충분한 데이터가 있다고 가정합니다. 예를 들어, "For breakfast I ate ( ) eggs"에 들러갈 단어는 to, too, two 등으로 혼동할 수 있습니다. x는 공백 주변의 단어를 수집합니다. 그러면 알고리즘이 알려주는 빈칸에 들어갈 단어는 two이고, to 또는 too가 아닐 것입니다. 알고리즘이 빈칸 주변의 단어에 대한 충분한 정보를 수집한다면 레이블 y 또는 혼동할 수 있는 세 단어 중에서 공백의 단어를 결정할 수 있습니다. 피처 x가 특정한 값 y에 대한 충분한 값을 가진 사례입니다.

For a counterexample. Consider a problem of predicting the price of a house from only the size of the house and from no other features. So if you imagine I tell you that a house is, you know, 500 square feet but I don't give you any other features. I don't tell you that the house is in an expensive part of the city. Or if I don't tell you that the house, the number of rooms in the house, or how nicely furnished the house is, or whether the house is new or old. If I don't tell you anything other than that this is a 500 square foot house, well there's so many other factors that would affect the price of a house other than just the size of a house that if all you know is the size, it's actually very difficult to predict the price accurately.So that would be a counterexample to this assumption that the features have sufficient information to predict the price to the desired level of accuracy.

The way I think about testing this assumption, one way I often think about it is, how often I ask myself. Given the input features x, given the features, given the same information available as well as learning algorithm. If we were to go to human expert in this domain. Can a human experts actually or can human expert confidently predict the value of y.

반대의 사례입니다. 다른 피처는 무시하고 주택 크기만으로 주택 가격을 예측하는 문제가 있습니다. 실제로 500 평방 피트의 주택 가격을 예측할 수 있을까요? 주택이 도시의 고가 주택 밀집 지역에 있는지, 방이 몇 개 있는지, 인테리어, 신축인지 구축인지를 모릅니다. 따라서, 주택 크기만으로 주택 가격을 정확히 예측하는 것은 어렵습니다. 이번 사례는 정확성이 높은 주택 가격을 예측하기 충분한 피처가 없습니다.

주어진 입력 피처 x가 충분하지 않을 때 더 많은 데이터가 필요한 지 테스트하는 방법이 있습니다. 즉, 알고리즘에게 주어진 피처 X를 부동산 전문가에게 똑같이 물어봅니다. 부동산 전문가는 자신 있게 주택 가격 y를 예측할 수 있을까요?

For this first example if we go to, you know an expert human English speaker. You go to someone that speaks English well, right, then a human expert in English just read most people like you and me will probably we would probably be able to predict what word should go in here, to a good English speaker can predict this well, and so this gives me confidence that x allows us to predict y accurately, but in contrast if we go to an expert in human prices. Like maybe an expert realtor, right, someone who sells houses for a living. If I just tell them the size of a house and I tell them what the price is well even an expert in pricing or selling

houses wouldn't be able to tell me and so this is fine that for the housing price example knowing only the size doesn't give me enough information to predict the price of the house.

첫 번째 사례를 영어를 구사하는 사람들에게 첫 번째 문장의 빈칸에 들어갈 말을 물어봅니다. 그들은 빈칸에 들어갈 단어를 정확히 예측할 수 있습니다. 피처 x가 정확하게 y를 예측할 수 있게 한다는 확신을 갖습니다. 반대로 두 번째 사례를 부동산 중개인에게 주택 크기만 가지고 주택 가격을 물어봅니다. 그들은 정확한 주택 가격을 알려주지 못합니다. 따라서, 충분한 정보를 제공하지 못한 것입니다.

So, let's say, this assumption helps. Let's see then, when having a lot of data could help. Suppose the features have enough information to predict the value of y. And let's suppose we use a learning algorithm with a large number of parameters so maybe logistic regression or linear regression with a large number of features. Or one thing that I sometimes do, one thing that I often do actually is using neural network with many hidden units. That would be another learning algorithm with a lot of parameters. So these are all powerful learning algorithms with a lot of parameters that can fit very complex functions. So, I'm going to call these, I'm going to think of these as low-bias algorithms because you know we can fit very complex functions and because we have a very powerful learning algorithm, they can fit very complex functions. Chances are, if we run these algorithms on the data sets, it will be able to fit the training set well, and so hopefully the training error will be small.

Now let's say, we use a massive, massive training set, in that case, if we have a huge training set, then hopefully even though we have a lot of parameters but if the training set is sort of even much larger than the number of parameters then hopefully these albums will be unlikely to overfit. Right because we have such a massive training set and by unlikely to overfit what that means is that the training error will hopefully be close to the test error. Finally putting these two together that the train set error is small and the test set error is close to the training error what this two together imply is that hopefully the test set error

will also be small.

피처가 충분한 정보를 제공한다면 데이터가 많을수록 알고리즘의 성능을 개선할 수 있습니다. 수많은 파라미터를 가진 학습 알고리즘은 수많은 피처를 가진 로지스틱 회귀이거나 선형 회귀입니다. 수많은 파라미터를 가진 또 다른 학습 알고리즘은 수많은 은닉 유닛을 가진 인공신경망입니다. 매우 복잡한 함수에 적합한 수많은 파라미터를 가진 강력한 학습 알고리즘입니다. 이것을 낮은 편향 알고리즘이라고 합니다. 데이터 셋에서 낮은 편향 알고리즘을 실행하면 학습 셋에 잘 맞기 때문에 학습 오류 Jtrain(θ)는 작을 것입니다.

방대한 학습 셋을 사용합니다. 파라미터가 많아도 학습 셋이 훨씬 더 크다면 과적합이 발생하지 않습니다. 방대한 학습 셋은 과적합을 방지하기 때문에 학습 오류는 매우 작을 것이고 테스트 오류도 작을 것입니다.

Another way to think about this is that in order to have a high performance learning algorithm we want it not to have high bias and not to have high variance. So the bias problem we're going to address by making sure we have a learning algorithm with many parameters and so that gives us a low bias alorithm and by using a very large training set, this ensures that we don't have a variance problem here. So hopefully our algorithm will

have no variance and so is by pulling these two together, that we end up with a low bias and a low variance learning algorithm and this allows us to do well on the test set.

언제나 높은 편향과 높은 분산이 없는 고성능 학습 알고리즘이 필요합니다. 편향 문제를 해결하기 위해 학습 알고리즘에 많은 파라미터를 적용하고, 분산 문제를 해결하기 위해 학습 알고리즘에 방대한 학습 셋을 적용합니다. 즉, 두 가지를 결합하면 낮은 편향과 낮은 분산을 가진 학습 알고리즘을 만들 수 있습니다.

And fundamentally it's a key ingredients of assuming that the features have enough information and we have a rich class of functions that's why it guarantees low bias, and then it having a massive training set that that's what guarantees more variance. So this gives us a set of conditions rather hopefully some understanding of what's the sort of problem where if

you have a lot of data and you train a learning algorithm with lot of parameters, that might be a good way to give a high performance learning algorithm and really, I think the key test that I often ask myself are first, can a human experts look at the features x and confidently predict the value of y. Because that's sort of a certification that y can be predicted accurately from the features x and second, can we actually get a large training set, and train the learning algorithm with a lot of parameters in the training set and if you can't do both then that's more often give you a very kind performance learning algorithm.

충분한 정보를 가진 피처가 있고 낮은 편향을 보장하는 함수가 있을 때 방대한 학습 셋은 효과적입니다. 많은 데이터로 학습 알고리즘을 학습시키면 해결할 수 있는 머신 러닝 문제가 있습니다. 많은 파라미터를 가진 학습 알고리즘에 많은 데이터로 학습시키면 고성능 학습 알고리즘을 구현합니다. 실제로 이런 학습 알고리즘을 구분하는 방법은 자신에게 질문하는 것입니다. 첫 번째 질문은 전문가가 피처 x를 보고 y를 자신 있게 예측할 수 있는 지를 묻는 것입니다. 피처 x가 충분한 정보를 가지고 있는 지를 알 수 있습니다. 두 번째 질문은 방대한 학습 셋을 구할 수 있는 질르 묻습니다. 많은 파라미터를 가진 학습 알고리즘을 방대한 학습 셋으로 학습하면 성능을 크게 개선할 수 있습니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

4 가지 학습 알고리즘에 대해 학습 셋의 크기를 변경하면서 테스트하였습니다. 학습 셋의 크기가 커질수록 4 가지 학습 알고리즘 모두 성능이 개선됩니다. 그래프에서 가장 안 좋은 알고리즘도 더 많은 데이터를 제공하면 가장 우수한 알고리즘도 이길 가능성이 보입니다. 알고리즘에 엄청난 양의 데이터를 제공하여 성능을 향상할 수 있습니다. 머신 러닝에서 최종 승리자는 최고의 알고리즘을 가진 사람이 아니라 가장 많은 데이터를 보유한 사람이다라고 합니다.

어떤 경우에 방대한 데이터가 필요할까요? 알고리즘에게 묻는 질문을 인간 전문가에게 물어봅니다. 인간 전문가가 답을 할 수 있다면 충분한 Feature를 가진 것이고, 답을 할 수 없다면 Feature가 부족한 것입니다. 두 번째로 방대한 데이터를 구할 수 있는 지도 중요합니다.

따라서, 많은 수의 Feature가 있어서 많은 파라미터를 보유한 학습 알고리즘은 학습하는 동안 과적합이 발생합니다. 그러나, 학습 셋의 크기가 충분히 크다면 과적합을 일으킬 가능성이 적고 학습 오류와 테스트 오류는 비슷할 것입니다. 즉, 학습 알고리즘의 편향 문제는 많은 파라미터로 해결할 수 있고, 편차 문제는 매우 큰 학습 셋으로 해결할 수 있습니다.

따라서, 방대한 데이터가 학습 알고리즘의 성능을 개선할 수 있습니다. 수많은 파라미터를 가진 로지스틱 회귀와 선형 회귀와 같은 학습 알고리즘, 수많은 은닉 유닛을 가진 인공신경망은 데이터가 많을수록 성능을 개선할 수 있습니다.