brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Oct 31. 2020

앤드류 응의 머신러닝(10-1): 알고리즘 평가하기

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Advice for Applying Machine Learning

머신 러닝 적용을 위한 조언

Evaluating a Learning Algorithm

(학습 알고리즘 평가)

Deciding What to Try next ( 다음에 할 작업 결정하기)

By now you have seen a lot of different learning algorithms. And if you've been following along these videos you should consider yourself an expert on many state-of-the-art machine learning techniques. But even among people that know a certain learning algorithm. There's often a huge difference between someone that really knows how to powerfully and effectively apply that algorithm, versus someone that's less familiar with some of the material that I'm about to teach and who doesn't really understand how to apply these algorithms and can end up wasting a lot of their time trying things out that don't really make sense.

지금까지 다양한 학습 알고리즘을 배웠습니다. 머신 러닝 강의를 열심히 공부했다면 여러분들은 머신 러닝 기술 전문가입니다. 그러나, 전문가들 사이에서도 실력 차이는 있습니다. 알고리즘을 강력하고 효과적으로 적용하는 방법을 잘 아는 전문가와 머신러닝에 익숙하지 않고 알고리즘을 제대로 적용하지 못하는 전문가들 사이에는 큰 차이가 있습니다. 익숙하지 않은 사람들은 말도 안 되는 작업을 하면서 많은 시간을 낭비합니다.

What I would like to do is make sure that if you are developing machine learning systems, that you know how to choose one of the most promising avenues to spend your time pursuing. And on this and the next few videos I'm going to give a number of practical suggestions, advice, guidelines on how to do that. And concretely what we'd focus on is the problem of, suppose you are developing a machine learning system or trying to improve the performance of a machine learning system, how do you go about deciding what are the proxy avenues to try next?

우선 머신러닝 시스템을 개발할 때 시간을 절약하는 방법에 대해 설명합니다. 이번 강의부터 현실적인 제안, 조언 그리고 가이드라인을 제시합니다. 우선 머신러닝을 개발하거나 성능을 향상할 때 어떻게 다음 해야 할 작업을 선택할까요?

To explain this, let's continue using our example of learning to predict housing prices. And let's say you've implement and regularize linear regression. Thus minimizing that cost function j. Now suppose that after you take your learn parameters, if you test your hypothesis on the new set of houses, suppose you find that this is making huge errors in this prediction of the housing prices. The question is what should you then try mixing in order to improve the learning algorithm? There are many things that one can think of that could improve the performance of the learning algorithm.

지금까지 다루었던 주택 가격을 예측하기 위해 비용 함수 J(θ)를 최소화하는 선형 회귀를 구현하고 정규화합니다. 파라미터 θ를 학습한 알고리즘이 새로운 주택의 가격을 예측할 때 큰 오류가 발생합니다. 학습 알고리즘을 개선하기 위해 무엇을 해야 할까요? 실제로 학습 알고리즘의 성능을 개선할 수 있는 방법은 많습니다.

One thing they could try, is to get more training examples. And concretely, you can imagine, maybe, you know, setting up phone surveys, going door to door, to try to get more data on how much different houses sell for. And the sad thing is I've seen a lot of people spend a lot of time collecting more training examples, thinking oh, if we have twice as much or ten times as much training data, that is certainly going to help, right? But sometimes getting more training data doesn't actually help and in the next few videos we will see why, and we will see how you can avoid spending a lot of time collecting more training data in settings where it is just not going to help.

Other things you might try are to well maybe try a smaller set of features. So if you have some set of features such as x1, x2, x3 and so on, maybe a large number of features. Maybe you want to spend time carefully selecting some small subset of them to prevent overfitting.

Or maybe you need to get additional features. Maybe the current set of features aren't informative enough and you want to collect more data in the sense of getting more features. And once again this is the sort of project that can scale up the huge projects can you imagine getting phone surveys to find out more houses, or extra land surveys to find out more about the pieces of land and so on, so a huge project. And once again it would be nice to know in advance if this is going to help before we spend a lot of time doing something like this.

첫 번째로 학습 데이터 셋을 더 많이 확보하는 것입니다. 전화 설문 조사나 방문 조사를 하면서 주택 판매에 대한 더 많은 데이터를 확보합니다. 머신 러닝 분야의 많은 종사자들은 더 많은 학습 데이터 셋을 확보하기 위해 많은 시간을 소비합니다. 학습 데이터 셋이 두 배나 열 배 정도 많다면 확실히 도움이 될지도 모릅니다. 하지만, 더 많은 학습 데이터 셋이 실제로 도움이 되지 않습니다. 이번 강의부터 더 많은 학습 데이터셋이 도움이 되지 않는 원인을 살펴보고, 학습 데이터 셋을 수집하기 위해 많은 시간을 소모하지 않는 방법을 고민합니다.

두 번째로 더 작은 피처 셋을 만드는 것입니다. 과적합(Overfitting)을 방지하기 위해 소규모 피처 셋을 만들기 위해 많은 시간을 소모합니다.

세 번째로 새로운 피처를 추가하는 것입니다. 기존 피처가 충분한 정보를 제공하지 못한다고 판단하여 더 많은 피처를 수집합니다. 더 많은 주택을 찾기 위해 전화 설문 조사나 토지 조사를 하는 거대한 프로젝트로 확대되기도 합니다.

일련의 작업을 하기 위해 많은 시간을 소모하기 전에 다음에 할 작업이 필요할지 안 할지를 미리 알 수 있다면 큰 도움이 될 것입니다.

We can also try adding polynomial features things like x2 square x2 square and product features x1, x2. We can still spend quite a lot of time thinking about that and we can also try other things like decreasing lambda, the regularization parameter or increasing lambda.

Given a menu of options like these, some of which can easily scale up to six month or longer projects. Unfortunately, the most common method that people use to pick one of these is to go by gut feeling. In which what many people will do is sort of randomly pick one of these options and maybe say, "Oh, lets go and get more training data." And easily spend six months collecting more training data or maybe someone else would rather be saying, "Well, let's go collect a lot more features on these houses in our data set." And I have a lot of times, sadly seen people spend, you know, literally 6 months doing one of these avenues that they have sort of at random only to discover six months later that that really wasn't a promising avenue to pursue.

네 번째로 고차 다항식을 추가하는 것입니다. 기존 피처를 이용하여 x1^2, x2^2 및 x1x2과 같은 고차 다항식을 만들기 위해 많은 시간을 소비합니다. 다섯 번째로 정규화 파라미터 λ (람다)의 값을 늘리거나 줄입니다.

학습 알고리즘을 개선하기 위해 다섯 가지 옵션을 선택하다 보면 프로젝트는 어느새 6 개월 이상 늘어지기도 합니다. 불행하게도 여러 옵션 중에 하나를 선택할 때 직감에 의지합니다. 특히 무작위로 선택해야 할 때 "더 많은 학습 데이터 셋을 구하겠습니다"라고 말하고 6개월의 시간을 보냅니다. 또는 "주택에 대한 더 많은 피처를 수집하겠습니다."라고 말하고 문자 그대로 6개월을 허비하기도 합니다. 그리고, 6 개월 후 좋은 방법이 아니라는 것을 깨닫습니다.

Fortunately, there is a pretty simple technique that can let you very quickly rule out half of the things on this list as being potentially promising things to pursue. And there is a very simple technique, that if you run, can easily rule out many of these options, and potentially save you a lot of time pursuing something that's just is not going to work. In the next two videos after this, I'm going to first talk about how to evaluate learning algorithms. And in the next few videos after that, I'm going to talk about these techniques, which are called the machine learning diagnostics. And what a diagnostic is, is a test you can run, to get insight into what is or isn't working with an algorithm, and which will often give you insight as to what are promising things to try to improve a learning algorithm's performance. We'll talk about specific diagnostics later in this video sequence. But I should mention in advance that diagnostics can take time to implement and can sometimes, you know, take quite a lot of time to implement and understand but doing so can be a very good use of your time when you are developing learning algorithms because they can often save you from spending many months pursuing an avenue that you could have found out much earlier just was not going to be fruitful.

So in the next few videos, I'm going to first talk about how evaluate your learning algorithms and after that I'm going to talk about some of these diagnostics which will hopefully let you much more effectively select more of the useful things to try mixing if your goal to improve the machine learning system.

다행히도 여러 가지 옵션들 중 절반을 배제할 수 있는 간단한 방법이 있습니다. 실제로 효과적이고 많은 시간을 절약할 수 있습니다. 다음 강의부터 학습 알고리즘을 평가하고 머신 러닝 진단하는 방법을 다룹니다. 진단이란 알고리즘이 잘 작동하는지를 확인하기 위한 테스트입니다. 진단은 학습 알고리즘의 성능을 향상할 수 있는 통찰력을 제공합니다. 진단 방법을 구현하는 시간이 필요하지만 전체적으로 학습 알고리즘을 개발하는 시간을 단축할 수 있습니다. 아마도 몇 개월 동안 진척이 없는 프로젝트를 더 나은 방향으로 이끌 수도 있습니다.

그래서, 다음 강의부터 학습 알고리즘을 평가하는 방법을 설명합니다. 그러고 나서 학습 시스템을 개선할 수 있는 효과적인 진단 방법을 다룹니다.