brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 19. 2020

앤드류 응의 머신러닝(18-1):Photo OCR 개요

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Application Example

(응용 사례 )

Photo OCR (사진 OCR)

Problem Description and Pipeline (사진 OCR 개요)

In this and the next few videos, I want to tell you about a machine learning application example, or a machine learning application history centered around an application called Photo OCR. There are three reasons why I want to do this, first I wanted to show you an example of how a complex machine learning system can be put together. Second, once told the concepts of a machine learning a type line and how to allocate resources when you're trying to decide what to do next. And this can either be in the context of you working by yourself on the big application. Or it can be the context of a team of developers trying to build a complex application together. And then finally, the Photo OCR problem also gives me an excuse to tell you about just a couple more interesting ideas for machine learning. One is some ideas of how to apply machine learning to computer vision problems, and second is the idea of artificial data synthesis, which we'll see in a couple of videos.

이번 주제는 머신 러닝 애플리캐이션입니다. Photo OCR (이미지의 텍스트 인식 기술)이라 불리는 애플리케이션을 중심으로 머신 러닝 애플리케이션을 다루는 방법을 설명합니다. Photo OCR을 다루는 이유는 세 가지입니다. 첫째, 복잡한 머신 러닝 시스템을 어떻게 결합할 수 있는 지를 설명할 수 있습니다. 둘째, 머신 러닝의 콘셉트와 다음 작업을 시도할 때 자원을 할당하는 방법을 설명할 수 있습니다. 복잡한 머신 러닝 애플리케이션을 구축할 때 혼자 작업하거나 여러 사람이 함께 팀을 이루어 작업합니다. 셋째, Photo OCR 문제는 머신 러닝 문제를 다루는 몇 가지 흥미로운 아이디어를 추가적으로 설명할 수 있습니다. 그중 하나는 컴퓨터 비전 문제에 머신 러닝을 적용하는 방법이고, 다른 하나는 인공 데이터 합성입니다.

So, let's start by talking about what is the Photo OCR problem. Photo OCR stands for Photo Optical Character Recognition. With the growth of digital photography and more recently the growth of camera in our cell phones we now have tons of visual pictures that we take all over the place. And one of the things that has interested many developers is how to get our computers to understand the content of these pictures a little bit better. The photo OCR problem focuses on how to get computers to read the text to the purest in images that we take. Given an image like this it might be nice if a computer can read the text in this image so that if you're trying to look for this picture again you type in the words, lulu bees and and have it automatically pull up this picture, so that you're not spending lots of time digging through your photo collection Maybe hundreds of thousands of pictures in. The Photo OCR problem does exactly this, and it does so in several steps.

이제 Photo OCR문제가 무엇인지 설명합니다. Photo OCR은 Photo Optical Chatacter Recognition의 준말로 이미지의 텍스트 인식 기술입니다. 디지털 사진의 성장과 스마트폰의 발달로 수많은 사진들이 쌓였습니다. 많은 개발자들이 컴퓨터가 사진의 내용을 조금 더 잘 이해할 수 있는 방법을 연구했습니다. Photo OCR 문제는 컴퓨터가 이미지나 사진에서 텍스트를 읽는 가장 단순한 방법입니다. 컴퓨터가 이미지에서 텍스트를 읽을 수 있다면 사진을 구분할 수 있습니다. 나중에 사진을 찾을 때 사용자가 'LULAB's'를 입력하면 컴퓨터는 사진을 찾아서 보여줄 수 있습니다. 즉, 여러분들은 사진 앨범을 하나하나 뒤지면서 많은 시간을 소비하지 않고 사진을 쉽게 찾을 수 있습니다. 사진 앨범에는 수십만 장의 사진이 있을 것입니다. Photo OCR 애플리케이션은 몇 단계를 거쳐서 정확하게 이미지 텍스트를 인식합니다.

First, given the picture it has to look through the image and detect where there is text in the picture.

우선 Photo OCR은 이미지를 살펴보고 이미지에서 텍스트가 있는 위치를 감지합니다.

And after it has done that or if it successfully does that it then has to look at these text regions and actually read the text in those regions. And hopefully if it reads it correctly, it'll come up with these transcriptions of what is the text that appears in the image. Whereas OCR, or optical character recognition of scanned documents is relatively easier problem, doing OCR from photographs today is still a very difficult machine learning problem, and you can do this. Not only can this help our computers to understand the content of our though images better, there are also applications like helping blind people, for example, if you could provide to a blind person a camera that can look at what's in front of them, and just tell them the words that my be on the street sign in front of them. With car navigation systems. For example, imagine if your car could read the street signs and help you navigate to your destination.

Photo OCR 애플리케이션은 이미지를 성공적으로 분석한 후 텍스트 영역을 확인해야만 합니다. 실제로 텍스트 영역에 텍스를 읽습니다. Photo OCR 애플리케이션이 정확히 텍스트 영역의 글자를 읽었다면, 이미지에 나타난 텍스트에 대한 트랜스크립션이 나타날 것입니다. OCR은 스캔한 문서에서 텍스트를 인식하는 것으로 비교적 쉬운 문제입니다. 반면에 이미지에서 텍스트를 읽는 Photo OCR은 여전히 어려운 머신러닝 문제입니다. OCR 문제는 컴퓨터가 이미지의 내용을 더 잘 이해할 수 있게 하여 다양한 분야에 응용될 수 있습니다. 시각 장애인에게 카메라를 제공하여 앞에 있는 사물을 인지할 수 있게 도와줍니다. 예를 들면, 시각 장애인에게 도로 표지판의 글씨를 읽어줍니다. 자동차 내비게이션 시스템에 적용한 OCR은 목적지까지 이동할 수 있도록 도로 표지판을 읽습니다.

In order to perform photo OCR, here's what we can do. First we can go through the image and find the regions where there's text and image. So, shown here is one example of text and image that the photo OCR system may find. Second, given the rectangle around that text region, we can then do character segmentation, where we might take this text box that says "Antique Mall" and try to segment it out into the locations of the individual characters. And finally, having segmented out into individual characters, we can then run a classification, which looks at the images of the visual characters, and tries to figure out the first character's an A, the second character's an N, the third character is a T, and so on, So that up by doing all this how that hopefully you can then figure out that this phrase is Rulegee's antique mall and similarly for some of the other words that appear in that image.

Photo OCR은 작업 순서는 다음과 같습니다. 첫 번째, 이미지를 분석하여 텍스트 영역과 이미지 영역을 구분합니다. 오른쪽 상단의 그림에서 Photo OCR 시스템이 텍스트와 이미지 영역을 구분합니다. 두 번째, 텍스트 영역의 빨간색 상자 안의 글자를 문자 단위로 분할합니다. "Antique Mall" 텍스트 상자를 개별 문자로 분할합니다. 세 번째, 개별 문자에 대한 이미지를 보고 첫 문자 A, 두 번째 문자 n, 세 번째 문자 t, 등등을 알파벳으로 분류합니다. 문자열이 "Antique Mall"이라는 것을 파악합니다.

I should say that there are some photo OCR systems that do even more complex things, like a bit of spelling correction at the end. So if, for example, your character segmentation and character classification system tells you that it sees the word c 1 e a n i n g. Then, you know, a sort of spelling correction system might tell you that this is probably the word 'cleaning', and your character classification algorithm had just mistaken the l for a 1. But for the purpose of what we want to do in this video, let's ignore this last step and just focus on the system that does these three steps of text detection, character segmentation, and character classification.

OCR 시스템의 마지막 과정에 더 복잡한 철자 교정 작업이 있습니다. 예를 들어, 문자 분할 및 문자 분류 시스템에서 "c1eaning"라는 단어를 읽은 경우에 철자를 잘못 인식하였으므로 철자 교정이 필요합니다. 아마도 "cleaning"이라는 단어를 잘못 인지했기 때문입니다. 'ㅣ'을 '1'로 인식한 것입니다. 이 강의에서 마지막 단계의 철자 교정을 무시하고 텍스트 감지, 문자 분할 및 문자 분류의 세 단계를 중심으로 설명합니다.

A system like this is what we call a machine learning pipeline. In particular, here's a picture showing the photo OCR pipeline. We have an image, which then fed to the text detection system text regions, we then segment out the characters--the individual characters in the text--and then finally we recognize the individual characters. In many complex machine learning systems, these sorts of pipelines are common, where you can have multiple modules--in this example, the text detection, character segmentation, character recognition modules--each of which may be machine learning component, or sometimes it may not be a machine learning component but to have a set of modules that act one after another on some piece of data in order to produce the output you want, which in the photo OCR example is to find the transcription of the text that appeared in the image. If you're designing a machine learning system one of the most important decisions will often be what exactly is the pipeline that you want to put together. In other words, given the photo OCR problem, how do you break this problem down into a sequence of different modules and you design the pipeline. And each the performance of each of the modules in your pipeline will often have a big impact on the final performance of your algorithm.

Photo OCR과 같은 시스템을 머신러닝 파이프라인이라고 합니다. 여기 OCR 파이프라인 그림이 있습니다. 이미지 입력 부분, 텍스트 감지, 문자 분할, 문자 인식 모듈로 나눕니다. 많은 복잡한 머신 러닝 시스템은 이러한 형태의 파이프라인을 구성합니다. 각 모듈은 머신 러닝 구성 요소일 수도 있고 아닐 수도 있습니다. 원하는 출력을 생성하기 위해 일련의 데이터를 순서대로 처리할 수 있는 모듈 셋입니다. Photo OCR 애플리케이션은 이미지에 나타난 텍스트를 인식하는 것입니다. 머신 러닝 시스템을 설계할 때 가장 중요한 것 중 하나는 정확한 파이프라인을 구성하는 것입니다. 즉, 사진 OCR 애플리케이션을 설계할 때 파이프라인을 구성하는 모듈을 세부적으로 구현하는 것입니다. 파이프라인의 각 모듈의 성능은 알고리즘의 최종 성능에 큰 영향을 줍니다.

If you have a team of engineers working on a problem like this, it is also very common to have different individuals work on different modules. So I could easily imagine tech easily being the of anywhere from 1 to 5 engineers, character segmentation maybe another 1-5 engineers, and character recognition being another 1-5 engineers, and so having a pipeline like often offers a natural way to divide up the workload amongst different members of an engineering team, as well. Although, or course, all of this work could also be done by just one person if that's how you want to do it. In complex machine learning systems the idea of a pipeline, of a machine of a pipeline, is pretty pervasive. And what you just saw is a specific example of how a Photo OCR pipeline might work.

여러분이 엔지니어링팀과 함께 작업을 한다면, 각 모듈에 각 팀들이 배정될 것입니다. 텍스트 감지 모듈에 1~5명의 엔지니어, 문자 분할 모듈에 1~5명의 엔지니어, 문자 인식 모듈에 1~5명의 엔지니어가 일할 것입니다. 엔지니어링 팀의 구성원들과 업무를 분장하는 것은 자연스럽습니다. 반대로 모든 작업을 한 사람이 수행할 수도 있습니다. 복잡한 머신러닝 시스템을 다루기 위해 파이프라인 개념이 매우 일반적으로 사용합니다. 그래서, 지금까지 Photo OCR 파이프라인이 동작 방식을 배웠습니다.

In the next few videos I'll tell you a little bit more about this pipeline, and we'll continue to use this as an example to illustrate a few more key concepts of machine learning.

다음 강의부터 파이프라인에 대해 조금 더 설명하고, 머신 러닝의 몇 가지 핵심 개념을 설명하기 위해 계속해서 Photo OCR 파이프라인을 사용할 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

Photo OCR은 Photo Optical Chatacter Recognition의 준말로 이미지의 텍스트 인식 기술입니다. OCR은 스캔한 문서에서 텍스트를 인식하는 것으로 비교적 쉬운 문제입니다. 반면에 이미지에서 텍스트를 읽는 Photo OCR은 여전히 어려운 머신러닝 문제입니다.

Photo OCR과 같은 복잡한 머신러닝 시스템이나 애플리케이션은 머신러닝 파이프라인이라는 콘셉트를 활용합니다. 파이프라인은 원하는 결과를 얻기 위해 세부적인 구성요소를 나열하고 연관관계를 도식화하는 것입니다. 예를 들면, Photo OCR은 이미지 입력 부분, 텍스트 감지, 문자 분할, 문자 인식 모듈로 나눕니다. 각 모듈마다 입력과 출력이 있습니다.

복잡한 머신 러닝 시스템을 설계하는 것은 파이프라인을 구체적으로 자세하게 그리는 것에서 시작합니다.

복잡한 머신 러닝 시스템을 설계할 때 가장 중요한 것 중 하나는 정확한 파이프라인을 구성하는 것입니다.