brunch

You can make anything
by writing

C.S.Lewis

by 라인하트 Dec 19. 2020

앤드류 응의 머신러닝(18-2): OCR 파이프라인

온라인 강의 플랫폼 코세라의 창립자인 앤드류 응 (Andrew Ng) 교수는 인공지능 업계의 거장입니다. 그가 스탠퍼드 대학에서 머신 러닝 입문자에게 한 강의를 그대로 코세라 온라인 강의 (Coursera.org)에서 무료로 배울 수 있습니다. 이 강의는 머신러닝 입문자들의 필수코스입니다. 인공지능과 머신러닝을 혼자 공부하면서 자연스럽게 만나게 되는 강의입니다.

Application Example

(응용 사례 )

Photo OCR (사진 OCR)

Sliding Windows (슬라이딩 윈도우)

In the previous video, we talked about the photo OCR pipeline and how that worked. In which we would take an image and pass the through a sequence of machine learning components in order to try to read the text that appears in an image. In this video I like to a little bit more about how the individual components of the pipeline works. In particular most of this video will center around the discussion of whats called a sliding windows.

지난 강의에서 Photo OCR 파이프 라인과 동작 방식을 설명했습니다. 이미지에 나타난 텍스트를 읽기 위해 이미지를 일련의 머신 러닝 컴포넌트를 통과시킵니다. 이번 강의에서 파이프라인의 각 컴포넌트가 동작 방식을 조금 더 자세히 설명하고, 슬라이딩 윈도우를 중심으로 설명합니다.

The first stage of the filter was the text detection where we look at an image like this and try to find the regions of text that appear in this image. Text detection is an unusual problem in computer vision. Because depending on the length of the text you're trying to find, these rectangles that you're trying to find can have different aspect. So in order to talk about detecting things in images let's start with a simpler example of pedestrian detection and we'll then later go back to. Ideas that were developed in pedestrian detection and apply them to text detection. So in pedestrian detection you want to take an image that looks like this and the whole idea is the individual pedestrians that appear in the image. So there's one pedestrian that we found, there's a second one, a third one a fourth one, a fifth one. And a one. This problem is maybe slightly simpler than text detection just for the reason that the aspect ratio of most pedestrians are pretty similar. Just using a fixed aspect ratio for these rectangles that we're trying to find. So by aspect ratio I mean the ratio between the height and the width of these rectangles. They're all the same for different pedestrians but for text detection the height and width ratio is different for different lines of text. Although for pedestrian detection, the pedestrians can be different distances away from the camera and so the height of these rectangles can be different depending on how far away they are but the aspect ratio is the same.

파이프라인의 첫 번째 단계는 이미지를 보고 텍스트 영역을 찾는 텍스트 감지 (Test Detection)입니다. 텍스트 감지는 컴퓨터 비전의 특수한 사례입니다. 텍스트 감지는 텍스트의 길이를 따라 사각형의 텍스트 영역을 감지합니다. 따라서, 이미지에서 사물을 감지하는 것을 설명하기 위해 보행자를 감지하는 사례를 살펴보겠습니다. 이유는 보행자 감지에서 개발된 아이디어를 텍스트에 감지에 적용하기 때문입니다. 여기 이미지가 있습니다. 이미지에 나타난 각각의 보행자를 찾는 것입니다. 여기 첫 번째 보행자, 두 번째 보행자, 세 번째 보행자, 네 번째 그리고 다섯 번째 보행자가 있습니다. 대부분의 보행자의 가로 세로 비율이 매우 유사하기 때문에 텍스트 감지보다 약간 더 단순합니다. 찾고자 하는 사물을 포함하는 직사각형 영역의 가로세로 비를 사용하는 것입니다. 가로 세로 비율은 직사각형의 높이와 너비 사이의 비율을 의미합니다. 보행자는 모두 비슷한 비율이지만, 텍스트는 줄마다 높이와 너비 비율이 다릅니다. 보행자 감지에서 직사각형의 크기는 카메라의 위치와 보행자 간의 거리에 따라 다를 수 있지만 가로세로 비율은 동일합니다.

In order to build a pedestrian detection system here's how you can go about it. Let's say that we decide to standardize on this aspect ratio of 82 by 36 and we could have chosen some rounded number like 80 by 40 or something, but 82 by 36 seems alright. What we would do is then go out and collect large training sets of positive and negative examples. Here are examples of 82 X 36 image patches that do contain pedestrians and here are examples of images that do not. On this slide I show 12 positive examples of y1 and 12 examples of y0. In a more typical pedestrian detection application, we may have anywhere from a 1,000 training examples up to maybe 10,000 training examples, or even more if you can get even larger training sets. And what you can do, is then train neuron network or some other learning algorithm to take this input, and image patch of dimension 82 by 36, and to classify 'y' and to classify that image patch as either containing a pedestrian or not. So this gives you a way of applying supervised learning in order to take an image patch can determine whether or not a pedestrian appears in that image capture.

보행자 감지 시스템을 구축하는 방법은 다음과 같습니다. 82 X 36 픽셀의 가로 세로 비율을 표준화하고, 80 X 40처럼 반올림 숫자를 선택할 수도 있지만, 82 X 36 픽셀은 좋습니다. 지금부터 Positive 예제와 Negative 예제로 구성된 대규모 학습 셋을 수집합니다. 왼쪽은 보행자가 포함된 82 X 36 픽셀 이미지들이고, 오른쪽은 보행자가 없는 82 X 36 픽셀 이미지들입니다. 왼쪽의 12개 이미지들은 y = 1 인 Positive 예제이고, 오른쪽의 12개 이미지들은 y = 0 인 Negative 예제들입니다. 일반적인 보행자 감지 시스템은 1,000개 학습 예제에서 최대 10,000 개의 학습 예제까지 또는 더 큰 학습 예제를 보유할 수 있습니다. 인공 신경망이나 다른 학습 알고리즘이 보행자가 있는지 없는지에 따라 y 값을 분류한 82 X 36 픽셀 차원의 이미지 패치를 학습니다. 캡처된 이미지에서 보행자가 있는지 없는지에 따르는 지도 학습 알고리즘을 적용할 수 있습니다.

Now, lets say we get a new image, a test set image like this and we want to try to find a pedestrian's picture image. What we would do is start by taking a rectangular patch of this image. Like that shown up here, so that's maybe a 82 X 36 patch of this image, and run that image patch through our classifier to determine whether or not there is a pedestrian in that image patch, and hopefully our classifier will return y equals 0 for that patch, since there is no pedestrian.

여기 새로운 이미지가 있습니다. 왼쪽 상단에 녹색의 직사각형 패치가 있습니다. 녹색 박스는 83 X 36 픽셀의 이미지 패치입니다. 박스 안에 보행자가 있는지 없는 지를 확인하는 분류기를 실행합니다. 이미지에 보행자가 없자면 분류기는 y = 0의 값을 반환합니다.

Next, we then take that green rectangle and we slide it over a bit and then run that new image patch through our classifier to decide if there's a pedestrian there. And having done that, we then slide the window further to the right and run that patch through the classifier again. The amount by which you shift the rectangle over each time is a parameter, that's sometimes called the step size of the parameter, sometimes also called the slide parameter, and if you step this one pixel at a time. So you can use the step size or stride of 1, that usually performs best, that is more cost effective, and so using a step size of maybe 4 pixels at a time, or eight pixels at a time or some large number of pixels might be more common, since you're then moving the rectangle a little bit more each time.

다음으로 녹색 직사각형을 오른쪽으로 약간 옮기고 새로운 이미지 패치에 보행자가 있는 지를 분류기를 통해 확인합니다. 다시 녹색 사각형을 오른쪽으로 약간 더 옮기고 분류기가 다시 실행합니다. 매번 사각형을 이동하는 간격을 파라미터로 정의하고 스텝 사이즈 (Step-size)또는 슬라이드 파라미터라고 부릅니다. 한 번에 한 픽셀 씩 이동하기도 합니다. 일반적으로 가장 성능이 좋고 비용 효율적인 간격을 1 스텝 사이즈 또는 스트라이드(stride)를 사용합니다. 1 스텝 사이즈는 한 번에 4 필셀 또는 8 필셀 또는 더 많은 픽셀 크기를 사용할 수 있습니다. 매번 사각형을 조금 더 움직입니다.

So, using this process, you continue stepping the rectangle over to the right a bit at a time and running each of these patches through a classifier, until eventually, as you slide this window over the different locations in the image, first starting with the first row and then we go further rows in the image, you would then run all of these different image patches at some step size or some stride through your classifier.

따라서, 녹색 사각형을 한 번에 조금씩 오른쪽으로 이동하고 분류기는 보행자가 있는지 없는 지를 판독합니다. 첫 번째 행을 완료한 후에 다음 행을 시작합니다. 이미지의 모든 부분을 모두 실행합니다.

Now, that was a pretty small rectangle, that would only detect pedestrians of one specific size. What we do next is start to look at larger image patches. So now let's take larger images patches, like those shown here and run those through the classifier as well. And by the way when I say take a larger image patch, what I really mean is when you take an image patch like this, what you're really doing is taking that image patch, and resizing it down to 82 X 36, say. So you take this larger patch and re-size it to be smaller image and then it would be the smaller size image that is what you would pass through your classifier to try and decide if there is a pedestrian in that patch.

앞에서는 작은 크기의 녹색 직사각형은 특정 크기의 보행자만을 감지합니다. 보는 것처럼 더 큰 이미지 패치를 사용하여 분류기를 실행합니다. 실제로는 더 큰 이미지 패치를 가져다가 82 X 36 크기로 줄이는 것입니다. 큰 이미지 패치를 더 작은 이미지로 크기를 조정한 다음 보행자가 있는지 없는 지를 확인하기 위해 분류기를 사용합니다.

And finally you can do this at an even larger scales and run that side of Windows to the end

마지막으로 더 큰 녹색 직사각형으로 같은 작업을 수행합니다.

And after this whole process hopefully your algorithm will detect whether theres pedestrian appears in the image, so that's how you train a the classifier, and then use a sliding windows classifier, or use a sliding windows detector in order to find pedestrians in the image.

그리고 전체 과정을 완료하면, 알고리즘은 이미지에서 보행자가 나타나는 영역을 감지합니다. 이것이 이미지에서 보행자를 찾기 위해 분류기, 슬라이딩 윈도우 분류기, 또는 슬라이딩 윈도우 검출기를 학습시키는 방법입니다.

Let's have a turn to the text detection example and talk about that stage in our photo OCR pipeline, where our goal is to find the text regions in unit similar to pedestrian detection

텍스트 감지 예로 돌아갑시다. 사진 OCR 파이프라인의 텍스트 감지 단계를 설명합니다. 목표는 보행자 감지와 유사하게 텍스트 영역을 찾는 것입니다.

You can come up with a label training set with positive examples and negative examples with examples corresponding to regions where text appears. So instead of trying to detect pedestrians, we're now trying to detect texts. And so positive examples are going to be patches of images where there is text. And negative examples is going to be patches of images where there isn't text. Having trained this we can now apply it to a new image, into a test set image.

Positive 예제와 Negative 예제가 포함된 레이블이 있는 학습 셋을 생성합니다. 텍스트가 나타나는 영역을 찾습니다. 보행자를 감지하는 대신에 텍스트를 감지합니다. Positive 예제는 텍스트가 있는 이미지 패치이고, Negative 예제는 텍스트가 없는 이미지 패치입니다. 학습을 완료한 알고리즘은 새로운 이미지기 잇는 테스트 셋에 적용할 수 있습니다.

So here's the image that we've been using as example. Now, last time we run, for this example we are going to run a sliding windows at just one fixed scale just for purpose of illustration, meaning that I'm going to use just one rectangle size. But lets say I run my little sliding windows classifier on lots of little image patches like this if I do that, what Ill end up with is a result like this where the white region show where my text detection system has found text and so the axis' of these two figures are the same. So there is a region up here, of course also a region up here, so the fact that this black up here represents that the classifier does not think it's found any texts up there, whereas the fact that there's a lot of white stuff here, that reflects that classifier thinks that it's found a bunch of texts. over there on the image. What i have done on this image on the lower left is actually use white to show where the classifier thinks it has found text. And different shades of grey correspond to the probability that was output by the classifier, so like the shades of grey corresponds to where it thinks it might have found text but has lower confidence the bright white response to whether the classifier, up with a very high probability, estimated probability of there being pedestrians in that location.

여기 이미 사용했던 이미지가 있습니다. 하나의 고정된 비율의 슬라이딩 윈도를 사용할 것입니다. 슬라이딩 윈도우는 직사각형 모양입니다. 하지만, 작은 이미지 패치로 작은 슬라이딩 윈도우 분류기를 실행한다고 가정합니다. 왼쪽 하단 이미지의 흰색 영역은 텍스트 감지 시스템이 텍스를 찾은 위치를 표시합니다. 맨 위의 이미지와 왼쪽 하단의 이미지는 동일한 크기입니다. 파란색 화살표와 빨간색 화살표는 동일한 위지를 가리킵니다. 검은색 영역은 텍스트 감지 시스템이 텍스가 없다고 판단한 영역입니다. 텍스트 분류기는 텍스트를 찾은 후에 흰색으로 표시합니다. 회색 영역은 텍스트 분류기가 텍스트가 있을 것으로 판단하는 확률입니다. 회색 영역은 흰색보다는 확률이 낮으므로 신뢰도가 낮습니다. 해당 위치에 보행자가 있을 확률을 추정하는 것과 같습니다.

We aren't quite done yet because what we actually want to do is draw rectangles around all the region where this text in the image, so were going to take one more step which is we take the output of the classifier and apply to it what is called an expansion operator. So what that does is, it take the image here, and it takes each of the white blobs, it takes each of the white regions and it expands that white region. Mathematically, the way you implement that is, if you look at the image on the right, what we're doing to create the image on the right is, for every pixel we are going to ask, is it withing some distance of a white pixel in the left image.

And so, if a specific pixel is within, say, five pixels or ten pixels of a white pixel in the leftmost image, then we'll also color that pixel white in the rightmost image. And so, the effect of this is, we'll take each of the white blobs in the leftmost image and expand them a bit, grow them a little bit, by seeing whether the nearby pixels, the white pixels, and then coloring those nearby pixels in white as well.

이미지에서 텍스트가 있는 모든 영역 주위에 직사각형을 그리는 것이기 때문에 아직 완료되지 않았습니다. 분류기의 결과를 보고 한 단계 더 작업을 수행해야 합니다. 텍스트 분류기의 결과를 확장 연산자 (expansion opeator)에 적용합니다. 확장 연산자는 이미지에서 흰색 영역을 바탕으로 확장합니다. 수학적으로 왼쪽 이미지에서 오른쪽의 이미지를 만드는 것입니다. 흰색 픽셀 간 거리가 얼마나 되는 지를 확인합니다. 왼쪽 이미지의 흰색 픽셀에서 5 픽셀 또는 10 픽셀 떨어진 거리에 다른 흰색 픽셀이 있다면 오른쪽 이미지처럼 흰색으로 채색합니다. 따라서 왼쪽 이미지의 흰색 얼룩을 확장하여 흰색 픽셀로 확장합니다.

Finally, we are just about done. We can now look at this right most image and just look at the connecting components and look at the as white regions and draw bounding boxes around them. And in particular, if we look at all the white regions, like this one, this one, this one, and so on, and if we use a simple heuristic to rule out rectangles whose aspect ratios look funny because we know that boxes around text should be much wider than they are tall. And so if we ignore the thin, tall blobs like this one and this one, and we discard these ones because they are too tall and thin, and we then draw a the rectangles around the ones whose aspect ratio thats a height to what ratio looks like for text regions, then we can draw rectangles, the bounding boxes around this text region, this text region, and that text region, corresponding to the Lula B's antique mall logo, the Lula B's, and this little open sign of over there. This example by the actually misses one piece of text. This is very hard to read, but there is actually one piece of text there. That says [xx] are corresponding to this but the aspect ratio looks wrong so we discarded that one. So you know it's ok on this image, but in this particular example the classifier actually missed one piece of text. It's very hard to read because there's a piece of text written against a transparent window. So that's text detection using sliding windows. And having found these rectangles with the text in it, we can now just cut out these image regions and then use later stages of pipeline to try to meet the texts.

마지막으로 오른쪽 하단의 이미지에서 흰색 영역을 보고 서로를 연결하여 흰색 영역 주위에 경계 상자를 그릴 수 있습니다. 모든 흰색 영역을 살펴보고 가로 세로 비율이 이상하게 보이는 사각형을 배제합니다. 텍스트 주위의 상자는 훨씬 길거나 훨씬 길어야 합니다. 따라서, 오른쪽 이미지의 하단에 길쭉한 두 개의 흰색 영역은 배제합니다. 그리고 가로 세로 비율이 어떤 비율인지에 따라 주위에 직사각현을 그립니다. 텍스트 영역 주변의 경계 상자를 그릴 수 있습니다. 해당 텍스트 영역은 LULAB's Entique Mall 큰 영역과 LULAB's 로고가 있는 작은 영역입니다. 이 예제는 실제로 한 조각의 텍스트를 놓쳤습니다. 읽기가 매우 어렵지만 텍스트가 하나 있습니다. 이미지 패치의 가로 세로 비율이 잘못된 것 같아서 버렸습니다. 분류기가 투명한 창에 쓰인 텍스트 한 조각을 놓쳤습니다. 이것이 슬라이딩 윈도우를 사용하는 텍스트 감지입니다. 텍스트가 있는 직사각형을 찾았습니다. 이제 이미지 영역을 잘라낸 다음 파이프라인의 다음 단계로 전달합니다.

Now, you recall that the second stage of pipeline was character segmentation, so given an image like that shown on top, how do we segment out the individual characters in this image? So what we can do is again use a supervised learning algorithm with some set of positive and some set of negative examples, what were going to do is look in the image patch and try to decide if there is split between two characters right in the middle of that image patch. So for initial positive examples. This first cross example, this image patch looks like the middle of it is indeed the middle has splits between two characters and the second example again this looks like a positive example, because if I split two characters by putting a line right down the middle, that's the right thing to do. So, these are positive examples, where the middle of the image represents a gap or a split between two distinct characters, whereas the negative examples, well, you know, you don't want to split two characters right in the middle, and so these are negative examples because they don't represent the midpoint between two characters.

이제 파이프라인의 두 번째 단계는 문자 분할(Charactor Segmentation)입니다. 맨 위의 텍스트 상자가 있을 때 각 문자 단위로 어떻게 분할할까요? 다시 한번 Positive 예제와 Negative 예제가 있는 셋을 활용한 지도 학습 알고리즘을 사용합니다. 이미지 패치를 보고 두 문자 사이를 분할할지를 결정합니다. 초기 Positive 예제를 봅시다. 첫 번째 이미지 패치는 두 문자로 나누어야 할 것 같습니다. 두 번째 이미지 패치는 중간을 기준으로 두 문자로 나누면 Positive 예제입니다. Positive 예제는 두 문자 사이에 중간에 갭이나 분할을 해야 합니다. 반면에 Negative 예제는 중간에 두 문자로 분리할 필요가 없습니다. 한 이미지 패치에 하나의 문자가 있기 때문입니다.

So what we will do is, we will train a classifier, maybe using neuron network, maybe using a different learning algorithm, to try to classify between the positive and negative examples. Having trained such a classifier, we can then run this on this sort of text that our text detection system has pulled out. As we start by looking at that rectangle, and we ask, "Gee, does it look like the middle of that green rectangle, does it look like the midpoint between two characters?". And hopefully, the classifier will say no, then we slide the window over and this is a one dimensional sliding window classifier, because were going to slide the window only in one straight line from left to right, theres no different rows here. There's only one row here.

다음으로 Positive 예제와 Negative 예제를 분류하기 위해 인공 신경망이나 다른 학습 알고리즘을 사용하여 분류기를 학습시킵니다. 분류기가 학습을 완료한 후 텍스트 감지 시스템이 추출한 텍스트에서 실행합니다. 직사각형을 보고 "녹색 직사각형의 중간처럼 보이나요? 두 문자의 중간처럼 보입니까?라고 묻습니다. 분류기는 아니오라고 할 것입니다. 이것이 1차원 슬라이딩 윈도우 분류기입니다. 왜냐하면 녹색 창이 왼쪽에서 오른쪽으로 한 직선으로 만 슬라이딩할 것이기 때문입니다. 여기에 다른 행은 없습니다.

But now, with the classifier in this position, we ask, well, should we split those two characters or should we put a split right down the middle of this rectangle. And hopefully, the classifier will output y equals one, in which case we will decide to draw a line down there, to try to split two characters.

하지만, 분류기 윈도우가 A와 N 글자의 중간에 위치할 경우 두 문자를 분할해야 하는지 아니면 직사각형의 중간을 분할해야 하는 지를 묻습니다. 바라건대 분류기는 y = 1을 출력할 것입니다. 파란색 구분선을 그려서 두 글자를 분할합니다.

Then we slide the window over again, optic process, don't close the gap, slide over again, optic says yes, do split there and so on, and we slowly slide the classifier over to the right and hopefully it will classify this as another positive example and so on.

다시 분류기 윈도우을 오른쪽으로 밀어서 놓습니다. 광학 프로세스는 멈추지 않고 계속 반복합니다. 글자와 글자 사이를 계속 분할합니다.

And we will slide this window over to the right, running the classifier at every step, and hopefully it will tell us, you know, what are the right locations to split these characters up into, just split this image up into individual characters. And so thats 1D sliding windows for character segmentation.

분류기 윈도우를 오른쪽으로 계속 밀면서 모든 단계에서 분류기를 실행합니다. 문자와 문자를 분할할 올바른 위치를 확인할 수 있습니다. 이것이 이미지를 개별 문자로 분할하는 방법입니다. 그리고, 이것이 문자 분할을 위한 1차원 슬라이딩 윈도우입니다.

So, here's the overall photo OCR pipe line again. In this video we've talked about the text detection step, where we use sliding windows to detect text. And we also use a one-dimensional sliding windows to do character segmentation to segment out, you know, this text image in division of characters. The final step through the pipeline is the character qualification step and that step you might already be much more familiar with the early videos on supervised learning where you can apply a standard supervised learning within maybe neuron network or maybe something else in order to take it's input, an image like that and classify which alphabet or which 26 characters A to Z, or maybe we should have 36 characters if you have the numerical digits as well, the multi class classification problem where you take it's input and image contained a character and decide what is the character that appears in that image?

여기 전체 Photo OCR 파이프라인이 있습니다. 이번 강의에서 슬라이딩 윈도우를 사용하여 텍스트 감지하는 단계에 대해 설명했습니다. 문자 분할을 위해 1차원 슬라이딩 윈도우를 사용하여 문자 분할로 텍스트 이미지를 분할합니다. 파이프라인의 마지막 단계는 문자 분류 단계입니다. 마지막 단계는 여러분이 이미 친숙한 인공 신경망이나 다른 알고리즘을 사용하는 지도 학습을 적용할 수 있습니다. 이미지를 A에서 Z까지 26 문자로 분류하거나 0에서 9까지 숫자를 포함하여 36가지 문자로 분류합니다. 이미지에 나타난 문자가 무엇인지를 결정하는 멀티 클래스 분류 문제입니다.

So that was the photo OCR pipeline and how you can use ideas like sliding windows classifiers in order to put these different components to develop a photo OCR system. In the next few videos we keep on using the problem of photo OCR to explore somewhat interesting issues surrounding building an application like this.

지금까지 Photo OCR 시스템을 개발하기 위한 여러 구성 요소를 배치하기 위해 파이프라인과 슬라이딩 윈도우 분류기와 같은 아이디어를 사용하는 방법을 설명했습니다. 다음 강의에서 Phoeo OCR 문제를 사용하여 복잡한 응용 사례를 구축하는 것과 관련한 흥미로운 문제를 공부할 것입니다.

앤드류 응의 머신러닝 동영상 강의

정리하며

Photo OCR은 Photo Optical Chatacter Recognition의 준말로 이미지의 텍스트 인식 기술입니다. Photo OCR과 같은 복잡한 머신러닝 시스템이나 애플리케이션은 머신러닝 파이프라인이라는 콘셉트를 활용합니다. 파이프라인은 원하는 결과를 얻기 위해 세부적인 구성요소를 나열하고 연관관계를 도식화하는 것입니다. 예를 들면, Photo OCR은 이미지 입력 부분, 텍스트 감지, 문자 분할, 문자 인식 모듈로 나눕니다.

Photo OCR 파이프라인의 첫 번째 모듈은 텍스트 감지(Text Detection)입니다. 텍스트 감지 시스템을 이해하기 위해 이미지에서 보행자를 감지하는 시스템을 먼저 설명합니다. 카메라와 보행자 간의 거리에 따라 보행자에다 이미지의 크기만 다를 뿐 가로 세로 비율은 동일합니다. 최소 이미지 패치는 82 X 36 픽셀입니다. 이미지 패치는 이미지 전체를 스캔합니다. 이동 간격은 스텝 사이즈 (Step-size) 또는 슬라이드 파라미터라고 부릅니다. 스텝 사이즈는 한 번에 4 필셀 또는 8 필셀 또는 더 많은 픽셀 크기를 사용할 수 있습니다. 이것이 슬라이딩 윈도우즈가 동작하는 방식입니다. 텍스트 감지 시스템은 이미지 패치로 이미지를 스캔하면서 텍스트 영역은 흰색으로 배경 영역은 검은색으로 칠합니다. 이미지 패치가 텍스트일 확률에 따라 회색으로 표현하기도 합니다. 여기서 확장 연산자를 활용하여 흰색 텍스트 영역 주위의 5 픽셀 또는 10 픽셀 떨어진 셀간 거리에 있는 것들을 흰색을 채색하여 텍스트가 있는 영역을 직사각형으로 만듭니다.

이제 파이프라인의 두 번째 단계는 문자 분할(Charactor Segmentation)입니다. 문자 분할을 위해 1차원 슬라이딩 윈도우를 사용하여 텍스트 이미지를 분할합니다.

파이프라인의 마지막 단계는 문자 분류 단계입니다. 이미지를 A에서 Z까지 26 문자로 분류하거나 0에서 9까지 숫자를 포함하여 36가지 문자로 분류합니다. 이미지에 나타난 문자가 무엇인지를 결정하는 멀티 클래스 분류 문제입니다.