brunch

라이킷 22 댓글 1

You can make anything
by writing

C.S.Lewis

계정을 잊어버리셨나요?

by 카일 도 Apr 16. 2018

키토크(keytalk) 시대와 Voice-UX

[ 키토크(Keytalk) 시대의 선포와 Voice-UX ]

나는 한국인이지만, 오늘 세계 디지털 역사에서 중요한 개념이 되기를 희망하는 것 하나를 조심스럽게 제시하려고 합니다.

바로 키토크(Keytalk)라는 개념입니다. 키토크(Keytalk)는 현재 구글 및 위키피디아에서도 일본의 롹밴드라고 나오고 있기 때문에 내가 이해하는 한, 이 개념의 발제자가 되려고 합니다.

현재까지 어떤 내 저서나 글에서도 이렇게 거창하게 시작한 적은 없었으나, 이번 개념은 Voice-UX (또는 GUI) 시대를 맞아,

이전 인터넷 시대 초기에 키워드(keyword) 검색 개념의 등장 만큼이나 중요한 문제 해결 방법이라고 판단했습니다.

따라서, 내가 키토크 개념을 첫 발제하고 이를 반영한 키토크 검색 서비스인 ‘말해’를 내가 창업한 (주)마이셀럽스 동료들과 함께 내 놓습니다.

(말해 라는 말로 찾는 생활포털 서비스는 현재도 앱스토어에 올라있으나, 아이폰은 수요일경, 안드로이드폰은 월요일 오후 경 업데이트 패치를 꼭 반영하기를 권장합니다)

이 개념에 관련된 다양한 특허는 한/미/일/중 출원을 완료한 상황이고, 이 내용에 대한 full 페이퍼(국/영)는 부분 부분 내 페이스북이나 다른 칼럼들을 통해 시리즈 형태로 공개할 생각이고,

아직 학계에 논문 형태로의 제시 계획은 없습니다.

나는 비즈니스 하는 사람이라 논문으로 내기보다 우선적으로, 서비스로 직접 시장에 내 놓고 소비자의 심판을 받기로 했습니다.

아래 내용은 다소 길지만, 70여장의 키토크 관련 다양한 메커니즘 소개 중 초기 5장 정도 분량을 먼저 공유하려고 합니다.

페이퍼 형태로 좀 쉽게 쓰다보니 100% 내가 좋아하는 문체는 아닙니다.

=== 아래 ===

1.
인류는 드디어 컴퓨터와 자연스럽게 대화를 나눌 수 있게 된 것일까? 불행인지 다행인지는 모르겠지만 일반적인 의미에서의 ’대화’가 가능한 컴퓨터는 아직 등장하지 않은 듯 하다.

물론 음성인식 ASR(Automated Speech Recognition) 기술과 사람이 쓰는 말을 기계가 알아듣도록 하는 자연어 이해(NLU) 기술에 많은 발전이 있었기 때문에

“어 이렇게 이야기해도 알아듣나?” 싶은 정도로 신통한 스마트 스피커 제품들이 많이 나와 있지만, 여전히 자연스러운 대화와는 거리가 멀다.
.
.

2.현재 음성 인식 제품들의 한계: 키워드 시대의 유산

현재 나와 있는 스마트 스피커를 사용하려면 먼저 “시리야~”, “O.K. 구글~”, “알렉사~”, “클로바~”, “아리야~”

이런 방식으로 먼저 서비스를 호출해야 한다. 눈에 보이는 스크린이 없을 뿐 까만 색 스크린에 커서가 반짝거리는 터미널 혹은 MS-DOS 화면에서 달라진 게 없다. [그림 1 - MS DOS 프롬프트 화면]

호출을 한 뒤에는 컴퓨터가 알아들을 수 있는 명령어를, 정해진 문법에 맞추어 입력해야 한다. 아마존에서 만든 Echo Show 같은 제품은 내장된 디스플레이에 아예 항상 예시문을 표시해 사용자에게 패턴을 익히도록 한다.

국내에 소개된 대부분의 제품들은 ‘이렇게 말해보세요’와 같은 도움말을 함께 제공하기도 한다. [그림 2 - 아마존의 alexa show]

사람과 사람 사이의 대화로 생각해 보면 상당히 부자연스러운 모습이다.

말 한 마디를 걸 때마다 상대를 불러야 하고, 상대방이 알아 들을지 말지 모르니 항상 단어와 문장을 점검하면서 말해야 한고,
못 알아 들으면 그 부분만 추가할 수 없고 다시 처음부터 말해야 할 뿐만 아니라,
다시 처음부터 말하면 또 다른 단어를 못 알아 듣기를 반복한다.

그러다가 화가 나서 욕이라도 할라치면 바로 대화를 끊고 기계에게 "입조심" 등 주의까지 받는다.

여전히 많은 사람들이 음성 기반 제품 사용을 단순히 음악 듣기용이나 가정용 홈 오토메이션 정도로 사용하거나 별 필요성을 느끼지 못하는 이유도 여기에 있다.

최근 국내 최대 텔레콤사의 미팅에서 담당은 “게으른 사용자를 타게팅 한다”고 표현하기도 했다.

가장 주목을 받고 있는 분야이고, 기술적으로도 가장 앞선 분야인 음성 기반 플랫폼과 서비스들인데, 어째서 이런 불편함이 생기는 것일까?

다양한 이유들 중에, 정보와의 연동 및 추천 부문에 대해서는 텍스트 시대의 유물인 ‘키워드’ 중심의 사고 체계와 인터페이스에 여전히 갇혀 있기 때문이다.

사람들의 생각과 취향이 다양한만큼, 생각을 말로 표현하는 방식도 다양하다. 그리고 사람과 사람 사이의 대화에는 굳이 말로 전해지지 않는 맥락도 존재한다.

하지만 0과 1만으로 세상을 이해하는 컴퓨터는 사람들의 자연스러운 취향과 관심에 대한 표현들을 이해하고 답을 찾아줄만한 능력을 아직 갖추지 못했다.

최근 챗봇들이 화두가 되면서, ‘Dialogue Management’라는 기술 개념 등에서, 훈련을 통해서 기계에게도 어느 정도 대화의 맥락을 가르칠 수 있다.
이 것은 인공지능, 구체적으로는 ‘딥 러닝’이라고 부르는 분야에서 열심히 연구하고 있는 주제 중 하나이지만 현재 수준을 벗어나 소위 말하는 자연어와 일상어를 잘 이해하는 데는 상당한 시간이 걸리고, 반드시 스냅샷(sanpshot) 형태의 방법이 등장해야 한다.

가장 발전해 있다고 하는 인공지능인 구글 알파고나 IBM의 왓슨도 키워드로 묻고 답하는 퀴즈처럼 모두 정해진 룰이 있는 게임에서는 인간과 경쟁이 가능했을지 모르지만, 음성으로 사용자와 대화를 나누는 것은 전혀 다른 차원의 이야기다.
.
.

3.추천 방식의 변화: 디렉토리와 키워드 기반의 분류체계의 변화

그럼, 인간 수준의 대화가 가능한 인공지능 기술이 등장할 때까지 마냥 기다리는 게 답은 아닐 것이다.
문제를 해결하는 방식이 꼭 기술과 알고리즘의 영역에만 존재하는 것은 아니다. 오히려 새로운 인식체계, 혹은 분류체계를 통해 이 문제는 해결될 수 있다. 즉, 추천 방식에 대한 통시적인 변화를 따라가다 보면 그 답이 나올 수 있다.

지금이야 셀 수 없을 정도로 다양한 상품 정보들을 역시 수많은 채널을 통해 접할 수 있지만, 인터넷 이전 시대에는 상점의 진열대가 일반적으로 사용되는 추천 기준이었다. 하지만 산업 발전과 함께 다양한 종류의 상품들이 나오기 시작하고 한정된 공간에 그 많은 상품들을 다 진열할 수 없게 되었다.

그래서 등장한 것이 ‘카탈로그’였다. 얼마 전까지 명절 때마다 등장했던 우체국 통신판매 카탈로그를 떠올리면 이해가 쉽겠다. 인터넷이 등장한 이후에도, 처음으로 온라인에서 정보를 탐색하는 방식은 카탈로그와 별로 다르지 않았다.

야후(Yahoo)라는 검색 서비스가 처음 등장했을 때도, 초기 인터페이스는 직접 검색어를 입력하는 방식이 아니라 카테고리별로 정리된 디렉토리 방식의 검색이 먼저 제공되었다. 당시 문헌정보학과나 도서관학과를 우대한다고 모집 했을 정도다.

이런 방식의 정보 탐색 및 상품 추천이 가능했던 건 전통적인 사업 중심의 분류체계들 덕분이었다. 예를 들면, 쇼핑몰에서 가전 카테고리의 하위 분류로 TV, 냉장고, 세탁기 등으로 나눠놓는 방식이다. [그림 3 -야후 초기 디렉토리 검색 화면]

현재 가장 일반적인 인터페이스로 활용되고 있는 키워드 검색도 크게 다르지 않다. 잇달아 등장한 구글이나 현재 대한민국 인터넷 트래픽의 거의 대부분을 차지하는 네이버 같은 서비스들도 문서와 문서 사이의 관계성에 근거하거나 혹은 관리자의 ‘의도’가 들어간 편집 방식의 ‘키워드 검색’을 제시했고, 이는 지금까지 대세로 이어져오고 있다.

하지만 결국 인식체계 자체는 산업 분류체계에 기초한 디렉토리 방식의 한계를 넘어서지 못하고 있다. 우리가 너무 익숙해져 있어서 불편함을 못 느끼고 있을 뿐, 새롭게 등장하는 대화 기반 제품에 적용하면 금새 부자연스러움을 느끼게 된다.

디렉토리 방식은 인터넷 태동기에 사용자 편의성을 높이는 대안이었을지 모르지만, 사용자의 다양한 취향을 반영하기에는 너무나도 큰 한계를 보일 수 밖에 없다.

우리가 실제로 정보를 탐색하는 방식을 생각해 보면, ‘출근 준비할 때 듣기 좋은 음악’, ‘우울할 때 기운을 북돋아주는 음료’, ‘상견례 할 때 가기 좋은 맛집’처럼 T.P.O(Time, Place, Occasion)에 기반을 둔 정황과 취향 중심의 ‘문제 해결’ 과정에 가깝다.

‘재즈, 클래식, 메탈, 힙합’ 같은 전통적인 인식체계 만으로는 정말 사용자가 원하는 음악에 대한 정보를 찾아 추천해 주는 데 한계가 있다.

이와 같이 기존의 분류체계에 기반을 둔 키워드 중심의 인터페이스로는 인간의 수많은 발화 패턴을 제대로 담아낼 수 없다.

키워드를 알아야 내가 원하는 정보에 도달할 수 있다는 것은 여전히 ‘정답 찾기’ 방식의 사고에 갇혀 있다는 의미이다.

정답을 찾아내는 식의 사고로는 사람과 사람 사이의 대화에서 자연스럽게 인지, 공유되는 맥락과 의도를 담아내는 것도 불가능하다.

그래서 기존의 음성 기반 제품들은 유저 인터페이스를 설계할 때, ‘시나리오’를 중심으로 하는 접근 방식을 택하고 있다. 최대한 사용자가 말할 것으로 예상되는 대화 패턴을 예상해서 스토리를 만들고, 이를 바탕으로 사용자의 의도를 담은 명령어와 이에 대한 답변을 미리 구성하는 방식이다.

실제로 많은 기업의 음성 기반 제품 설계 팀에서는 시나리오 작가들을 대거 고용해 예상 시나리오를 다량으로 만들어 입력하는 방식으로 제품을 설계하고 있기도 하다.

사용자의 예상 발화 패턴을 하나 하나 예외 케이스로 추가하고, 각각의 사용자에게 정해진 패턴을 벗어나는 경우 “못 알아 들었어요. 제가 알아들을 수 있도록 정해진 방식으로 다시 말씀해 주세요”라는 식으로 사용자의 언어 습관을 지적하는 방식은 자연스러운 대화와도 거리가 멀 뿐 아니라,

정해진 화면 위에서 시각적인 가이드를 받으면서 진행되는 디스플레이 중심의 인터페이스 설계 방식과도 다를 바가 없다.

스마트 스피커가 정한 시나리오와 패턴에 맞게 명령어를 입력하기 위해 우리는 계속 딱딱한 말투를 사용하고, 평소 언어습관과 전혀 다른 공손하고 완벽한 문장만을 사용해야 할까?
마치 외국인에게 익숙하지 않은 언어로 말을 걸 때처럼 생각만 해도 불편한 경험이 아닐 수 없다.
.
.

4.새로운 분류체계: 다이내믹 온톨로지와 키토크

사람의 취향과 그에 대한 표현은 너무나도 다양하고, 이에 대한 정답이나 순위가 존재하지 않는 것이 맞다. 그런데 우리는 늘 추천 결과가 맞다 안맞다를 얘기하고 있다.

기존의 나이, 년도 등과 같이 정확한 답이 존재하던 분야가 키워드 검색 시대에만 해도 맛집 추천 등의 개념이 나오면서 순위와 정답의 개념들은 꽤 상쇄되기 시작했다.

중간에 위치기반 서비스가 등장하면서 여행, 숙박 같은 카테고리와 같이 위치 기반으로 어느정도 추천을 해 줄 수 있는 서비스가 등장 하긴 했으나 여전히 키워드를 통해 원하는 것을 찾을 수 있는 카테고리는 너무 제한적이다.

사람들은 숙박만 관련해서도 ‘전망이좋고’, ‘사진찍기좋고’, ‘깨끗하고’ 심지어는 ‘수압이센’ 모텔을 찾기 시작했기 때문이다.

이 중에서는 수압이센 모텔은 동별로 수압데이터가 존재하기 때문에 추론적 적용이 가능하지만 다른 요구사항들은 지금의 방식으로는 풀리지 않는다.

방송 카테고리만 봐도 년간 동영상 클립의 10%도 소비자에게 선택되지 않고 있다.
'무한도전', '송중기' 등 키워드를 검색해야만 선택되어 질 수 있는 체계이고 대부분의 영상 클립들은 수백년이 지나도 선택되어 질 수 없는 구조이기 때문이다.

대화기반의 추천 시대가 되어, 갑자기 너무나 다양한 취향에 맞닥뜨리게 되었다.

즉, 나의 로맨틱과 너의 로맨틱이 다르고, 영화에서의 작품성과 웹툰에서의 작품성 또한 다르고 실시간으로 변화한다.

실시간으로 개념이 변화한다는 것은 빅데이터 시대가 되면서 추론이 가능해 졌는데,
요즘처럼 연상연하 커플이 나오는 드라마가 주목받을때, 적어도 현재 우리나라의 여성이 생각하는 “요즘 로맨틱한”이라는 개념에는 ‘연상연하’라는 정서와 정황이 상당 부분 반영 되는 것을 다양한 데이터를 통해 알 수 있듯이, 대중들의 취향은 항상 유동적이고 정답이 존재하지 않는다. [그림 4 - 정답찾기 방식 vs. 문제해결 방식]

그렇다면 대화형 서비스에서 “로맨틱한 영화 있어?” 라는 추천 질문에 대해 누가 어떤 방식으로 정답과 순위없이 추천을 해 주어야 할까?

정답 찾기 중심의 ‘키워드’에서 벗어나, 우리가 일반적으로 문제를 해결하는 데 사용하는 자연스러운 취향과 정황 중심의 분류체계를 새롭게 만들어 적용해야 한다.

앞서 언급했듯, 사람의 생각과 취향에는 정답이 존재하지 않는다.

어떤 사람에게는 심심할 때 로맨틱 코미디 영화가 가장 좋은 대안이 될 수 있겠지만, 다른 사람에게는 스릴러 장르의 영화가 가장 좋은 심심풀이일 수 있다.

수많은 정보 가운데 내가 입력한 ‘키워드’를 담고 있는 제한된 수의 ‘정답’을 찾는 대신, 다양한 사람들이 각자의 취향을 갖고 살아가는 이 세상에서 문제를 어떻게 해결할지에 초점을 맞추어 함께 토론하고 결과를 도출하는 방식을 찾을 수는 없을까?

마이셀럽스는 사용자가 일상에서 사용하는 언어들을 바탕으로 사람의 감정과 취향, T.P.O(Time, Place, Occasion) 및 각 카테고리 별 특성, 색채, 질감 등 다양한 방식으로 새로운 인식체계를 구상하고, 머신 러닝 기술과 자연어 이해 기술을 응용하여 구현해 냈다.

이른바 Dynamic Ontology라고 불리우는 이 새로운 인식체계는 단순히 키워드로 표현되는 매칭 방식이 아닌, 단어와 단어 사이의 연관도 뿐 아니라 영화, 스타, 그림 등 각 카테고리별 특성적으로 추출할 수 있는 데이터들을 바탕으로 하는 벡터 모델 기반의 추론 방식을 응용한다.

이는 기존에 존재하던, 단순히 단어사이의 연관관계 만으로 점수화 하는 워드투벡터 방식에서 확장된 개념이다.

다이내믹 온톨로지는 결과를 추론하는 과정에는 사용자의 의도와 맥락을 담고 있는 ‘키토크(Keytalk)’를 추출해 낸다.

‘키토크’는 키워드와 비슷한 ‘단어’의 형태로 표현되지만, 단순한 매칭 방식이 아니라 한 문서 안에 존재하는 다양한 표현들 간의 인접도와 유사도를 바탕으로 하나의 단어가 담고 있는 다양한 취향과 의도를 모두 담아내는 구조로 설계되어 있다.
[그림 5 - 다이내믹 온톨로지 개념] [그림 6 - 말해 키토크]

이와 같은 방식을 가능하게 한 시작점은 이른바 ‘DT(Data Technology) 시대’가 도래해서이다. 지구 상에 존재하는 데이터의 90% 이상이 최근 3년 안에 생성되었고, 그 대부분이 온라인 사에 대중들이 남긴 ‘라이프 로그(life-log) 데이터’다. 최근에는 해시태그까지 더해져서 머신이 러닝할 수 있는 라이프로그 데이터의 폭이 매우 넓어 졌다.

뿐만아니라, 색상 분석기, 이미지 인식기 등 데이터를 추출할 수 있는 다양한 기술이 활용되었고, 오프라인에서 오랜동안 연구되었던 많은 논문들을 러닝하여 모델링된 추론 엔진이 중요한 역할을 했다.

최근 수 년 간 소셜 미디어의 급격한 성장으로 인해 이 데이터의 양과 질 모든 면에서 사용자의 ‘취향’에 대한 힌트를 얻을 수 있는 가능성이 증가했다. 현존하는 음성 인식이나 자연어이해 기술만으로 해결이 어려웠던 문제들의 실마리를 급격하게 증가하고 있는 데이터에서 발견할 수 있게 된 것이다.

‘퇴근길에’, ‘울적할 때’, ‘도서관에서’, ‘낭만적인’ 같이 사람들이 생각하는 그대로의 언어 체계를 기반으로 다양한 카테고리(혹은 산업군)의 대상들을 분류할 수 있게 해 준다.

물론 그렇다고 해서 사용자들에게 익숙한 기존의 분류체계와 완전히 별개로 동작하는 것은 아니다. 단순히 ‘최근 1년 동안 구매 후기와 블로그에서 해당 제품이 직접 언급된 문서를 추출한 뒤 이 문서들에서 ‘매력’이라는 단어가 몇 번 등장 했는가’로 평가한다면 기존의 방식과 다를 바가 없겠지만, 중요한 것은 사람마다 다른 취향을 반영하기 위한 논리와 기반 기술의 차이일 것이다.

예를 들면 ‘매력적인 와인’과, ‘매력적인 원피스’를 고를 때 같은 ‘매력적인’이라는 표현이 담고 있는 의미는 서로 다르다.

와인의 경우에는 아마도 ‘당도와 향, 색상’ 등의 요소들과 와인을 실제로 마셔 본 사람들의 주관적 평가, 와인의 라벨 디자인과 와인 생산자에 얽힌 뒷 이야기까지 다양한 요소들이 ‘매력적’이라는 표현을 구성할 것이고,
원피스라면 ‘컬러와 소재, 패턴, 길이’와 같은 요소들은 물론 어떤 브랜드의 제품인지, 최근에 어떤 모델이 입고 등장했는지, 얼마나 많이 팔린 제품인지 등이 영향을 끼치게 된다. [그림 7 - 원피스와 와인의 매력적인의 차이]
.
.

5.키토크 기반의 새로운 음성 서비스 : 말해

키토크가 음성 기반 제품들에 적용되면 어떤 차이가 발생하는 걸까? 적어도 사용자가 자연스럽게 사용하는 언어습관 그대로의 발화로부터 훨씬 더 다양하고 풍성한 취향과 의도 정보들을 추출해낼 수 있게 되고, 또 그 정보들로부터 완전히 일치하지 않더라도 상당히 개연성 있는 결과들을 추론해 사용자에게 제시할 수 있게 된다.

예를 들면, 어떤 사용자가 “나 우울한데 뭐 볼만한 신나는 영화 없어?”라고 묻는다면, 기존의 음성 기반 제품들은 운이 좋다면 준비된 시나리오 패턴에 맞춰 정해진 답을 제시할 수 있을 것이고,

아니라면 사용자의 발화를 그대로 검색어로 활용해 나온 검색 결과를 문서 단위로 추출한 뒤 이를 읽어주는 방법 밖에는 없을 것이다.

(실제 우리나라 대부분의 음성 서비스들은 심심해를 외치면 내내 음악만을 틀어준다. 심심한 문제를 해결하는 방법은 수도 없을 텐데 말이다)

하지만 키토크 방식으로는 사용자의 같은 발화로부터 ‘우울한’, ‘볼만한’, ‘신나는’이라는 감성과 취향 정보들을 추출해낸 뒤, 다이내믹 온톨로지를 활용해 이 정보들과 가장 연관도가 높은 ‘영화’, ‘와인’, ‘웹툰’, ‘맛집’ 들을 추론해 내는 방식으로 ‘사용자의 문제를’ 해결한다.

같은 사용자의 명령으로부터 훨씬 더 풍부한 정보들을 잡아내는 것은 물론, 상대적으로 더 다양한 기준을 적용하여 요청에 대한 결과를 제공할 수 있게 된 것이다.

그림 8은 실제로 키토크 방식이 적용된 음성 기반 생활 포털 ‘말해’의 정보 탐색 방식이다. [그림 8 - 스마트폰 음성 인터페이스 사용 비중]

사용자는 ‘말해’를 이용할 때, 정해진 문법이나 패턴을 신경쓰지 않고 편안하게 평소처럼 자신의 의도를 말하면 그만이다.

내 언어 습관인 '이~ 그~ 저~ 좀~ 마리야~ 거 뭐더라...' 하면서 마이크의 리드타임을 길게 늘려가면서 말해도 아무런 영향을 받지 않는다.

뿐만 아니라 반말, 심지어는 욕을 해도 무슨 상관인가?

내가 재미있는 거 뭐 없어라고 하면, 재미있는 이라는 키토크가 연관된 모든 카테고리를 우선 보여주고 내가 택할 수 있게 해 준다.

기분 거시기하다고 하면, '기분전환되는' 키토크가 포함된 다양한 카테고리를 추천해 준다.

그 카테고리 안에서 나는 키토크들을 추가해서 말할 수도 있다. 기존에 한번의 문장만 발화하고 나서 다시 처음부터 해야 하는 불편을 없애고 직접적으로 대화하는 방식을 택했다.

그리고 제대로 못알아 들은 부분에 대해서는, 카테고리 안에서 "너 왜 '재미있는'은 빼먹어!" 라고 말하면 추가가 되기도 하고, 반대로 키토크를 on-off 방식으로 선택 추가하여 결과치의 내 취향을 탐색할 수도 있고, 다양한 정렬 방식까지도 정황에 근거하고 있다.

우리가 기계의 눈치를 보며, 내가 그가 원하는 질문을 잘 했을까? 하고 조마조마 기다리는 일은 적어도 말해 서비스에서는 없다.

실제로 현재 나와 있는 음성 기반 제품들을 사용하는 사람들의 대부분은 점차 정해진 명령어를 발화하기보다는 ‘아 심심해’라든가 ‘나 우울해’와 같은 일상 언어들을 기계에게 말하기 시작했다고 한다.

이런 사용자들에게 더 이상 정답 찾기 식의 스무고개를 강요하기보다는, 그냥 자유롭고 편안하게 사용자에게 말하도록 해 주고, “그럼 이건 어때”라고 대안을 제시해주는 편이 더 낫지 않을까? 기술은 사람에게 편의를 제공하 위한 존재이지, 사람이 기술에 맞춰야 하는 건 아니니 말이다.

정답찾기 방식이 아닌, 추천에 대한 반발을 최소화 하는 방식을 택해야 한다. 이에 대한 부분은 다음 내용에서 연재하도록 한다.

** 출처를 명기하지 않는 인용은 가오가 많이 떨어지는 행동입니다
** 많은 정보검색, 음성 서비스 관련 기업가 특히 리더보다 당장의 숙제에 급급한 담당자 분들!, 저희가 70개의 특허를 출원완료 해 놓느라 3년이 걸렸습니다. 현재 N사 영화 담당자 분, 그리고 음성서비스 담당자분 포함 다수의 문제 점을 파악하고 있으니, 각사 리더분들 께서는 담당자들에게 환기를 요청드립니다. 저희는 작은 스타트업이라 협력을 원합니다 **

[사진 1 키토크(keytalk) 발제자 카일도(도준웅)]
-긴글 읽어주셔서 감사드립니다 -[ 키토크(Keytalk) 시대의 선포와 Voice-UX ]

[ 키토크(Keytalk) 시대의 선포와 Voice-UX ]

나는 한국인이지만, 오늘 세계 디지털 역사에서 중요한 개념이 되기를 희망하는 것 하나를 조심스럽게 페이스북을 통해 제시하려고 합니다.

바로 키토크(Keytalk)라는 개념입니다.  키토크(Keytalk)는 현재 구글 및 위키피디아에서도 일본의 롹밴드라고 나오고 있기 때문에 내가 이해하는 한, 이 개념의 발제자가 되려고 합니다.

현재까지 어떤 내 저서나 글에서도 이렇게 거창하게 시작한 적은 없었으나, 이번 개념은 Voice-UX (또는 GUI) 시대를 맞아,

이전 인터넷 시대 초기에 키워드(keyword) 검색 개념의 등장 만큼이나 중요한 문제 해결 방법이라고 판단했습니다.

따라서, 내가 키토크 개념을 첫 발제하고 이를 반영한 키토크 검색 서비스인 ‘말해’를 내가 창업한 (주)마이셀럽스 동료들과 함께 내 놓습니다.

(말해 라는 말로 찾는 생활포털 서비스는 현재도 앱스토어에 올라있으나, 아이폰은 수요일경, 안드로이드폰은 월요일 오후경의 업데이트 패치를  꼭 반영하기를 권장합니다)

이 개념에 관련된 다양한 특허는 한/미/일/중 출원을 완료한 상황이고, 이 내용에 대한 full 페이퍼(국/영)는 부분 부분 내 페이스북이나 다른 칼럼들을 통해 시리즈 형태로 공개할 생각이고,

아직 학계에 논문 형태로의 제시 계획은 없습니다.

나는 비즈니스 하는 사람이라 논문으로 내기보다 우선적으로, 서비스로 직접 시장에 내 놓고 소비자의 심판을 받기로 했습니다.

아래 내용은 다소 길지만, 70여장의 키토크 관련 다양한 메커니즘 소개 중 초기 5장 정도 분량을 먼저 공유하려고 합니다.

페이퍼 형태로 좀 쉽게 쓰다보니 100% 내가 좋아하는 문체는 아닙니다.

===  아래  ===

1.

인류는 드디어 컴퓨터와 자연스럽게 대화를 나눌 수 있게 된 것일까? 불행인지 다행인지는 모르겠지만 일반적인 의미에서의 ’대화’가 가능한 컴퓨터는 아직 등장하지 않은 듯 하다.

물론 음성인식 ASR(Automated Speech Recognition) 기술과 사람이 쓰는 말을 기계가 알아듣도록 하는 자연어 이해(NLU) 기술에 많은 발전이 있었기 때문에

“어 이렇게 이야기해도 알아듣나?” 싶은 정도로 신통한 스마트 스피커 제품들이 많이 나와 있지만, 여전히 자연스러운 대화와는 거리가 멀다.

.

.

2.현재 음성 인식 제품들의 한계: 키워드 시대의 유산

현재 나와 있는 스마트 스피커를 사용하려면 먼저 “시리야~”, “O.K. 구글~”, “알렉사~”, “클로바~”, “아리야~”

이런 방식으로 먼저 서비스를 호출해야 한다. 눈에 보이는 스크린이 없을 뿐 까만 색 스크린에 커서가 반짝거리는 터미널 혹은 MS-DOS 화면에서 달라진 게 없다. [그림 1 - MS DOS 프롬프트 화면]

호출을 한 뒤에는 컴퓨터가 알아들을 수 있는 명령어를, 정해진 문법에 맞추어 입력해야 한다. 아마존에서 만든 Echo Show 같은 제품은 내장된 디스플레이에 아예 항상 예시문을 표시해 사용자에게 패턴을 익히도록 한다.

국내에 소개된 대부분의 제품들은 ‘이렇게 말해보세요’와 같은 도움말을 함께 제공하기도 한다. [그림 2 - 아마존의 alexa show]

사람과 사람 사이의 대화로 생각해 보면 상당히 부자연스러운 모습이다.

말 한 마디를 걸 때마다 상대를 불러야 하고, 상대방이 알아 들을지 말지 모르니 항상 단어와 문장을 점검하면서 말해야 한고,

못 알아 들으면 그 부분만 추가할 수 없고 다시 처음부터 말해야 할 뿐만 아니라,

다시 처음부터 말하면 또 다른 단어를 못 알아 듣기를 반복한다.

그러다가 화가 나서 욕이라도 할라치면 바로 대화를 끊고 기계에게 "입조심" 등 주의까지 받는다.

여전히 많은 사람들이 음성 기반 제품 사용을 단순히 음악 듣기용이나 가정용 홈 오토메이션 정도로 사용하거나 별 필요성을 느끼지 못하는 이유도 여기에 있다.

최근 국내 최대 텔레콤사의 미팅에서 담당은 “게으른 사용자를 타게팅 한다”고 표현하기도 했다.

가장 주목을 받고 있는 분야이고, 기술적으로도 가장 앞선 분야인 음성 기반 플랫폼과 서비스들인데, 어째서 이런 불편함이 생기는 것일까?

다양한 이유들 중에, 정보와의 연동 및 추천 부문에 대해서는 텍스트 시대의 유물인 ‘키워드’ 중심의 사고 체계와 인터페이스에 여전히 갇혀 있기 때문이다.

사람들의 생각과 취향이 다양한만큼, 생각을 말로 표현하는 방식도 다양하다. 그리고 사람과 사람 사이의 대화에는 굳이 말로 전해지지 않는 맥락도 존재한다.

하지만 0과 1만으로 세상을 이해하는 컴퓨터는 사람들의 자연스러운 취향과 관심에 대한 표현들을 이해하고 답을 찾아줄만한 능력을 아직 갖추지 못했다.

최근 챗봇들이 화두가 되면서, ‘Dialogue Management’라는 기술 개념 등에서, 훈련을 통해서 기계에게도 어느 정도 대화의 맥락을 가르칠 수 있다.

이 것은 인공지능, 구체적으로는 ‘딥 러닝’이라고 부르는 분야에서 열심히 연구하고 있는 주제 중 하나이지만 현재 수준을 벗어나 소위 말하는 자연어와 일상어를 잘 이해하는 데는 상당한 시간이 걸리고, 반드시 스냅샷(sanpshot) 형태의 방법이 등장해야 한다.

가장 발전해 있다고 하는 인공지능인 구글 알파고나 IBM의 왓슨도 키워드로 묻고 답하는 퀴즈처럼 모두 정해진 룰이 있는 게임에서는 인간과 경쟁이 가능했을지 모르지만, 음성으로 사용자와 대화를 나누는 것은 전혀 다른 차원의 이야기다.

.

.

3.추천 방식의 변화: 디렉토리와 키워드 기반의 분류체계의 변화

그럼, 인간 수준의 대화가 가능한 인공지능 기술이 등장할 때까지 마냥 기다리는 게 답은 아닐 것이다.

문제를 해결하는 방식이 꼭 기술과 알고리즘의 영역에만 존재하는 것은 아니다. 오히려 새로운 인식체계, 혹은 분류체계를 통해 이 문제는 해결될 수 있다. 즉, 추천 방식에 대한 통시적인 변화를 따라가다 보면 그 답이 나올 수 있다.

지금이야 셀 수 없을 정도로 다양한 상품 정보들을 역시 수많은 채널을 통해 접할 수 있지만, 인터넷 이전 시대에는 상점의 진열대가 일반적으로 사용되는 추천 기준이었다. 하지만 산업 발전과 함께 다양한 종류의 상품들이 나오기 시작하고 한정된 공간에 그 많은 상품들을 다 진열할 수 없게 되었다.

그래서 등장한 것이 ‘카탈로그’였다. 얼마 전까지 명절 때마다 등장했던 우체국 통신판매 카탈로그를 떠올리면 이해가 쉽겠다. 인터넷이 등장한 이후에도, 처음으로 온라인에서 정보를 탐색하는 방식은 카탈로그와 별로 다르지 않았다.

야후(Yahoo)라는 검색 서비스가 처음 등장했을 때도, 초기 인터페이스는 직접 검색어를 입력하는 방식이 아니라 카테고리별로 정리된 디렉토리 방식의 검색이 먼저 제공되었다. 당시 문헌정보학과나 도서관학과를 우대한다고 모집 했을 정도다.

이런 방식의 정보 탐색 및 상품 추천이 가능했던 건 전통적인 사업 중심의 분류체계들 덕분이었다. 예를 들면, 쇼핑몰에서 가전 카테고리의 하위 분류로 TV, 냉장고, 세탁기 등으로 나눠놓는 방식이다. [그림 3 -야후 초기 디렉토리 검색 화면]

현재 가장 일반적인 인터페이스로 활용되고 있는 키워드 검색도 크게 다르지 않다. 잇달아 등장한 구글이나 현재 대한민국 인터넷 트래픽의 거의 대부분을 차지하는 네이버 같은 서비스들도 문서와 문서 사이의 관계성에 근거하거나 혹은 관리자의 ‘의도’가 들어간 편집 방식의 ‘키워드 검색’을 제시했고, 이는 지금까지 대세로 이어져오고 있다.

하지만 결국 인식체계 자체는 산업 분류체계에 기초한 디렉토리 방식의 한계를 넘어서지 못하고 있다. 우리가 너무 익숙해져 있어서 불편함을 못 느끼고 있을 뿐, 새롭게 등장하는 대화 기반 제품에 적용하면 금새 부자연스러움을 느끼게 된다.

디렉토리 방식은 인터넷 태동기에 사용자 편의성을 높이는 대안이었을지 모르지만, 사용자의 다양한 취향을 반영하기에는 너무나도 큰 한계를 보일 수 밖에 없다.

우리가 실제로 정보를 탐색하는 방식을 생각해 보면,  ‘출근 준비할 때 듣기 좋은 음악’, ‘우울할 때 기운을 북돋아주는 음료’, ‘상견례 할 때 가기 좋은 맛집’처럼 T.P.O(Time, Place, Occasion)에 기반을 둔 정황과 취향 중심의 ‘문제 해결’ 과정에 가깝다.

‘재즈, 클래식, 메탈, 힙합’ 같은 전통적인 인식체계 만으로는 정말 사용자가 원하는 음악에 대한 정보를 찾아 추천해 주는 데 한계가 있다.

이와 같이 기존의 분류체계에 기반을 둔 키워드 중심의 인터페이스로는 인간의 수많은 발화 패턴을 제대로 담아낼 수 없다.

키워드를 알아야 내가 원하는 정보에 도달할 수 있다는 것은 여전히 ‘정답 찾기’ 방식의 사고에 갇혀 있다는 의미이다.

정답을 찾아내는 식의 사고로는 사람과 사람 사이의 대화에서 자연스럽게 인지, 공유되는 맥락과 의도를 담아내는 것도 불가능하다.

그래서 기존의 음성 기반 제품들은 유저 인터페이스를 설계할 때, ‘시나리오’를 중심으로 하는 접근 방식을 택하고 있다. 최대한 사용자가 말할 것으로 예상되는 대화 패턴을 예상해서 스토리를 만들고, 이를 바탕으로 사용자의 의도를 담은 명령어와 이에 대한 답변을 미리 구성하는 방식이다.

실제로 많은 기업의 음성 기반 제품 설계 팀에서는 시나리오 작가들을 대거 고용해 예상 시나리오를 다량으로 만들어 입력하는 방식으로 제품을 설계하고 있기도 하다.

사용자의 예상 발화 패턴을 하나 하나 예외 케이스로 추가하고, 각각의 사용자에게 정해진 패턴을 벗어나는 경우 “못 알아 들었어요. 제가 알아들을 수 있도록 정해진 방식으로 다시 말씀해 주세요”라는 식으로 사용자의 언어 습관을 지적하는 방식은 자연스러운 대화와도 거리가 멀 뿐 아니라,

정해진 화면 위에서 시각적인 가이드를 받으면서 진행되는 디스플레이 중심의 인터페이스 설계 방식과도 다를 바가 없다.

스마트 스피커가 정한 시나리오와 패턴에 맞게 명령어를 입력하기 위해 우리는 계속 딱딱한 말투를 사용하고, 평소 언어습관과 전혀 다른 공손하고 완벽한 문장만을 사용해야 할까?

마치 외국인에게 익숙하지 않은 언어로 말을 걸 때처럼 생각만 해도 불편한 경험이 아닐 수 없다.

.

.

4.새로운 분류체계: 다이내믹 온톨로지와 키토크

사람의 취향과 그에 대한 표현은 너무나도 다양하고, 이에 대한 정답이나 순위가 존재하지 않는 것이 맞다. 그런데 우리는 늘 추천 결과가 맞다 안맞다를 얘기하고 있다.

기존의 나이, 년도 등과 같이 정확한 답이 존재하던 분야가 키워드 검색 시대에만 해도 맛집 추천 등의 개념이 나오면서 순위와 정답의 개념들은 꽤 상쇄되기 시작했다.

중간에 위치기반 서비스가 등장하면서 여행, 숙박 같은 카테고리와 같이 위치 기반으로 어느정도 추천을 해 줄 수 있는 서비스가 등장 하긴 했으나 여전히 키워드를 통해 원하는 것을 찾을 수 있는 카테고리는 너무 제한적이다.

사람들은 숙박만 관련해서도 ‘전망이좋고’, ‘사진찍기좋고’, ‘깨끗하고’ 심지어는 ‘수압이센’ 모텔을 찾기 시작했기 때문이다.

이 중에서는 수압이센 모텔은 동별로 수압데이터가 존재하기 때문에 추론적 적용이 가능하지만 다른 요구사항들은 지금의 방식으로는 풀리지 않는다.

방송 카테고리만 봐도 년간 동영상 클립의 10%도 소비자에게 선택되지 않고 있다.

'무한도전', '송중기' 등 키워드를 검색해야만 선택되어 질 수 있는 체계이고 대부분의 영상 클립들은 수백년이 지나도 선택되어 질 수 없는 구조이기 때문이다.

대화기반의 추천 시대가 되어, 갑자기 너무나 다양한 취향에 맞닥뜨리게 되었다.

즉, 나의 로맨틱과 너의 로맨틱이 다르고, 영화에서의 작품성과 웹툰에서의 작품성 또한 다르고 실시간으로 변화한다.

실시간으로 개념이 변화한다는 것은 빅데이터 시대가 되면서 추론이 가능해 졌는데,

요즘처럼 연상연하 커플이 나오는 드라마가 주목받을때, 적어도 현재 우리나라의 여성이 생각하는 “요즘 로맨틱한”이라는 개념에는 ‘연상연하’라는 정서와 정황이 상당 부분 반영 되는 것을 다양한 데이터를 통해 알 수 있듯이, 대중들의 취향은 항상 유동적이고 정답이 존재하지 않는다.  [그림 4 - 정답찾기 방식 vs. 문제해결 방식]

그렇다면 대화형 서비스에서 “로맨틱한 영화 있어?” 라는 추천 질문에 대해 누가 어떤 방식으로 정답과 순위없이 추천을 해 주어야 할까?

정답 찾기 중심의 ‘키워드’에서 벗어나, 우리가 일반적으로 문제를 해결하는 데 사용하는 자연스러운 취향과 정황 중심의 분류체계를 새롭게 만들어 적용해야 한다.

앞서 언급했듯, 사람의 생각과 취향에는 정답이 존재하지 않는다.

어떤 사람에게는 심심할 때 로맨틱 코미디 영화가 가장 좋은 대안이 될 수 있겠지만, 다른 사람에게는 스릴러 장르의 영화가 가장 좋은 심심풀이일 수 있다.

수많은 정보 가운데 내가 입력한 ‘키워드’를 담고 있는 제한된 수의 ‘정답’을 찾는 대신, 다양한 사람들이 각자의 취향을 갖고 살아가는 이 세상에서 문제를 어떻게 해결할지에 초점을 맞추어 함께 토론하고 결과를 도출하는 방식을 찾을 수는 없을까?

마이셀럽스는 사용자가 일상에서 사용하는 언어들을 바탕으로 사람의 감정과 취향, T.P.O(Time, Place, Occasion) 및 각 카테고리 별 특성, 색채, 질감 등 다양한 방식으로 새로운 인식체계를 구상하고, 머신 러닝 기술과 자연어 이해 기술을 응용하여 구현해 냈다.

이른바 Dynamic Ontology라고 불리우는 이 새로운 인식체계는 단순히 키워드로 표현되는 매칭 방식이 아닌, 단어와 단어 사이의 연관도 뿐 아니라 영화, 스타, 그림 등 각 카테고리별 특성적으로 추출할 수 있는 데이터들을 바탕으로 하는 벡터 모델 기반의 추론 방식을 응용한다.

이는 기존에 존재하던, 단순히 단어사이의 연관관계 만으로 점수화 하는   워드투벡터 방식에서 확장된 개념이다.

다이내믹 온톨로지는 결과를 추론하는 과정에는 사용자의 의도와 맥락을 담고 있는 ‘키토크(Keytalk)’를 추출해 낸다.

‘키토크’는 키워드와 비슷한 ‘단어’의 형태로 표현되지만, 단순한 매칭 방식이 아니라 한 문서 안에 존재하는 다양한 표현들 간의 인접도와 유사도를 바탕으로 하나의 단어가 담고 있는 다양한 취향과 의도를 모두 담아내는 구조로 설계되어 있다.

[그림 5 - 다이내믹 온톨로지 개념]  [그림 6 - 말해 키토크]

이와 같은 방식을 가능하게 한 시작점은 이른바 ‘DT(Data Technology) 시대’가 도래해서이다. 지구 상에 존재하는 데이터의 90% 이상이 최근 3년 안에 생성되었고, 그 대부분이 온라인 사에 대중들이 남긴 ‘라이프 로그(life-log) 데이터’다. 최근에는 해시태그까지 더해져서 머신이 러닝할 수 있는 라이프로그 데이터의 폭이 매우 넓어 졌다.

뿐만아니라, 색상 분석기, 이미지 인식기 등 데이터를 추출할 수 있는 다양한 기술이 활용되었고, 오프라인에서 오랜동안 연구되었던 많은 논문들을 러닝하여 모델링된 추론 엔진이 중요한 역할을 했다.

최근 수 년 간 소셜 미디어의 급격한 성장으로 인해 이 데이터의 양과 질 모든 면에서 사용자의 ‘취향’에 대한 힌트를 얻을 수 있는 가능성이 증가했다. 현존하는 음성 인식이나 자연어이해 기술만으로 해결이 어려웠던 문제들의 실마리를 급격하게 증가하고 있는 데이터에서 발견할 수 있게 된 것이다.

‘퇴근길에’, ‘울적할 때’, ‘도서관에서’, ‘낭만적인’ 같이 사람들이 생각하는 그대로의 언어 체계를 기반으로 다양한 카테고리(혹은 산업군)의 대상들을 분류할 수 있게 해 준다.

물론 그렇다고 해서 사용자들에게 익숙한 기존의 분류체계와 완전히 별개로 동작하는 것은 아니다. 단순히 ‘최근 1년 동안 구매 후기와 블로그에서 해당 제품이 직접 언급된 문서를 추출한 뒤 이 문서들에서 ‘매력’이라는 단어가 몇 번 등장 했는가’로 평가한다면 기존의 방식과 다를 바가 없겠지만, 중요한 것은 사람마다 다른 취향을 반영하기 위한 논리와 기반 기술의 차이일 것이다.

예를 들면 ‘매력적인 와인’과, ‘매력적인 원피스’를 고를 때 같은 ‘매력적인’이라는 표현이 담고 있는 의미는 서로 다르다.

와인의 경우에는 아마도 ‘당도와 향, 색상’ 등의 요소들과 와인을 실제로 마셔 본 사람들의 주관적 평가, 와인의 라벨 디자인과 와인 생산자에 얽힌 뒷 이야기까지 다양한 요소들이 ‘매력적’이라는 표현을 구성할 것이고,

원피스라면 ‘컬러와 소재, 패턴, 길이’와 같은 요소들은 물론 어떤 브랜드의 제품인지, 최근에 어떤 모델이 입고 등장했는지, 얼마나 많이 팔린 제품인지 등이 영향을 끼치게 된다. [그림 7 - 원피스와 와인의 매력적인의 차이]

.

.

5.키토크 기반의 새로운 음성 서비스 : 말해

키토크가 음성 기반 제품들에 적용되면 어떤 차이가 발생하는 걸까? 적어도 사용자가 자연스럽게 사용하는 언어습관 그대로의 발화로부터 훨씬 더 다양하고 풍성한 취향과 의도 정보들을 추출해낼 수 있게 되고, 또 그 정보들로부터 완전히 일치하지 않더라도 상당히 개연성 있는 결과들을 추론해 사용자에게 제시할 수 있게 된다.

예를 들면, 어떤 사용자가 “나 우울한데 뭐 볼만한 신나는 영화 없어?”라고 묻는다면, 기존의 음성 기반 제품들은 운이 좋다면 준비된 시나리오 패턴에 맞춰 정해진 답을 제시할 수 있을 것이고,

아니라면 사용자의 발화를 그대로 검색어로 활용해 나온 검색 결과를 문서 단위로 추출한 뒤 이를 읽어주는 방법 밖에는 없을 것이다.

(실제 우리나라 대부분의 음성 서비스들은 심심해를 외치면 내내 음악만을 틀어준다. 심심한 문제를 해결하는 방법은 수도 없을 텐데 말이다)

하지만 키토크 방식으로는 사용자의 같은 발화로부터 ‘우울한’, ‘볼만한’, ‘신나는’이라는 감성과 취향 정보들을 추출해낸 뒤, 다이내믹 온톨로지를 활용해 이 정보들과 가장 연관도가 높은 ‘영화’, ‘와인’, ‘웹툰’, ‘맛집’ 들을 추론해 내는 방식으로 ‘사용자의 문제를’ 해결한다.

같은 사용자의 명령으로부터 훨씬 더 풍부한 정보들을 잡아내는 것은 물론, 상대적으로 더 다양한 기준을 적용하여 요청에 대한 결과를 제공할 수 있게 된 것이다.

그림 8은 실제로 키토크 방식이 적용된 음성 기반 생활 포털 ‘말해’의 정보 탐색 방식이다.  [그림 8 - 스마트폰 음성 인터페이스 사용 비중]

사용자는 ‘말해’를 이용할 때, 정해진 문법이나 패턴을 신경쓰지 않고 편안하게 평소처럼 자신의 의도를 말하면 그만이다.

내 언어 습관인 '이~ 그~ 저~ 좀~ 마리야~ 거 뭐더라...' 하면서 마이크의 리드타임을 길게 늘려가면서 말해도 아무런 영향을 받지 않는다.

뿐만 아니라 반말, 심지어는 욕을 해도 무슨 상관인가?

내가 재미있는 거 뭐 없어라고 하면, 재미있는 이라는 키토크가 연관된 모든 카테고리를 우선 보여주고 내가 택할 수 있게 해 준다.

기분 거시기하다고 하면, '기분전환되는' 키토크가 포함된 다양한 카테고리를 추천해 준다.

그 카테고리 안에서 나는 키토크들을 추가해서 말할 수도 있다. 기존에 한번의 문장만 발화하고 나서 다시 처음부터 해야 하는 불편을 없애고 직접적으로 대화하는 방식을 택했다.

그리고 제대로 못알아 들은 부분에 대해서는, 카테고리 안에서 "너 왜 '재미있는'은 빼먹어!" 라고 말하면 추가가 되기도 하고, 반대로 키토크를 on-off 방식으로 선택 추가하여 결과치의 내 취향을 탐색할 수도 있고, 다양한 정렬 방식까지도 정황에 근거하고 있다.

우리가 기계의 눈치를 보며, 내가 그가 원하는 질문을 잘 했을까? 하고 조마조마 기다리는 일은 적어도 말해 서비스에서는 없다.

실제로 현재 나와 있는 음성 기반 제품들을 사용하는 사람들의 대부분은 점차 정해진 명령어를 발화하기보다는 ‘아 심심해’라든가 ‘나 우울해’와 같은 일상 언어들을 기계에게 말하기 시작했다고 한다.

이런 사용자들에게 더 이상 정답 찾기 식의 스무고개를 강요하기보다는, 그냥 자유롭고 편안하게 사용자에게 말하도록 해 주고, “그럼 이건 어때”라고 대안을 제시해주는 편이 더 낫지 않을까? 기술은 사람에게 편의를 제공하 위한 존재이지, 사람이 기술에 맞춰야 하는 건 아니니 말이다.

정답찾기 방식이 아닌, 추천에 대한 반발을 최소화 하는 방식을 택해야 한다. 이에 대한 부분은 다음 내용에서 연재하도록 한다.

** 출처를 명기하지 않는 인용은 가오가 많이 떨어지는 행동입니다

** 많은 정보검색, 음성 서비스 관련 기업가 특히 리더보다 당장의 숙제에 급급한 담당자 분들!, 저희가 70개의 특허를 출원완료 해 놓느라 3년이 걸렸습니다. 현재 N사 영화 담당자 분, 그리고 음성서비스 담당자분 포함 다수의 문제 점을 파악하고 있으니, 각사 리더분들 께서는 담당자들에게 환기를 요청드립니다. 저희는 작은 스타트업이라 협력을 원합니다 **

[사진 1 키토크(keytalk) 발제자 카일도(도준웅)]

-긴글 읽어주셔서 감사드립니다 -



I. The Era of Keytalks and Voice UX

Have we finally entered an era where humankind can have a natural conversation with computers? Webster defines  conversation as ‘an oral exchange of sentiments, observations, opinions or ideas’ (https://www.merriam-webster.com/dictionary/conversation?utm_campaign=sd&utm_medium=serp&utm_source=jsonld).

By that definition, whether you’re an optimist and are dreaming of a day when we can talk to our computers like we see in Star Trek, or a pessimist fearing SkyNet and judgement day, we can safely say that we have not yet entered the age of true conversation with computers.

Of course, in recent years we have seen huge advances in Automated Speech Recognition (ASR) technology as well as Natural Language Understanding (NLU), which has led to the development and adoption of smart speakers and other smart products that can passably understand human utterances and then carry out actions.

However, despite these advances, we still find ourselves a long way off from true, natural conversation with machines.

• The limits of current voice recognition products: the legacy of the keyword era

Currently, you need to use some kind of action keyword, such as “Siri”, “OK Google”, “Alexa”, to get your smart speakers’ attention.  But if you think about it, this is really no different from the old days of MS-DOS and typing in commands into a terminal – we’re just not looking at any screens when we do it.

Once the speaker has been called, the user has to input commands that the speaker has been preprogrammed to understand using proper grammar. Speakers, such as Amazon’s Echo Show use their display to help educate users on how to speak properly for the speaker to understand their inputs. They will offer guidance like ‘ask questions like this’ or ‘try saying this’.

If we think about what a conversation between two people, this is extremely unnatural. Every time we want to advance the conversation, we have to say the name of the person we are talking to, and because we don’t know the person we’re speaking with will understand what we say, we have to carefully think about what we say before we say it.

It’s not enough that we have to start from the beginning when the voice speaker doesn’t understand what we say, but we also have to rethink what we are going to say and use different words in an effort to get the speaker to understand us.

During this exchange, if we get upset or frustrated, we are liable to make some exclamation, or God forbid, utter a curse word. As soon as this happens, the voice speaker doesn’t understand, and our conversation ends.

Some voice speakers will even chastise us for our foul mouths if we swear in their presence! Most voice speakers are currently being used for simple tasks like playing music or for automating your life at home, for example to turn off and on the lights, and therefore, once the user learns the correct things to say, and may not have run into these kind of problems.

However, users that want to maximize their use of voice speakers, to actively make use of these speakers throughout their daily lives, find themselves quickly frustrated and exhausted. How do voice speaker makers feel about this? Recently in Korea, a marketing manager for the AI speaker developed by one of Korea’s top telecommunication  companies said ‘We are targeting the lazy users.’

Voice-based platforms are one of the most promising fields in AI, a field that is leading the way with major technological breakthrough after breakthrough, so how is it that we have these kinds of discomforts when using these services? These inconveniences arise from an overreliance on keyword-based classification structures that have carried over from text era that remains ingrained in voice interfaces.

The ways we express our thoughts through speech are as varied and many as people’s thoughts and tastes. And, in real, person-to-person conversation, meaning can be conveyed through nonverbal manners or understood through context. However, computers, which understand the world in 1s and 0s, are not yet able to fully understand people’s natural preferences and interests.

Of course, with the arrival of conversational interfaces in the form of chatbots and the concept of dialogue management, major players in tech talk about how we can teach conversational context to machines to some extent.

They are talking about deep learning, the most exciting and technologically advanced field in AI that is currently being researched. Currently, deep learning shows a capacity for understanding natural conversation on a small-scale, but the computer power and data required to train these systems is quite large and requires a massive time investment. As a result, what we see right now is more of a ‘snapshot’ of what deep learning is capable.

When we think of leading AI today, the first projects that come to mind are Google’s AlphaGo and IBM’s Watson.

These AIs wow us with their ability to win complex games and respond accurately in quiz shows, but in these cases also we see that they are restricted to a rules-based environment where questions are asked through keywords (even in the game context, a move can be considered a registered keyword), and answers are given. Regardless of how successful they are in these specific, context-limited situations, true conversation is on a completely other level.

• Changing recommendation methodology: the transition from directory-based search to keyword-based search

So how do we resolve this problem? How do we arrive at true conversation?  Saying that the answer is to wait until a true general AI is invented is not really an answer at all.

Furthermore, we don’t need to restrict ourselves to solving every problem with more and more powerful algorithms.

There is a space we can occupy in the now that takes existing AI technological breakthroughs and pairs them with more intelligent, more versatile recognition/classification systems.

That is to say, if we follow the diachronic change in recommendation methodology, we can find an answer.

Today we can find and peruse more information through more channels than we could ever hope to count, but before the invention of the internet, consumers mainly received information – that is to say, purchase recommendations – through the display stands in and outside of shops. These display stands were compiled into magazine format creating the catalogs that millennials have only heard of and Gen Xers and baby boomers survived and thrived on. Sears is famous for getting its start through mail-order catalogs – soon becoming known as ‘the Consumer’s Bible’. (citation needed).

With the introduction of the internet, the first search engines followed a similar model. There was not yet the technology for keyword based searches, so information was organized into directories and subdirectories to point users in the direction they wanted to go in and find the information they sought.

The representative example of this directory-based search is Yahoo back in its heyday. In fact, Yahoo was so wed to its directory-based search structure that it hired library science majors and specialists to manage and refine their search directories.

This search and content recommendation methodology was made possible thanks to classification systems based on traditional businesses.

In just the same way that Sears published a magazine catalog, Yahoo took that existing architecture, digitized it – in this manner improvements were made over the old system in that users could click and immediately be taken to the subdirectory they wanted to go to. So for example, if I wanted to purchase a television, I could go to yahoo, choose the home appliance category, which would contain the television, refrigerator, washing machine subcategories, and then click on the television subcategory to find the information/recommendations that I am seeking.

The keyword-based search engines we use today are, in reality, not so different from the Yahoo Directory of old.

Both Google and Korea’s number 1 search engine, Naver, are based on the relationship between documents and utilize a keyword-based search methodology that scours the internet for relevant information based on the keywords the user inputs.

While keyword-based search led to huge breakthroughs in search convenience and power – Google’s global dominance makes an easy case-in-point – it has not succeeded in fully surpassing directory-based searches that relied on classifying data by industry. Most of us have grown up and lived in a keyword-search world and have no experience with other search methods, and therefore, you may not have even noticed inconveniences when they occur in your type-based search experience.

However, with the introduction of voice user experiences and voice applications, we begin to see quite clearly the limitations of current popular search methods as well as just how unnatural it is to use keywords when using voice to search.

When we think about search, then, it is important to empty our mind of our habits and biases – intentional or otherwise – and ask ourselves what really goes on in our minds when we think about searching for information.

What form do those thoughts take? You be looking for ‘some good songs to pump me up in the morning so I can wake up and get to work,’ or ‘I’m feeling pretty down, what are some movies that will make me feel better,’ or you may have uncommon situations pop up that are no less important, like ‘I’m meeting my girlfriend’s parents for the first time, what’s a good restaurant to take them to?’

These search queries adhere more closely to a ‘problem solving’ process that is led by context and taste based on time, place and occasion (TPO).

Sure, traditional-style classification systems, such as searching music by genre or year, does narrow down and focus your search, but does it actually provide the user with the information they are searching for? If the goal of voice is to imitate or replicate real human-to-human conversation, then we should be able to say/type real situations where we’ve asked our friends/family for recommendations in the past, however keyword-based search interfaces are unable to fully accomplish this task.

Compare the problem solving example outlined above to the keyword search, where we need to know the right keywords to type in to get the search results we want to receive – this is what we call a ‘finding answer’ perspective.

With a finding answer approach to search, its impossible to fully cover all the possible responses, all the variables that we see in natural conversation.

This is why we see current voice-based products, such as voice speakers and assistants, following scenario-based design principles.

Voice designers are tasked with coming up with all the scenarios that could occur when the user engages with their app and then from that baseline create scripts and populate dictionaries with the words, phrases, questions, answers and actions the designer wants the voice app to process. In reality, voice apps today are designed to already have specific answers and it falls upon the user to ask the right questions.

Using the existing voice app construction framework, voice designers must think up and manually add any exceptions that the user may utter and in the event that the user does come up with an utterance that the voice app has not been prepped for, we get the dreaded “I’m sorry, I didn’t understand that,” and we come to realize ever so clearly that we are still quite far away from natural, organic conversation.

But do we really have to speak in perfect grammar, with perfect annunciation, using only courteous and proper words?

Do we have to always pre-think our utterances to be sure that they follow the script, lest the app fail to understand us and we have to start over?

When we combine perfect utterances with perfect grammar and perfect language, is the resulting utterance even remotely natural – or have we completely lost what we have originally set out to accomplish?

What we are left with us an unnatural, uncomfortable and perhaps most importantly, inconvenient user experience.

• A New Data Classification System: Dynamic Ontology and Keytalks

People have so many different tastes and so many different ways of expressing them, and when it comes to people’s tastes, preferences or opinions, there is no ‘right answer’.

But despite there being no right answer, we still find ourselves looking over the recommendations/search results given to us by our favorite apps and judging whether those results are right or not. In recent years, we’ve seen the introduction of location-based services, promoted within social media and review recommendation sites, such as Yelp or TripAdvisor, yet these sites and services are unable to quickly, easily, immediately find what we want owing to the restrictive nature of keyword-based search.

Let’s look at an example. You want to find a romantic hotel that specializes in romantic programs like spas and romantic-lit restaurants, but also won’t break the bank. If you were to go into a traditional trip / hotel recommendation app, you would have to sort by location and price, and then look through the hotels one-by-one until you find the reviews / information that fits all of your desired specifications. Wouldn’t it be nice to search just once and immediately be supplied with the results you want?

Consumers aren’t looking for just name and location anymore when they search for hotels. Of course, this information helps, but they want to know more specific information and they want this information fast and provided in a convenient, easy-to-read manner.

Once we transcend the keyword paradigm, we can start to solving all sorts of problems. Maybe shower water-pressure is really important to you. Using public available data sources, we can find out hotel/motel water pressure and draw inferences, which can then organized, classified and presented in a user-friendly search feature.

Now let’s take a look at TV on demand / video streaming services. Most television/video content providers readily admit that their user base only accesses around 10%, or less, of available content (citation needed).

Yet, these services have to maintain massive content libraries and deal with all the associated management/upkeep costs.

If you were to take a popular network like HBO, of course consumers know about their big hits: their Game of Thrones, their Westworlds, but what about their other content?

What about older content? What about networks that have even larger libraries than HBO, but consumers only really consume their top show of the season/year? If consumers don’t already know the keyword for what they are looking for (‘Game of Thrones’ for example), it can be hard to find content. And that’s a real shame when you think that there may be TV shows, books, streams, etc. that match your tastes but you’ll never know because you don’t know how to find them.

We’re now able to get a better grasp on user tastes/preferences as people search and communicate online in a more conversation-like manner.

Not only can we classify data in new ways, like ‘hotels with high water-pressure showers,’ or ‘movies with character-driven plots,’ but we are finding that words themselves can have different meaning in different contexts.

The criteria by which we judge a movie to be ‘creative’ may be different from the criteria by which we judge a restaurant to be ‘creative’.

And to complete matters, what we, in our broad collective consciousness define as ‘creative’ changes over time and in real-time.

If you think about it, it makes some logical sense that tastes and preferences are continuously changing and evolving.

Humans, as we grow and mature, like different things than we used to when we were younger. Things that were popular in the 90s were not popular in the 2000s.

So, this idea is not some new or novel thing. However, with the advent of big data and big data processing technologies, we can now, more than ever before, understand these trends as they develop and evolve.

To take celebrity relationships for an example, as soon as the next power couple ala Brangelina starts dating, there is an enormous amount of data created about them as people gossip, comment and follow the relationship.

This in turn ends up influencing the way the public looks at these celebrities and can change their tastes/opinions about these celebrities and the productions they show up in.

This is just 1 more reason why we need to move away from keyword-based searches which seeks to connect keywords to specific right answers. We need to develop and apply new data classification systems that take into account natural language, tastes, and context, and do so in a manner that moves away from a ‘answer finding’ methodology and moves towards a ‘problem solving’ one.

OK, so we know that people’s tastes are different, and that they change over time… but let’s take 1 more step back.

Take a concept like boredom. Some people, when they are bored are looking for an easy-to-watch, feel-good romantic comedy movie, while other people want something more exciting, like a thriller – or they might not want to watch a movie at all, they may want to go out and do something.

With so much information out there, in a world where everyone has their own unique tastes and preferences and each question has a different answer for different people in different situations, can we move away from a keyworld ‘right answer/wrong answer’ approach and move towards a problem solving approach that takes into account all of these variables at the same time?

Mycelebs developed a new data classification system that uses machine learning techniques and natural language understanding technologies to classify data by people’s tastes, preferences, and sentiments, context – time, place and occasion, and categorical characteristics – color, quality, etc. from real natural language that people use as they go about their lives.

We call this classification system Dynamic Ontology. This is not just simply dressing up simple keyword matching; Dynamic Ontology uses word-to-vector modeling to find the correlation between words as its basis for making inferences.

But this is just the basis of Mycelebs technology foundation. Mycelebs also makes use of word embedding, color analysis, image recognition and other high-level machine/deep learning technologies that, when put together, create Dynamic Ontology.

Dynamic Ontology creates Keytalks that understand user intent and context.

Keytalks are similar to keywords in that they are made up of singular words or phrases that can be expressed in a keyword/hashtag manner, but Keytalks are loaded with so much more information than simple keywords are.

This is not just another way of matching words to documents, Keytalks are imbued with a huge wealth of meaning. Keytalks may appear simple, but they are composed of a structure consisting of closeness and similarity between words and expressions, as well as attribute data such as color aesthetics, texture and material.

This new approach to data classification and the vast wealth of meaning found in each Keytalk is only possible thanks to advent of the Data Technology era. More than 90% of all data that exists was created within the last 3 years and most of that data takes the form of ‘life-log data’ – the data people upload on their social media accounts, for example. Due to all of this new data being created every day, we are accumulating larger and larger data sets that lead to better machine learning outcomes and expand the scope of what we are capable of accomplishing.

Thanks to social media and the uploading of people’s thoughts, opinions, sentiments, reviews, pictures and more, we are able to gain insights into people’s tastes and preferences, while at the same time using this data to develop and continuously improve voice recognition and natural language understanding technologies.

Dynamic Ontology takes language as its actually being used, such as ‘on my way home from work’, ‘when I’m feeling gloomy’, ‘romantic’, ‘at the library’, understands it, and then classifies it according to the context it is used in or according to specific categories (or industries),

Many existing data science companies can, for example, take all of the reviews for a company’s products that have been uploaded over the past year and count up the number of times the word ‘racy’ is used and then use this number as a metric for data analysis. But this just touches the surface of natural language analysis. Let’s go deeper. ‘Racy’ can be used to describe clothing, like a ‘racy’ dress, but it can also be used to describe wines. There will be some overlap to the meaning but the characteristics inherent in a ‘racy’ dress may be different than those of a ‘racy’ wine. A ‘racy’ wine may be a bit on the sweeter and fruity side, a racy dress is probably hinting more towards how scandalous a dress is, or how proper it is to be worn in public. These descriptors, ‘sweet’ and ‘fruity, and ‘scandalous’ and ‘proper’, are quite different. And going even deeper, these descriptors, or taste attributes as we call them at Mycelebs, combine to create tastes/preferences such as ‘racy’, and these taste attributes themselves are made up of characteristics such as color, pattern, length for clothing or body and flavor for wine. And our data analysis inherently picks up the influence of other factors, such as brand name/worth, if the product was worn/drank/used by a celebrity, and how many units were sold.

Malhae: the first Keytalk-based voice service

What kind of changes do we see when we apply Keytalks to voice-based products and applications?

We can develop a product that understands the real natural language patterns that people use every day, and infer rich taste, intent and content information from those utterances, which we use to provide better, higher relevancy individual recommendations.

Voice-based product/application developers sell images of extremely intelligent AI that serves at your beck and call as you recline on your couch after a hard day at work.

The man/woman gets home, sits back in their recliner, and says ‘play me a movie to pick me up because I’m feeling down’ and the smart speaker, which is linked to his/her television, immediately shows accurate results.

What we, the consumer, don’t know is that the test we see in these ads is a carefully chosen and vetted example that is designed to look perfect.

This is because existing smart speakers/applications follow a scenario-based approach – product managers try to predict how their app will be used and write down all the scenarios and then likewise, write in the answers for these scenarios.

That is why we call this method a ‘finding answer approach’.

The AI doesn’t truly understand what ‘pick me up’ or ‘feeling down’ mean, it just knows to give a certain answer when it sees those options.

And, terms like ‘pick me up’ may not be in the app/service dictionary at all, or the natural language processing will see this expression as 3 separate words and deduce a more literal meaning from the conclusion.

However, with Mycelebs Keytalks, we know the actual meaning, context, intent that exists inherently in these words and can therefore provide much better results. And Keytalks never need a manager/administrator working behind the scenes to add/subtract ‘correct’ answers and scenarios.

After deducing sentiment and taste information, such as ‘depressing’, ‘lame’, ‘exciting’, from user utterances, mycelebs uses Dynamic Ontology to find the highest correlations in this information for categories such as restaurants, movies, wine, tv and more to create Keytalks.

Through this methodology, mycelebs moves away from the old question-answer dichotomy and towards the outlook that ‘we solve user problems.’ Through this methodology, users are able to access much more information that takes into account their individual preferences and standards.

When using Malhae, users can simply speak as they always do without worrying about speaking with proper grammar or using proper language. Users can pause as you’re speaking and use filler terms like ‘um’, or ‘uh’ and Malhae will understand those terms for what they are.

Users don’t need to worry about the way they will speak to Malhae. Mycelebs knows that people have different ways of talking to each other – the way you speak with your friends is different from the way you speak with your parents, your colleagues or your boss.

With Malhae, users speak colloquially, and mix in slang or swear words without any negative impact on search results. Its Mycelebs’ philosophy that we shouldn’t have to speak up or speak properly to a machine – the machine should adapt to the way we are speaking.

When we say a Keytalk like ‘fun’, Malhae shows us all the relevant categories correlated with ‘fun’ and then the user can choose the specific category they want to see results in, such as movies.

Categories are listed in order of relevance. If users want to add in more Keytalks to narrow down their search options, they can do so without starting their search over from the beginning.

They can also use Malhae’s touch interface to peruse Keytalk hints and add them in that way, or type them in directly if they are in a location that does not allow them to speak.

There’s no need to worry about starting over from the beginning any time users want to modify their search.

In the event that the user added too many Keytalks and over-refined their search, they can toggle the selected Keytalks on or off as they choose.

Users can also modify their searches by more basic meta parameters such as demographic information, location, or price.

Malhae gives full coverage so that users can add, remove, and modify their searches and focus on what is important: getting highly relevant, immediately usable recommendations without worry about how to use the app itself.

And in the event that the voice recognition doesn’t pick up what was said, users can easily add Keytalks and categories that they would like to see recommendations in using natural language: ‘hey, I said find me something fun!’ all without losing their previous search queries. So for example, if the user originally searched for ‘fun motels in Seoul’, but the voice recognition software struggled to hear ‘fun’ at the beginning, the user can simply add ‘fun’ while keeping their ‘hotel’ and ‘Seoul’ Keytalk queries. Likewise, if there are any Keytalks that the user would like removed from their search query, they can just easily have those keytalks removed or greyed out using either the touch interface or voice interface.

Once a voice search has been initiated, the user can also refine their search using more classic filters such as listing by price, as well.

All of this combines to free the user from the worry, “did I ask my question in the perfect manner for the machine to understand me?”

According to a recent survey, people have begun to speak to voice-based products the same way they would with a friend, making open-ended statements such as ‘I’m bored’ or ‘I’m feeling down right now’.

Rather than playing twenty questions trying to find the answer, and each ‘turn’ in the conversation another minefield for misspoken utterances, pauses, fillers, etc. that can kill the conversation, wouldn’t it be better to allow people to speak naturally and comfortably and provide them with recommendations they can take advantage of right away? It is time to move away from forcing the user to learn how to use the technology, and move towards having the technology understand and provide better outcomes for users.

Kyle (Junwoong) Doh, Mycelebs