brunch

매거진 비전공자의 데이터 공부

라이킷 6 댓글 2

You can make anything
by writing

C.S.Lewis

계정을 잊어버리셨나요?

by Sand Apr 23. 2022

셰익스피어 sonnet에서 가장 많이 쓰인 단어는?

파이썬 NLTK를 활용한 말뭉치 조사

STEP 1

NLTK 라이브러리에서 필요한 함수 읽어오기

STEP 2

##텍스트 파일 읽어오는 함수

def opener(title):

return open("/Users/Name/Downloads/"+title+".txt", "r").read()

STEP 3

## 줄바꿈 정제하기, 모든 단어 소문자로 바꾸기

def tokenize_words(title):

words = T.word_tokenize(opener(title).replace('\n', ' ').lower())

return [w for w in words if w not in S.words("english")]

STEP 4

## 말뭉치 단어 갯수 찾기

def word_count(words, title):

text = all_unique_count(title)

analyzed_text = text[text['word'].isin(words)].sort_values(by='count', ascending = False)

return analyzed_text

STEP 5

## 텍스트 데이터 정제하기 (문장부호 제거)

def clean_words(title):

return [w for w in tokenize_words(title) if w not in [".", ",", "?", "!", "-", "“", "”", "--", "’","‘" ,":", ";", "(", ")"]]

STEP 6

## 정제된 말뭉치 단어 갯수 세기

def all_unique_count (title):

words_unique = Counter(clean_words(title))

title_df = pd.DataFrame(words_unique.items()).rename(columns={0:'word', 1:'count'})

return title_df.sort_values(by='count', ascending = False)

결과 보기

셰익스피어 Sonnet 전체에서 "눈", "사랑" "너"라는 몇 번이나 쓰였을까?

놀랍게도 "너" (you)라는 단어는 한 번도 안쓰였는데

그 이유는 thy, thou, thee 의 형태로 쓰였기 때문.

가장 많이 쓰인 명사는 "사랑 (love)", "beauty (아름다움)", "time (시간)".

keyword

Sand 소속 직업 기획자

프로덕트와 전략을 공부하는 주니어 PO

구독자 11

매거진의 이전글 week 1. 자연어 처리 입문 -(1)

작품 선택

키워드 선택 0 / 3 0

댓글여부

댓글 쓰기 허용 afliean

브런치는 최신 브라우저에 최적화 되어있습니다. IE chrome safari