2. Neural Networks / L6. Sentiment Analysis - Sentiment Classification Projects : Curate a Dataset

티스토리 뷰

Deep Learning

2. Neural Networks / L6. Sentiment Analysis - Sentiment Classification Projects : Curate a Dataset

chrisysl 2018. 7. 26. 21:25

Sentiment Classification & How To "Frame Problems" for a Neural Network

· Curate a Dataset

1
2
3
4
5
6
7
8
9
10
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")
 
g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()
 
g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()
Colored by Color Scripter
cs

- 결국 딥러닝은 우리가 알고있는 정보로 부터 우리가 원하는 정보를 얻어내는 것을 최종 목표로 한다.

- 위와같이 데이터셋을 셋팅해준 뒤 작업을 시작해보자.

- 데이터 자체만으론 네트워크를 학습시킬 수 없기때문에, 두가지의 의미있는 데이터를 제시해줘야한다.

- 여기선 reviews와 labels로 나뉜 두 데이터를 통해 네트워크를 학습시켜보도록 하겠다.

- reviews 는 data set이고, labels는 우리가 알고자하는(What we want to know) 정보다.

- 따라서 reviews를 input으로 네트워크를 학습시켜 labels를 predict해보도록 하자.

· Lesson : Develop a Predictive Theory

- NEGATIVE 와 POSITIVE를 네트워크에 학습시키려면 reviews의 어떤것을 뜯어봐야할까?

- NEGATIVE엔 단어들이 terrible, trash, impossible 등의 단어가,

- POSITIVE엔 excellent, fascinating 같은 단어가 들어가있는것을 확인할 수 있다.

1
2
3
4
5
6
7
8
9
10
for i in range(len(reviews)) : 
    if(labels[i] == 'POSITIVE') : 
        for word in reviews[i].split(" ") :
            positive_counts[word] += 1
            total_counts[word] += 1
            
    else : 
        for word in reviews[i].split(" ") :
            negative_counts[word] += 1
            total_counts[word] += 1
cs

- 이후 위와같이 입력하여 단어별 카운트를 시도하였다.

- 그 이후 출력결과

- 하지만 위의 결과들을 보더라도 the, a 와 같은 단어들은 우리가 원하는 목적을 가진 단어들이 아니다.

- 따라서 정규화(normalize)가 필요하다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Create Counter object to store positive/negative ratios
pos_neg_ratios = Counter()
 
#       Calculate the ratios of positive and negative uses of the most common words
#       Consider words to be "common" if they've been used at least 100 times
for term, cnt in list(total_counts.most_common()) :    
    if (cnt > 100) : 
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term] + 1)
        pos_neg_ratios[term] = pos_neg_ratio
        
for word, ratio in pos_neg_ratios.most_common() :
    if (ratio > 1) : 
        pos_neg_ratios[word] = np.log(ratio)
    else : 
        pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))
cs

- 위와같이 작성하여 단어별 비율을 파악하였고 그 결과 아래와 같다.

- 결과적으로 위와같이 원하는 데이터를 추출해내고 있다.

- 전체 내용 생략.

- https://github.com/thisisyuunsung/sentiment_network_basic

- 내용 확인

'Deep Learning' 카테고리의 다른 글

2. Neural Networks / L7. Keras - Lab : Student Admissions in Keras (1)	2018.08.09
2. Neural Networks / L7. Keras - Keras (0)	2018.08.07
2. Neural Networks / Project : First-Neural-Network (0)	2018.07.25
2. Neural Networks / L4. GPU Workspaces Demo - GPU Workspaces for iterator wrapper (0)	2018.07.18
2. Neural Networks / L3. Training Neural Networks - Other Activation Functions, Batch vs Stochastic Gradient Descent, Learning Rate Decay, Momentum (0)	2018.07.17

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

글 보관함

내용정리

티스토리 뷰

2. Neural Networks / L6. Sentiment Analysis - Sentiment Classification Projects : Curate a Dataset

'Deep Learning' 카테고리의 다른 글

티스토리툴바