强烈建议你试试无所不能的chatGPT，快点击我

Python入门NLP（二）

阅读量：4227 次

发布时间：2019-05-26

本文共 3319 字，大约阅读时间需要 11 分钟。

上次我们看了词频的统计和如何处理停用词。

这次我们看与语义相关的NLP

640?wx_fmt=jpeg

使用NLTK Tokenize文本

如果有一下这样一段文本：

Python can be easy to pick up whether you're a first time programmer or you're experienced with other languages. The following pages are a useful first step to get on your way writing programs with Python!

使用句子tokenizer将文本tokenize成句子:

from nltk.tokenize import sent_tokenize

mytext = "Python can be easy to pick up whether you're a first time programmer or you're experienced with other languages. The following pages are a useful first step to get on your way writing programs with Python!"

print(sent_tokenize(mytext))

输出：

["Python can be easy to pick up whether you're a first time programmer or you're experienced with other languages.", 'The following pages are a useful first step to get on your way writing programs with Python!']

它能把每句话准确的识别出来。

接下来试试单词tokenizer:

from nltk.tokenize import word_tokenize

mytext = "Python can be easy to pick up whether you're a first time programmer or you're experienced with other languages. The following pages are a useful first step to get on your way writing programs with Python!"

print(word_tokenize(mytext))

输出：

['Python', 'can', 'be', 'easy', 'to', 'pick', 'up', 'whether', 'you', "'re", 'a', 'first', 'time', 'programmer', 'or', 'you', "'re", 'experienced', 'with', 'other', 'languages', '.', 'The', 'following', 'pages', 'are', 'a', 'useful', 'first', 'step', 'to', 'get', 'on', 'your', 'way', 'writing', 'programs', 'with', 'Python', '!']

它也能准确的识别每个单词。

使用Tokenize时可以指定语言 : sent_tokenize(mytext,"Chinese")

同义词处理

WordNet是一个为自然语言处理而建立的数据库。它包括一些同义词组和一些简短的定义。

具体看代码：

from nltk.corpus import wordnet

syn = wordnet.synsets("happiness")

print(syn[0].definition())

输出：

state of well-being characterized by emotions ranging from contentment to intense joy

它能具体解读出这个词汇的意思

所以我可以使用如下的方式来获取它的同义词：

from nltk.corpus import wordnet

synonyms = []

for syn in wordnet.synsets('happy'):

for lemma in syn.lemmas():

synonyms.append(lemma.name())

print(synonyms)

输出：

['happy', 'felicitous', 'happy', 'glad', 'happy', 'happy', 'well-chosen']

同样的，我们也因此可以得到反义词处理：

antonyms = []

for syn in wordnet.synsets("happy"):

for l in syn.lemmas():

if l.antonyms():

antonyms.append(l.antonyms()[0].name())

print(antonyms)

输出：

['unhappy']

词干提取

语言形态学和信息检索里，词干提取是去除词缀得到词根的过程，例如working的词干为work。

搜索引擎在索引页面时就会使用这种技术，所以很多人为相同的单词写出不同的版本。

有很多种算法可以避免这种情况，最常见的是波特词干算法。NLTK有一个名为PorterStemmer的类，就是这个算法的实现:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

print(stemmer.stem('working'))

print(stemmer.stem('worked'))

输出：

work

work

单词变体还原

单词变体还原类似于词干，但不同的是，变体还原的结果是一个真实的单词。不同于词干，当你试图提取某些词时，它会产生类似的词:

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

print(stemmer.stem('increases'))

输出：

increas

现在，如果用NLTK的WordNet来对同一个单词进行变体还原，才是正确的结果:

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('increases'))

输出：

increase

结果可能会是一个同义词或同一个意思的不同单词。

有时候将一个单词做变体还原时，总是得到相同的词。

这是因为语言的默认部分是名词。要得到动词，可以这样指定：

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize('playing', pos="v"))

输出：

play

实际上，这也是一种很好的文本压缩方式，最终得到文本只有原先的50%到60%。

结果还可以是动词(v)、名词(n)、形容词(a)或副词(r)：

print(lemmatizer.lemmatize('playing', pos="n"))

print(lemmatizer.lemmatize('playing', pos="a"))

print(lemmatizer.lemmatize('playing', pos="r"))

输出：

playing

playing

playing

这次就到这里吧。

下期见。

转载地址：http://ianqi.baihongyu.com/

你可能感兴趣的文章

九度OJ 1091：棋盘游戏（DP、BFS、DFS、剪枝）

Openfiler 配置 NFS 示例

Oracle 11.2.0.1 RAC GRID 无法启动： Oracle High Availability Services startup failed

Oracle 18c 单实例安装手册详细截图版

Oracle Linux 6.1 + Oracle 11.2.0.1 RAC + RAW 安装文档

Oracle 11g 新特性 -- Online Patching （Hot Patching 热补丁）说明

Oracle 11g 新特性 -- ASM 增强说明

Oracle 11g 新特性 -- Database Replay （重演）说明

Oracle 11g 新特性 -- 自动诊断资料档案库(ADR) 说明

CSDN博客之星投票说明

Oracle wallet 配置说明

Oracle smon_scn_time 表说明

VBox fdisk 不显示添加的硬盘解决方法

Java多态性理解

【屌丝程序的口才逆袭演讲稿50篇】第一篇：互联网时代U盘化生存方式【张振华.Jack】

CentOS6.4配置Hadoop-2.6.0集群配置安装指南(经过实战演练)【张振华.Jack】

【屌丝程序的口才逆袭演讲稿50篇】第二篇：专注的力量 [张振华.Jack]

BFS——求矩阵中“块”的个数

BFS——走迷宫的最小步数

并查集——好朋友

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！-- 愿君每日到此一游！

当前时间: 2024-09-26 05:22:02 当前IP: 18.226.4.191 联系邮箱:javaeecc@qq.com Copyright © 2020 - 2022 baihongyu.com 京ICP备2021015314号-2

强烈建议你试试无所不能的CHAT-GPT，快点击我

强烈建议你试试无所不能的CHAT-GPT，快点击我