Edit

## Overview

### SyntaxNet

#### 安装

echo 'Bob brought the pizza to Alice.' | syntaxnet/demo.sh

Input: Bob brought the pizza to Alice .Parse:brought VBD ROOT +-- Bob NNP nsubj +-- pizza NN dobj |   +-- the DT det +-- to IN prep |   +-- Alice NNP pobj +-- . . punct

SyntaxNet自带的pre-trained English parser叫Parsey McParseface。我们可以用这个parser来分析语句。根据How to Install and Use SyntaxNet and Parsey McParseface中所述，Parsey McParseface输出实为CoNLL table。这个table的格式在models/syntaxnet/syntaxnet/text_formats.cc，如下：

 50 // CoNLL document format reader for dependency annotated corpora. 51 // The expected format is described e.g. at http://ilk.uvt.nl/conll/#dataformat 52 // 53 // Data should adhere to the following rules: 54 //   - Data files contain sentences separated by a blank line. 55 //   - A sentence consists of one or tokens, each one starting on a new line. 56 //   - A token consists of ten fields described in the table below. 57 //   - Fields are separated by a single tab character. 58 //   - All data files will contains these ten fields, although only the ID 59 //     column is required to contain non-dummy (i.e. non-underscore) values. 60 // Data files should be UTF-8 encoded (Unicode). 61 // 62 // Fields: 63 // 1  ID:      Token counter, starting at 1 for each new sentence and increasing 64 //             by 1 for every new token. 65 // 2  FORM:    Word form or punctuation symbol. 66 // 3  LEMMA:   Lemma or stem. 67 // 4  CPOSTAG: Coarse-grained part-of-speech tag or category. 68 // 5  POSTAG:  Fine-grained part-of-speech tag. Note that the same POS tag 69 //             cannot appear with multiple coarse-grained POS tags. 70 // 6  FEATS:   Unordered set of syntactic and/or morphological features. 71 // 7  HEAD:    Head of the current token, which is either a value of ID or '0'. 72 // 8  DEPREL:  Dependency relation to the HEAD. 73 // 9  PHEAD:   Projective head of current token. 74 // 10 PDEPREL: Dependency relation to the PHEAD.

INFO:tensorflow:Processed 1 documents1       What    _       PRON    WP      _       0       ROOT    _       _2       is      _       VERB    VBZ     _       1       cop     _       _3       a       _       DET     DT      _       5       det     _       _4       control _       NOUN    NN      _       5       nn      _       _5       panel   _       NOUN    NN      _       1       nsubj   _       _

CoNLL table中的所有tag缩写的含义在这里Universal Dependency Relations

### NLTK

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

#### Stemming vs. Lemmatization

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.

NLTK支持多种Stemmer，包括但不限于 Porter stemmer, Lancaster Stemmer, Snowball Stemmer。

>>> from nltk.stem import SnowballStemmer>>> snowball_stemmer = SnowballStemmer(“english”)>>> snowball_stemmer.stem(‘maximum’)u’maximum’>>> snowball_stemmer.stem(‘presumably’)u’presum’>>> snowball_stemmer.stem(‘multiply’)u’multipli’

NLTK中的Lemarize:

>>> from nltk.stem import WordNetLemmatizer>>> wordnet_lemmatizer = WordNetLemmatizer()>>> wordnet_lemmatizer.lemmatize(‘dogs’)u’dog’>>> wordnet_lemmatizer.lemmatize(‘churches’)u’church’>>> wordnet_lemmatizer.lemmatize(‘is’, pos=’v’)u’be’>>> wordnet_lemmatizer.lemmatize(‘are’, pos=’v’)u’be’>>>
• pos = Part Of Speech

WeChat Pay
Alipay
Welcome to my other publishing channels