Links: PROGRAMMING - TECHNOLOGY - PYTHON
Rel:
Ref:
Tags: #public
Natural Language Toolkit
TERMS:
investor-speak vs regular english-speak lexicon:
______
______
Form of data pre-processing of taking the root stem
e.g. "riding" stem = "rid-"
where "rid-" can be "ride, riding, ridden, ..."
Useful where meaning of the word is unchanged:
# I was taking a ride in the car. {: id="i-was-taking-a-ride-in-the-car." }
# I was riding in the car. {: id="i-was-riding-in-the-car." }
______
POS tag list: (part of speech)
- CC coordinating conjunction
- CD cardinal digit
- DT determiner
- EX existential there (like: "there is" ... think of it like "there exists")
- FW foreign word
- IN preposition/subordinating conjunction
- JJ adjective 'big'
- JJR adjective, comparative 'bigger'
- JJS adjective, superlative 'biggest'
- LS list marker 1)
- MD modal could, will
- NN noun, singular 'desk'
- NNS noun plural 'desks'
- NNP proper noun, singular 'Harrison'
- NNPS proper noun, plural 'Americans'
- PDT predeterminer 'all the kids'
- POS possessive ending parent\'s
- PRP personal pronoun I, he, she
- PRP$ possessive pronoun my, his, hers
- RB adverb very, silently,
- RBR adverb, comparative better
- RBS adverb, superlative best
- RP particle give up
- TO to go 'to' the store.
- UH interjection errrrrrrrm
- VB verb, base form take
- VBD verb, past tense took
- VBG verb, gerund/present participle taking
- VBN verb, past participle taken
- VBP verb, sing. present, non-3d take
- VBZ verb, 3rd person sing. present takes
- WDT wh-determiner which
- WP wh-pronoun who, what
- WP$ possessive wh-pronoun whose
- WRB wh-abverb where, when
______
Next step is finding out words that modify or effect that(/those) noun(s).
Chunk into "noun phrases" where you have a noun and a bunch of modifiers around it
Downside = can only use regular expressions.
see re idenfifiers and modifiers @ regularexpressions.py
______
_____
NE TYPE - Examples:
ORGANIZATION - Georgia-Pacific Corp., WHO
PERSON - Eddy Bonte, President Obama
LOCATION - Murray River, Mount Everest
DATE - June, 2008-06-29
TIME - two fifty a m, 1:30 p.m.
MONEY - 175 million Canadian Dollars, GBP 10.40
PERCENT - twenty pct, 18.75 %
FACILITY - Washington Monument, Stonehenge
GPE - South East Asia, Midlothian
______
e.g. the "lemma" for run, runs, ran, and running = run
default for pos="n"
______
[ NLTK’s Corporas ] can be found in Users/name/nltk_data/corpora/
______
WordNet is more of a giant Lexicon vs a Corpora.
Synsets are:
wup_similarity \= “Wu and Palmer” wrote a paper on semantic similarity
______
Must be one of two choices/labels:
- positive or negative,
- spam or not,
- etc.
______
Words as Features for Learning:
______
posterior ~= “likelihood”
posterior \= prior_occurences x likelihood / current_evidence
1) create classifier based on set of shuffled documents.
2) feed (slightly) different documents and test accuracy.
3) see most important "features" (words) in determining accuracy of neg and pos reviews
- dont tell the machine the category, ask machine to tell us w/ training from feed of featuresets for top 3000 words and high freq of words in neg or pos reviews
** high occurance likely means importance for each