Links: PROGRAMMING - TECHNOLOGY - PYTHON
Rel:
Ref:
Tags: #public

Natural Language Toolkit


"A paragraph typically contains a main idea"

TERMS:

investor-speak vs regular english-speak lexicon:

______

______

Form of data pre-processing of taking the root stem
e.g. "riding" stem = "rid-"
where "rid-" can be "ride, riding, ridden, ..."

Useful where meaning of the word is unchanged:
# I was taking a ride in the car. {: id="i-was-taking-a-ride-in-the-car." }
# I was riding in the car. {: id="i-was-riding-in-the-car." }

______

POS tag list: (part of speech)
- CC coordinating conjunction
- CD cardinal digit
- DT determiner
- EX existential there (like: "there is" ... think of it like "there exists")
- FW foreign word
- IN preposition/subordinating conjunction
- JJ adjective 'big'
- JJR adjective, comparative 'bigger'
- JJS adjective, superlative 'biggest'
- LS list marker 1)
- MD modal could, will
- NN noun, singular 'desk'
- NNS noun plural 'desks'
- NNP proper noun, singular 'Harrison'
- NNPS proper noun, plural 'Americans'
- PDT predeterminer 'all the kids'
- POS possessive ending parent\'s
- PRP personal pronoun I, he, she
- PRP$ possessive pronoun my, his, hers
- RB adverb very, silently,
- RBR adverb, comparative better
- RBS adverb, superlative best
- RP particle give up
- TO to go 'to' the store.
- UH interjection errrrrrrrm
- VB verb, base form take
- VBD verb, past tense took
- VBG verb, gerund/present participle taking
- VBN verb, past participle taken
- VBP verb, sing. present, non-3d take
- VBZ verb, 3rd person sing. present takes
- WDT wh-determiner which
- WP wh-pronoun who, what
- WP$ possessive wh-pronoun whose
- WRB wh-abverb where, when

______

Next step is finding out words that modify or effect that(/those) noun(s).
Chunk into "noun phrases" where you have a noun and a bunch of modifiers around it

Downside = can only use regular expressions.
see re idenfifiers and modifiers @ regularexpressions.py

______

_____

NE TYPE - Examples:

ORGANIZATION - Georgia-Pacific Corp., WHO
PERSON - Eddy Bonte, President Obama
LOCATION - Murray River, Mount Everest
DATE - June, 2008-06-29
TIME - two fifty a m, 1:30 p.m.
MONEY - 175 million Canadian Dollars, GBP 10.40
PERCENT - twenty pct, 18.75 %
FACILITY - Washington Monument, Stonehenge
GPE - South East Asia, Midlothian

______

e.g. the "lemma" for run, runs, ran, and running = run
default for pos="n"

______

[ NLTK’s Corporas ] can be found in Users/name/nltk_data/corpora/

______

______

Must be one of two choices/labels:
- positive or negative,
- spam or not,
- etc.

______

Words as Features for Learning:

______

posterior ~= “likelihood”
posterior \= prior_occurences x likelihood / current_evidence

1) create classifier based on set of shuffled documents.
2) feed (slightly) different documents and test accuracy.
3) see most important "features" (words) in determining accuracy of neg and pos reviews

- dont tell the machine the category, ask machine to tell us w/ training from feed of featuresets for top 3000 words and high freq of words in neg or pos reviews

** high occurance likely means importance for each