(Bayesian) Linguistic Analysis

jphastings · September 13, 2009, 7:25pm

I’ve found some interesting libraries (Classifier, Bishop) that will
help you classify a given phrase once you’ve trained it with a suitable
number of guided test phrases, but I’m looking for something a little
different:

I’d like to be able to train a class to return (lets say) how happy a
phrase is. So I could train it with 100 phrases that were between -100%
happy (ie. sad), 0% happy (neutral) and 100% happy and then on entering
a new phrase it would return the percentage happy that phrase was.

Am I looking for Bayesian analysis? Am I missing some feature of the
Classifier class? Should I be looking elsewhere for this functionality?

Any information will be gratefully received!

jphastings · September 14, 2009, 3:40pm

Jp Hastings-spital wrote:

I’d like to be able to train a class to return (lets say) how happy a
phrase is. So I could train it with 100 phrases that were between -100%
happy (ie. sad), 0% happy (neutral) and 100% happy and then on entering
a new phrase it would return the percentage happy that phrase was.

Am I looking for Bayesian analysis? Am I missing some feature of the
Classifier class? Should I be looking elsewhere for this functionality?

Well, pretty much. This is how bayesian spam filters work, by training
against a set of messages(decomposed to words) to learn what junk email
is made (and not made) of.

What you get out at the end when you point the classifier at new texts
is a probability that it belongs to class ‘x’.

But in the standard classifiers you don’t do the training with ‘scores’,
merely the absence or presence of a class. How you would decide at the
outset that, for training, a phrase if 57% and not 56% happy is hard to
see - if you know that you already have your algorithm.

What you might do is train the classifier to know about a number of
emotional classes, eg:

‘ecstatic’
‘cheerful’
‘sad’
‘despairing’

They would obviously overlap, but the resulting scores (probabilities)
might help you then better distinguish everyday happy from very very
very happy

a

jphastings · September 14, 2009, 3:53pm

Thanks for your reply!

I guess I’m just looking to see if its been done before, I’m not worried
about sitting down and working out new algorithms to do precisely this -
in fact it could be quite fun - but there’s nothing worse than building
your own then finding there’s a ruby class that does just that and more!

Classifier has a ‘#categories’ method that will return the probabilities
for each category you’ve specified (in fact, the way a category is
chosen is by picking the one with the highest probability) but I don’t
see how to make it work with binary terms. With the standard ‘spam’
example you train to two categories (spam, not spam) but the
probabilities for ‘spam’ and ‘not spam’ will almost never total 100%
because of the way the algorithm works.

Of course I could always scale the results; if Phrase A is 25%
‘neutral’, 36% ‘unhappy’ and 44% ‘happy’ I can do (0.25 * 0 + 0.36 * -1

0.44 * 1)/(0.25 + 0.36 + 0.44) and I’ll have a reasonable percentage
for the ‘happiness’ of that phrase.

Looks like its testing time!

Thanks again

jphastings · September 14, 2009, 4:15pm

Alex F. [email protected] writes:

Well, pretty much. This is how bayesian spam filters work, by training
algorithm.
might help you then better distinguish everyday happy from very very
very happy

The problem is that Bayes classifiers will work only at the syntactic
level.

Death was ecstatic that day and cheerfully reaped John.
His nephew had been despairing for years, longing this sad event.

The first sentence alone would denote a very sad event (the more so if
it was preceded by a paragraph about John and how likeable he was).
The second sentence on the other hand denotes a great happiness in the
nephew. You might not empathize with him, but that’s what he feels…

So would you be happy with a purely syntactical approach, or do you
want to extract the real meaning of the sentences?

jphastings · September 14, 2009, 4:29pm

The example I used above was a simplified one - really I’m trying to
analyze the style with which a person writes, briefly: if I rite 2 u
like vis; or the possibility that I’m loquacious with my vocabulary.

Although I do also hope to be able to apply what I find to a
happy/neutral/sad analysis. I will have the exact problem you’ve
described, modern slang is actually a pain to monitor, so many words are
contranyms! If you have any advice for deeper analysis I’m all ears!

jphastings · September 14, 2009, 4:31pm

Pascal J. Bourguignon wrote:

Well, not exactly a ruby class. It’s an industrial grade application
written in Common Lisp running on distributed cluster computers (plus
an Oracle backend database and some Java code for the user interface
(including implementing a scheme in Java to let the user customize
their processing of the synthesized data)).

Wow! I appreciate your research there - its a bit much for my
requirements, but linguistic analysis is obviously far deeper than
counting the occurrences of different words! Fascinating field, I can
see myself getting lost in it!

jphastings · September 14, 2009, 4:22pm

Jp Hastings-spital [email protected] writes:

Thanks for your reply!

I guess I’m just looking to see if its been done before,

Yes, it is done. http://www.ravenpack.com sells a commercial product
based on extracting the sentiments out of financial news…

I’m not worried
about sitting down and working out new algorithms to do precisely this -
in fact it could be quite fun - but there’s nothing worse than building
your own then finding there’s a ruby class that does just that and more!

Well, not exactly a ruby class. It’s an industrial grade application
written in Common Lisp running on distributed cluster computers (plus
an Oracle backend database and some Java code for the user interface
(including implementing a scheme in Java to let the user customize
their processing of the synthesized data)).

(Information infered from
http://www.ravenpack.com/aboutus/employment.htm)

jphastings · September 15, 2009, 1:41am

On Mon, Sep 14, 2009 at 9:31 AM, Jp Hastings-spital
[email protected] wrote:

see myself getting lost in it!
This “happiness” finder may be a bit of a unicorn. Good luck

As an example, a poster from ages – a few years – ago (for real)
sadly sung a ballad of how he (or she) gave up on trying to understand
sentiment logically through internet groups. I believe I remember
this happening over some very heated battle/thread on this list.

Actually, his/hers was more of a statement than poetry, but take it
how you want.

I’ll admit it would be fun to apply something like that to different
film dialogs.

Todd