Sorting text in a file

scudco · March 26, 2008, 6:59am

Hi Ive been hacking away at this all morning and getting nowhere fast.
Im relatively new to ruby and im not so hot at regex.

Im trying to grab text data from a website that shows events and then
putting each event into its own class. I figured out how to get the
screen scraped stuff into a clean state. Its just processing it into my
class htat im having problems.

Here is a few events in their natural format

—start of file----
Toto and Boz Scaggs

Seminal American rock band with the talented blues-rock musician. Mar
21, 7pm, ¥13,000. JCB Hall, Suidobashi. Tel: Udo 03-3402-5999.

Kreva

Hip-hop track maker. Mar 21, 7pm, ¥5,000. Akasaka Blitz.

Tel: Disk Garage 03-5436-9600.

Blood Red Shoes

Rock duo from the UK. Mar 21, 7pm, ¥5,000. Shibuya Club Quattro.
Tel: Creativeman 03-3462-6969.

etcetcetc
—end-----

First i grab the file into a string. As all the concerts are seperated
by 4 newlines I use

concertevents = filetext.split(/\n\n\n\n/)

to get an array of events.

Id then like to process these further by keeping the group name seperate
from the rest of the other details. So I thought I’d do

artist = conevt.slice(/[^\n]*/) #get artist info

which assumes the group name will only be on one line. Fine for this
prototype.

The details are a bit trickier as some spill onto a second line (but
seperated by a blank line). The second event is so. I tried

description = conevt.slice(/.\n\n(.\n\n.*)/,1) #get desc

Although my RegexCoach programm says it works with the first event, when
i run the programme it seems slice returns nil to description. It
definately works for the second event which takes up 3 lines.

So first question is how should I alter the above regex to make it work
for those cases above - any hints tips or if you feel like it answers.
At this stage im up for easier longer ways rather than the shorter more
cryptic ones.

Second am i going about this the write way. Should I have just avoided
regex and simply read the file line by line using if structures to
figure out which lines are with which event???

Does anyone know of any good resources e.g. tutorials on this subject
i.e. screen scraping, cleaning the grabbed text and then processing it
into your own classes.

wow its a long post…ill leave it at that.

scudco · March 26, 2008, 7:41am

Adam A. wrote:

The details are a bit trickier as some spill onto a second line (but
seperated by a blank line).

Then you should have posted an example file with all the possibilities.

Second am i going about this the write way. S

Probably not.

Should I have just avoided
regex and simply read the file line by line using if structures to
figure out which lines are with which event???

That is one way. On the website, the information is probably contained
in different html tags. So scraping the website, then joining all the
data together, then trying to separate the data is not a good plan.
You should be able to pick the pieces from the website directly.
However, you have to know html is written and how it is structured.
Ruby has several gems, e.g. Hpricot, that make it easy to pick out
pieces of information on a website, but you sort of have to know how
html in order to pick out the data you want.

If that sounds too confusing, then just deal with the text file you
have, and YES you should avoid regex’s whenever possible. So reading
the file line by line would be much easier.

scudco · March 26, 2008, 7:49am

7stud – wrote:

Adam A. wrote:

The details are a bit trickier as some spill onto a second line (but
seperated by a blank line).

Then you should have posted an example file with all the possibilities.

Second am i going about this the write way. S

Probably not.

Should I have just avoided
regex and simply read the file line by line using if structures to
figure out which lines are with which event???

That is one way. On the website, the information is probably contained
in different html tags. So scraping the website, then joining all the
data together, then trying to separate the data is not a good plan.
You should be able to pick the pieces from the website directly.
However, you have to know html is written and how it is structured.
Ruby has several gems, e.g. Hpricot, that make it easy to pick out
pieces of information on a website, but you sort of have to know how
html in order to pick out the data you want.

If that sounds too confusing, then just deal with the text file you
have, and YES you should avoid regex’s whenever possible. So reading
the file line by line would be much easier.

Ack! Let’s try that again:

That is one way. On the website, the data is probably contained in
different html tags. So scraping the website, then joining all the data
together, then trying to separate the data back out again is not a very
good plan. You should be able to pick out the pieces of the data you
want directly from the html. However, you have to know how html is
written and how html is structured. Ruby has several gems, e.g.
Hpricot, that make it easy to pick out pieces of information from a page
of html.

If that sounds too confusing, then just deal with the text file you have
already, and YES you should avoid regex’s whenever possible. Reading
the file line by line would be better and probably easier.

scudco · March 26, 2008, 7:52am

Hpricot info:

http://code.whytheluckystiff.net/hpricot/

scudco · March 27, 2008, 4:58am

Thanks very much for your replies 7stud and Zaki, I am tinkering with
Hpricot now. Ill see how working with the html tags in place will work.

scudco · March 26, 2008, 8:29am

Adam A. wrote:

Hi Ive been hacking away at this all morning and getting nowhere fast.
Im relatively new to ruby and im not so hot at regex.

Hi,

How about something like this quick script:
don’t forget the /m modifier for multiline matching mode.
(it assumes that there is no newline in the artist name part though)

File.open(‘events.txt’, ‘r’) {|f|
contents = f.read()
contents.split(/\n\n\n\n/).each {|conevt|
if (conevt =~ /([^\n])\n\n(.)/im)
artist = $1
description = $2
print “ARTIST: #{artist}\nDESC: #{description}\n\n”
end
}
}

Second am i going about this the write way. Should I have just avoided
regex and simply read the file line by line using if structures to
figure out which lines are with which event???

From what I can see, I guess this is the format you have to deal with…
in this case, I believe regexp are the way to go and you will only hurt
yourself in the long term with switch-case spaghetti

In case, you can get your hands on other formats, or if you are in
charge of creating the data in the first place, I wouldn’t recommend
using plain text in the first place (yaml, xml, ini, whichever you like
best), but I think that is not an option for you.

Zaki