Finding string matches, in order, in a file

bodikp · September 18, 2007, 2:41pm

Hi,
I’ve got files I want to parse. I’m using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I’m getting an array all right, but, I don’t
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that’s 300 lines deep into the file. Why isn’t the
first instance in the string, the file, the first entry in the array?

xmlfile.scan(/\n<issue code="[A-Z]{3}">(.*)</issue>\n?/)
do |match|

codes = $1
puts codes

Thanks,
Peter

bodikp · September 18, 2007, 3:16pm

Peter B. wrote:

Hi,
I’ve got files I want to parse. I’m using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I’m getting an array all right, but, I don’t
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that’s 300 lines deep into the file. Why isn’t the
first instance in the string, the file, the first entry in the array?

The file may have multiple copies of some entries, and
your regexp may be botched.

xmlfile.scan(/\n<issue code="[A-Z]{3}">(.*)</issue>\n?/)

I don’t like the looks of that regular expression. Try this one.

/\n(.*?)</issue>\n?/m

bodikp · September 18, 2007, 3:28pm

William J. wrote:

Peter B. wrote:

Hi,
I’ve got files I want to parse. I’m using a string scan routine that
populates an array. I need to pull the entries of that array out, in
order, eventually. I’m getting an array all right, but, I don’t
understand its order. The first instance in the string, meaning the
whole file, is way down the list in the array. The first entry in the
array is an entry that’s 300 lines deep into the file. Why isn’t the
first instance in the string, the file, the first entry in the array?

The file may have multiple copies of some entries, and
your regexp may be botched.

xmlfile.scan(/\n<issue code="[A-Z]{3}">(.*)</issue>\n?/)

I don’t like the looks of that regular expression. Try this one.

/\n(.*?)</issue>\n?/m

Thanks, William. I tried your regex, but, I’m still getting the first
entry as one that’s 300 lines deep into the file. In fact, the results
look exactly the same to me.

bodikp · September 18, 2007, 4:31pm

Robert K. wrote:

2007/9/18, Peter B. [email protected]:

Thanks, William. I tried your regex, but, I’m still getting the first
entry as one that’s 300 lines deep into the file. In fact, the results
look exactly the same to me.

Still William’s regexp is significantly better than the original one.
You seem to be processing XML files. It may be that there is some
white space between and that you are not prepared
for. You can handle that by replacing \n with \s*.

A completely different approach is to use REXML or another XML tool
and use XPath search. This is way less error prone - but usually also
slower. If you just want to extract these codes then a SAX parser
approach might still be pretty fast.

Kind regards

robert

Same old output. I’ll look into REXML. I downloaded it. But, it’s enough
for me to just learn Ruby. I don’t know if I can handle yet another
scripting language. Anyway, thanks a lot.
-Peter

bodikp · September 18, 2007, 3:56pm

2007/9/18, Peter B. [email protected]:

Thanks, William. I tried your regex, but, I’m still getting the first
entry as one that’s 300 lines deep into the file. In fact, the results
look exactly the same to me.

Still William’s regexp is significantly better than the original one.
You seem to be processing XML files. It may be that there is some
white space between and that you are not prepared
for. You can handle that by replacing \n with \s*.

A completely different approach is to use REXML or another XML tool
and use XPath search. This is way less error prone - but usually also
slower. If you just want to extract these codes then a SAX parser
approach might still be pretty fast.

Kind regards

robert

bodikp · September 18, 2007, 5:16pm

On Sep 18, 8:28 am, Peter B. [email protected] wrote:

entry as one that’s 300 lines deep into the file. In fact, the results
look exactly the same to me.

Don’t give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Some tricky points. A . won’t match a newline unless
the m modifier is at the end of the regexp.
.* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it’s best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Try this:

%q{

I’m Issue XIV,
who are you?

I’m Issue XX, are you?

}.scan(
/\s*<issue +code *= *“[A-Z]{3}” >(.?)</issue>/m){
p $1
}

bodikp · September 18, 2007, 5:27pm

William J. wrote:

On Sep 18, 8:28 am, Peter B. [email protected] wrote:

entry as one that’s 300 lines deep into the file. In fact, the results
look exactly the same to me.

Don’t give up yet. A regular expression is a very concentrated
piece of code, and it very often requires tweeking.

Can you show us the first entry in the file that should
be matched? That would enable us to test our reg.exps.

Some tricky points. A . won’t match a newline unless
the m modifier is at the end of the regexp.
.* will often match too much unless you make it
non-greedy by appending ? (i.e., .*?).
Sometimes it’s best to make the regexp case-insensitive
by using the i modifier.
You may assume that your text will always have
<issue code=
but perhaps it has
<issue code =

Try this:

%q{

I’m Issue XIV,
who are you?

I’m Issue XX, are you?

}.scan(
/\s*<issue +code *= *“[A-Z]{3}” >(.?)</issue>/m){
p $1
}

Believe me, I haven’t given up. I need this to work! I really appreciate
your perseverance, though. Here’s what I have now:

xmlfile.scan(/\s*<issue +code *="[A-Z]{3}
">(.?)</issue>\n?/mi) do |match|
codes = $1
puts codes
end

My xml file that I’m testing is 2087 lines deep. The first entry in this
file is on lines 21-23. Here they are:

Trade (Domestic & Foreign)

So, these words, “Trade (Domestic & Foreign)” should be my first
entry in my array. But, it continues to come up with the word
“Immigration” as the first entry in the array, and that’s way down on
line 358.

Thanks,
Peter

bodikp · September 18, 2007, 5:22pm

2007/9/18, Peter B. [email protected]:

white space between and that you are not prepared

Same old output. I’ll look into REXML. I downloaded it.

It’s part of the standard distribution.

But, it’s enough
for me to just learn Ruby. I don’t know if I can handle yet another
scripting language. Anyway, thanks a lot.

Well, as William said: can you show a piece of the document you are
trying to match?

Kind regards

robert

bodikp · September 18, 2007, 6:02pm

On Sep 18, 10:27 am, Peter B. [email protected] wrote:

be matched? That would enable us to test our reg.exps.
<issue code =

Trade (Domestic & Foreign)

So, these words, “Trade (Domestic & Foreign)” should be my first
entry in my array. But, it continues to come up with the word
“Immigration” as the first entry in the array, and that’s way down on
line 358.

Thanks,
Peter

During the posting process, your regexp was broken into
2 lines; when I corrected that, it worked.

Here I’ve slightly shortened it.

%q{

I’m Issue XIV,
who are you?

I’m Issue XX, are you?

Trade (Domestic & Foreign)

}.scan(
/\s*<issue +code=“[A-Z]{3}”>(.*?)</issue>/m){
p $1
}

==== output ====
“\nI’m Issue XX, are you?\n”
“Trade (Domestic & Foreign)”
==== end of output ====

If this still won’t work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren’t?
Perhaps this would be worth a try:
/[^<>]<issue\W+code=“[A-Z]{3}”>(.?)</issue>/m

bodikp · September 18, 2007, 8:26pm

If this still won’t work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren’t?
Perhaps this would be worth a try:
/[^<>]<issue\W+code="[A-Z]{3}">(.?)</issue>/m

Still no go, William. I tried your last phrase there, too.

bodikp · September 19, 2007, 2:03pm

William J. wrote:

On Sep 18, 1:26 pm, Peter B. [email protected] wrote:

If this still won’t work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren’t?
Perhaps this would be worth a try:
/[^<>]<issue\W+code=“[A-Z]{3}”>(.?)</issue>/m

Still no go, William. I tried your last phrase there, too.

You’ve got to track down what’s going on.
Copy and paste the code below into a file.
(Don’t even think about typing it in.)
Run the file. Is this the output?

“\nI’m Issue XIV,\nwho are you?\n”
“\nI’m Issue XX, are you?\n”
“Trade (Domestic & Foreign)”

If it is, open both the Ruby file and the xml
file with the same editor; copy the first desired
entry from the xml file and paste it at the bottom
of the big string in the Ruby file. Even though
it looks like an exact duplicate of what’s already
in the string, maybe it differs somehow.
The output should now be:

“\nI’m Issue XIV,\nwho are you?\n”
“\nI’m Issue XX, are you?\n”
“Trade (Domestic & Foreign)”
“Trade (Domestic & Foreign)”

If it isn’t, edit the entry that you just pasted
into the big Ruby string; delete spaces and
line-endings and replace them with new spaces and
line-endings. (Perhaps the the xml file has some
bizarre invisible characters.)

%q{

I’m Issue XIV,
who are you?

I’m Issue XX, are you?

Trade (Domestic & Foreign)
}.scan(

Using extended-mode regular expression for clarity.

Whitespace and comments are ignored.

%r{

\s*
<issue[ \t]+code[ \t]=[ \t]“[^”]"[ \t]>
(.*?)

}xmi
){ p $1 }

I’m getting exactly what you predict. And, . . ., perhaps I haven’t made
this clear, but, I am getting healthy output from my script. I’m getting
208 lines of data. Each line is an entry between the entries
I’ve described. They’re just totally out of order! That’s my problem.
Here are the first 5 entries I get back:
Immigration
Health Issues
Copyright/Patent/Trademark
Budget/Appropriations
Health Issues
…

bodikp · September 19, 2007, 1:12am

On Sep 18, 1:26 pm, Peter B. [email protected] wrote:

If this still won’t work on your file, could the file
be contaminated with some non-displaying characters
that appear to be whitespace but aren’t?
Perhaps this would be worth a try:
/[^<>]<issue\W+code=“[A-Z]{3}”>(.?)</issue>/m

Still no go, William. I tried your last phrase there, too.

You’ve got to track down what’s going on.
Copy and paste the code below into a file.
(Don’t even think about typing it in.)
Run the file. Is this the output?

“\nI’m Issue XIV,\nwho are you?\n”
“\nI’m Issue XX, are you?\n”
“Trade (Domestic & Foreign)”

If it is, open both the Ruby file and the xml
file with the same editor; copy the first desired
entry from the xml file and paste it at the bottom
of the big string in the Ruby file. Even though
it looks like an exact duplicate of what’s already
in the string, maybe it differs somehow.
The output should now be:

“\nI’m Issue XIV,\nwho are you?\n”
“\nI’m Issue XX, are you?\n”
“Trade (Domestic & Foreign)”
“Trade (Domestic & Foreign)”

If it isn’t, edit the entry that you just pasted
into the big Ruby string; delete spaces and
line-endings and replace them with new spaces and
line-endings. (Perhaps the the xml file has some
bizarre invisible characters.)

%q{

I’m Issue XIV,
who are you?

I’m Issue XX, are you?

Trade (Domestic & Foreign)

}.scan(

Using extended-mode regular expression for clarity.

Whitespace and comments are ignored.

%r{

\s*
<issue[ \t]+code[ \t]=[ \t]“[^”]"[ \t]>
(.*?)

}xmi
){ p $1 }

bodikp · September 19, 2007, 4:40pm

On Sep 19, 7:03 am, Peter B. [email protected] wrote:

…
Very odd. “scan” will return the strings in the order that
they are found.

How did your program read the file? Could its contents have
been disordered somehow? After your program reads the file
into a string, have it write the string to a temp file and
then use fc or diff to compare the 2 files.

“Health Issues” appears twice; I presume the file contains
two entries.

You’ve probably had your editor do a string search to verify
that “Immigration” isn’t the first entry in the file.

bodikp · September 19, 2007, 8:11pm

On Sep 19, 11:20 am, Peter B. [email protected] wrote:

then use fc or diff to compare the 2 files.
Before I do this scan, I do a sweep where I delete all the extra white
space at the beginning of each line. There’s a lot of it there. I’ve
tried this without that sweep, but, I get the same results, especially
after, with your help, I was more generic in my definition of the white
space around these entries.

I did a write of this array as a string to a file. When I pulled up the
file, it looks exactly like the output of my script.

“Health Issues” appears twice in your output. Returning to my
original
hypothesis, are you certain “Trade” doesn’t occur twice in the file?
If it occurs twice, and the reg.ex. fails to match the first instance
but matches the second, it may appear to you that “Trade” is being
output out of order, when in fact the first occurrance is simply
missing.
If there are more copies of each entry in the file than you suspect,
a faulty reg.ex. will make it seem that the output is out of order.

bodikp · September 19, 2007, 6:20pm

William J. wrote:

On Sep 19, 7:03 am, Peter B. [email protected] wrote:

…
Very odd. “scan” will return the strings in the order that
they are found.

How did your program read the file? Could its contents have
been disordered somehow? After your program reads the file
into a string, have it write the string to a temp file and
then use fc or diff to compare the 2 files.

“Health Issues” appears twice; I presume the file contains
two entries.

You’ve probably had your editor do a string search to verify
that “Immigration” isn’t the first entry in the file.

Yes, as I said yesterday, the word “Immigration” is on line 300+, and,
the first entry that should be seen is the “Trade” one.

Before I do this scan, I do a sweep where I delete all the extra white
space at the beginning of each line. There’s a lot of it there. I’ve
tried this without that sweep, but, I get the same results, especially
after, with your help, I was more generic in my definition of the white
space around these entries.

I did a write of this array as a string to a file. When I pulled up the
file, it looks exactly like the output of my script.

bodikp · September 19, 2007, 8:20pm

On Sep 19, 1:06 pm, William J. [email protected] wrote:

been disordered somehow? After your program reads the file
the first entry that should be seen is the “Trade” one.
“Health Issues” appears twice in your output. Returning to my
original
hypothesis, are you certain “Trade” doesn’t occur twice in the file?
If it occurs twice, and the reg.ex. fails to match the first instance
but matches the second, it may appear to you that “Trade” is being
output out of order, when in fact the first occurrance is simply
missing.
If there are more copies of each entry in the file than you suspect,
a faulty reg.ex. will make it seem that the output is out of order.

A way to see if the reg.ex. is matching all entries.
grep -c '<issue ’ thefile.xml
If this is more than the 208 lines output by the Ruby program,
then the reg.ex. is probably failing in some cases.

bodikp · September 20, 2007, 2:01pm

William,
So, I found out what my problem was. And, yes, it’s kind of embarassing.
It turns out that, at the top of my script, which I hadn’t looked at in
days, I was actually parsing through multiple files, not one file. I was
looking at multiple xml files, not just the one.

So, I apologize to you. I really appreciate your doggedness in helping
me. You have the patience of Job. This forum’s generosity astounds me,
and, you’re a perfect example of why.

Cheers,
Peter

bodikp · September 20, 2007, 4:51pm

On Sep 20, 7:01 am, Peter B. [email protected] wrote:

William,
So, I found out what my problem was.

Good to hear. I hate to see bugs like this unsquashed.

bodikp · September 19, 2007, 10:10pm

William J. wrote:

On Sep 19, 1:06 pm, William J. [email protected] wrote:

been disordered somehow? After your program reads the file
the first entry that should be seen is the “Trade” one.
“Health Issues” appears twice in your output. Returning to my
original
hypothesis, are you certain “Trade” doesn’t occur twice in the file?
If it occurs twice, and the reg.ex. fails to match the first instance
but matches the second, it may appear to you that “Trade” is being
output out of order, when in fact the first occurrance is simply
missing.
If there are more copies of each entry in the file than you suspect,
a faulty reg.ex. will make it seem that the output is out of order.

A way to see if the reg.ex. is matching all entries.
grep -c '<issue ’ thefile.xml
If this is more than the 208 lines output by the Ruby program,
then the reg.ex. is probably failing in some cases.

Yes, when I grep my original xml file, I see 229 entries. But, with my
Ruby script, I only see 208. But, yes, many of these entries do repeat.
I’ll look more closely at that first entry, the Trade one, to see what
might be different about it. Thanks.