Reg exp

hai_l · December 4, 2008, 3:40am

I have a problem with regexp. I have some document like :

A method for detecting a post-translationally modified protein with a
glycosyl group comprising contacting the protein with a glycosyl
transferase enzyme and a labeling agent, wherein the labeling agent
comprises a chemical handle and a transferable glycosyl group.

I want to divide it to some string follow a rule that string start with
“a, an, the” like :

"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
“a transferable glycosyl group.”

I use the code

while element.size do
if element =~ /([Aa]|[Aa]n|[Tt]he)( [^ ]+)(?:[Aa]|[Aa]n|[Tt]he)?/
then
temp_string = $1 +$2
temp_array << temp_string
else
break
end
element.slice!(temp_string)
end

but it’s not ok. the result is

A method
a post-translationally
a glycosyl
the protein
a glycosyl
a labeling
the labeling
a chemical
a transferable

Can anyone help me about the code ?

Hai Anh

hai_l · December 4, 2008, 5:08am

From: Hai anh Le [mailto:[email protected]]

while element.size do

if element =~ /([Aa]|[Aa]n|[Tt]he)( [^ ]+)(?:[Aa]|[Aa]n|[Tt]he)?/

then

temp_string = $1 +$2

temp_array << temp_string

else

break

end

element.slice!(temp_string)

end

if you like to walk thru the string manually, then stringscanner works
best.

otoh, you can also try string#split

but my initial reaction was just to find those articles then mark them w
newlines (maybe because i’m fond of programming w paper and pencil

eg,

puts s.gsub(/([Aa]|[Aa]n|[Tt]he)\W+/){“\n”+$1+“\s”}

A method for detecting
a post-translationally modified protein with
a glycosyl group comprising contacting
the protein with
a glycosyl transferase enzyme and
a labeling agent, wherein
the labeling agent comprises
a chemical handle and
a transferable glycosyl group.

hai_l · December 4, 2008, 7:19pm

On Thu, Dec 4, 2008 at 3:34 AM, Hai anh Le [email protected] wrote:

"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
“a transferable glycosyl group.”

I though of split, but then you get an array entry for the “separator”
and the following part, so you would need to paste them again
yourself:

irb(main):008:0> a = “A method for detecting a post-translationally
modified protein with a glycosyl group comprising contacting the
protein with a glycosyl transferase enzyme and a labeling agent,
wherein the labeling agent comprises a chemical handle and a
transferable glycosyl group.”
=> “A method for detecting a post-translationally modified protein
with a glycosyl group comprising contacting the protein with a
glycosyl transferase enzyme and a labeling agent, wherein the labeling
agent comprises a chemical handle and a transferable glycosyl group.”
irb(main):011:0> require ‘enumerator’
=> true
irb(main):012:0> result = []
=> []
irb(main):015:0> a.split(/\b(a|an|the)\b/i)[1…-1].each_slice(2) {|a,
b| result << (a+b)}
=> nil
irb(main):016:0> result
=> ["A method for detecting ", "a post-translationally modified
protein with ", "a glycosyl group comprising contacting ", "the
protein with ", "a glycosyl transferase enzyme and ", "a labeling
agent, wherein ", "the labeling agent comprises ", "a chemical handle
and ", “a transferable glycosyl group.”]

Maybe someone can come with a split solution easier to join?

Hope this helps,

Jesus.

hai_l · December 4, 2008, 7:28am

PeÃ±a, Botp wrote:

puts s.gsub(/([Aa]|[Aa]n|[Tt]he)\W+/){"\n"+$1+"\s"}

A method for detecting
a post-translationally modified protein with
a glycosyl group comprising contacting
the protein with
a glycosyl transferase enzyme and
a labeling agent, wherein
the labeling agent comprises
a chemical handle and
a transferable glycosyl group.

Thank for your help, this is another way. I think I can use “\n” to
mark, after that depend on it to divide.

hai_l · December 4, 2008, 8:16pm

Hi –

On Fri, 5 Dec 2008, Jesús Gabriel y Galán wrote:

I though of split, but then you get an array entry for the “separator”
glycosyl transferase enzyme and a labeling agent, wherein the labeling
protein with ", "a glycosyl group comprising contacting ", "the
protein with ", "a glycosyl transferase enzyme and ", "a labeling
agent, wherein ", "the labeling agent comprises ", "a chemical handle
and ", “a transferable glycosyl group.”]

Maybe someone can come with a split solution easier to join?

Try this:

a.split(/(?=an?\b|the\b)/i)

David

hai_l · December 4, 2008, 11:13pm

On Thu, Dec 4, 2008 at 8:09 PM, David A. Black [email protected]
wrote:

Try this:

a.split(/(?=an?\b|the\b)/i)

Neat, thanks. I understand how lookaheads work when it comes to
matching a regex,
but it’s still not clear to me why split works with the lookahead.
Isn’t the match of the lookahead a 0-width string?

irb(main):001:0> a = “abcxxxabcxxxabcxxx”
=> “abcxxxabcxxxabcxxx”
irb(main):006:0> a.match(/(?=abc)/)[0]
=> “”
irb(main):007:0> a =~ /(?=abc)/
=> 0

Then, how does split knows where to start the next search?

Thanks,

Jesus.