Forum: Ruby reg exp

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Bf7ae4d3ada1ba5b5ac332ff6875e1a6?d=identicon&s=25 Hai anh Le (dikdikdik)
on 2008-12-04 03:40
I have a problem with regexp. I have some document like :

A method for detecting a post-translationally modified protein with a
glycosyl group comprising contacting the protein with a glycosyl
transferase enzyme and a labeling agent, wherein the labeling agent
comprises a chemical handle and a transferable glycosyl group.

I want to divide it to some string follow a rule that string start with
"a, an, the" like :

"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
"a transferable glycosyl group."


I use the code

while element.size do
    if element =~ /([Aa]|[Aa]n|[Tt]he)( [^ ]+)(?:[Aa]|[Aa]n|[Tt]he)?/
then
      temp_string = $1 +$2
      temp_array << temp_string
    else
      break
    end
    element.slice!(temp_string)
end


but it's not ok. the result is

    A method
    a post-translationally
    a glycosyl
    the protein
    a glycosyl
    a labeling
    the labeling
    a chemical
    a transferable

Can anyone help me about the code ?

Hai Anh
6087a044557d6b59ab52e7dd20f94da8?d=identicon&s=25 Peña, Botp (Guest)
on 2008-12-04 05:08
(Received via mailing list)
From: Hai anh Le [mailto:lhanh@wichiptech.com]
# while element.size do
#     if element =~ /([Aa]|[Aa]n|[Tt]he)( [^ ]+)(?:[Aa]|[Aa]n|[Tt]he)?/
# then
#       temp_string = $1 +$2
#       temp_array << temp_string
#     else
#       break
#     end
#     element.slice!(temp_string)
# end

if you like to walk thru the string manually, then stringscanner works
best.

otoh, you can also try string#split

but my initial reaction was just to find those articles then mark them w
newlines (maybe because i'm fond of programming w paper and pencil :)

eg,

>puts s.gsub(/([Aa]|[Aa]n|[Tt]he)\W+/){"\n"+$1+"\s"}

A method for detecting
a post-translationally modified protein with
a glycosyl group comprising contacting
the protein with
a glycosyl transferase enzyme and
a labeling agent, wherein
the labeling agent comprises
a chemical handle and
a transferable glycosyl group.
Bf7ae4d3ada1ba5b5ac332ff6875e1a6?d=identicon&s=25 Hai anh Le (dikdikdik)
on 2008-12-04 07:28
Peña, Botp wrote:

>
>>puts s.gsub(/([Aa]|[Aa]n|[Tt]he)\W+/){"\n"+$1+"\s"}
>
> A method for detecting
> a post-translationally modified protein with
> a glycosyl group comprising contacting
> the protein with
> a glycosyl transferase enzyme and
> a labeling agent, wherein
> the labeling agent comprises
> a chemical handle and
> a transferable glycosyl group.

Thank for your help, this is another way. I think I can use "\n" to
mark, after that depend on it to divide.
E088bb5c80fd3c4fd02c2020cdacbaf0?d=identicon&s=25 Jesús Gabriel y Galán (Guest)
on 2008-12-04 19:19
(Received via mailing list)
On Thu, Dec 4, 2008 at 3:34 AM, Hai anh Le <lhanh@wichiptech.com> wrote:
> "A method for detecting "
> "a post-translationally modified protein with "
> "a glycosyl group comprising contacting "
> "the protein with "
> "a glycosyl transferase enzyme and "
> "a labeling agent, wherein "
> "the labeling agent comprises "
> "a chemical handle and "
> "a transferable glycosyl group."

I though of split, but then you get an array entry for the "separator"
and the following part, so you would need to paste them again
yourself:

irb(main):008:0> a = "A method for detecting a post-translationally
modified protein with a glycosyl group comprising contacting the
protein with a glycosyl transferase enzyme and a labeling agent,
wherein the labeling agent comprises a chemical handle and a
transferable glycosyl group."
=> "A method for detecting a post-translationally modified protein
with a glycosyl group comprising contacting the protein with a
glycosyl transferase enzyme and a labeling agent, wherein the labeling
agent comprises a chemical handle and a transferable glycosyl group."
irb(main):011:0> require 'enumerator'
=> true
irb(main):012:0> result = []
=> []
irb(main):015:0> a.split(/\b(a|an|the)\b/i)[1..-1].each_slice(2) {|a,
b| result << (a+b)}
=> nil
irb(main):016:0> result
=> ["A method for detecting ", "a post-translationally modified
protein with ", "a glycosyl group comprising contacting ", "the
protein with ", "a glycosyl transferase enzyme and ", "a labeling
agent, wherein ", "the labeling agent comprises ", "a chemical handle
and ", "a transferable glycosyl group."]

Maybe someone can come with a split solution easier to join?

Hope this helps,

Jesus.
F53b05cdbdf561cfe141f69b421244f3?d=identicon&s=25 David A. Black (Guest)
on 2008-12-04 20:16
(Received via mailing list)
Hi --

On Fri, 5 Dec 2008, Jesús Gabriel y Galán wrote:

>>
> I though of split, but then you get an array entry for the "separator"
> glycosyl transferase enzyme and a labeling agent, wherein the labeling
> protein with ", "a glycosyl group comprising contacting ", "the
> protein with ", "a glycosyl transferase enzyme and ", "a labeling
> agent, wherein ", "the labeling agent comprises ", "a chemical handle
> and ", "a transferable glycosyl group."]
>
> Maybe someone can come with a split solution easier to join?

Try this:

   a.split(/(?=an?\b|the\b)/i)


David
E088bb5c80fd3c4fd02c2020cdacbaf0?d=identicon&s=25 Jesús Gabriel y Galán (Guest)
on 2008-12-04 23:13
(Received via mailing list)
On Thu, Dec 4, 2008 at 8:09 PM, David A. Black <dblack@rubypal.com>
wrote:

> Try this:
>
>  a.split(/(?=an?\b|the\b)/i)

Neat, thanks. I understand how lookaheads work when it comes to
matching a regex,
but it's still not clear to me why split works with the lookahead.
Isn't the match of the lookahead a 0-width string?

irb(main):001:0> a = "abcxxxabcxxxabcxxx"
=> "abcxxxabcxxxabcxxx"
irb(main):006:0> a.match(/(?=abc)/)[0]
=> ""
irb(main):007:0> a =~ /(?=abc)/
=> 0

Then, how does split knows where to start the next search?

Thanks,

Jesus.
This topic is locked and can not be replied to.