I have a problem with regexp. I have some document like :
A method for detecting a post-translationally modified protein with a
glycosyl group comprising contacting the protein with a glycosyl
transferase enzyme and a labeling agent, wherein the labeling agent
comprises a chemical handle and a transferable glycosyl group.
I want to divide it to some string follow a rule that string start with
“a, an, the” like :
"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
“a transferable glycosyl group.”
I use the code
while element.size do
if element =~ /([Aa]|[Aa]n|[Tt]he)( [^ ]+)(?:[Aa]|[Aa]n|[Tt]he)?/
then
temp_string = $1 +$2
temp_array << temp_string
else
break
end
element.slice!(temp_string)
end
but it’s not ok. the result is
A method
a post-translationally
a glycosyl
the protein
a glycosyl
a labeling
the labeling
a chemical
a transferable
A method for detecting
a post-translationally modified protein with
a glycosyl group comprising contacting
the protein with
a glycosyl transferase enzyme and
a labeling agent, wherein
the labeling agent comprises
a chemical handle and
a transferable glycosyl group.
"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
“a transferable glycosyl group.”
I though of split, but then you get an array entry for the “separator”
and the following part, so you would need to paste them again
yourself:
irb(main):008:0> a = “A method for detecting a post-translationally
modified protein with a glycosyl group comprising contacting the
protein with a glycosyl transferase enzyme and a labeling agent,
wherein the labeling agent comprises a chemical handle and a
transferable glycosyl group.”
=> “A method for detecting a post-translationally modified protein
with a glycosyl group comprising contacting the protein with a
glycosyl transferase enzyme and a labeling agent, wherein the labeling
agent comprises a chemical handle and a transferable glycosyl group.”
irb(main):011:0> require ‘enumerator’
=> true
irb(main):012:0> result = []
=> []
irb(main):015:0> a.split(/\b(a|an|the)\b/i)[1…-1].each_slice(2) {|a,
b| result << (a+b)}
=> nil
irb(main):016:0> result
=> ["A method for detecting ", "a post-translationally modified
protein with ", "a glycosyl group comprising contacting ", "the
protein with ", "a glycosyl transferase enzyme and ", "a labeling
agent, wherein ", "the labeling agent comprises ", "a chemical handle
and ", “a transferable glycosyl group.”]
Maybe someone can come with a split solution easier to join?
A method for detecting
a post-translationally modified protein with
a glycosyl group comprising contacting
the protein with
a glycosyl transferase enzyme and
a labeling agent, wherein
the labeling agent comprises
a chemical handle and
a transferable glycosyl group.
Thank for your help, this is another way. I think I can use “\n” to
mark, after that depend on it to divide.
I though of split, but then you get an array entry for the “separator”
glycosyl transferase enzyme and a labeling agent, wherein the labeling
protein with ", "a glycosyl group comprising contacting ", "the
protein with ", "a glycosyl transferase enzyme and ", "a labeling
agent, wherein ", "the labeling agent comprises ", "a chemical handle
and ", “a transferable glycosyl group.”]
Maybe someone can come with a split solution easier to join?
On Thu, Dec 4, 2008 at 8:09 PM, David A. Black [email protected]
wrote:
Try this:
a.split(/(?=an?\b|the\b)/i)
Neat, thanks. I understand how lookaheads work when it comes to
matching a regex,
but it’s still not clear to me why split works with the lookahead.
Isn’t the match of the lookahead a 0-width string?
irb(main):001:0> a = “abcxxxabcxxxabcxxx”
=> “abcxxxabcxxxabcxxx”
irb(main):006:0> a.match(/(?=abc)/)[0]
=> “”
irb(main):007:0> a =~ /(?=abc)/
=> 0
Then, how does split knows where to start the next search?
Thanks,
Jesus.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.