Forum: Ruby how to extract something in between a pattern

Abd9008cc291fdbbed08595979598b90?d=identicon&s=25 Kn Ta (horizon)
on 2017-02-17 12:55
Was wondering how to extract certain data from a text file using Ruby.
For example, my text file has:

"Response by <a
href="https://www.helloworld.com/body/test_service_ltd">... Service
Limited</a> to <a href="https://www.helloworld.com/user/Joe_Bloggs_3">Jo
Bloggs</a> on <time datetime="2016-09-13T14:43:42+01:00"
title="2016-09-13 14:43:42 +0100">13 September 2016</time>.

Follow up sent to <a
href="https://www.helloworld.com/body/test_service_ltd">... Service
Limited</a> by <a href="https://www.helloworld.com/user/Jane_doe_4">Jane
Doe</a> on <time datetime="2017-02-03T16:48:38+00:00" title="2017-02-03
16:48:38 +0000"> 3 February 2017</time>."

How can I extract 'Joe_Bloggs_3' and the date '2016-09-13', and
'Jane_doe_4' and '2017-02-03' and so on..

I need to write the extracted data to an output file. So the output is:

Joe_Bloggs_3, 2016-09-13
Jane_doe_4, 2017-02-03
0fa73332c8e4a3b06ea439fd3f034322?d=identicon&s=25 Ronald Fischer (rovf)
on 2017-02-20 11:02
There are several ways to do it with regular expressions, but in any
case, the patterns you want to extract needs to be enclosed in
parentheses (which makes them capturing groups).

One way would then be to use String#match on your input string (see
http://ruby-doc.org/core-1.9.3/String.html#method-i-match), which
returns an object of type MatchData. The example in the aforementioned
URL shows how you can extract the matched strings from the MatchData
object.
B078cb4f4fb473c7a54d1fc36d10c70e?d=identicon&s=25 Regis d'Aubarede (raubarede)
on 2017-02-20 16:42
Kn Ta wrote in post #1185565:

txt=<<EEND
Response by <a
href="https://www.helloworld.com/body/test_service_ltd">... Service
Limited</a> to <a href="https://www.helloworld.com/user/Joe_Bloggs_3">Jo
Bloggs</a> on <time datetime="2016-09-13T14:43:42+01:00"
title="2016-09-13 14:43:42 +0100">13 September 2016</time>.

Follow up sent to <a
href="https://www.helloworld.com/body/test_service_ltd">... Service
Limited</a> by <a href="https://www.helloworld.com/user/Jane_doe_4">Jane
Doe</a> on <time datetime="2017-02-03T16:48:38+00:00" title="2017-02-03
16:48:38 +0000"> 3 February 2017</time>.
EEND

txt.gsub(/\r?\n/,"").scan(/\b(href|datetime)="(.*?)"/).
 each_slice(2) do |(k,v),(k1,v1)|
  p [v.split('/').last,v1.split('T').first] if k=="href" &&
k1=="datetime"
end

==
["Joe_Bloggs_3", "2016-09-13"]
["Jane_doe_4", "2017-02-03"]
Abd9008cc291fdbbed08595979598b90?d=identicon&s=25 Kn Ta (horizon)
on 2017-03-04 07:21
Regis d'Aubarede wrote in post #1185575:
> Kn Ta wrote in post #1185565:
>
> txt=<<EEND
> Response by <a
> href="https://www.helloworld.com/body/test_service_ltd">... Service
> Limited</a> to <a href="https://www.helloworld.com/user/Joe_Bloggs_3">Jo
> Bloggs</a> on <time datetime="2016-09-13T14:43:42+01:00"
> title="2016-09-13 14:43:42 +0100">13 September 2016</time>.
>
> Follow up sent to <a
> href="https://www.helloworld.com/body/test_service_ltd">... Service
> Limited</a> by <a href="https://www.helloworld.com/user/Jane_doe_4"> Jane
> Doe</a> on <time datetime="2017-02-03T16:48:38+00:00" title="2017-02-03
> 16:48:38 +0000"> 3 February 2017</time>.
> EEND
>
> txt.gsub(/\r?\n/,"").scan(/\b(href|datetime)="(.*?)"/).
>  each_slice(2) do |(k,v),(k1,v1)|
>   p [v.split('/').last,v1.split('T').first] if k=="href" &&
> k1=="datetime"
> end
>
> ==
> ["Joe_Bloggs_3", "2016-09-13"]
> ["Jane_doe_4", "2017-02-03"]

Can you explain what the code is doing please.

It's just that when I run the code, it returns just one result:

["A_Person", "2017-01-30"]

From the 25 or so entries in my text file, there is one line with:

Request to <a
href="https://www.helloworld.com/body/test_service_ltd">... Service
Ltd</a> by <a href="https://www.helloworld.com/user/J_Doe">J Doe</a>.
Annotated by <a href="https://www.helloworld.com/user/A_Person">
A Person</a> on <time datetime="2017-01-30T13:13:03+00:00"
title="2017-01-30 13:13:03 +0000">30 January 2017</time>.

I think the code is getting thrown off by this line. I deleted the
"annotated by" part and made it similar to the other lines. Then the
code returns nothing.

Actually, is there a way to just extract "J_Doe". Forget about
the date/time for now. (I can filter this elsewhere)
Please log in before posting. Registration is free and takes only a minute.
Existing account

NEW: Do you have a Google/GoogleMail, Yahoo or Facebook account? No registration required!
Log in with Google account | Log in with Yahoo account | Log in with Facebook account
No account? Register here.