How to extract something in between a pattern


#1

Was wondering how to extract certain data from a text file using Ruby.
For example, my text file has:

"Response by Test Service
Limited to Jo
Bloggs
on 13 September 2016.

Follow up sent to Test Service
Limited by Jane
Doe
on 3 February 2017."

How can I extract ‘Joe_Bloggs_3’ and the date ‘2016-09-13’, and
‘Jane_doe_4’ and ‘2017-02-03’ and so on…

I need to write the extracted data to an output file. So the output is:

Joe_Bloggs_3, 2016-09-13
Jane_doe_4, 2017-02-03


#2

There are several ways to do it with regular expressions, but in any
case, the patterns you want to extract needs to be enclosed in
parentheses (which makes them capturing groups).

One way would then be to use String#match on your input string (see
http://ruby-doc.org/core-1.9.3/String.html#method-i-match), which
returns an object of type MatchData. The example in the aforementioned
URL shows how you can extract the matched strings from the MatchData
object.


#3

Kn Ta wrote in post #1185565:

txt=<<EEND
Response by … Service
Limited to Jo
Bloggs
on 13 September 2016.

Follow up sent to … Service
Limited by Jane
Doe
on 3 February 2017.
EEND

txt.gsub(/\r?\n/,"").scan(/\b(href|datetime)="(.*?)"/).
each_slice(2) do |(k,v),(k1,v1)|
p [v.split(’/’).last,v1.split(‘T’).first] if k==“href” &&
k1==“datetime”
end

==
[“Joe_Bloggs_3”, “2016-09-13”]
[“Jane_doe_4”, “2017-02-03”]


#4

Regis d’Aubarede wrote in post #1185575:

Kn Ta wrote in post #1185565:

txt=<<EEND
Response by … Service
Limited to Jo
Bloggs
on 13 September 2016.

Follow up sent to … Service
Limited by Jane
Doe
on 3 February 2017.
EEND

txt.gsub(/\r?\n/,"").scan(/\b(href|datetime)="(.*?)"/).
each_slice(2) do |(k,v),(k1,v1)|
p [v.split(’/’).last,v1.split(‘T’).first] if k==“href” &&
k1==“datetime”
end

==
[“Joe_Bloggs_3”, “2016-09-13”]
[“Jane_doe_4”, “2017-02-03”]

Can you explain what the code is doing please.

It’s just that when I run the code, it returns just one result:

[“A_Person”, “2017-01-30”]

From the 25 or so entries in my text file, there is one line with:

Request to Test Service
Ltd by J Doe.
Annotated by
A Person
on 30 January 2017.

I think the code is getting thrown off by this line. I deleted the
“annotated by” part and made it similar to the other lines. Then the
code returns nothing.

Actually, is there a way to just extract “J_Doe”. Forget about
the date/time for now. (I can filter this elsewhere)