Reading XML to relational tables

Hi everyone,

I need to build 3 relational tables from an xml text. In this tables, I
need to keep track of words that have the and tags in them
along with the
word mentioned and its count in the

tag. This is easier to
illustrate with an example:

I need to take this text:

My name is Ted, and I like coffee. Ted does not like tea.

I have a brother who likes tea but does not like coffee

To 3 normalized tables like this:

…p_table…
p_id desc
1 My name is…
2 I have a …

…p_to_emph_table…
p_id e_id count
1 2 1
2 1 1
2 2 1

…emph_table…
e_id emph_word
1 Tea
2 Coffee

I am not sure what would be the best approach to parse this xml with
ruby or what tool
could help me do this efficiently?

Any ideas appreciated,

Ted.

On Sat, Apr 2, 2011 at 12:47 AM, Ted F. [email protected]
wrote:

My name is Ted, and I like coffee.

1 Tea
2 Coffee

I am not sure what would be the best approach to parse this xml with
ruby or what tool
could help me do this efficiently?

What I’d do is parse the XML (use Nokogiri, for example) and get all p
elements. For each p element, insert it into p_table if not present
and get its id. Look at all emph inside the p element, and for each of
them:

  • Check if the word is already in emph_table and get the id or
  • Insert it into emph_table and get the id

With that id, insert or update a row in the p_to_emph_table with the p
and the word id.

This is a straightforward approach that should work. Make a try (ask
any question that blocks you) and let us know how it goes.

Jesus.

Hi Jesus,

Thank you for your help. Right now I am stuck trying to traverse the
elements in a single xml::element. I know I can use this elements method
to list the elements, but I am not sure how
I can traverse through them and get their contents individually.

xml = File.read(‘translateXML.xml’)
doc = Nokogiri::XML(xml)

split into sentences first

arr = doc.search(‘p’)

puts arr[0].elements

On Sat, Apr 9, 2011 at 11:39 PM, Ted F. [email protected]
wrote:

split into sentences first

arr = doc.search(‘p’)

Try something like:

require ‘nokogiri’

doc = Nokogiri::XML(File.read(“p.xml”))
doc.search(“p”).each do |p_element|
puts “---------”
puts p_element.text
p_element.css(“emph,strong”).each do |emph|
puts “Highlighted: #{emph.text}”
end
end

Jesus.