Taking out text between symbols and joining together

bob05 · June 30, 2010, 1:54am

Hello,

I have some text files that I would like to extract text from, then join
them on one single line and save them to a text file.

Here is an example of the text I want to take out:

Protein complexes in Saccharomyces cerevisiae (GPM06600002310) GPM06600002310 None

Here is how I would like the text to save as:

Protein complexes in Saccharomyces cerevisiae
(GPM06600002310)GPM06600002310 None

So far I have this:

require ‘rexml/document’
include REXML
file = File.new(“1.xml”)
doc = Document.new(file)
puts doc
aFile = File.new(“1.txt”, “w”)
aFile.write(doc)
aFile.close

I was wondering, how can you split out text and join them on one line?

bob05 · June 30, 2010, 8:21am

On Wed, Jun 30, 2010 at 1:54 AM, Bob05 Dr [email protected]
wrote:

None
include REXML
file = File.new(“1.xml”)
doc = Document.new(file)
puts doc
aFile = File.new(“1.txt”, “w”)
aFile.write(doc)
aFile.close

I was wondering, how can you split out text and join them on one line?

First of all, your document doesn’t parse well, because it has two root
nodes.
After solving that, what you need is to get to each element and
extract its text children nodes.
Take a look at:

http://www.germane-software.com/software/rexml/docs/tutorial.html

And the methods:

elements
[]
text

of the API. Experiment a little in IRB:

irb(main):001:0> s = <<EOF
irb(main):002:0" Protein complexes in Saccharomyces cerevisiae
irb(main):003:0" (GPM06600002310)
irb(main):004:0" GPM06600002310
irb(main):005:0" None
irb(main):006:0" EOF
=> “Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)\nGPM06600002310\nNone\n”
irb(main):007:0>
irb(main):008:0*
irb(main):009:0* require ‘rexml/document’
=> true
irb(main):010:0> include REXML
=> Object
irb(main):011:0> doc = Document.new s
REXML::ParseException: #<RuntimeError: attempted adding second root
element to document>

ooooops, two root elements. I’ll add a fake one surrounding everything:

irb(main):012:0> s = <<EOF
irb(main):013:0"
irb(main):014:0" Protein complexes in Saccharomyces cerevisiae
irb(main):015:0" (GPM06600002310)
irb(main):016:0" GPM06600002310
irb(main):017:0" None
irb(main):018:0"
irb(main):019:0" EOF
=> “\nProtein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)\nGPM06600002310\nNone\n\n”
irb(main):020:0> doc = Document.new s
=> … </>
irb(main):025:0> doc.elements
=> #<REXML::Elements:0xb72907e0 @element= … </>>
irb(main):026:0> doc.elements.each {|el| p el}
… </>
=> [ … </>]
irb(main):027:0> doc.to_a
=> [ … </>, “\n”]
irb(main):028:0> doc.elements.to_a
=> [ … </>]
irb(main):032:0> doc.elements[“/Title”]
=> nil
irb(main):033:0> doc.elements[“Title”]
=> nil
irb(main):034:0> root = doc.root
=> … </>
irb(main):035:0> root.elements[“Title”]
=> … </>
irb(main):036:0> root.elements[“Title”].to_s
=> “Protein complexes in Saccharomyces
cerevisiae\n(GPM06600002310)”

Look, it seems that with that I can get the text of the Title element.
Let’s see if there’s a better way:

irb(main):039:0> root.elements[“Title”].methods.sort
=> [“<<”, “==”, “===”, “=~”, “[]”, “[]=”, “id”, “send”, “add”,
“add_attribute”, “add_attributes”, “add_element”, “add_namespace”,
“add_text”, “all?”, “any?”, “attribute”, “attributes”, “bytes”,
“cdatas”, “children”, “class”, “clone”, “collect”, “comments”,
“context”, “context=”, “count”, “cycle”, “dclone”, “deep_clone”,
“delete”, “delete_at”, “delete_attribute”, “delete_element”,
“delete_if”, “delete_namespace”, “detect”, “display”, “document”,
“drop”, “drop_while”, “dup”, “each”, “each_child”, “each_cons”,
“each_element”, “each_element_with_attribute”,
“each_element_with_text”, “each_index”, “each_recursive”,
“each_slice”, “each_with_index”, “elements”, “entries”, “enum_cons”,
“enum_for”, “enum_slice”, “enum_with_index”, “eql?”, “equal?”,
“expanded_name”, “extend”, “find”, “find_all”, “find_first_recursive”,
“find_index”, “first”, “freeze”, “frozen?”, “fully_expanded_name”,
“get_elements”, “get_text”, “grep”, “group_by”, “has_attributes?”,
“has_elements?”, “has_name?”, “has_text?”, “hash”, “id”,
“ignore_whitespace_nodes”, “include?”, “indent”, “index”,
“index_in_parent”, “inject”, “insert_after”, “insert_before”,
“inspect”, “instance_eval”, “instance_exec”, “instance_of?”,
“instance_variable_defined?”, “instance_variable_get”,
“instance_variable_set”, “instance_variables”, “instructions”,
“is_a?”, “kind_of?”, “length”, “local_name”, “map”, “max”, “max_by”,
“member?”, “method”, “methods”, “min”, “min_by”, “minmax”,
“minmax_by”, “name”, “name=”, “namespace”, “namespaces”,
“next_element”, “next_sibling”, “next_sibling=”, “next_sibling_node”,
“nil?”, “node_type”, “none?”, “object_id”, “one?”, “parent”,
“parent=”, “parent?”, “partition”, “prefix”, “prefix=”, “prefixes”,
“previous_element”, “previous_sibling”, “previous_sibling=”,
“previous_sibling_node”, “private_methods”, “protected_methods”,
“public_methods”, “push”, “raw”, “reduce”, “reject”, “remove”,
“replace_child”, “replace_with”, “respond_to?”, “reverse_each”,
“root”, “root_node”, “select”, “send”, “singleton_methods”, “size”,
“sort”, “sort_by”, “taint”, “tainted?”, “take”, “take_while”, “tap”,
“text”, “text=”, “texts”, “to_a”, “to_enum”, “to_s”, “to_set”, “type”,
“unshift”, “untaint”, “whitespace”, “write”, “xpath”, “zip”]

There’s a text method in there, would that do what I expect?

irb(main):040:0> root.elements[“Title”].text
=> “Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)”

Bingo !

Is there a way to access it directly from the doc, instead of having a
root variable?

irb(main):042:0> doc.elements[“ROOT/Title”].text
=> “Protein complexes in Saccharomyces cerevisiae\n(GPM06600002310)”

Now you can do the same for the other elements. I also recommend you
learn XPath and CSS selectors if you are going to be parsing markup,
and also look at other parsers like Nokogiri. This example was pretty
simple, but these things can get nasty.

Jesus.