Identify and extract positions from a string - how to?

Hi,

I am not quite sure about how to approach the following problem:

I have a long (long long long) string of letters, a genomic sequence
(600k characters+).
Now, what I want to do is to extract certain parts of this string, based
on the position.
So for example lets say I want all characters from position 2340 to
5436.

A quick pointer in the right direction would be much appreciated. I have
a vague idea that it could perhaps be done with count? Like “puts string
where string.count(“actg”)=2340 until string.count(“actg”)=5436”… ?
Not sure tho, and probably there are better ways.

Cheers,

Marc

On 7/19/07, Marc H. [email protected] wrote:

A quick pointer in the right direction would be much appreciated. I have
a vague idea that it could perhaps be done with count? Like “puts string
where string.count(“actg”)=2340 until string.count(“actg”)=5436”… ?
Not sure tho, and probably there are better ways.

string[2340…5436]

Cheers,

On Thu, 19 Jul 2007 19:31:12 +0900, Marc H. wrote:

Hi,

I am not quite sure about how to approach the following problem:

I have a long (long long long) string of letters, a genomic sequence
(600k characters+).
Now, what I want to do is to extract certain parts of this string, based
on the position.
So for example lets say I want all characters from position 2340 to
5436.

What about

puts “My String”[5…7]

Thomas

Le 19 juillet à 12:31, Marc H. a écrit :

Hi,

I am not quite sure about how to approach the following problem:

I have a long (long long long) string of letters, a genomic sequence
(600k characters+).
Now, what I want to do is to extract certain parts of this string, based
on the position.
So for example lets say I want all characters from position 2340 to
5436.

For example :

str = “abcdefghijklmnopqrstuvwxyz”
=> “abcdefghijklmnopqrstuvwxyz”

The simplest way to do answer you question is :

str[5…11]
=> “fghijkl”

You may want to try the other variants :

str[5, 6]
=> “fghijkl”

str[/f.*l/]
=> “fghijkl”

str[‘jghijkl’]
=> “fghijkl”

If you need to parse it char per char, you can use a multitude of
methods :

str[5…10].each_byte { |b| puts b.chr }
f
g
h
i
j
k
=> “fghijk”

str[5…10].split(//)
=> [“f”, “g”, “h”, “i”, “j”, “k”]

str[5…10].split(//).each { |c| puts c }
f
g
h
i
j
k
=> [“f”, “g”, “h”, “i”, “j”, “k”]

Etc.

I didn’t try with very long strings, now, but I don’t see why the ranges
methods of access wouldn’t be acceptable. (Of course, the regular
expression will be slower.)

Fred

Thanks a lot, dont know how I missed that in the string chapter.

Anyhow, another thing came up:

while string[1…10] is pretty much what I was looking for - is there any
way that I can substitute the numbers (or the whole content of the
square brackets for that matter) with variables?

As it is now I have a file that contains coordinates and a second file
that contains the string that I want to extract from.

So ideally the script would read each line of the coordinate file

45…78
90…120
etc

and uses it in the extraction method

file.readlines each do |l|
puts string[l]
end

Doesnt work tho -any suggestions on how to pipe each line of the
coordinate file to the string method? I know I know, probably simple,
but I am still learning :wink:

Cheers,

Marc

end
Posted via http://www.ruby-forum.com/.
irb(main):001:0> (a,b) = “3…5”.split(“…”).map {|x| x.to_i}
=> [3, 5]
irb(main):002:0> “test_string”[a…b]
=> “t_s”

Le 19 juillet à 13:16, Marc H. a écrit :

and uses it in the extraction method

file.readlines each do |l|
puts string[l]
end

The others solutions in the thread are the ones to use, but I feel the
need to suggest the very dirty / insecure / bad one :

File(filepath).readlines.each do |l|
puts string[eval(l)]
end

Don’t try this at home, etc… :slight_smile:

(But, in a controlled environment, it may be useful since it allows for
all the variations that can be evaluated in one line of ruby code…)

Fred

2007/7/19, F. Senault [email protected]:

File(filepath).readlines.each do |l|
puts string[eval(l)]
end

Don’t try this at home, etc… :slight_smile:

(But, in a controlled environment, it may be useful since it allows for
all the variations that can be evaluated in one line of ruby code…)

A safer variant:

file.each do |line|
if /^(\d+)..(\d+)$/ =~ line
puts string[ $1.to_i … $2.to_i ]
end
end

Note, that file.each is more efficient than file.readlines.each
because it does not need to read the whole file into memory.

Kind regards

robert

On Thu, 19 Jul 2007 20:16:12 +0900, Marc H. wrote:

As it is now I have a file that contains coordinates and a second file
that contains the string that I want to extract from.

So ideally the script would read each line of the coordinate file

45…78
90…120
etc

Those …-things are called ranges, which, what wonder, are a class in
ruby. Have a look at http://corelib.rubyonrails.org/ for the class
Range.

another way to express str[45…78] is str[45,78] or str.slice(45,78) or
str.slice(45…78), where the numbers can be replaced by variables:
str[fr…to], str[fr,to], str.slice[fr,to], str.slice(fr…to)

This information can be found at the same webpage, just look for the
class String :wink:

and uses it in the extraction method

file.readlines each do |l|
puts string[l]
end

Doesnt work tho -any suggestions on how to pipe each line of the
coordinate file to the string method? I know I know, probably simple,
but I am still learning :wink:

l is a String-object, not a Range-object.

file.readlines each do |l|
fr, to = l.split(/../)
puts string[fr,to]
end

should do the job.

The thingy with the slashes in the split-method is a regular expression.

Regards
Thomas

On Thu, 19 Jul 2007 11:54:29 +0000, Thomas W. wrote:

puts string[fr,to]

should be

puts string[fr.to_i,to.to_i]

Thomas

On Thu, 19 Jul 2007 14:04:13 +0200, F. Senault wrote:

str[45…78].length
=> 34

str[45,78].length
=> 78

(IOW start_position…end_position versus start_position,length.)

I guess you are right. I misintepreted the documentation, which says in
a
number of examples:

a = “hello there”
a[1,3] #=> “ell”
a[1…3] #=> “ell”

I should have taken the time to read the text instead.

Thomas

Le 19 juillet à 13:54, Thomas W. a écrit :

Those …-things are called ranges, which, what wonder, are a class in
ruby. Have a look at http://corelib.rubyonrails.org/ for the class Range.

another way to express str[45…78] is str[45,78]

Nope :

str[45…78].length
=> 34
str[45,78].length
=> 78

(IOW start_position…end_position versus start_position,length.)

Fred