Identify and extract positions from a string - how to?


#1

Hi,

I am not quite sure about how to approach the following problem:

I have a long (long long long) string of letters, a genomic sequence
(600k characters+).
Now, what I want to do is to extract certain parts of this string, based
on the position.
So for example lets say I want all characters from position 2340 to
5436.

A quick pointer in the right direction would be much appreciated. I have
a vague idea that it could perhaps be done with count? Like “puts string
where string.count(“actg”)=2340 until string.count(“actg”)=5436”… ?
Not sure tho, and probably there are better ways.

Cheers,

Marc


#2

On 7/19/07, Marc H. removed_email_address@domain.invalid wrote:

A quick pointer in the right direction would be much appreciated. I have
a vague idea that it could perhaps be done with count? Like “puts string
where string.count(“actg”)=2340 until string.count(“actg”)=5436”… ?
Not sure tho, and probably there are better ways.

string[2340…5436]

Cheers,


#3

On Thu, 19 Jul 2007 19:31:12 +0900, Marc H. wrote:

Hi,

I am not quite sure about how to approach the following problem:

I have a long (long long long) string of letters, a genomic sequence
(600k characters+).
Now, what I want to do is to extract certain parts of this string, based
on the position.
So for example lets say I want all characters from position 2340 to
5436.

What about

puts “My String”[5…7]

Thomas


#4

Le 19 juillet à 12:31, Marc H. a écrit :

Hi,

I am not quite sure about how to approach the following problem:

I have a long (long long long) string of letters, a genomic sequence
(600k characters+).
Now, what I want to do is to extract certain parts of this string, based
on the position.
So for example lets say I want all characters from position 2340 to
5436.

For example :

str = “abcdefghijklmnopqrstuvwxyz”
=> “abcdefghijklmnopqrstuvwxyz”

The simplest way to do answer you question is :

str[5…11]
=> “fghijkl”

You may want to try the other variants :

str[5, 6]
=> “fghijkl”

str[/f.*l/]
=> “fghijkl”

str[‘jghijkl’]
=> “fghijkl”

If you need to parse it char per char, you can use a multitude of
methods :

str[5…10].each_byte { |b| puts b.chr }
f
g
h
i
j
k
=> “fghijk”

str[5…10].split(//)
=> [“f”, “g”, “h”, “i”, “j”, “k”]

str[5…10].split(//).each { |c| puts c }
f
g
h
i
j
k
=> [“f”, “g”, “h”, “i”, “j”, “k”]

Etc.

I didn’t try with very long strings, now, but I don’t see why the ranges
methods of access wouldn’t be acceptable. (Of course, the regular
expression will be slower.)

Fred


#5

Thanks a lot, dont know how I missed that in the string chapter.

Anyhow, another thing came up:

while string[1…10] is pretty much what I was looking for - is there any
way that I can substitute the numbers (or the whole content of the
square brackets for that matter) with variables?

As it is now I have a file that contains coordinates and a second file
that contains the string that I want to extract from.

So ideally the script would read each line of the coordinate file

45…78
90…120
etc

and uses it in the extraction method

file.readlines each do |l|
puts string[l]
end

Doesnt work tho -any suggestions on how to pipe each line of the
coordinate file to the string method? I know I know, probably simple,
but I am still learning :wink:

Cheers,

Marc


#6

end
Posted via http://www.ruby-forum.com/.
irb(main):001:0> (a,b) = “3…5”.split("…").map {|x| x.to_i}
=> [3, 5]
irb(main):002:0> “test_string”[a…b]
=> “t_s”


#7

Le 19 juillet à 13:16, Marc H. a écrit :

and uses it in the extraction method

file.readlines each do |l|
puts string[l]
end

The others solutions in the thread are the ones to use, but I feel the
need to suggest the very dirty / insecure / bad one :

File(filepath).readlines.each do |l|
puts string[eval(l)]
end

Don’t try this at home, etc… :slight_smile:

(But, in a controlled environment, it may be useful since it allows for
all the variations that can be evaluated in one line of ruby code…)

Fred


#8

2007/7/19, F. Senault removed_email_address@domain.invalid:

File(filepath).readlines.each do |l|
puts string[eval(l)]
end

Don’t try this at home, etc… :slight_smile:

(But, in a controlled environment, it may be useful since it allows for
all the variations that can be evaluated in one line of ruby code…)

A safer variant:

file.each do |line|
if /^(\d+)…(\d+)$/ =~ line
puts string[ $1.to_i … $2.to_i ]
end
end

Note, that file.each is more efficient than file.readlines.each
because it does not need to read the whole file into memory.

Kind regards

robert


#9

On Thu, 19 Jul 2007 20:16:12 +0900, Marc H. wrote:

As it is now I have a file that contains coordinates and a second file
that contains the string that I want to extract from.

So ideally the script would read each line of the coordinate file

45…78
90…120
etc

Those …-things are called ranges, which, what wonder, are a class in
ruby. Have a look at http://corelib.rubyonrails.org/ for the class
Range.

another way to express str[45…78] is str[45,78] or str.slice(45,78) or
str.slice(45…78), where the numbers can be replaced by variables:
str[fr…to], str[fr,to], str.slice[fr,to], str.slice(fr…to)

This information can be found at the same webpage, just look for the
class String :wink:

and uses it in the extraction method

file.readlines each do |l|
puts string[l]
end

Doesnt work tho -any suggestions on how to pipe each line of the
coordinate file to the string method? I know I know, probably simple,
but I am still learning :wink:

l is a String-object, not a Range-object.

file.readlines each do |l|
fr, to = l.split(/../)
puts string[fr,to]
end

should do the job.

The thingy with the slashes in the split-method is a regular expression.

Regards
Thomas


#10

On Thu, 19 Jul 2007 11:54:29 +0000, Thomas W. wrote:

puts string[fr,to]

should be

puts string[fr.to_i,to.to_i]

Thomas


#11

On Thu, 19 Jul 2007 14:04:13 +0200, F. Senault wrote:

str[45…78].length
=> 34

str[45,78].length
=> 78

(IOW start_position…end_position versus start_position,length.)

I guess you are right. I misintepreted the documentation, which says in
a
number of examples:

a = “hello there”
a[1,3] #=> “ell”
a[1…3] #=> “ell”

I should have taken the time to read the text instead.

Thomas


#12

Le 19 juillet à 13:54, Thomas W. a écrit :

Those …-things are called ranges, which, what wonder, are a class in
ruby. Have a look at http://corelib.rubyonrails.org/ for the class Range.

another way to express str[45…78] is str[45,78]

Nope :

str[45…78].length
=> 34

str[45,78].length
=> 78

(IOW start_position…end_position versus start_position,length.)

Fred