FileString - request for comments

Hi there

I just put FileString on github: GitHub - apeiros/filestring: Treat files like plain normal strings
FileString is a class that wraps a path on the filesystem (a file) and
provides an exact copy of the String API. This means you can code as if
you had a String and your file on the disk gets manipulated just
“magically”.

The library is very young (just a bit more than 24h), so please use with
care.

I’d appreciate any kind of comment.

Regards
Stefan

On Nov 8, 2009, at 7:47 PM, [email protected] wrote:

I just put FileString on github: GitHub - apeiros/filestring: Treat files like plain normal strings
FileString is a class that wraps a path on the filesystem (a file)
and provides an exact copy of the String API. This means you can
code as if you had a String and your file on the disk gets
manipulated just “magically”.

Interesting choice to use a String. I used Tie::File a couple of
times in Perl code. It works as an Array instead:

James Edward G. II

James Edward G. II wrote:

Tie::File - Access the lines of a disk file via a Perl array - metacpan.org

James Edward G. II

What would the advantage over mmap[1] be? FileString is pure ruby
(right?) and hence more portable, but probably mmap is much more
efficient? Any other tradeoffs?

[1] http://moulon.inra.fr/ruby/mmap.html; looks like this project of Guy
Decoux’s has been recently adopted by knu:
GitHub - knu/ruby-mmap: Ruby bindings for Unix mmap(2) by Guy Decoux.

-------- Original-Nachricht --------

Datum: Mon, 9 Nov 2009 12:37:17 +0900
Von: James Edward G. II [email protected]
An: [email protected]
Betreff: Re: FileString - request for comments

Tie::File - Access the lines of a disk file via a Perl array - metacpan.org

James Edward G. II

Somebody I know already implemented a TieFile in ruby, the repository is
at http://killerfox.protection-fault.ch/gitrepo/tie_file.git

Personally I don’t tend to think of a file as an array. I’d use
Tie::File if I’d need a persistent array, so the problem is coming “the
other way round”. With FileString I explicitly want to deal with a File,
but not with an IO like API (of course you could go at it as “I need a
persistent String” too - but that wasn’t/isn’t the case for me).

Regards
Stefan

Hi Joel

What would the advantage over mmap[1] be? FileString is pure ruby
(right?) and hence more portable, but probably mmap is much more
efficient? Any other tradeoffs?

Interesting, I was looking if a solution existed already and didn’t find
mmap. Yes, FileString is pure ruby and should therefore run on all ruby
implementations. And yes, I’d expect mmap to be more efficient on the
other hand. It’d be interesting to combine the two (if that’s at all
possible).
In a quick test it seems FileString is more complete too, e.g. Mmap
doesn’t have #replace (should be trivial to add). But Mmap has the
feature to only tie a part of the file.

GitHub - knu/ruby-mmap: Ruby bindings for Unix mmap(2) by Guy Decoux.
Thanks for the link

Regards
Stefan

2009/11/9 Eleanor McHugh [email protected]:

On 9 Nov 2009, at 13:54, [email protected] wrote:

It would probably be fairly trivial for you to directly support mmap at the
OS level using Ruby/DL, Ruby-FFI or even syscall (although that’s ugly and
fragile).

I am still trying to wrap my head around the question whether hiding
file IO behind a String API is a good idea. Basically the reason to
create something like this is to be able to use a file in places which
expect to be given a String instance. However, code that uses String
assumes fast access to arbitrary portions of the string. When those
accesses are translated into random accesses to a file performance
might suffer dramatically. Put differently: hiding the fact that we
are dealing with a file is convenient but may actually break your
neck. And although at a certain level of abstraction a file and a
String are pretty much the same (sequence of chars / bytes) it may
actually be a good thing to keep the API separate in order to treat
both appropriately. Stefan, what’s your experience?

Kind regards

robert

Robert,

RK> I am still trying to wrap my head around the question whether hiding
RK> file IO behind a String API is a good idea.

As the PickAxe book points out, by having file i/o represented by a
String … that is, making it irrelevant whether one is talking to a
String or a File … makes for some nice unit testing.

On 9 Nov 2009, at 13:54, [email protected] wrote:

(if that’s at all possible).
In a quick test it seems FileString is more complete too, e.g. Mmap
doesn’t have #replace (should be trivial to add). But Mmap has the
feature to only tie a part of the file.

It would probably be fairly trivial for you to directly support mmap
at the OS level using Ruby/DL, Ruby-FFI or even syscall (although
that’s ugly and fragile). Take a look at some of my Plumber’s Guide
presentations at the link in my signature and also at
http://kenai.com/projects/ruby-ffi
for details of how to wrap these kinds of system calls such that
they’ll run identically on JRuby, Rubinius and MRI.

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net

raise ArgumentError unless @reality.responds_to? :reason

-------- Original-Nachricht --------

Datum: Tue, 10 Nov 2009 00:28:56 +0900
Von: Robert K. [email protected]
An: [email protected]
Betreff: Re: FileString - request for comments

I am still trying to wrap my head around the question whether hiding
file IO behind a String API is a good idea. Basically the reason to
create something like this is to be able to use a file in places which
expect to be given a String instance.

No. At least that was not the idea (though, you could).
The reason is that e.g. replacing a part of a file is cumbersome.
Compare:

IO API:

File.open(path, “r+b”) do |fh|
fh.seek(offset+length)
rest = fh.read
fh.seek(offset)
fh.write(replacement)
fh.write(rest)
}

String API:

fs = FileString.new(path)
fs[offset, length] = replacment # done!

Imagine how much more inconvenient it becomes when it’s not offset &
length but a Range, or when you have to accomodate negative offsets etc.

And there are other examples, just dive a bit in FileString’s source :slight_smile:

The String API is far more convenient.

However, code that uses String
assumes fast access to arbitrary portions of the string. When those
accesses are translated into random accesses to a file performance
might suffer dramatically.

Yes. If you get that kind of problem - you can always use File.read
instead of FileString#to_s (or to_str).

Put differently: hiding the fact that we
are dealing with a file is convenient but may actually break your
neck.

As all highlevel things. If you don’t know the things you’re dealing
with you can easily kill performance. Consider e.g. ary.any? { |obj|
other.include?(obj) } - there, just accidentally created an O(n^2)
algorithm. It can happen everywhere and it can look totally innocent.
That’s not a problem that’s specific to FileString but to everything
that’s abstract.

And although at a certain level of abstraction a file and a
String are pretty much the same (sequence of chars / bytes) it may
actually be a good thing to keep the API separate in order to treat
both appropriately. Stefan, what’s your experience?

As you see, I disagree :slight_smile:
However, what you say is of course correct. Using FileString means you
have to keep in mind that you’re dealing with a file.
But: if you know you’re dealing with a file, it can even help you making
things faster. For example, if you indeed want to compare two files for
equality, FileString#== will be faster and less memory intensive than
you doing File.read(a) == File.read(b) if the two files are big.

Kind regards

robert

Thanks for your thoughts robert, much appreciated

regards
Stefan

Eleanor McHugh wrote:
[…]

Using a given representation just because it’s unit testing friendly
isn’t necessarily a good idea…

…or necessarily a bad idea. There’s something to be said for letting
architecture emerge from testability.

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net

raise ArgumentError unless @reality.responds_to? :reason

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

On 09.11.2009 17:29, [email protected] wrote:

-------- Original-Nachricht --------

Von: Robert K. [email protected]
As all highlevel things. If you don’t know the things you’re dealing with you can easily kill performance. Consider e.g. ary.any? { |obj| other.include?(obj) } - there, just accidentally created an O(n^2) algorithm. It can happen everywhere and it can look totally innocent.
That’s not a problem that’s specific to FileString but to everything that’s abstract.

True.

And although at a certain level of abstraction a file and a
String are pretty much the same (sequence of chars / bytes) it may
actually be a good thing to keep the API separate in order to treat
both appropriately. Stefan, what’s your experience?

As you see, I disagree :slight_smile:

However, what you say is of course correct. Using FileString means
you > have to keep in mind that you’re dealing with a file.
But: if you know you’re dealing with a file, it can even help you
making things faster. For example, if you indeed want to compare two
files for equality, FileString#== will be faster and less memory
intensive than you doing File.read(a) == File.read(b) if the two
files > are big.

A good point! You’re probably right and I was too pessimistic. I’d
love to see

fs[/foo(\w+)/, 1] = “bar”
fs.gsub! /foo/, “bar”

etc. because those would be the ones that would make FileString
convenient for me. :slight_smile:

Thanks for your thoughts robert, much appreciated

Thanks for listening and sharing!

Kind regards

robert

On 9 Nov 2009, at 16:49, Ralph S. wrote:

Robert,

RK> I am still trying to wrap my head around the question whether
hiding
RK> file IO behind a String API is a good idea.

As the PickAxe book points out, by having file i/o represented by a
String … that is, making it irrelevant whether one is talking to a
String or a File … makes for some nice unit testing.

Using a given representation just because it’s unit testing friendly
isn’t necessarily a good idea…

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net

raise ArgumentError unless @reality.responds_to? :reason

-------- Original-Nachricht --------

Datum: Tue, 10 Nov 2009 06:25:08 +0900
Von: Robert K. [email protected]
Betreff: Re: FileString - request for comments

A good point! You’re probably right and I was too pessimistic. I’d
love to see

fs[/foo(\w+)/, 1] = “bar”

I just noticed that I actually didn’t have that functionality in. I
added it now in the way described in the earlier reply.

Also a small correction of one of my earlier statements (typo):
You can use File.read or FileString#to_s (or to_str) instead of the
FileString instance. FileString#to_s returns the contents of the file.

Regards
Stefan

-------- Original-Nachricht --------

Datum: Tue, 10 Nov 2009 06:25:08 +0900
Von: Robert K. [email protected]
Betreff: Re: FileString - request for comments

A good point! You’re probably right and I was too pessimistic. I’d
love to see

fs[/foo(\w+)/, 1] = “bar”
fs.gsub! /foo/, “bar”

etc. because those would be the ones that would make FileString
convenient for me. :slight_smile:

Those already exist. Unfortunately optimizing regex matching is too
involved as that I could have done that in 24h :slight_smile:
Means fs[/foo(\w+)/, 1] = “bar” is just more convenient than writing:
data = File.read
data[/foo(\w+)/, 1] = “bar”
File.open(path, “w”) { |fh| fh.write(data) }
But I think that’s already quite worth it :slight_smile:
I mean - that’s just lots of boilerplate.

Thanks for listening and sharing!

Always :smiley:
The listening part has made me change the docs btw., I know hint on
thinking about performance and probably just use a string and write back
when all is done.

Regards
Stefan

On Monday 09 November 2009 06:54:15 am [email protected] wrote:

Hi Joel

What would the advantage over mmap[1] be? FileString is pure ruby
(right?) and hence more portable, but probably mmap is much more
efficient? Any other tradeoffs?

Interesting, I was looking if a solution existed already and didn’t find
mmap. Yes, FileString is pure ruby and should therefore run on all ruby
implementations. And yes, I’d expect mmap to be more efficient on the other
hand.

I’d have looked for mmap first, knowing the concept from Linux. I’d also
expect
that with mmap, you should be able to implement an efficient regex,
though I’m
not sure how well gsub! would work, unless you can guarantee the match
is
always exactly the length of the target string.

(And for gsub to be efficient, you’d need some fancy copy-on-write
stuff, which
would make it that much more difficult to chain them.)

But if you were looking for comments, it looks awesome. Thanks!