Forum: Ruby file.seek and unused bytes

Posted by Greg Willits (-gw-)
on 2009-07-04 00:06
Ruby 1.8.6

(sorry this one takes some setup to explain the context of the question)

I'm using file seek to create equal fixed length rows in a disk file.
The row data itself is not fixed length, but I am forcing the seek
position to be at fixed intervals for the start of each row. So, on disk
it might look something like this:

aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000
bbbbbbbbbbbbbbX000000000000000000000
ccccccccccccccccccccccccccccccccccccccccX00

I'm really only writing the "aaaa..." and "bbbb...." portions of the
rows with an EOL (an X here so it's easy to see).

I have one operation step which uses tail to grab the last line of the
file. When I do that, I get something like this:

000000000000000000000cccccccccccccccccccccccccccccccccccccccc

which is the empty bytes past the EOL of the line "bbb..." plus the
valid data of the following line.

After some fiddling, it became apparent that I can test the value of the
byte against zero to know if there's data in it or not -- if byte_data
== 0 so that I can trim off that leading set of zeros, but I'm not
certain those empty bytes will always be zero.

And finally my question.....

So, my question is about those zeros. Does advancing the seek position
(do we still call that the "cursor"?) intentionally and proactively fill
the unsed bytes with what apparently equates to zero? OR, am I just
getting luck that my test files have used virgin disk space which yields
zeros, and the seek position just skips bytes which potentially would
contain garbage from previously used disk sectors?

Can I count on those unused bytes always being zero?

-- gw
Posted by Greg Willits (-gw-)
on 2009-07-04 00:08
Greg Willits wrote:

> aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000
> bbbbbbbbbbbbbbX000000000000000000000
> ccccccccccccccccccccccccccccccccccccccccX00

Argh. those should look equal length in non-proportional font.

aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000
bbbbbbbbbbbbbbX000000000000000000000
cccccccccccccccccccccccccccccccccX00

-- gw
Posted by Eleanor McHugh (Guest)
on 2009-07-04 01:12
(Received via mailing list)
On 3 Jul 2009, at 23:06, Greg Willits wrote:
> So, my question is about those zeros. Does advancing the seek position
> (do we still call that the "cursor"?) intentionally and proactively  
> fill
> the unsed bytes with what apparently equates to zero? OR, am I just
> getting luck that my test files have used virgin disk space which  
> yields
> zeros, and the seek position just skips bytes which potentially would
> contain garbage from previously used disk sectors?
>
> Can I count on those unused bytes always being zero?

Unfortunately you're getting lucky. A seek adjusts the file pointer
but doesn't write anything to disk so whilst your 'unused' bytes won't
be changing value as a result of writing data to the file unless you
write the full record, you can't rely on them not having a value other
than zero if you don't.

Also you have to consider that zero may itself be a valid data value
within a record :)


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
----
raise ArgumentError unless @reality.responds_to? :reason
Posted by Robert Klemme (Guest)
on 2009-07-04 14:51
(Received via mailing list)
On 04.07.2009 01:11, Eleanor McHugh wrote:
>> Can I count on those unused bytes always being zero?
> 
> Unfortunately you're getting lucky. A seek adjusts the file pointer  
> but doesn't write anything to disk so whilst your 'unused' bytes won't  
> be changing value as a result of writing data to the file unless you  
> write the full record, you can't rely on them not having a value other  
> than zero if you don't.

Actually it should not matter what those bytes are.  Your record format
should make sure that you exactly know how long an individual record is
- as Ellie pointed out:

> Also you have to consider that zero may itself be a valid data value  
> within a record :)

Oh, the details. :-)

Here's another one: if your filesystem supports sparse files and the
holes are big enough (at least spanning more than one complete cluster
or whatever the smallest allocation unit of the filesystem is called)
those bytes might not really come from disk, in which case - when read -
they are usually zeroed.  But I guess this is also implementation 
dependent.

One closing remark: using "tail" to look at a binary file is probably a
bad idea in itself.  "tail" makes certain assumptions about what a line
is (it needs to in order to give you N last lines).  Those assumptions
are usually incorrect when using binary files.

Kind regards

  robert
Posted by Greg Willits (-gw-)
on 2009-07-04 18:53
>>Does advancing the seek position intentionally and proactively fill
>>the unsed bytes with what apparently equates to zero? OR, am I just
>>getting lucky?

> you're getting lucky

Thanks, I figured as much (just thought I'd see if Ruby was any 
different). I've gone ahead and filled empty positions using 
string.ljust.

tail works fine. It's all text data with 10.chr EOLs, and yeah I would 
know whether a 0 is a valid piece of data or not based on the file 
format.

Thanks.
Posted by Brian Candler (candlerb)
on 2009-07-05 11:03
Eleanor McHugh wrote:
>> Can I count on those unused bytes always being zero?
> 
> Unfortunately you're getting lucky. A seek adjusts the file pointer
> but doesn't write anything to disk so whilst your 'unused' bytes won't
> be changing value as a result of writing data to the file unless you
> write the full record, you can't rely on them not having a value other
> than zero if you don't.

I don't believe that's the case today. If it were, then you would have a 
very easy way to examine the contents of unused sectors on the disk - 
which would allow you to see other people's deleted files, passwords 
etc.

It was possible on old mainframe systems in the 80's though :-)

But today, if you extend a file using seek, you should always read 
zeros.
Posted by Greg Willits (-gw-)
on 2009-07-05 19:08
Brian Candler wrote:
> Eleanor McHugh wrote:
>>> Can I count on those unused bytes always being zero?
>> 
>> Unfortunately you're getting lucky. A seek adjusts the file pointer
>> but doesn't write anything to disk so whilst your 'unused' bytes won't
>> be changing value as a result of writing data to the file unless you
>> write the full record, you can't rely on them not having a value other
>> than zero if you don't.
> 
> I don't believe that's the case today. If it were, then you would have a 
> very easy way to examine the contents of unused sectors on the disk - 
> which would allow you to see other people's deleted files, passwords 
> etc.
> 
> It was possible on old mainframe systems in the 80's though :-)

80's micros too with BASIC :-P

> But today, if you extend a file using seek, you should always read 
> zeros.

That makes a great deal of sense, and would be consistent with what I 
was seeing. I was wondering why the values being returned were zeros 
instead of nil or something else.

Either way, I know its a better practice to pack the rows, but I had a 
moment of laziness because I'm dealing will a couple million rows and 
figured if there was some processing time to be saved, I'd take 
advantage of it.

I would have experimented, but I don't know how to ensure that the 
various file contents are in fact being written to the exact same disk 
space.

-- gw
Posted by Gary Wright (Guest)
on 2009-07-05 20:04
(Received via mailing list)
On Jul 3, 2009, at 6:06 PM, Greg Willits wrote:
> So, my question is about those zeros. Does advancing the seek position
> (do we still call that the "cursor"?) intentionally and proactively  
> fill
> the unsed bytes with what apparently equates to zero?

I don't think anyone has answered this question directly but on POSIX-
like file systems a seek past the end of the file and a subsequent
write will cause the intervening bytes (which have never been written)
to read as zeros. Whether those 'holes' occupy disk space or not is
implementation dependent.

Gary Wright
Posted by Greg Willits (-gw-)
on 2009-07-05 20:13
Gary Wright wrote:
> On Jul 3, 2009, at 6:06 PM, Greg Willits wrote:
>> So, my question is about those zeros. Does advancing the seek position
>> (do we still call that the "cursor"?) intentionally and proactively  
>> fill
>> the unsed bytes with what apparently equates to zero?
> 
> I don't think anyone has answered this question directly but on POSIX-
> like file systems a seek past the end of the file and a subsequent
> write will cause the intervening bytes (which have never been written)
> to read as zeros. Whether those 'holes' occupy disk space or not is
> implementation dependent.


If in deed this is a fact (and it's consistent with my observation), 
then I'd say it's worth taking advantage of. I can't find a definitive 
reference to cite though (Pickaxe, The Ruby Way).

-- gw
Posted by Brian Candler (candlerb)
on 2009-07-05 21:59
Greg Willits wrote:
>> I don't think anyone has answered this question directly but on POSIX-
>> like file systems a seek past the end of the file and a subsequent
>> write will cause the intervening bytes (which have never been written)
>> to read as zeros. Whether those 'holes' occupy disk space or not is
>> implementation dependent.
> 
> 
> If in deed this is a fact (and it's consistent with my observation), 
> then I'd say it's worth taking advantage of. I can't find a definitive 
> reference to cite though (Pickaxe, The Ruby Way).

Well, those aren't POSIX references. But from "Advanced Programming in 
the UNIX Environment" by the late great Richard Stevens, pub. 
Addison-Wesley, p53:

"`lseek` only records the current file offset within the kernel - it 
does not cause any I/O to take place. This offset is then used by the 
next read or write operation.

The file's offset can get greater than the file's current size, in which 
case the next `write` to the file will extend the file. This is referred 
to as creating a hole in a file and is allowed. Any bytes in a file that 
have not been written are read back as 0."
Posted by Greg Willits (-gw-)
on 2009-07-05 22:58
Brian Candler wrote:
> Greg Willits wrote:
>>> I don't think anyone has answered this question directly but on POSIX-
>>> like file systems a seek past the end of the file and a subsequent
>>> write will cause the intervening bytes (which have never been written)
>>> to read as zeros. Whether those 'holes' occupy disk space or not is
>>> implementation dependent.
>> 
>> 
>> If in deed this is a fact (and it's consistent with my observation), 
>> then I'd say it's worth taking advantage of. I can't find a definitive 
>> reference to cite though (Pickaxe, The Ruby Way).
> 
> Well, those aren't POSIX references. But from "Advanced Programming in 
> the UNIX Environment" by the late great Richard Stevens, pub. 
> Addison-Wesley, p53:
> 
> "`lseek` only records the current file offset within the kernel - it 
> does not cause any I/O to take place. This offset is then used by the 
> next read or write operation.
> 
> The file's offset can get greater than the file's current size, in which 
> case the next `write` to the file will extend the file. This is referred 
> to as creating a hole in a file and is allowed. Any bytes in a file that 
> have not been written are read back as 0."


I see, you guys are saying it's an OS-level detail, not a Ruby-specfic 
detail.

It seems though that any hole in the file must be written to. Otherwise 
the file format itself must keep track of every byte that it has written 
to or not in order to have a write-nothing / read-as-zero capability. 
This would seem to be very inefficient overhead.

Hmm... duh, I can bust out the hex editor and have a look.

<pause>

OK, well, empty bytes created by extending the filesize of a new file 
are 0.chr not an ASCII zero character (well, at least according to the 
hex editor app). That could simply be the absence of data from virgin 
disk space. I suppose, that absence of data could be interpreted however 
the app wants, so the hex editor says it is 0.chr and the POSIX code 
says it is 48.chr.

Still though, since the file isn't being filled with the data that is 
provided by the read-back, that still confuses me. How does the read 
know to convert those particular NULL values into ASCII zeros vs a NULL 
byte I write on purpose? And it still doesn't really confirm what would 
happen when non-virgin disk space is being written to.

Hrrmmm. :-\

Thanks for the discussion so far.

-- gw
Posted by Brian Candler (candlerb)
on 2009-07-05 23:30
Greg Willits wrote:
> It seems though that any hole in the file must be written to. Otherwise 
> the file format itself must keep track of every byte that it has written 
> to or not in order to have a write-nothing / read-as-zero capability. 

Unless you seek over entire blocks, in which case the filesystem can 
create a "sparse" file with entirely missing blocks (i.e. the disk usage 
reported by du can be much less than the file size)

When you read any of these blocks, you will see all zero bytes.

> Hmm... duh, I can bust out the hex editor and have a look.
> 
> <pause>
> 
> OK, well, empty bytes created by extending the filesize of a new file 
> are 0.chr not an ASCII zero character (well, at least according to the 
> hex editor app). That could simply be the absence of data from virgin 
> disk space. I suppose, that absence of data could be interpreted however 
> the app wants, so the hex editor says it is 0.chr and the POSIX code 
> says it is 48.chr.

No, POSIX says it is a zero byte (character \0, \x00, byte value 0, 
binary 00000000, ASCII NUL, however you want to think of it)
Posted by Greg Willits (-gw-)
on 2009-07-06 00:12
Brian Candler wrote:

>> It seems though that any hole in the file must be written to. Otherwise 
>> the file format itself must keep track of every byte that it has written 
>> to or not in order to have a write-nothing / read-as-zero capability. 

>Unless you seek over entire blocks, in which case the filesystem can 
>create a "sparse" file with entirely missing blocks (i.e. the disk usage 
>reported by du can be much less than the file size)

>When you read any of these blocks, you will see all zero bytes.

OK. But the file system doesn't keep track of aything smaller than the 
block, right? So, it's not keeping track of the misc individual holes 
created by each extension of the seek (?).


> No, POSIX says it is a zero byte (character \0, \x00, byte value 0, 
> binary 00000000, ASCII NUL, however you want to think of it)

Doh! My zeros are coming from a step in my process which includes 
converting this particular data chunk to integers which I was 
forgetting. And nil.to_i will generate a zero. So, my bad; that detail 
is cleared up.

The only thing I'm still not real clear on is....

 - file X gets written to disk block 999 -- the data is a stream of 200 
contiguous "A" characters

 - file X gets deleted (which AFAIK only deletes the directory entry, 
and does not null-out the file data unless the OS has been told to do 
just that with a "secure delete" operation)

 - file Y gets written to disk block 999 -- the data has holes in it 
from extending the seek position

Generally, I wouldn't read in the holes, but I have this one little step 
that does end up with some holes, and I know it. What I don't know is 
what to expect in those holes. Null values or, garbage "A' characters 
left over from file X.

Logically I would expect garbage data, but the literal impact of 
paragraphs quoted earlier from the Unix book above indicates I should 
expect null values. I can't think of any tools I have that would enable 
me to test this.

Because I don't know, I've gone ahead and packed the holes with a known 
character. However, if I can avoid that I want to because it sucks up 
some time I'd like to avoid in large files, but it's not super critical.

At this point I'm more curious than anything. I appreciate the dialog.

-- gw

Posted by Gary Wright (Guest)
on 2009-07-06 00:44
(Received via mailing list)
On Jul 5, 2009, at 6:13 PM, Greg Willits wrote:
>
> Generally, I wouldn't read in the holes, but I have this one little  
> step
> that does end up with some holes, and I know it. What I don't know is
> what to expect in those holes. Null values or, garbage "A' characters
> left over from file X.

You should expect null bytes (at least on Posix-like file systems).
I'm not sure why you are doubting this.

 From the Open Group Base Specification description of lseek:
<http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html>:
> The lseek() function shall allow the file offset to be set beyond  
> the end of the existing data in the file. If data is later written  
> at this point, subsequent reads of data in the gap shall return  
> bytes with the value 0 until data is actually written into the gap.
>

Gary Wright
Posted by Greg Willits (-gw-)
on 2009-07-06 01:36
Gary Wright wrote:
> On Jul 5, 2009, at 6:13 PM, Greg Willits wrote:
>>
>> Generally, I wouldn't read in the holes, but I have this one little  
>> step
>> that does end up with some holes, and I know it. What I don't know is
>> what to expect in those holes. Null values or, garbage "A' characters
>> left over from file X.
> 
> You should expect null bytes (at least on Posix-like file systems).
> I'm not sure why you are doubting this.

I wasn't separating the read from the write. The spec talks about 
reading zeros but doesn't talk about writing them. I wasn't trusting 
that the nulls were getting written. I think I get it now that the read 
is what matters. Whether a null/zero got written, or whether the gaps 
are accounted for in some other way, is independent of the data the read 
returns.

I still don't see where the nulls come from (if they're not being 
written), but if the rules allow me to expect nulls/zeros, and those 
gaps are being accounted for somewhere/somehow then that's what matters.

-- gw
Posted by Brian Candler (candlerb)
on 2009-07-06 07:54
Greg Willits wrote:
> I still don't see where the nulls come from (if they're not being 
> written)

All disk I/O is done in terms of whole blocks (typically 1K)

Whenever the filesystem adds a new block to a file, insteading of 
reading the existing contents into the VFS cache it just zero-fills a 
block in the VFS cache. A write to an offset then updates that block and 
marks it 'dirty'. The entire block will then at some point get written 
back to disk, including of course any of the zeros which were not 
overwritten with user data.
Posted by Robert Klemme (Guest)
on 2009-07-06 08:36
(Received via mailing list)
On 06.07.2009 00:13, Greg Willits wrote:

> Generally, I wouldn't read in the holes, but I have this one little step 
> that does end up with some holes, and I know it. What I don't know is 
> what to expect in those holes. Null values or, garbage "A' characters 
> left over from file X.
> 
> Logically I would expect garbage data, but the literal impact of 
> paragraphs quoted earlier from the Unix book above indicates I should 
> expect null values. I can't think of any tools I have that would enable 
> me to test this.

I would not expect anything in those bytes for the simple reason that
this reduces portability of your program.  If anything the whole
discussion has shown that apparently there are (or were) different
approaches to handling this (including return of old data which should
not happen any more nowadays).

> Because I don't know, I've gone ahead and packed the holes with a known 
> character. However, if I can avoid that I want to because it sucks up 
> some time I'd like to avoid in large files, but it's not super critical.
> 
> At this point I'm more curious than anything. I appreciate the dialog.

I stick to the point I made earlier: if you need particular data to be
present in the slack of your records you need to make sure it's there.
Since your IO is done block wise and you probably aligned your offsets
with block boundaries anyway there should not be a noticeable difference
in IO.  You probably need a bit more CPU time to generate that data but
that's probably negligible in light of the disk IO overhead.

If you want to save yourself that effort you should probably make sure
that your record format allows for easy separation of the data and slack
area.  There are various well established practices, for example
preceding the data area with a length indicator or terminating data with
a special marker byte.

My 0.02 EUR.

Kind regards

  robert
Posted by Greg Willits (-gw-)
on 2009-07-06 09:44
Brian Candler wrote:
> Greg Willits wrote:
>> I still don't see where the nulls come from (if they're not being 
>> written)
> 
> All disk I/O is done in terms of whole blocks (typically 1K)
> 
> Whenever the filesystem adds a new block to a file, insteading of 
> reading the existing contents into the VFS cache it just zero-fills a 
> block in the VFS cache. A write to an offset then updates that block and 
> marks it 'dirty'. The entire block will then at some point get written 
> back to disk, including of course any of the zeros which were not 
> overwritten with user data.

Ah. That's what I was looking for. thanks.
Posted by Greg Willits (-gw-)
on 2009-07-06 10:17
Robert Klemme wrote:
> On 06.07.2009 00:13, Greg Willits wrote:
> 
>> Generally, I wouldn't read in the holes, but I have this one little step 
>> that does end up with some holes, and I know it. What I don't know is 
>> what to expect in those holes. Null values or, garbage "A' characters 
>> left over from file X.
>> 
>> Logically I would expect garbage data, but the literal impact of 
>> paragraphs quoted earlier from the Unix book above indicates I should 
>> expect null values. I can't think of any tools I have that would enable 
>> me to test this.
> 
> I would not expect anything in those bytes for the simple reason that
> this reduces portability of your program.

Understood. In this case, I'm making a concious decision to go with 
whatever is faster. I've already written the code so that it is easy to 
add back in the packing if it's ever needed.

We're working with large data sets for aggregation which takes a long 
time to run, and second only to the ease and clarity of the top level 
DSL, is the speed of the aggregation process itself so we can afford to 
do more analysis.


>> Because I don't know, I've gone ahead and packed the holes with a known 
>> character. However, if I can avoid that I want to because it sucks up 
>> some time I'd like to avoid in large files, but it's not super critical.
>> 
>> At this point I'm more curious than anything. I appreciate the dialog.
> 
> should probably make sure
> that your record format allows for easy separation of the data and slack
> area.  There are various well established practices, for example
> preceding the data area with a length indicator or terminating data with
> a special marker byte.

Yep, already done that. Where this 'holes' business comes in, is that to 
stay below the 4GB limit, the data has to be processed and the file 
written out in chunks. Each chunk may have a unique line length. So, we 
find the longest line of the chunk, and write records at that interval 
using seek. Each record terminates with a line feed.

Since we don't know the standard length of each chunk until processing 
is done (and the file has already een started), a set of the lengths is 
added to the end of the file instead of the beginning.

When reading data, the fastest way to get the last line which has my 
line lengths, is to use tail. This  returns a string starting from the 
last record's EOL marker to the EOF. This "line" has the potential 
(likelihood) to include the empty bytes of the last record in front of 
the actual I want because of how tail interprets "lines" between EOL 
markers. I need to strip those empty bytes from the start of the line 
before I get to the line lengths data.

Every other aspect of the file uses the common approach of lines with 
#00 between fields and #10 at the end of the data, followed by zero or 
more fill characters to make each row an equal length of bytes.

-- gw





Posted by Brian Candler (candlerb)
on 2009-07-06 10:48
Greg Willits wrote:
> Yep, already done that. Where this 'holes' business comes in, is that to 
> stay below the 4GB limit, the data has to be processed and the file 
> written out in chunks. Each chunk may have a unique line length. So, we 
> find the longest line of the chunk, and write records at that interval 
> using seek. Each record terminates with a line feed.

To me, this approach smells. For example, it could have *really* bad 
disk usage if one record in your file is much larger than all the 
others.

Is the reason for this fixed-space padding just so that you can jump 
directly to record number N in the file, by calculating its offset?

If so, it sounds to me like what you really want is cdb:
http://cr.yp.to/cdb.html

You emit key/value records of the form

+1,50:1->(50 byte record)
+1,70:2->(70 byte record)
+1,60:3->(60 byte record)
...
+2,500:10->(500 byte record)
... etc

then pipe it into cdbmake. The resulting file is built, in a single 
pass, with a hash index, allowing you to jump to record with key 'x' 
instantly.

There's a nice and simple ruby-cdb library available, which wraps djb's 
cdb library.

Of course, with cdb you're not limited to integers as the key to locate 
the records, nor do they have to be in sequence. Any unique key string 
will do - consider it like an on-disk frozen Hash. (The key doesn't have 
to be unique actually, but then when you search for key K you would ask 
for all records matching this key)
Posted by Robert Klemme (Guest)
on 2009-07-06 11:17
(Received via mailing list)
2009/7/6 Greg Willits <lists@gregwillits.ws>:
> Robert Klemme wrote:

> We're working with large data sets for aggregation which takes a long
> time to run, and second only to the ease and clarity of the top level
> DSL, is the speed of the aggregation process itself so we can afford to
> do more analysis.

Did you actually measure significant differences in time or are you
just assuming there is a significant impact because you write less and
have to do less processing?

>> a special marker byte.
>
> Yep, already done that. Where this 'holes' business comes in, is that to
> stay below the 4GB limit, the data has to be processed and the file
> written out in chunks. Each chunk may have a unique line length. So, we
> find the longest line of the chunk, and write records at that interval
> using seek. Each record terminates with a line feed.

Errr, I am not sure I fully understand your approach.  What you write
sounds like if you end up with a file containing multiple sections
which each have lines with identical length.  So a file with two
sections could look like this

aaN0000
aaaN000
aN00000
aaaaaaN
aaaaN00
bN00
bbbN
bbN0
bbN0

Basically you are combining two approaches in one file: fixed length
records and variable length records with termination marker.  That
sounds odd to me.  If file size matters then I do not understand why
you do not just write out the file like a regular text file, i.e. only
use the line termination approach.

> Since we don't know the standard length of each chunk until processing
> is done (and the file has already een started), a set of the lengths is
> added to the end of the file instead of the beginning.
>
> When reading data, the fastest way to get the last line which has my
> line lengths, is to use tail.

Why don't you open the file, seek to N bytes before the end and read
them?  You do not need tail for this and you also have all the file
handling in your Ruby program.

> Every other aspect of the file uses the common approach of lines with
> #00 between fields and #10 at the end of the data, followed by zero or
> more fill characters to make each row an equal length of bytes.

It seems either I am missing something or you are doing something
weird for which I do not understand the reason.  Can you shade some
more light on the nature of the processing and why you follow this
approach?  That would be a wonderful completion of the discussion.

Kind regards

robert
Posted by Brian Candler (candlerb)
on 2009-07-06 12:00
Robert Klemme wrote:
>> Yep, already done that. Where this 'holes' business comes in, is that to
>> stay below the 4GB limit, the data has to be processed and the file
>> written out in chunks. Each chunk may have a unique line length. So, we
>> find the longest line of the chunk, and write records at that interval
>> using seek. Each record terminates with a line feed.
> 
> Errr, I am not sure I fully understand your approach.  What you write
> sounds like if you end up with a file containing multiple sections
> which each have lines with identical length.

My reading of the description was that each "chunk" was a separate file, 
and so all the records within that file had the same line length, but I 
could be wrong.
Posted by Robert Klemme (Guest)
on 2009-07-06 12:55
(Received via mailing list)
2009/7/6 Brian Candler <b.candler@pobox.com>:
>
> My reading of the description was that each "chunk" was a separate file,
> and so all the records within that file had the same line length, but I
> could be wrong.

But he writes "the file" which makes me think that there is only one
output file.  Also, the 4GB limit seems to indicate that we are
talking about a single large file.

Greg, help!

;-)

Cheers

robert
Posted by Brian Candler (candlerb)
on 2009-07-06 13:03
Brian Candler wrote:
> If so, it sounds to me like what you really want is cdb:
> http://cr.yp.to/cdb.html
> 
> You emit key/value records of the form
> 
> +1,50:1->(50 byte record)
> +1,70:2->(70 byte record)
> +1,60:3->(60 byte record)
> ...
> +2,500:10->(500 byte record)
> ... etc
> 
> then pipe it into cdbmake. The resulting file is built, in a single 
> pass, with a hash index, allowing you to jump to record with key 'x' 
> instantly.

I dug out some old CDB-building code; it's actually very simple with 
ruby-cdb and no piping is required. Here's some code which takes 
table-separated key-value pairs file and builds a cdb:

def cdbmake(src, dest=src+".cdb", mode=0444, uid=0, gid=0)
  require 'cdb'
  File.open(src) do |f1|
    File.delete(file+".tmp") rescue nil
    CDBMake.open(src+".tmp", mode) do |f2|
      f1.each_line do |line|
        line.chomp!
        next if line =~ /^#/ or line =~ /^\s*$/
        if line =~ /^([^\t]*)(\t(.*))?$/
          f2.add($1, $3 || "")
        else
          puts "#{src}: Bad line #{line.inspect} ignored"
        end
      end
    end
  end
  File.rename(src+".tmp", dest)
  File.chown(uid, gid, dest)
rescue
  File.delete(src+".tmp") rescue nil
  raise
end

This used http://raa.ruby-lang.org/project/cdb/ which unfortunately is 
not available as a gem, but it does bundle the cdb source for you.

I see there's now a cdb gem. To install this you need the cdb 
development libraries already installed on your system. Under Ubuntu all 
I needed was:

  sudo apt-get install libcdb-dev
  sudo gem install cdb

Strangely I found it installed all files mode 600. The fix:

  find /usr/lib/ruby/gems/1.8/gems/cdb-0.0.2 -type f | xargs sudo chmod 
644

Anyway, the README shows quite clearly how to use it.
Posted by Greg Willits (-gw-)
on 2009-07-06 14:41
Eeek. Opened a can of worms!

It's 5:30 am where I am (I'm "still up" and not "up early"), so I'll hit 
the highlights, and detail more later if for some reason there's an 
interest.

-- application: data aggregation

-- I get data from county services and school districts. They have no 
way to correlate and aggregate their data. That's what I am doing. I 
match kids that are flagged in county services for whatever reason and 
have to match them up with school records, health records, judicial 
records where applicable, etc.

-- I get dozens of CSV files of various subjects. Every school 
district's CSV file for any subject (attendance, grades, etc) has 
similar content but not identical. Ditto for demographics from various 
agencies.

-- before I can even attempt aggregation, I need to normalize all this 
data so it can be indexed and  analyzed, and I ensure the data is 
cleaned and standardized for display.

-- there's a top layer DSL where I describe data sources and how to 
transform any one particular raw data file into a normalized data file. 
There another set of descriptions to allow any one field to come from a 
number of possible sources. So, something like birthcity might come from 
data source X, but if not available, check data source Y, etc.

-- this process has been abstracted into a data aggregation core to do 
the normalization, the indexing, and other tasks, with an 
application-specific layer to handle the definition of transformations, 
indexes that are needed, order of precedence, and the stitching of data 
from various sources into records.

-- So, this particular step I've been talking about is where a raw CSV 
file undergoes normalization by reorgnizing the fields of each record 
into a common structure for that given data topic, each field undergoes 
some scrubbing (character case, packing phones, normalizing date 
formats, translation of codes into strings, etc).

-- raw data files range from a handful of columns to a couple dozen 
columns. From a few hundred rows to a couple million rows.

-- data is an array of hashes

-- by the time we get done normalizing a particular raw source, it can 
hit the 4GB memory limit any one ruby fork has available to play with 
(many forks run in parallel on multiple cores)

-- while most CSV files work out just fine reading into RAM, 
transforming them 100% in RAM, and then writing in a simple single step, 
many files do not.

-- so we have updated the process to load X records from the raw file, 
tranform X records in memory, then write X records to disk, and loop 
untill all records are done.

-- and we have to deal with indexes, duplicates, and other issues in 
there as well

-- imagine 2,000,000 raw records from one file which get processed in 
200,000 record chunks, but output back to another single file.

-- as I step through each chunk of 200,000 records, I can get the 
longest length of that 200,000, and I can store that, but I can't know 
what the longest length is for the next 200,000 that I haven't loaded 
yet.

-- having processed and written the first 200,000 results to disk, and 
then determining the length of the second 200,000, I'm not going to go 
back and change the first 200,000 just to make them all the same. 
there's no value in that at all.

-- So, when I get done with each 200,000 chunk, I now have a series of 
integers which tells me the length of the records in each chunk of 
200,000 rows.

-- the file has already been written, so again, I'm not going to go back 
and move everything to insert this data at the top (which is where I 
would put if indeed every record was the same length)

-- so, I put this data at the end.

-- BTW, the rows lengths are similiar enough that the disk efficiency is 
not an issue.

-- I read this last line using tail, and I strip off the leading empty 
bytes (if any) as I described earlier

-- it's a couple of very simple calculation to convert any "index" 
position into the exact byte seek position to find a specific record.

-- from this point on, the records are read as random access from disk 
because otherwise I would need oodles of GB of data in RAM all at once 
during the aggregation process.

-- is doing 200,000 or even 500,000 at a time in chunks really any 
faster than doing them one at a time -- that I actually don't know yet, 
I am just now finishing all the code that this "chunking" touches and 
ensure I get the same results I used to. the size of the chunks isn't as 
important for speed as it is for memory management -- making sure I stay 
within 4GB.

-- as for the speed issues, we've done a lot of profiling, and even 
wrote a mini compiler to read our file tranformation DSL and output a 
stream of inline variable declarations and commands whichs gets included 
as a module on the fly for each data source. That trick saved us from 
parsing the DSL for each data row and literally shaved hours off the 
total processing time. We attacked many other levels of optimization 
while working to keep the code as readable as possible, because it's a 
complicated layering of abstractions and processes.

-- I will look into cdb

-- gw



Posted by Brian Candler (candlerb)
on 2009-07-06 15:14
Greg Willits wrote:
> Eeek. Opened a can of worms!
...
> -- by the time we get done normalizing a particular raw source, it can 
> hit the 4GB memory limit any one ruby fork has available to play with 
> (many forks run in parallel on multiple cores)

Ah, so this is a 4GB process limit not file limit.

> -- imagine 2,000,000 raw records from one file which get processed in 
> 200,000 record chunks, but output back to another single file.
> 
> -- as I step through each chunk of 200,000 records, I can get the 
> longest length of that 200,000, and I can store that, but I can't know 
> what the longest length is for the next 200,000 that I haven't loaded 
> yet.

Standard "external processing", this is fine.

> -- it's a couple of very simple calculation to convert any "index" 
> position into the exact byte seek position to find a specific record.

OK, so you want your output file to have a property which CSV files 
don't, namely that you can jump to line N with a direct seek.

> -- I will look into cdb

Here's another worm for your can: CouchDB :-)

Output each normalised CSV row as a JSON document, then it will build 
arbitrary indexes out of that, defined using Javascript (or Ruby or 
...).

Unfortunately, building the indexes isn't as fast as it should be yet, 
and I note you have a lot of hand-optimised code. But if you ever want 
to be able to search your big files by content of a field, rather than 
just jump to the N'th record, this might be your saviour.

Maybe mongodb would be even faster, but I've not played with it yet.

Regards,

Brian.
Posted by Robert Klemme (Guest)
on 2009-07-06 20:11
(Received via mailing list)
On 06.07.2009 14:41, Greg Willits wrote:

> -- is doing 200,000 or even 500,000 at a time in chunks really any 
> faster than doing them one at a time -- that I actually don't know yet, 
> I am just now finishing all the code that this "chunking" touches and 
> ensure I get the same results I used to. the size of the chunks isn't as 
> important for speed as it is for memory management -- making sure I stay 
> within 4GB.

 From my experience doing one at a time is efficient enough.  Of course
you then cannot do the line length math.  But, for that I'd probably
consider doing something different: What about storing all file offsets
in an Array and write it to a file "<orig>.idx" via Marshal.  That way
you do not need fill bytes, your files are smaller and you can process
one line at a time.  Reading a single entry is then easy: just slurp in
the index in one go and seek to index[record_index].

robert@fussel ~
$ time ruby19 -e 'h=(1..1_000_000).map {|i| i << 6};File.open("x","wb")
{|io| Marshal.dump(h,io)}'

real    0m1.848s
user    0m1.483s
sys     0m0.249s

robert@fussel ~
$ ls -l x
-rw-r--r-- 1 robert Kein 5736837 Jul  6 20:01 x

Apart from that this will reduce memory usage of individual processes
and you might be able to better utilize your cores.  Dumping via Marshal
is pretty fast and the memory overhead of that single index array is not
too big.

Alternatively you can write the index file while you are writing the
main data file.  You just need to fix the number of bits you reserve for
each file offset.  Then the read operation can be done via two seek
operations (first on the index, then on the data file) if you do not
cache the index.

> -- as for the speed issues, we've done a lot of profiling, and even 
> wrote a mini compiler to read our file tranformation DSL and output a 
> stream of inline variable declarations and commands whichs gets included 
> as a module on the fly for each data source. That trick saved us from 
> parsing the DSL for each data row and literally shaved hours off the 
> total processing time. We attacked many other levels of optimization 
> while working to keep the code as readable as possible, because it's a 
> complicated layering of abstractions and processes.

Did you consider to make your dsl generate the code or is that what you
are doing?

Kind regards

  robert
Posted by Thomas Chust (Guest)
on 2009-07-06 20:22
(Received via mailing list)
2009/7/6 Greg Willits <lists@gregwillits.ws>:
> [...]
> -- So, this particular step I've been talking about is where a raw CSV
> file undergoes normalization by reorgnizing the fields of each record
> into a common structure for that given data topic, each field undergoes
> some scrubbing (character case, packing phones, normalizing date
> formats, translation of codes into strings, etc).
> [...]

Hello,

reading this I wonder why you don't dump the records into a database
at this point. For example SQLite3 would be lightweight, fast and
flexible. You could probably save a lot of effort not having to build
index structures manually and you could query your dataset using SQL
or the ORM of your choice afterwards...

cu,
Thomas
Posted by Greg Willits (-gw-)
on 2009-07-06 21:47
> reading this I wonder why you don't dump the records into a database
> at this point. For example SQLite3 would be lightweight, fast and
> flexible. You could probably save a lot of effort not having to build
> index structures manually and you could query your dataset using SQL
> or the ORM of your choice afterwards...

That would actually require a much larger rewrite than what I've done. 
It wasn't too big a leap to change the APIs from "here's all the 
records" to "here's some of the records." A few order of operations had 
to be juggled, but overall it's not a major logical leap, nor a major 
implementation leap. That probably goes for all the other DB options as 
well.

At this point, the change from all-in-RAM to chunks-in-RAM has been a 
relatively straight forward retrofit, and based on previsou versions 
which did use a DB, I think this is still faster or as fast. Another 
option we will explore is 64-bit ruby (which I started another thread 
for a few days ago).

>What about storing all file offsets
>in an Array and write it to a file

That's a possibility that would be an easy retrofit and probably w/o 
breaking the current interfaces. It would use a little more RAM, but 
would eliminate calculating that offset each time during reads. That's 
not a real pinch point, but worth an experiment just because it's a 
simpler concept to recognize.

> Did you consider to make your dsl generate the code 
> or is that what you are doing?

If I interpret the question correctly, that's what we're doing. I may be 
using "DSL" a little loosely here though. It's more like an "active 
config" language. I write setup files that are a mixture of k-v pairs 
for certain things, and expressions for other things. All of that gets 
processed at the start of every aggregation run which results in a 
number of Ruby Module files being created which get included 
dynamically.

Many thanks for all the thoughts & feedback.

-- gw

Posted by Robert Klemme (Guest)
on 2009-07-06 22:00
(Received via mailing list)
On 06.07.2009 21:47, Greg Willits wrote:

>> What about storing all file offsets
>> in an Array and write it to a file
> 
> That's a possibility that would be an easy retrofit and probably w/o 
> breaking the current interfaces. It would use a little more RAM, but 
> would eliminate calculating that offset each time during reads. That's 
> not a real pinch point, but worth an experiment just because it's a 
> simpler concept to recognize.

I'd be interested to learn the outcome of that exercise.

>> Did you consider to make your dsl generate the code 
>> or is that what you are doing?
> 
> If I interpret the question correctly, that's what we're doing. I may be 
> using "DSL" a little loosely here though. It's more like an "active 
> config" language. I write setup files that are a mixture of k-v pairs 
> for certain things, and expressions for other things. All of that gets 
> processed at the start of every aggregation run which results in a 
> number of Ruby Module files being created which get included 
> dynamically.

Only that you do not need to write this into files but you can do that
in memory with eval.  Anyway, that's just a detail. :-)

> Many thanks for all the thoughts & feedback.

You're welcome!

Kind regards

  robert
Posted by Greg Willits (-gw-)
on 2009-07-07 00:08
Robert Klemme wrote:
> On 06.07.2009 21:47, Greg Willits wrote:
> 
>>> What about storing all file offsets
>>> in an Array and write it to a file
>> 
>> That's a possibility that would be an easy retrofit and probably w/o 
>> breaking the current interfaces. It would use a little more RAM, but 
>> would eliminate calculating that offset each time during reads. That's 
>> not a real pinch point, but worth an experiment just because it's a 
>> simpler concept to recognize.
> 
> I'd be interested to learn the outcome of that exercise.


In my original system, I used the row lengths and the number of rows per 
"set" to pre-calculate a lookup of offsets into the start of each set. 
Upon recieving a request for record 623,456 I first had to determine 
which set that would be in (4th set based on 200,000 per set). So I'd 
use a floor calc to determine that. Then subtract that factor of rows 
from the 623,456, then use the basic row_length multiplier. So, there 
were a few calcs involved.

It looked like this:

set_indx      = (row_number.to_f / @set_size.to_f).floor
set_sub_indx  = row_number - (@set_size * set_indx)
record_start  = @set_offsets[set_indx] + (set_sub_indx * 
@record_lengths[set_indx])

By saving the row-specific offsets as an array, that is simplified to :

record_start = @record_offsets[row_number]

This turned out to make the total process of fetching rows 15% faster. 
(roughly 32 seconds vs 38 seconds for reading 1.4 million rows on my dev 
system).

I already index the data itself using hashes (which work very, very 
fast) for aggregation lookups, so this concept is quite parallel to 
other ways the code works, and is a worthwhile change to make -- so 
thanks for that idea.

-- gw

Posted by James Gray (bbazzarrakk)
on 2009-07-07 00:44
(Received via mailing list)
On Jul 6, 2009, at 5:08 PM, Greg Willits wrote:

> In my original system, I used the row lengths and the number of rows  
> per
> "set" to pre-calculate a lookup of offsets into the start of each set.
> Upon recieving a request for record 623,456 I first had to determine
> which set that would be in (4th set based on 200,000 per set).

> By saving the row-specific offsets as an array, that is simplified  
> to :
>
> record_start = @record_offsets[row_number]
>
> This turned out to make the total process of fetching rows 15% faster.
> (roughly 32 seconds vs 38 seconds for reading 1.4 million rows on my  
> dev
> system).

> I already index the data itself using hashes (which work very, very
> fast) for aggregation lookups, so this concept is quite parallel to
> other ways the code works, and is a worthwhile change to make -- so
> thanks for that idea.

The more I read in this thread, the more I feel it's crying out for a
database.  I recognize that you disagree with this, I just disagree
with your disagreement.  In a totally friendly way, of course.  :)

Originally I too was thinking of the already suggested SQLite, but now
I'm thinking that not right.  My current feeling is that Redis would
be a terrific fit:

http://code.google.com/p/redis/

That's just my two bits.  I promise to let the database argument lie
now.

James Edward Gray II
Posted by Joel VanderWerf (Guest)
on 2009-07-07 01:02
(Received via mailing list)
James Gray wrote:
> http://code.google.com/p/redis/

Does this install as a gem for you? I get this...

$ gem install redis
Successfully installed redis-0.0.1
1 gem installed
Installing ri documentation for redis-0.0.1...

...

and then rdoc fails with:
invalid argument: --format=html

and the gem dir redis-0.0.1 is empty, and

Even disabling rdoc:

$ gem install redis --no-rdoc --no-ri
Successfully installed redis-0.0.1
1 gem installed

results in an empty gem dir.
Posted by Ezra Zygmuntowicz (Guest)
on 2009-07-07 01:08
(Received via mailing list)
On Jul 6, 2009, at 4:01 PM, Joel VanderWerf wrote:

> ...
> 1 gem installed
>
> results in an empty gem dir.
>
> --  
>      vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407
>



Get redis from github as far as I know there is no redis gem on
rubyforge yet(i wrote the ruby client)

git clone git://github.com/ezmobius/redis-rb.git

cd redis-rb

rake gem
sudo gem install pkg/redis***.gem

rake redis:install
rake redis:start


Cheers-

Ezra Zygmuntowicz
ez@engineyard.com
Posted by Joel VanderWerf (Guest)
on 2009-07-07 01:18
(Received via mailing list)
Ezra Zygmuntowicz wrote:
> rake redis:install
> rake redis:start

Ok until this point:

$ rake redis:start
(in /home/vjoel/ruby/src/db/redis-rb)
Detach with Ctrl+\  Re-attach with rake redis:attach
rake aborted!
No such file or directory - dtach -A /tmp/redis.dtach redis-server
/etc/redis.conf

Is dtach something I need to install separately? (kubuntu 8.04 here)
Posted by Ezra Zygmuntowicz (Guest)
on 2009-07-07 01:25
(Received via mailing list)
On Jul 6, 2009, at 4:17 PM, Joel VanderWerf wrote:

> Ok until this point:
>
> $ rake redis:start
> (in /home/vjoel/ruby/src/db/redis-rb)
> Detach with Ctrl+\  Re-attach with rake redis:attach
> rake aborted!
> No such file or directory - dtach -A /tmp/redis.dtach redis-server / 
> etc/redis.conf
>
> Is dtach something I need to install separately? (kubuntu 8.04 here)

Yeah sorry dtach is another piece of software to install, you can get
it form darwin ports or your package manager. But that is just a
convenience wrapper for starting redis, you can start redis yourself
by just checking it out and running the ./redis-server  command

Cheers-

Ezra Zygmuntowicz
ez@engineyard.com
Posted by Greg Willits (-gw-)
on 2009-07-07 01:51
James Gray wrote:
> The more I read in this thread, the more I feel it's crying out for a
> database.  I recognize that you disagree with this, I just disagree
> with your disagreement.  In a totally friendly way, of course.  :)

I think if you saw the whole thing you'd see why the db is more of a 
hinderance than an assett. I've described a small piece of it, and not 
the piece that has the most compelling reasons to avoid a db -- at least 
a traditional db. Some of these other ones that have brought up may be a 
different story.

-- gw
Posted by Robert Klemme (Guest)
on 2009-07-07 09:05
(Received via mailing list)
2009/7/7 Greg Willits <lists@gregwillits.ws>:
> James Gray wrote:
>> The more I read in this thread, the more I feel it's crying out for a
>> database.  I recognize that you disagree with this, I just disagree
>> with your disagreement.  In a totally friendly way, of course.  :)
>
> I think if you saw the whole thing you'd see why the db is more of a
> hinderance than an assett. I've described a small piece of it, and not
> the piece that has the most compelling reasons to avoid a db -- at least
> a traditional db. Some of these other ones that have brought up may be a
> different story.

I have to admit that I thought of a database as well.  Keywords that
especially triggered this were: index, aggregation, huge, speed.  You
make it really sound as if we are talking about a huge beast of which
we are getting just a tiny glimpse. :-)

Kind regards

robert
Posted by Greg Willits (-gw-)
on 2009-07-07 19:10
Robert Klemme wrote:
> 2009/7/7 Greg Willits <lists@gregwillits.ws>:
>> James Gray wrote:
>>> The more I read in this thread, the more I feel it's crying out for a
>>> database
>>
>> I think if you saw the whole thing you'd see why the db is more of a
>> hinderance than an assett. I've described a small piece of it, and not
>> the piece that has the most compelling reasons to avoid a db -- at least
>> a traditional db. Some of these other ones that have brought up may be a
>> different story.
> 
> I have to admit that I thought of a database as well.  Keywords that
> especially triggered this were: index, aggregation, huge, speed.  You
> make it really sound as if we are talking about a huge beast of which
> we are getting just a tiny glimpse. :-)

Heh. No, it's not huge, certainly not by today's standards of huge.

The first version of this did everything by database, but we realized we 
were using little of the DB's strengths, and that in-memory indexes 
would be more efficient to both create and use. Most of our indexes 
aren't normal indexing of single fields, but rather combinations of 
pieces of various fields. Those ones are used to determine matches and 
probabilities of matches between two data sets.

During aggregation, we do a lot of work with just the indexes 
themselves, something that wasn't as effective when using a DB.

When we replaced the DB with random access text files for the data 
itself, and replaced all the other tasks with our own techniques, the 
system as whole was faster, and more flexible to code new ideas for 
matching and aggregation.

Now, whether something like redis or couchdb has value added compared to 
a traditional db because they behave more like what we've done, I don't 
know. Haven't looked into that yet, but it does look interesting.

Still, what we have works quite well, is entirely in Ruby, and requires 
no other resources except to connect to the application's db for the 
final data update. So there's some advantages to that simplicity as 
well.

-- gw

Posted by Joel VanderWerf (Guest)
on 2009-07-07 21:17
(Received via mailing list)
Ezra Zygmuntowicz wrote:
> Yeah sorry dtach is another piece of software to install, you can get it 
> form darwin ports or your package manager. But that is just a 
> convenience wrapper for starting redis, you can start redis yourself by 
> just checking it out and running the ./redis-server  command

Thanks, that works. Nice to see that little notes.rb sinatra app in
action, though it's a got a few bugs.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.