Ruby 1.8.6 (sorry this one takes some setup to explain the context of the question) I'm using file seek to create equal fixed length rows in a disk file. The row data itself is not fixed length, but I am forcing the seek position to be at fixed intervals for the start of each row. So, on disk it might look something like this: aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000 bbbbbbbbbbbbbbX000000000000000000000 ccccccccccccccccccccccccccccccccccccccccX00 I'm really only writing the "aaaa..." and "bbbb...." portions of the rows with an EOL (an X here so it's easy to see). I have one operation step which uses tail to grab the last line of the file. When I do that, I get something like this: 000000000000000000000cccccccccccccccccccccccccccccccccccccccc which is the empty bytes past the EOL of the line "bbb..." plus the valid data of the following line. After some fiddling, it became apparent that I can test the value of the byte against zero to know if there's data in it or not -- if byte_data == 0 so that I can trim off that leading set of zeros, but I'm not certain those empty bytes will always be zero. And finally my question..... So, my question is about those zeros. Does advancing the seek position (do we still call that the "cursor"?) intentionally and proactively fill the unsed bytes with what apparently equates to zero? OR, am I just getting luck that my test files have used virgin disk space which yields zeros, and the seek position just skips bytes which potentially would contain garbage from previously used disk sectors? Can I count on those unused bytes always being zero? -- gw
on 2009-07-04 00:06
on 2009-07-04 00:08
Greg Willits wrote: > aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000000 > bbbbbbbbbbbbbbX000000000000000000000 > ccccccccccccccccccccccccccccccccccccccccX00 Argh. those should look equal length in non-proportional font. aaaaaaaaaaaaaaaaaaaaaaaaaX0000000000 bbbbbbbbbbbbbbX000000000000000000000 cccccccccccccccccccccccccccccccccX00 -- gw
on 2009-07-04 01:12
On 3 Jul 2009, at 23:06, Greg Willits wrote: > So, my question is about those zeros. Does advancing the seek position > (do we still call that the "cursor"?) intentionally and proactively > fill > the unsed bytes with what apparently equates to zero? OR, am I just > getting luck that my test files have used virgin disk space which > yields > zeros, and the seek position just skips bytes which potentially would > contain garbage from previously used disk sectors? > > Can I count on those unused bytes always being zero? Unfortunately you're getting lucky. A seek adjusts the file pointer but doesn't write anything to disk so whilst your 'unused' bytes won't be changing value as a result of writing data to the file unless you write the full record, you can't rely on them not having a value other than zero if you don't. Also you have to consider that zero may itself be a valid data value within a record :) Ellie Eleanor McHugh Games With Brains http://slides.games-with-brains.net ---- raise ArgumentError unless @reality.responds_to? :reason
on 2009-07-04 14:51
On 04.07.2009 01:11, Eleanor McHugh wrote: >> Can I count on those unused bytes always being zero? > > Unfortunately you're getting lucky. A seek adjusts the file pointer > but doesn't write anything to disk so whilst your 'unused' bytes won't > be changing value as a result of writing data to the file unless you > write the full record, you can't rely on them not having a value other > than zero if you don't. Actually it should not matter what those bytes are. Your record format should make sure that you exactly know how long an individual record is - as Ellie pointed out: > Also you have to consider that zero may itself be a valid data value > within a record :) Oh, the details. :-) Here's another one: if your filesystem supports sparse files and the holes are big enough (at least spanning more than one complete cluster or whatever the smallest allocation unit of the filesystem is called) those bytes might not really come from disk, in which case - when read - they are usually zeroed. But I guess this is also implementation dependent. One closing remark: using "tail" to look at a binary file is probably a bad idea in itself. "tail" makes certain assumptions about what a line is (it needs to in order to give you N last lines). Those assumptions are usually incorrect when using binary files. Kind regards robert
on 2009-07-04 18:53
>>Does advancing the seek position intentionally and proactively fill >>the unsed bytes with what apparently equates to zero? OR, am I just >>getting lucky? > you're getting lucky Thanks, I figured as much (just thought I'd see if Ruby was any different). I've gone ahead and filled empty positions using string.ljust. tail works fine. It's all text data with 10.chr EOLs, and yeah I would know whether a 0 is a valid piece of data or not based on the file format. Thanks.
on 2009-07-05 11:03
Eleanor McHugh wrote: >> Can I count on those unused bytes always being zero? > > Unfortunately you're getting lucky. A seek adjusts the file pointer > but doesn't write anything to disk so whilst your 'unused' bytes won't > be changing value as a result of writing data to the file unless you > write the full record, you can't rely on them not having a value other > than zero if you don't. I don't believe that's the case today. If it were, then you would have a very easy way to examine the contents of unused sectors on the disk - which would allow you to see other people's deleted files, passwords etc. It was possible on old mainframe systems in the 80's though :-) But today, if you extend a file using seek, you should always read zeros.
on 2009-07-05 19:08
Brian Candler wrote: > Eleanor McHugh wrote: >>> Can I count on those unused bytes always being zero? >> >> Unfortunately you're getting lucky. A seek adjusts the file pointer >> but doesn't write anything to disk so whilst your 'unused' bytes won't >> be changing value as a result of writing data to the file unless you >> write the full record, you can't rely on them not having a value other >> than zero if you don't. > > I don't believe that's the case today. If it were, then you would have a > very easy way to examine the contents of unused sectors on the disk - > which would allow you to see other people's deleted files, passwords > etc. > > It was possible on old mainframe systems in the 80's though :-) 80's micros too with BASIC :-P > But today, if you extend a file using seek, you should always read > zeros. That makes a great deal of sense, and would be consistent with what I was seeing. I was wondering why the values being returned were zeros instead of nil or something else. Either way, I know its a better practice to pack the rows, but I had a moment of laziness because I'm dealing will a couple million rows and figured if there was some processing time to be saved, I'd take advantage of it. I would have experimented, but I don't know how to ensure that the various file contents are in fact being written to the exact same disk space. -- gw
on 2009-07-05 20:04
On Jul 3, 2009, at 6:06 PM, Greg Willits wrote: > So, my question is about those zeros. Does advancing the seek position > (do we still call that the "cursor"?) intentionally and proactively > fill > the unsed bytes with what apparently equates to zero? I don't think anyone has answered this question directly but on POSIX- like file systems a seek past the end of the file and a subsequent write will cause the intervening bytes (which have never been written) to read as zeros. Whether those 'holes' occupy disk space or not is implementation dependent. Gary Wright
on 2009-07-05 20:13
Gary Wright wrote: > On Jul 3, 2009, at 6:06 PM, Greg Willits wrote: >> So, my question is about those zeros. Does advancing the seek position >> (do we still call that the "cursor"?) intentionally and proactively >> fill >> the unsed bytes with what apparently equates to zero? > > I don't think anyone has answered this question directly but on POSIX- > like file systems a seek past the end of the file and a subsequent > write will cause the intervening bytes (which have never been written) > to read as zeros. Whether those 'holes' occupy disk space or not is > implementation dependent. If in deed this is a fact (and it's consistent with my observation), then I'd say it's worth taking advantage of. I can't find a definitive reference to cite though (Pickaxe, The Ruby Way). -- gw
on 2009-07-05 21:59
Greg Willits wrote: >> I don't think anyone has answered this question directly but on POSIX- >> like file systems a seek past the end of the file and a subsequent >> write will cause the intervening bytes (which have never been written) >> to read as zeros. Whether those 'holes' occupy disk space or not is >> implementation dependent. > > > If in deed this is a fact (and it's consistent with my observation), > then I'd say it's worth taking advantage of. I can't find a definitive > reference to cite though (Pickaxe, The Ruby Way). Well, those aren't POSIX references. But from "Advanced Programming in the UNIX Environment" by the late great Richard Stevens, pub. Addison-Wesley, p53: "`lseek` only records the current file offset within the kernel - it does not cause any I/O to take place. This offset is then used by the next read or write operation. The file's offset can get greater than the file's current size, in which case the next `write` to the file will extend the file. This is referred to as creating a hole in a file and is allowed. Any bytes in a file that have not been written are read back as 0."
on 2009-07-05 22:58
Brian Candler wrote: > Greg Willits wrote: >>> I don't think anyone has answered this question directly but on POSIX- >>> like file systems a seek past the end of the file and a subsequent >>> write will cause the intervening bytes (which have never been written) >>> to read as zeros. Whether those 'holes' occupy disk space or not is >>> implementation dependent. >> >> >> If in deed this is a fact (and it's consistent with my observation), >> then I'd say it's worth taking advantage of. I can't find a definitive >> reference to cite though (Pickaxe, The Ruby Way). > > Well, those aren't POSIX references. But from "Advanced Programming in > the UNIX Environment" by the late great Richard Stevens, pub. > Addison-Wesley, p53: > > "`lseek` only records the current file offset within the kernel - it > does not cause any I/O to take place. This offset is then used by the > next read or write operation. > > The file's offset can get greater than the file's current size, in which > case the next `write` to the file will extend the file. This is referred > to as creating a hole in a file and is allowed. Any bytes in a file that > have not been written are read back as 0." I see, you guys are saying it's an OS-level detail, not a Ruby-specfic detail. It seems though that any hole in the file must be written to. Otherwise the file format itself must keep track of every byte that it has written to or not in order to have a write-nothing / read-as-zero capability. This would seem to be very inefficient overhead. Hmm... duh, I can bust out the hex editor and have a look. <pause> OK, well, empty bytes created by extending the filesize of a new file are 0.chr not an ASCII zero character (well, at least according to the hex editor app). That could simply be the absence of data from virgin disk space. I suppose, that absence of data could be interpreted however the app wants, so the hex editor says it is 0.chr and the POSIX code says it is 48.chr. Still though, since the file isn't being filled with the data that is provided by the read-back, that still confuses me. How does the read know to convert those particular NULL values into ASCII zeros vs a NULL byte I write on purpose? And it still doesn't really confirm what would happen when non-virgin disk space is being written to. Hrrmmm. :-\ Thanks for the discussion so far. -- gw
on 2009-07-05 23:30
Greg Willits wrote: > It seems though that any hole in the file must be written to. Otherwise > the file format itself must keep track of every byte that it has written > to or not in order to have a write-nothing / read-as-zero capability. Unless you seek over entire blocks, in which case the filesystem can create a "sparse" file with entirely missing blocks (i.e. the disk usage reported by du can be much less than the file size) When you read any of these blocks, you will see all zero bytes. > Hmm... duh, I can bust out the hex editor and have a look. > > <pause> > > OK, well, empty bytes created by extending the filesize of a new file > are 0.chr not an ASCII zero character (well, at least according to the > hex editor app). That could simply be the absence of data from virgin > disk space. I suppose, that absence of data could be interpreted however > the app wants, so the hex editor says it is 0.chr and the POSIX code > says it is 48.chr. No, POSIX says it is a zero byte (character \0, \x00, byte value 0, binary 00000000, ASCII NUL, however you want to think of it)
on 2009-07-06 00:12
Brian Candler wrote: >> It seems though that any hole in the file must be written to. Otherwise >> the file format itself must keep track of every byte that it has written >> to or not in order to have a write-nothing / read-as-zero capability. >Unless you seek over entire blocks, in which case the filesystem can >create a "sparse" file with entirely missing blocks (i.e. the disk usage >reported by du can be much less than the file size) >When you read any of these blocks, you will see all zero bytes. OK. But the file system doesn't keep track of aything smaller than the block, right? So, it's not keeping track of the misc individual holes created by each extension of the seek (?). > No, POSIX says it is a zero byte (character \0, \x00, byte value 0, > binary 00000000, ASCII NUL, however you want to think of it) Doh! My zeros are coming from a step in my process which includes converting this particular data chunk to integers which I was forgetting. And nil.to_i will generate a zero. So, my bad; that detail is cleared up. The only thing I'm still not real clear on is.... - file X gets written to disk block 999 -- the data is a stream of 200 contiguous "A" characters - file X gets deleted (which AFAIK only deletes the directory entry, and does not null-out the file data unless the OS has been told to do just that with a "secure delete" operation) - file Y gets written to disk block 999 -- the data has holes in it from extending the seek position Generally, I wouldn't read in the holes, but I have this one little step that does end up with some holes, and I know it. What I don't know is what to expect in those holes. Null values or, garbage "A' characters left over from file X. Logically I would expect garbage data, but the literal impact of paragraphs quoted earlier from the Unix book above indicates I should expect null values. I can't think of any tools I have that would enable me to test this. Because I don't know, I've gone ahead and packed the holes with a known character. However, if I can avoid that I want to because it sucks up some time I'd like to avoid in large files, but it's not super critical. At this point I'm more curious than anything. I appreciate the dialog. -- gw
on 2009-07-06 00:44
On Jul 5, 2009, at 6:13 PM, Greg Willits wrote: > > Generally, I wouldn't read in the holes, but I have this one little > step > that does end up with some holes, and I know it. What I don't know is > what to expect in those holes. Null values or, garbage "A' characters > left over from file X. You should expect null bytes (at least on Posix-like file systems). I'm not sure why you are doubting this. From the Open Group Base Specification description of lseek: <http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html>: > The lseek() function shall allow the file offset to be set beyond > the end of the existing data in the file. If data is later written > at this point, subsequent reads of data in the gap shall return > bytes with the value 0 until data is actually written into the gap. > Gary Wright
on 2009-07-06 01:36
Gary Wright wrote: > On Jul 5, 2009, at 6:13 PM, Greg Willits wrote: >> >> Generally, I wouldn't read in the holes, but I have this one little >> step >> that does end up with some holes, and I know it. What I don't know is >> what to expect in those holes. Null values or, garbage "A' characters >> left over from file X. > > You should expect null bytes (at least on Posix-like file systems). > I'm not sure why you are doubting this. I wasn't separating the read from the write. The spec talks about reading zeros but doesn't talk about writing them. I wasn't trusting that the nulls were getting written. I think I get it now that the read is what matters. Whether a null/zero got written, or whether the gaps are accounted for in some other way, is independent of the data the read returns. I still don't see where the nulls come from (if they're not being written), but if the rules allow me to expect nulls/zeros, and those gaps are being accounted for somewhere/somehow then that's what matters. -- gw
on 2009-07-06 07:54
Greg Willits wrote: > I still don't see where the nulls come from (if they're not being > written) All disk I/O is done in terms of whole blocks (typically 1K) Whenever the filesystem adds a new block to a file, insteading of reading the existing contents into the VFS cache it just zero-fills a block in the VFS cache. A write to an offset then updates that block and marks it 'dirty'. The entire block will then at some point get written back to disk, including of course any of the zeros which were not overwritten with user data.
on 2009-07-06 08:36
On 06.07.2009 00:13, Greg Willits wrote: > Generally, I wouldn't read in the holes, but I have this one little step > that does end up with some holes, and I know it. What I don't know is > what to expect in those holes. Null values or, garbage "A' characters > left over from file X. > > Logically I would expect garbage data, but the literal impact of > paragraphs quoted earlier from the Unix book above indicates I should > expect null values. I can't think of any tools I have that would enable > me to test this. I would not expect anything in those bytes for the simple reason that this reduces portability of your program. If anything the whole discussion has shown that apparently there are (or were) different approaches to handling this (including return of old data which should not happen any more nowadays). > Because I don't know, I've gone ahead and packed the holes with a known > character. However, if I can avoid that I want to because it sucks up > some time I'd like to avoid in large files, but it's not super critical. > > At this point I'm more curious than anything. I appreciate the dialog. I stick to the point I made earlier: if you need particular data to be present in the slack of your records you need to make sure it's there. Since your IO is done block wise and you probably aligned your offsets with block boundaries anyway there should not be a noticeable difference in IO. You probably need a bit more CPU time to generate that data but that's probably negligible in light of the disk IO overhead. If you want to save yourself that effort you should probably make sure that your record format allows for easy separation of the data and slack area. There are various well established practices, for example preceding the data area with a length indicator or terminating data with a special marker byte. My 0.02 EUR. Kind regards robert
on 2009-07-06 09:44
Brian Candler wrote: > Greg Willits wrote: >> I still don't see where the nulls come from (if they're not being >> written) > > All disk I/O is done in terms of whole blocks (typically 1K) > > Whenever the filesystem adds a new block to a file, insteading of > reading the existing contents into the VFS cache it just zero-fills a > block in the VFS cache. A write to an offset then updates that block and > marks it 'dirty'. The entire block will then at some point get written > back to disk, including of course any of the zeros which were not > overwritten with user data. Ah. That's what I was looking for. thanks.
on 2009-07-06 10:17
Robert Klemme wrote: > On 06.07.2009 00:13, Greg Willits wrote: > >> Generally, I wouldn't read in the holes, but I have this one little step >> that does end up with some holes, and I know it. What I don't know is >> what to expect in those holes. Null values or, garbage "A' characters >> left over from file X. >> >> Logically I would expect garbage data, but the literal impact of >> paragraphs quoted earlier from the Unix book above indicates I should >> expect null values. I can't think of any tools I have that would enable >> me to test this. > > I would not expect anything in those bytes for the simple reason that > this reduces portability of your program. Understood. In this case, I'm making a concious decision to go with whatever is faster. I've already written the code so that it is easy to add back in the packing if it's ever needed. We're working with large data sets for aggregation which takes a long time to run, and second only to the ease and clarity of the top level DSL, is the speed of the aggregation process itself so we can afford to do more analysis. >> Because I don't know, I've gone ahead and packed the holes with a known >> character. However, if I can avoid that I want to because it sucks up >> some time I'd like to avoid in large files, but it's not super critical. >> >> At this point I'm more curious than anything. I appreciate the dialog. > > should probably make sure > that your record format allows for easy separation of the data and slack > area. There are various well established practices, for example > preceding the data area with a length indicator or terminating data with > a special marker byte. Yep, already done that. Where this 'holes' business comes in, is that to stay below the 4GB limit, the data has to be processed and the file written out in chunks. Each chunk may have a unique line length. So, we find the longest line of the chunk, and write records at that interval using seek. Each record terminates with a line feed. Since we don't know the standard length of each chunk until processing is done (and the file has already een started), a set of the lengths is added to the end of the file instead of the beginning. When reading data, the fastest way to get the last line which has my line lengths, is to use tail. This returns a string starting from the last record's EOL marker to the EOF. This "line" has the potential (likelihood) to include the empty bytes of the last record in front of the actual I want because of how tail interprets "lines" between EOL markers. I need to strip those empty bytes from the start of the line before I get to the line lengths data. Every other aspect of the file uses the common approach of lines with #00 between fields and #10 at the end of the data, followed by zero or more fill characters to make each row an equal length of bytes. -- gw
on 2009-07-06 10:48
Greg Willits wrote: > Yep, already done that. Where this 'holes' business comes in, is that to > stay below the 4GB limit, the data has to be processed and the file > written out in chunks. Each chunk may have a unique line length. So, we > find the longest line of the chunk, and write records at that interval > using seek. Each record terminates with a line feed. To me, this approach smells. For example, it could have *really* bad disk usage if one record in your file is much larger than all the others. Is the reason for this fixed-space padding just so that you can jump directly to record number N in the file, by calculating its offset? If so, it sounds to me like what you really want is cdb: http://cr.yp.to/cdb.html You emit key/value records of the form +1,50:1->(50 byte record) +1,70:2->(70 byte record) +1,60:3->(60 byte record) ... +2,500:10->(500 byte record) ... etc then pipe it into cdbmake. The resulting file is built, in a single pass, with a hash index, allowing you to jump to record with key 'x' instantly. There's a nice and simple ruby-cdb library available, which wraps djb's cdb library. Of course, with cdb you're not limited to integers as the key to locate the records, nor do they have to be in sequence. Any unique key string will do - consider it like an on-disk frozen Hash. (The key doesn't have to be unique actually, but then when you search for key K you would ask for all records matching this key)
on 2009-07-06 11:17
2009/7/6 Greg Willits <lists@gregwillits.ws>: > Robert Klemme wrote: > We're working with large data sets for aggregation which takes a long > time to run, and second only to the ease and clarity of the top level > DSL, is the speed of the aggregation process itself so we can afford to > do more analysis. Did you actually measure significant differences in time or are you just assuming there is a significant impact because you write less and have to do less processing? >> a special marker byte. > > Yep, already done that. Where this 'holes' business comes in, is that to > stay below the 4GB limit, the data has to be processed and the file > written out in chunks. Each chunk may have a unique line length. So, we > find the longest line of the chunk, and write records at that interval > using seek. Each record terminates with a line feed. Errr, I am not sure I fully understand your approach. What you write sounds like if you end up with a file containing multiple sections which each have lines with identical length. So a file with two sections could look like this aaN0000 aaaN000 aN00000 aaaaaaN aaaaN00 bN00 bbbN bbN0 bbN0 Basically you are combining two approaches in one file: fixed length records and variable length records with termination marker. That sounds odd to me. If file size matters then I do not understand why you do not just write out the file like a regular text file, i.e. only use the line termination approach. > Since we don't know the standard length of each chunk until processing > is done (and the file has already een started), a set of the lengths is > added to the end of the file instead of the beginning. > > When reading data, the fastest way to get the last line which has my > line lengths, is to use tail. Why don't you open the file, seek to N bytes before the end and read them? You do not need tail for this and you also have all the file handling in your Ruby program. > Every other aspect of the file uses the common approach of lines with > #00 between fields and #10 at the end of the data, followed by zero or > more fill characters to make each row an equal length of bytes. It seems either I am missing something or you are doing something weird for which I do not understand the reason. Can you shade some more light on the nature of the processing and why you follow this approach? That would be a wonderful completion of the discussion. Kind regards robert
on 2009-07-06 12:00
Robert Klemme wrote: >> Yep, already done that. Where this 'holes' business comes in, is that to >> stay below the 4GB limit, the data has to be processed and the file >> written out in chunks. Each chunk may have a unique line length. So, we >> find the longest line of the chunk, and write records at that interval >> using seek. Each record terminates with a line feed. > > Errr, I am not sure I fully understand your approach. What you write > sounds like if you end up with a file containing multiple sections > which each have lines with identical length. My reading of the description was that each "chunk" was a separate file, and so all the records within that file had the same line length, but I could be wrong.
on 2009-07-06 12:55
2009/7/6 Brian Candler <b.candler@pobox.com>: > > My reading of the description was that each "chunk" was a separate file, > and so all the records within that file had the same line length, but I > could be wrong. But he writes "the file" which makes me think that there is only one output file. Also, the 4GB limit seems to indicate that we are talking about a single large file. Greg, help! ;-) Cheers robert
on 2009-07-06 13:03
Brian Candler wrote: > If so, it sounds to me like what you really want is cdb: > http://cr.yp.to/cdb.html > > You emit key/value records of the form > > +1,50:1->(50 byte record) > +1,70:2->(70 byte record) > +1,60:3->(60 byte record) > ... > +2,500:10->(500 byte record) > ... etc > > then pipe it into cdbmake. The resulting file is built, in a single > pass, with a hash index, allowing you to jump to record with key 'x' > instantly. I dug out some old CDB-building code; it's actually very simple with ruby-cdb and no piping is required. Here's some code which takes table-separated key-value pairs file and builds a cdb: def cdbmake(src, dest=src+".cdb", mode=0444, uid=0, gid=0) require 'cdb' File.open(src) do |f1| File.delete(file+".tmp") rescue nil CDBMake.open(src+".tmp", mode) do |f2| f1.each_line do |line| line.chomp! next if line =~ /^#/ or line =~ /^\s*$/ if line =~ /^([^\t]*)(\t(.*))?$/ f2.add($1, $3 || "") else puts "#{src}: Bad line #{line.inspect} ignored" end end end end File.rename(src+".tmp", dest) File.chown(uid, gid, dest) rescue File.delete(src+".tmp") rescue nil raise end This used http://raa.ruby-lang.org/project/cdb/ which unfortunately is not available as a gem, but it does bundle the cdb source for you. I see there's now a cdb gem. To install this you need the cdb development libraries already installed on your system. Under Ubuntu all I needed was: sudo apt-get install libcdb-dev sudo gem install cdb Strangely I found it installed all files mode 600. The fix: find /usr/lib/ruby/gems/1.8/gems/cdb-0.0.2 -type f | xargs sudo chmod 644 Anyway, the README shows quite clearly how to use it.
on 2009-07-06 14:41
Eeek. Opened a can of worms! It's 5:30 am where I am (I'm "still up" and not "up early"), so I'll hit the highlights, and detail more later if for some reason there's an interest. -- application: data aggregation -- I get data from county services and school districts. They have no way to correlate and aggregate their data. That's what I am doing. I match kids that are flagged in county services for whatever reason and have to match them up with school records, health records, judicial records where applicable, etc. -- I get dozens of CSV files of various subjects. Every school district's CSV file for any subject (attendance, grades, etc) has similar content but not identical. Ditto for demographics from various agencies. -- before I can even attempt aggregation, I need to normalize all this data so it can be indexed and analyzed, and I ensure the data is cleaned and standardized for display. -- there's a top layer DSL where I describe data sources and how to transform any one particular raw data file into a normalized data file. There another set of descriptions to allow any one field to come from a number of possible sources. So, something like birthcity might come from data source X, but if not available, check data source Y, etc. -- this process has been abstracted into a data aggregation core to do the normalization, the indexing, and other tasks, with an application-specific layer to handle the definition of transformations, indexes that are needed, order of precedence, and the stitching of data from various sources into records. -- So, this particular step I've been talking about is where a raw CSV file undergoes normalization by reorgnizing the fields of each record into a common structure for that given data topic, each field undergoes some scrubbing (character case, packing phones, normalizing date formats, translation of codes into strings, etc). -- raw data files range from a handful of columns to a couple dozen columns. From a few hundred rows to a couple million rows. -- data is an array of hashes -- by the time we get done normalizing a particular raw source, it can hit the 4GB memory limit any one ruby fork has available to play with (many forks run in parallel on multiple cores) -- while most CSV files work out just fine reading into RAM, transforming them 100% in RAM, and then writing in a simple single step, many files do not. -- so we have updated the process to load X records from the raw file, tranform X records in memory, then write X records to disk, and loop untill all records are done. -- and we have to deal with indexes, duplicates, and other issues in there as well -- imagine 2,000,000 raw records from one file which get processed in 200,000 record chunks, but output back to another single file. -- as I step through each chunk of 200,000 records, I can get the longest length of that 200,000, and I can store that, but I can't know what the longest length is for the next 200,000 that I haven't loaded yet. -- having processed and written the first 200,000 results to disk, and then determining the length of the second 200,000, I'm not going to go back and change the first 200,000 just to make them all the same. there's no value in that at all. -- So, when I get done with each 200,000 chunk, I now have a series of integers which tells me the length of the records in each chunk of 200,000 rows. -- the file has already been written, so again, I'm not going to go back and move everything to insert this data at the top (which is where I would put if indeed every record was the same length) -- so, I put this data at the end. -- BTW, the rows lengths are similiar enough that the disk efficiency is not an issue. -- I read this last line using tail, and I strip off the leading empty bytes (if any) as I described earlier -- it's a couple of very simple calculation to convert any "index" position into the exact byte seek position to find a specific record. -- from this point on, the records are read as random access from disk because otherwise I would need oodles of GB of data in RAM all at once during the aggregation process. -- is doing 200,000 or even 500,000 at a time in chunks really any faster than doing them one at a time -- that I actually don't know yet, I am just now finishing all the code that this "chunking" touches and ensure I get the same results I used to. the size of the chunks isn't as important for speed as it is for memory management -- making sure I stay within 4GB. -- as for the speed issues, we've done a lot of profiling, and even wrote a mini compiler to read our file tranformation DSL and output a stream of inline variable declarations and commands whichs gets included as a module on the fly for each data source. That trick saved us from parsing the DSL for each data row and literally shaved hours off the total processing time. We attacked many other levels of optimization while working to keep the code as readable as possible, because it's a complicated layering of abstractions and processes. -- I will look into cdb -- gw
on 2009-07-06 15:14
Greg Willits wrote: > Eeek. Opened a can of worms! ... > -- by the time we get done normalizing a particular raw source, it can > hit the 4GB memory limit any one ruby fork has available to play with > (many forks run in parallel on multiple cores) Ah, so this is a 4GB process limit not file limit. > -- imagine 2,000,000 raw records from one file which get processed in > 200,000 record chunks, but output back to another single file. > > -- as I step through each chunk of 200,000 records, I can get the > longest length of that 200,000, and I can store that, but I can't know > what the longest length is for the next 200,000 that I haven't loaded > yet. Standard "external processing", this is fine. > -- it's a couple of very simple calculation to convert any "index" > position into the exact byte seek position to find a specific record. OK, so you want your output file to have a property which CSV files don't, namely that you can jump to line N with a direct seek. > -- I will look into cdb Here's another worm for your can: CouchDB :-) Output each normalised CSV row as a JSON document, then it will build arbitrary indexes out of that, defined using Javascript (or Ruby or ...). Unfortunately, building the indexes isn't as fast as it should be yet, and I note you have a lot of hand-optimised code. But if you ever want to be able to search your big files by content of a field, rather than just jump to the N'th record, this might be your saviour. Maybe mongodb would be even faster, but I've not played with it yet. Regards, Brian.
on 2009-07-06 20:11
On 06.07.2009 14:41, Greg Willits wrote: > -- is doing 200,000 or even 500,000 at a time in chunks really any > faster than doing them one at a time -- that I actually don't know yet, > I am just now finishing all the code that this "chunking" touches and > ensure I get the same results I used to. the size of the chunks isn't as > important for speed as it is for memory management -- making sure I stay > within 4GB. From my experience doing one at a time is efficient enough. Of course you then cannot do the line length math. But, for that I'd probably consider doing something different: What about storing all file offsets in an Array and write it to a file "<orig>.idx" via Marshal. That way you do not need fill bytes, your files are smaller and you can process one line at a time. Reading a single entry is then easy: just slurp in the index in one go and seek to index[record_index]. robert@fussel ~ $ time ruby19 -e 'h=(1..1_000_000).map {|i| i << 6};File.open("x","wb") {|io| Marshal.dump(h,io)}' real 0m1.848s user 0m1.483s sys 0m0.249s robert@fussel ~ $ ls -l x -rw-r--r-- 1 robert Kein 5736837 Jul 6 20:01 x Apart from that this will reduce memory usage of individual processes and you might be able to better utilize your cores. Dumping via Marshal is pretty fast and the memory overhead of that single index array is not too big. Alternatively you can write the index file while you are writing the main data file. You just need to fix the number of bits you reserve for each file offset. Then the read operation can be done via two seek operations (first on the index, then on the data file) if you do not cache the index. > -- as for the speed issues, we've done a lot of profiling, and even > wrote a mini compiler to read our file tranformation DSL and output a > stream of inline variable declarations and commands whichs gets included > as a module on the fly for each data source. That trick saved us from > parsing the DSL for each data row and literally shaved hours off the > total processing time. We attacked many other levels of optimization > while working to keep the code as readable as possible, because it's a > complicated layering of abstractions and processes. Did you consider to make your dsl generate the code or is that what you are doing? Kind regards robert
on 2009-07-06 20:22
2009/7/6 Greg Willits <lists@gregwillits.ws>: > [...] > -- So, this particular step I've been talking about is where a raw CSV > file undergoes normalization by reorgnizing the fields of each record > into a common structure for that given data topic, each field undergoes > some scrubbing (character case, packing phones, normalizing date > formats, translation of codes into strings, etc). > [...] Hello, reading this I wonder why you don't dump the records into a database at this point. For example SQLite3 would be lightweight, fast and flexible. You could probably save a lot of effort not having to build index structures manually and you could query your dataset using SQL or the ORM of your choice afterwards... cu, Thomas
on 2009-07-06 21:47
> reading this I wonder why you don't dump the records into a database > at this point. For example SQLite3 would be lightweight, fast and > flexible. You could probably save a lot of effort not having to build > index structures manually and you could query your dataset using SQL > or the ORM of your choice afterwards... That would actually require a much larger rewrite than what I've done. It wasn't too big a leap to change the APIs from "here's all the records" to "here's some of the records." A few order of operations had to be juggled, but overall it's not a major logical leap, nor a major implementation leap. That probably goes for all the other DB options as well. At this point, the change from all-in-RAM to chunks-in-RAM has been a relatively straight forward retrofit, and based on previsou versions which did use a DB, I think this is still faster or as fast. Another option we will explore is 64-bit ruby (which I started another thread for a few days ago). >What about storing all file offsets >in an Array and write it to a file That's a possibility that would be an easy retrofit and probably w/o breaking the current interfaces. It would use a little more RAM, but would eliminate calculating that offset each time during reads. That's not a real pinch point, but worth an experiment just because it's a simpler concept to recognize. > Did you consider to make your dsl generate the code > or is that what you are doing? If I interpret the question correctly, that's what we're doing. I may be using "DSL" a little loosely here though. It's more like an "active config" language. I write setup files that are a mixture of k-v pairs for certain things, and expressions for other things. All of that gets processed at the start of every aggregation run which results in a number of Ruby Module files being created which get included dynamically. Many thanks for all the thoughts & feedback. -- gw
on 2009-07-06 22:00
On 06.07.2009 21:47, Greg Willits wrote: >> What about storing all file offsets >> in an Array and write it to a file > > That's a possibility that would be an easy retrofit and probably w/o > breaking the current interfaces. It would use a little more RAM, but > would eliminate calculating that offset each time during reads. That's > not a real pinch point, but worth an experiment just because it's a > simpler concept to recognize. I'd be interested to learn the outcome of that exercise. >> Did you consider to make your dsl generate the code >> or is that what you are doing? > > If I interpret the question correctly, that's what we're doing. I may be > using "DSL" a little loosely here though. It's more like an "active > config" language. I write setup files that are a mixture of k-v pairs > for certain things, and expressions for other things. All of that gets > processed at the start of every aggregation run which results in a > number of Ruby Module files being created which get included > dynamically. Only that you do not need to write this into files but you can do that in memory with eval. Anyway, that's just a detail. :-) > Many thanks for all the thoughts & feedback. You're welcome! Kind regards robert
on 2009-07-07 00:08
Robert Klemme wrote: > On 06.07.2009 21:47, Greg Willits wrote: > >>> What about storing all file offsets >>> in an Array and write it to a file >> >> That's a possibility that would be an easy retrofit and probably w/o >> breaking the current interfaces. It would use a little more RAM, but >> would eliminate calculating that offset each time during reads. That's >> not a real pinch point, but worth an experiment just because it's a >> simpler concept to recognize. > > I'd be interested to learn the outcome of that exercise. In my original system, I used the row lengths and the number of rows per "set" to pre-calculate a lookup of offsets into the start of each set. Upon recieving a request for record 623,456 I first had to determine which set that would be in (4th set based on 200,000 per set). So I'd use a floor calc to determine that. Then subtract that factor of rows from the 623,456, then use the basic row_length multiplier. So, there were a few calcs involved. It looked like this: set_indx = (row_number.to_f / @set_size.to_f).floor set_sub_indx = row_number - (@set_size * set_indx) record_start = @set_offsets[set_indx] + (set_sub_indx * @record_lengths[set_indx]) By saving the row-specific offsets as an array, that is simplified to : record_start = @record_offsets[row_number] This turned out to make the total process of fetching rows 15% faster. (roughly 32 seconds vs 38 seconds for reading 1.4 million rows on my dev system). I already index the data itself using hashes (which work very, very fast) for aggregation lookups, so this concept is quite parallel to other ways the code works, and is a worthwhile change to make -- so thanks for that idea. -- gw
on 2009-07-07 00:44
On Jul 6, 2009, at 5:08 PM, Greg Willits wrote: > In my original system, I used the row lengths and the number of rows > per > "set" to pre-calculate a lookup of offsets into the start of each set. > Upon recieving a request for record 623,456 I first had to determine > which set that would be in (4th set based on 200,000 per set). > By saving the row-specific offsets as an array, that is simplified > to : > > record_start = @record_offsets[row_number] > > This turned out to make the total process of fetching rows 15% faster. > (roughly 32 seconds vs 38 seconds for reading 1.4 million rows on my > dev > system). > I already index the data itself using hashes (which work very, very > fast) for aggregation lookups, so this concept is quite parallel to > other ways the code works, and is a worthwhile change to make -- so > thanks for that idea. The more I read in this thread, the more I feel it's crying out for a database. I recognize that you disagree with this, I just disagree with your disagreement. In a totally friendly way, of course. :) Originally I too was thinking of the already suggested SQLite, but now I'm thinking that not right. My current feeling is that Redis would be a terrific fit: http://code.google.com/p/redis/ That's just my two bits. I promise to let the database argument lie now. James Edward Gray II
on 2009-07-07 01:02
James Gray wrote:
> http://code.google.com/p/redis/
Does this install as a gem for you? I get this...
$ gem install redis
Successfully installed redis-0.0.1
1 gem installed
Installing ri documentation for redis-0.0.1...
...
and then rdoc fails with:
invalid argument: --format=html
and the gem dir redis-0.0.1 is empty, and
Even disabling rdoc:
$ gem install redis --no-rdoc --no-ri
Successfully installed redis-0.0.1
1 gem installed
results in an empty gem dir.
on 2009-07-07 01:08
On Jul 6, 2009, at 4:01 PM, Joel VanderWerf wrote: > ... > 1 gem installed > > results in an empty gem dir. > > -- > vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407 > Get redis from github as far as I know there is no redis gem on rubyforge yet(i wrote the ruby client) git clone git://github.com/ezmobius/redis-rb.git cd redis-rb rake gem sudo gem install pkg/redis***.gem rake redis:install rake redis:start Cheers- Ezra Zygmuntowicz ez@engineyard.com
on 2009-07-07 01:18
Ezra Zygmuntowicz wrote: > rake redis:install > rake redis:start Ok until this point: $ rake redis:start (in /home/vjoel/ruby/src/db/redis-rb) Detach with Ctrl+\ Re-attach with rake redis:attach rake aborted! No such file or directory - dtach -A /tmp/redis.dtach redis-server /etc/redis.conf Is dtach something I need to install separately? (kubuntu 8.04 here)
on 2009-07-07 01:25
On Jul 6, 2009, at 4:17 PM, Joel VanderWerf wrote: > Ok until this point: > > $ rake redis:start > (in /home/vjoel/ruby/src/db/redis-rb) > Detach with Ctrl+\ Re-attach with rake redis:attach > rake aborted! > No such file or directory - dtach -A /tmp/redis.dtach redis-server / > etc/redis.conf > > Is dtach something I need to install separately? (kubuntu 8.04 here) Yeah sorry dtach is another piece of software to install, you can get it form darwin ports or your package manager. But that is just a convenience wrapper for starting redis, you can start redis yourself by just checking it out and running the ./redis-server command Cheers- Ezra Zygmuntowicz ez@engineyard.com
on 2009-07-07 01:51
James Gray wrote: > The more I read in this thread, the more I feel it's crying out for a > database. I recognize that you disagree with this, I just disagree > with your disagreement. In a totally friendly way, of course. :) I think if you saw the whole thing you'd see why the db is more of a hinderance than an assett. I've described a small piece of it, and not the piece that has the most compelling reasons to avoid a db -- at least a traditional db. Some of these other ones that have brought up may be a different story. -- gw
on 2009-07-07 09:05
2009/7/7 Greg Willits <lists@gregwillits.ws>: > James Gray wrote: >> The more I read in this thread, the more I feel it's crying out for a >> database. I recognize that you disagree with this, I just disagree >> with your disagreement. In a totally friendly way, of course. :) > > I think if you saw the whole thing you'd see why the db is more of a > hinderance than an assett. I've described a small piece of it, and not > the piece that has the most compelling reasons to avoid a db -- at least > a traditional db. Some of these other ones that have brought up may be a > different story. I have to admit that I thought of a database as well. Keywords that especially triggered this were: index, aggregation, huge, speed. You make it really sound as if we are talking about a huge beast of which we are getting just a tiny glimpse. :-) Kind regards robert
on 2009-07-07 19:10
Robert Klemme wrote: > 2009/7/7 Greg Willits <lists@gregwillits.ws>: >> James Gray wrote: >>> The more I read in this thread, the more I feel it's crying out for a >>> database >> >> I think if you saw the whole thing you'd see why the db is more of a >> hinderance than an assett. I've described a small piece of it, and not >> the piece that has the most compelling reasons to avoid a db -- at least >> a traditional db. Some of these other ones that have brought up may be a >> different story. > > I have to admit that I thought of a database as well. Keywords that > especially triggered this were: index, aggregation, huge, speed. You > make it really sound as if we are talking about a huge beast of which > we are getting just a tiny glimpse. :-) Heh. No, it's not huge, certainly not by today's standards of huge. The first version of this did everything by database, but we realized we were using little of the DB's strengths, and that in-memory indexes would be more efficient to both create and use. Most of our indexes aren't normal indexing of single fields, but rather combinations of pieces of various fields. Those ones are used to determine matches and probabilities of matches between two data sets. During aggregation, we do a lot of work with just the indexes themselves, something that wasn't as effective when using a DB. When we replaced the DB with random access text files for the data itself, and replaced all the other tasks with our own techniques, the system as whole was faster, and more flexible to code new ideas for matching and aggregation. Now, whether something like redis or couchdb has value added compared to a traditional db because they behave more like what we've done, I don't know. Haven't looked into that yet, but it does look interesting. Still, what we have works quite well, is entirely in Ruby, and requires no other resources except to connect to the application's db for the final data update. So there's some advantages to that simplicity as well. -- gw
on 2009-07-07 21:17
Ezra Zygmuntowicz wrote: > Yeah sorry dtach is another piece of software to install, you can get it > form darwin ports or your package manager. But that is just a > convenience wrapper for starting redis, you can start redis yourself by > just checking it out and running the ./redis-server command Thanks, that works. Nice to see that little notes.rb sinatra app in action, though it's a got a few bugs.
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.