Transpose a large file(>2GB)

amslonewolf · April 27, 2008, 12:32pm

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

I can’t read the whole file into memory due to the file sizes…

amslonewolf · April 27, 2008, 1:00pm

On Apr 27, 2008, at 12:32 , Ams Lo wrote:

What is the fast and ruby way to transpose a large file(>2GB)??

I can’t read the whole file into memory due to the file sizes…

Interesting.

If the file is

12
3
456

You want

134
2 5
6

?

– fxn

amslonewolf · April 27, 2008, 1:15pm

Exactly… Sorry, I should have explained it better…

Xavier N. wrote:

On Apr 27, 2008, at 12:59 , Xavier N. wrote:

12
3
456

You want

134
2 5
6

Sorry, I don’t know why Mail.app deletes a single leading whitespace
when there’s one, last line should be

space space six

– fxn

amslonewolf · April 27, 2008, 1:04pm

On Apr 27, 2008, at 12:59 , Xavier N. wrote:

12
3
456

You want

134
2 5
6

Sorry, I don’t know why Mail.app deletes a single leading whitespace
when there’s one, last line should be

space space six

– fxn

amslonewolf · April 27, 2008, 1:30pm

Ams Lo [email protected] writes:

Exactly… Sorry, I should have explained it better…

Wrong. It should be below.

2 5
6

Like this:

Exactly… Sorry, I should have explained it better…

Either have a 64-bit machine, and do it naively,

or have a vector of file positions pointing to the beginning of each
line O(file_size), and then write the character pointed to by these
file position (or a space if they reached the end of line) on a single
output line, and advance the file positions (unless they reach the end
of line). Repeat until all have reached end of line.

Note: this may be bad on the disk, if the input file has a lot of
lines. In this case, you may want to implement a recursive
transposition algorithm:

to transpose a matrix of dimension (c r) do
if (c=1) and (r=1)
then return the matrix itself
else
return the matrix made of:
| transpose of submatrix [0…(c/2)[,[0…(r/2)[ transpose of
submatrix [0…(c/2)[,[(r/2)…r[ |
| transpose of submatrix [(c/2)…c[,[0…(r/2)[ transpose of
submatrix [(c/2)…c[,[(r/2)…r[ |

that is, you cut the matrix in four, transpose each part
independently, and combine the four parts, exchanging the parts on the
anti-diagonal.

With this algorithm disk accesses should be more localized, so you may
get better speed on very big files with very big lines.

–
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
----------> How do I quote correctly in Usenet? <-----------
—> http://homepage.ntlworld.com/g.mccaughan/g/remarks/uquote.html <—

Pascal B. http://www.informatimago.com/

amslonewolf · April 28, 2008, 6:03pm

Save pieces of it to disk.

amslonewolf · April 28, 2008, 6:48pm

On Sun, Apr 27, 2008 at 6:32 AM, Ams Lo [email protected] wrote:

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

fseek to the end of the file. fseek backwards by a nice round number
like 4k. Read in that exact number of bytes. Transpose it. Write
the transposed block to your output file. Then fseek back 24k.
Read, transpose, write. fseek back 34k. Etc. Rinse and repeat to
the beginning of the file.

–
Avdi

Home: http://avdi.org
Developer Blog: Avdi Grimm, Code Cleric
Twitter: http://twitter.com/avdi
Journal: http://avdi.livejournal.com

amslonewolf · April 28, 2008, 6:11pm

On Apr 28, 2008, at 18:02 , Roger P. wrote:

Save pieces of it to disk.

Since seeking and inserting naively in text files is expensive I
wondered whether it could pay to use an intermediate specialized
software like DBM or whatever to build some structure from a line-
oriented reading of the input file, and then use it to output the
transposition in a line-oriented way.

Just off the top of my head without measuring, perhaps it is just much
worse.

amslonewolf · April 28, 2008, 6:56pm

On Mon, Apr 28, 2008 at 12:47 PM, Avdi G. [email protected] wrote:

fseek to the end of the file. fseek backwards by a nice round number
like 4k. Read in that exact number of bytes. Transpose it. Write
the transposed block to your output file. Then fseek back 24k.
Read, transpose, write. fseek back 34k. Etc. Rinse and repeat to
the beginning of the file.

Of course, if you’re transposing lines, that’s trickier.

–
Avdi

Home: http://avdi.org
Developer Blog: Avdi Grimm, Code Cleric
Twitter: http://twitter.com/avdi
Journal: http://avdi.livejournal.com

amslonewolf · April 28, 2008, 7:27pm

On Apr 28, 2008, at 19:20 , Lionel B. wrote:

With this method you’ll always read/write sequential data on disk
when it’s possible with as large data chunks as possible without
hitting the swap.

In what sense is it sequential? These approeaches require prepending/
appending rectangular chunks of text in the ouput file?

amslonewolf · April 28, 2008, 7:21pm

Pascal B. wrote:

anti-diagonal.

With this algorithm disk accesses should be more localized, so you may
get better speed on very big files with very big lines.

Here’s an algorithm that should minimize the amount of seeks.

Given that the file doesn’t fit in memory, you’ll have to fight against
the disk cache as it won’t be effective (caching data that won’t ever be
reused) so you want to load as much useful data in memory at once and
only useful data.

If your program can allocate an amount of memory, compute the number
of element that fit in / matrix_line_count. You probably want to
test that with actual Ruby code as it’s difficult to guess correctly.

Be creative on how you store data in memory to gain space, you shoud try
to store it pre-transposed in memory and already compacted, but it might
be difficult/slow to write (although disk seeks are slow enough that you
might be able to benefit from relatively heavy computations).

On each line, read the first elements (effectively reading the
first columns), then write the first lines in the transposed matrix.
Continue by columns.

With this method you’ll always read/write sequential data on disk when
it’s possible with as large data chunks as possible without hitting the
swap.

You can test the opposite method (reading lines, writing
columns, it may be faster on your hardware but my gutt feeling is that
you should avoid seeking on writes).

Lionel

amslonewolf · April 28, 2008, 8:50pm

Xavier N. wrote:

On Apr 28, 2008, at 19:20 , Lionel B. wrote:

With this method you’ll always read/write sequential data on disk
when it’s possible with as large data chunks as possible without
hitting the swap.

In what sense is it sequential?

It doesn’t read or write one byte at a time Which would be the
result of a naÃ¯ve transpose algorithm using direct disk access.

These approeaches require prepending/
appending rectangular chunks of text in the ouput file?

There’s no way around that if you can’t fit all the matrix in memory,
the only thing you can do is optimize the rectangles to minimize the
number of chunks to read…

Lionel

amslonewolf · April 29, 2008, 5:30pm

On Sun, 27 Apr 2008 05:32:52 -0500, Ams Lo wrote:

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

I can’t read the whole file into memory due to the file sizes…

I assume all of the lines in the file are the same length, and the line
length is a multiple of the disk block size. Thus the file is laid out
like

000011112222
333344445555
666677778888
9999aaaabbbb
ccccddddeeee
ffffgggghhhh
iiiijjjjkkkk
llllmmmmnnnn
ooooppppqqqq
rrrrsssstttt
uuuuvvvvwwww
xxxxyyyyzzzz

(where the letters and numbers indicate the block number where the data
is stored). We’re going to transpose in chunks, indicated by separators,
so that each chunk has one full disk block per row:

Basically, for each chunk x,y where (x>y):
*read in chunk x,y and y,x
here, that’s 8 disk blocks
*transpose the data in chunk x,y in place in memory
*transpose the data in chunk y,x in place in memory
*write the transposed data from chunk x,y to chunk y,x
*write the transposed data from chunk y,x to chunk x,y
Now, for each chunk x,x (the diagonal)
*read in chunk x,x
*transpose the data in chunk x,x in memory
*write out the transposed data to chunk x,x

Thus, each disk block is read exactly once and written exactly once.

If your lines don’t align precisely to disk blocks then you get:

You could make a pass through the file to pad things first, but it looks
like you could choose appropriate chunk sizes so that you’d touch each
disk block at most twice anyway, and that would have the same
performance characteristics (ignoring disk seek time) as making a pass
through the file to pad it.

–Ken

amslonewolf · April 29, 2008, 6:10pm

On Tue, 29 Apr 2008 15:28:55 +0000, Ken B. wrote:

like
rrrrsssstttt
6666|7777|8888
xxxx|yyyy|zzzz
*transpose the data in chunk x,x in memory *write out the transposed
----±—±---
You could make a pass through the file to pad things first, but it looks
like you could choose appropriate chunk sizes so that you’d touch each
disk block at most twice anyway, and that would have the same
performance characteristics (ignoring disk seek time) as making a pass
through the file to pad it.

–Ken

Clarification: the appropriate chunk size would be
(#rows = #columns) > 0.5 * disk block size.

–Ken

amslonewolf · April 29, 2008, 6:15pm

Ken B. wrote:

On Sun, 27 Apr 2008 05:32:52 -0500, Ams Lo wrote:

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

I can’t read the whole file into memory due to the file sizes…

I assume all of the lines in the file are the same length, and the line
length is a multiple of the disk block size.

From the rest of the algorithm, you assume that the OP wants to
transpose in place, and indeed your algorithm is probably one of the
best in this case (and is more or less Pascal B.'s algorithm
tuned for disk access). I’m not sure this is what the OP wants (wasn’t
specified). He seemed memory constrained but didn’t report any disk
space constraint, so he may benefit from an algorithm writing another
file.

In short I wonder if it’s the best in the general case.

For a data point a 20000 * 20000 64 bit integer matrix is >2GB but only
160kb-wide, splitting it in narrower chunks will take a huge toll on the
reading/writing speed compared to the huge sequential reads (or writes
depending on the choices I left open) I proposed. The slow writes (or
reads) due to numerous seeks would even be faster than in your algorithm
as you advise a block size access which is only 512bytes, with 2G of RAM
you can write/read in ~ 2 * 10^9 / 20000 * 8 = 12500 bytes (12288 bytes
is the nearest multiple of 512), so it should be around 24 times faster
(assuming inverse proportion of bandwidth vs number of seeks).

The problem with splitting the matrix in submatrix (though an elegant
solution) is that it doesn’t try to minimize disk seeks. Rectangular
portions as high or wide as the matrix are better suited to disk access
as they minimize the number of seeks in at least one of the read/write
part.

Lionel

amslonewolf · April 30, 2008, 12:11am

On Tue, 29 Apr 2008 18:13:23 +0200, Lionel B. wrote:

length is a multiple of the disk block size.

For a data point a 20000 * 20000 64 bit integer matrix is >2GB but only
160kb-wide, splitting it in narrower chunks will take a huge toll on the
reading/writing speed compared to the huge sequential reads (or writes
depending on the choices I left open) I proposed. The slow writes (or
reads) due to numerous seeks would even be faster than in your algorithm
as you advise a block size access which is only 512bytes, with 2G of RAM
you can write/read in ~ 2 * 10^9 / 20000 * 8 = 12500 bytes (12288 bytes
is the nearest multiple of 512), so it should be around 24 times faster
(assuming inverse proportion of bandwidth vs number of seeks).

The block size I used was just for illustration. I chose the blocksize
and line size that I did so that my diagrams wouldn’t be overwhelmingly
tall. I also made no assumption about the data size. You should use your
disk’s block size – the size that is fetched anyway when you do a read
(). I don’t know of a portable way to get this number. On my ext3
filesystem, I can get it by running dumpe2fs, and it’s 4KB.

Beyond the block size, you have no guarantee of contiguous blocks in the
file being contiguous on disk anyway, though ext2 and ext3 are pretty
good at avoiding fragmentation.

Your algorithm poses other interesting questions about seeks, reads and
writes. Let’s assume that whenever we allocate a new block it’s
contiguous with the last one we wrote in the file. Let’s read in the
original file contiguously, and write noncontiguously.

We need to read in enough lines to fill all of the blocks we are
outputting. With a 512k block size and 64-bit integers, that’s 4 lines.
Not too bad memory-wise, but keep this in mind because if we read in
less
lines, we’ll be writing to the same disk block multiple times.

Now let’s start writing. We write a block at the beginning of line 0,
then we seek to the beginning of line 1 and write another block. Since
Linux allows sparse files it doesn’t write anything in the blocks
between
these two blocks. They’ll get filled in later when we write more blocks
there. The two blocks we just wrote are now next to each other on disk.
Keep going, and you’ll find that both your reads and writes are
contiguous. On the other hand, if you try to transpose back, the file
that you just finished writing is the pathological worst case for seek
time.

Does this sound about right? In practice it may not be worth optimizing
seek time, since that’s very much the realm of the operating system.

–Ken

amslonewolf · April 30, 2008, 1:31am

Ken B. wrote:

Now let’s start writing. We write a block at the beginning of line 0,
then we seek to the beginning of line 1 and write another block. Since
Linux allows sparse files it doesn’t write anything in the blocks between
these two blocks.

Yes

They’ll get filled in later when we write more blocks
there. The two blocks we just wrote are now next to each other on disk.

Not necessarily. Filesystems might protect themselves against these
access patterns to avoid fragmentation (ext2/3 already prevent several
processes accessing the disk at the same time writing to the same
cylinder groups to avoid fragmentation), in fact if I understand the
source code of ext2 correctly this is one of the things indirect block
allocation prevent partially (ie: non-contiguous parts are allocated in
different parts of the disk and the fs tries to put a part next to a
contiguous one).
I guess you’d probably end with a matrix split in large continuous
blocks in various parts of one or perhaps more cylinder groups on disk.

Keep going, and you’ll find that both your reads and writes are
contiguous.

This probably isn’t true and if you want to get down to the real meat,
you’ll have to evaluate the seek times and amount of them. On a modern
drive, a cylinder probably stores something of the order of the
megabytes or tens of megabytes. This means that more than one line of
the matrix fit in one single cylinder : consecutive lines can be
accessed without seek or minimal (track to next track) seek. The only
penalty is that you access a low percentage of what the disk could have
in one revolution.

On the other end, if you access lines in non contiguous parts of the
matrix with smaller segments (typical when you transpose by sub-matrix),
you diminish the amount of data used on each revolution and get more
seeks. If you are doing tens or hundred of thousands of them, this
begins to take a huge toll. Indeed reading 2G on a modern SATA disk
takes at most 40s (50+MB/s), so seeking can become the performance
bottleneck as soon as you get into the tens of thousands with quick
seeks, and a 2G 64bit integer matrix already have a number of lines in
this ballpark.

On the other hand, if you try to transpose back, the file
that you just finished writing is the pathological worst case for seek
time.

This is not obvious and at least not the case for ext2/3 (unless your
filesystem is already heavily fragmented). I believe the same allocation
heuristics are used in various modern fs to avoid file fragmentation.

Does this sound about right? In practice it may not be worth optimizing
seek time, since that’s very much the realm of the operating system.

Unfortunately when you deal with data that can’t fit in cache, you
pretty much defeat any mechanism that the OS uses to optimize disk
access so you can’t rely on it. By the number of lines in the matrix,
you can guess the order of seek numbers neeeded and you can guess that
algorithms not optimizing them will get far worse performance than
others optimizing them.

Now what would be real fun is the OP telling that he uses a SSD with
access times in the order of 0.1ms for storing these matrices Then
I’d eat my hat and advise to go with the easiest algorithm to code

Lionel

amslonewolf · April 30, 2008, 2:41am

On Wed, 30 Apr 2008 01:26:45 +0200, Lionel B. wrote:

blocks we just wrote are now next to each other on disk.
blocks in various parts of one or perhaps more cylinder groups on disk.
In that case, none of the rest of what I said is relevant anymore.
Without reading into the guts of your filesystem, all you can do is
guess
and benchmark.

Does this sound about right? In practice it may not be worth optimizing
seek time, since that’s very much the realm of the operating system.

Unfortunately when you deal with data that can’t fit in cache, you
pretty much defeat any mechanism that the OS uses to optimize disk
access so you can’t rely on it.

There’s still an I/O scheduler. My suggestion now might actually be to
use your algorithm, making sure to read in several lines in a single
read
(2) call, so that the OS’s I/O scheduler could figure out the best way
to
get the data. I’m not entirely sure about writes. Does the write system
call queue up the writes for the I/O scheduler and let the user process
keep going, or does it stop the process until the write completes (for
some definition of completes)?

My point was that you can’t rely on the data having any particular
organization with respect to seek times.

Now what would be real fun is the OP telling that he uses a SSD with
access times in the order of 0.1ms for storing these matrices Then
I’d eat my hat and advise to go with the easiest algorithm to code

Yep. More unstated assumptions about the best way to do things.
I hope he finds a solution that’s good enough.

–Ken