Sorting question


#1

Hi,
I have following log in simple format:

– START - DATE (YYYY-MM-DD HH:MM:SS.sss)



– END - DATE (YYYY-MM-DD HH:MM:SS.sss)

– START - DATE (YYYY-MM-DD HH:MM:SS.sss)



– END - DATE (YYYY-MM-DD HH:MM:SS.sss)

– START - DATE (YYYY-MM-DD HH:MM:SS.sss)



– END - DATE (YYYY-MM-DD HH:MM:SS.sss)

– START - DATE (YYYY-MM-DD HH:MM:SS.sss)



– END - DATE (YYYY-MM-DD HH:MM:SS.sss)

What is the most efficient way to sort it in ruby by START DATE & TIME?
Log size is generally smaller than 1GB…
I was thinking of :

  1. loading the file into memory
  2. splitting into hash (date | xml)
  3. sorting key values
  4. output

Is there a better way to do this ? I’m mostly concerned with memory
consumption…


#2

Miroslaw N. wrote:

-- START - DATE (YYYY-MM-DD HH:MM:SS.sss) 3) sorting key values 4) output

Is there a better way to do this ? I’m mostly concerned with memory
consumption…

Hi, Miroslaw,

That would require lots of RAM - but should work. (If it triggers
swapping, then it will be slow).

I would read the file twice. The first time using pos to note the
position of the start of each “-- START…” line. Record the date and
position in a tag record.

Then sort these tags in to order. You might do this in RAM or write the
data to a file, sort that with a utility and read it back again.

And finally, process the tags in order, using pos= to position back to
the beginning of the section, and copy it out to the results file.

A final thought - the entries are presumably written in ending date
order by N processes. Therefore the oldest start, must be one of the
last N. If you remove that, the next oldest will again be one of the
last N. Continue by induction. Therefore, if you were to read the file
in reverse order, you could use a “heap” structure of size N to delay
writing the sections until you knew which was the oldest. This would
mean you could write an output file in reverse order of start time.
Another phase then reads this temp file, again in reverse, writing out
in the correct order.

You can’t work from the beginning of the file, because a long running
process may be delayed more than N sections before it reports.

That would solve your problem with no sorting, and only holding N
sections in memory. You would need to code your own readpreviousline
routine.

Regards

Ian


#3

Miroslaw N. wrote:

Is there a better way to do this ? I’m mostly concerned with memory
consumption…

If this is the major concern, then you need to do an external sort: that
is, one which doesn’t load the whole file into memory.

If you’re on Unix, the ‘sort’ command will do this very nicely. All you
need to do is to reformat it suitably (one pass), send it to ‘sort’
(which will do all the necessary spooling and merging), and then
reformat it again (one pass).

I’d suggest using the ‘-z’ option to sort, so that each record is
null-terminated. The way your records are structured, they should
naturally sort in the correct order.

Also set env[‘LC_ALL’]=‘C’ to be on the safe side. I’ve found strange
problems with sort if this isn’t done.

If you want to implement an external sort in Ruby it’s an interesting
learning exercise, but is likely to be nowhere near as good as the Unix
one. You could consider implementing a merge sort: e.g. split your file
into 16 chunks, sort each one, then merge the 16 together one record at
a time.


#4

2009/3/11 Miroslaw N. removed_email_address@domain.invalid:

-- START - DATE (YYYY-MM-DD HH:MM:SS.sss) 3) sorting key values 4) output

Is there a better way to do this ? I’m mostly concerned with memory
consumption…

You can at least optimize it like this

  1. while reading the file build up a structure (I’d prefer Array) in
    memory

  2. sort it (using sort_by or sort!)

  3. output it

Sample attached. You can as well add parsing of timestamps which
might save a bit of memory.

Kind regards

robert