General Approach to Data Validation

dfg59 · February 8, 2007, 3:16pm

Hey all-

I posted some questions yesterday about getting ActiveRecord working
with Oracle 10g, and I’m happy to say I was successful. ActiveRecord
makes pulling and manipulating data so easy. Now, I’m writing some data
validation scripts. Here’s the goal: given two data sources, confirm
that specific fields in the first table are the same as the fields in
the 2nd for the same record, check to see if there are records in the
1st table that aren’t in the 2nd, and vise-versa. I was wondering if my
general approach is the fastest way to do this, as I’m not new to ruby
but generally new to writing high performance sql (and thus the
corresponding ActiveRecord statements). I’m working with very large
datasets here and want to streamline the code as much as possible.
Here’s my general strategy (assume the objects are ActiveRecord
objects), advice is much appreciated:

Table1.find(:all).each do |rec1|
rec2 =
Table2.find(:first,:conditions=>{:key1=>rec1.key1,:key2=>rec1.key2})
if rec2
if rec1.field1 != rec2.field2
#throw value mismatch here
end
else
#throw missing record error here
end
end

Table2.find(:all).each do |rec2|
rec1 =
Table1.find(:first,:conditions=>{:key1=>rec2.key1,:key2=>rec2.key2})
if !rec1
#throw missing record error here
end
end

dfg59 · February 8, 2007, 3:42pm

Drew O. schrieb:

datasets here and want to streamline the code as much as possible.
Drew, I would try to perform the comparison inside the database,
especially if your datasets are “very large”. (How large is this
actually?) Given that you are working with Oracle, I would use Oracle’s
database links to get access to both tables.

Regards,
Pit

dfg59 · February 8, 2007, 4:11pm

Pit C. wrote:

Drew O. schrieb:

datasets here and want to streamline the code as much as possible.
Drew, I would try to perform the comparison inside the database,
especially if your datasets are “very large”. (How large is this
actually?) Given that you are working with Oracle, I would use Oracle’s
database links to get access to both tables.

Regards,
Pit

Pit -

This does make sense, but I’m using ruby in this situation for a reason.
I had previously written some validation scripts that ran on .csv dumps
from the database and I leveraged this previous work to get these
scripts up and running quickly by introducing ActiveRecord. Also, I’m
writing an error report to .csv using FasterCSV and doing quite a bit of
data manipulation during these compares. In short, I’d really like to
continue using ruby/ActiveRecord here. However, I want to make sure that
the way I’m going about it is as “efficiently as possible”. It’s not a
huge deal, however if I’m make some massive error that would save
50% when running my scripts, it would be nice to change them.

As far as record size, we’re talking close to 1 million records, more in
some cases.

-Drew

dfg59 · February 8, 2007, 4:51pm

On Fri, Feb 09, 2007 at 12:11:10AM +0900, Drew O. wrote:

huge deal, however if I’m make some massive error that would save
50% when running my scripts, it would be nice to change them.

As far as record size, we’re talking close to 1 million records, more in
some cases.

I’d suggest:

sort both the datasets by the primary key (or SELECT … ORDER BY)
run through both lists with a pointer to each. This lets you
efficiently
find records which are in A but not in B, and vice versa. This is a
linear-time operation.
then, for records which are in both A and B, you can compare them
attribute-by-attribute.

You can do the whole lot externally: write out both datasets to CSV
files,
sort them (using the Unix ‘sort’ command), and then open both files and
read
in a line at a time. Your ‘pointer’ is then just your current position
in
the file, and this avoids having to read the whole datasets into memory.
A
million records should be quite doable.

If you just want to find rows which exist in one but not the other, the
Unix
‘join’ command may do what you need. This also needs the files to be
pre-sorted.

Doing the search in SQL will probably be more efficient, but the query
is a
pain to write. Something like

SELECT table1.key
FROM table1 LEFT JOIN table2
ON table1.key = table2.key
WHERE table2.key IS NULL;

will find rows which exist in table A but not in table B. Turn it around
to
find rows in B but not A. Do a third query which is a normal join to
compare
all the fields of corresponding rows. There may be a better way that
this,
but my SQL foo doesn’t extend that far.

HTH,

Brian.

dfg59 · February 8, 2007, 5:27pm

Drew O. schrieb:

As far as record size, we’re talking close to 1 million records, more in
some cases.

Drew, I don’t think doing it on the client is the right tool for this
job. But if you really want to, I’d try one of the following approaches.
Note that I’ve never used ActiveRecord before, so I don’t know whether
you actually can do this.

One way would be to read all the records at once and then use Ruby to
compare the two datasets. This requires lots of RAM, and it doesn’t
scale well if you’ll get more records. You could try to split the
datasets into smaller disjoint parts and then compare only those parts
in order to reduce the memory needed.

The other way would be to read the records one after the other (using
cursors in db terminology) in an appropriate sort order. If ActiveRecord
allows you to do this in parallel with both tables, you could compare
the tables like this:

table1 = open_cursor_for “table1”
table2 = open_cursor_for “table2”

record1 = table1.next_record
record2 = table2.next_record

until record1.nil? and record2.nil?
case relevant_fields(record1) <=> relevant_fields(record2)
when -1
puts “#{record1} is missing in table2”
record1 = table1.next_record
when +1
puts “#{record2} is missing in table1”
record2 = table2.next_record
else
puts “#{record1} exists in both tables”
record1 = table1.next_record
record2 = table2.next_record
end
end

table1.close
table2.close

(I see that Brian suggested the same algorithm.)

The code you have shown in your first post performs roughly 2 million
database queries. Try this with 100, 1000, 10000 queries, and then
estimate how long it would take for the real job. If this is no problem
for you, your code should be fine.

Regards,
Pit

dfg59 · February 8, 2007, 5:51pm

Pit C. wrote:

The other way would be to read the records one after the other (using
cursors in db terminology) in an appropriate sort order. If ActiveRecord
allows you to do this in parallel with both tables, you could compare
the tables like this:

This is the exact approach I originally wanted to take. I would be very
interested to know if ActiveRecord supports this type of operation. I
have yet to see a way to do it from what I’ve read of the documentation.

-Drew

dfg59 · February 8, 2007, 8:47pm

On Feb 8, 2007, at 11:51 AM, Drew O. wrote:

have yet to see a way to do it from what I’ve read of the
documentation.

A while back (last year?) I asked why ActiveRecord didn’t support
‘lazy’ iteration over query results. AR slurps up the entire
result set into memory and then wraps objects (i.e. instances of
ActiveRecord::Base or subclasses) around each row.

This approach obviously has problems if you have large data sets.

Perhaps this has changed since I last posed the question?

Gary W.

dfg59 · February 8, 2007, 5:32pm

Brian C. wrote:

I’d suggest:

sort both the datasets by the primary key (or SELECT … ORDER BY)

run through both lists with a pointer to each. This lets you
efficiently
find records which are in A but not in B, and vice versa. This is a
linear-time operation.

then, for records which are in both A and B, you can compare them
attribute-by-attribute.

I think you’ve hit the nail on the head with this. It will drastically
reduce the number of queries I make. I’m coding it up now and taking a
look at the performace difference.

Thanks for your help,
Drew

dfg59 · February 9, 2007, 7:31am

On Fri, 9 Feb 2007, Ezra Z. wrote:

But I do think there is an ActiveRecordExtensions[1] project that
fixes some of these issues as well.

Cheers-
– Ezra Z.-- Lead Rails Evangelist
– [email protected]
– Engine Y., Serious Rails Hosting
– (866) 518-YARD (9273)

[1] http://rubyforge.org/projects/arext/

hey ezra-

fyi, nearly all the database apis support

db.execute(sql) do |tuple|
p tuple
end

which does the obvious - only one tuple is in memory at a time. i sent
in a
patch for rails to use any given block as meaning: construct the AR
object and
yield it, one at the time which largely consited of a bunch of '&b’s but
it
there was no interest at the time. i’m shocked that this is still an
issue in
the rails core. haven’t people been hitting this as rails is used on
bigger
projects with bigger databases?

have used any of the patches/extensions?

ps. wore my engine yard shirt today!

regards.

-a

dfg59 · February 9, 2007, 8:19am

Hey Ara-

On Feb 8, 2007, at 10:31 PM, [email protected] wrote:

a cursor but doesn’t use a cursor, it does limits and offsets.
[1] http://rubyforge.org/projects/arext/
which does the obvious - only one tuple is in memory at a time. i

have used any of the patches/extensions?

I’ve used the AR extensions project, it adds a lot of the stuff you
need to work with large data sets. Hosting lots of rails apps the
error i see most often is either the ActiveRecord n+1 query problem
where each page results in hundreds of sql queries. Or its the too
many :include joins that end up pulling hundreds or thousands of
objects into memory at once, killing the mongrels the app runs in.

ps. wore my engine yard shirt today!

Awesome!

regards.

-a

we can deny everything, except that we have the possibility of
being better.
simply reflect on that.

the dalai lama

Cheers-
– Ezra Z.
– Lead Rails Evangelist
– [email protected]
– Engine Y., Serious Rails Hosting
– (866) 518-YARD (9273)

dfg59 · February 9, 2007, 6:32am

Hi~

On Feb 8, 2007, at 11:46 AM, Gary W. wrote:

the tables like this:
ActiveRecord::Base or subclasses) around each row.

This approach obviously has problems if you have large data sets.

Perhaps this has changed since I last posed the question?

Gary W.

ActiveRecord will start to consume a huge amount of memory and cpu
if you fetch more then a thousand or so records at a time. This is
because it does load all results into memory at once and then wraps
each row in a AR object which is expensive with the amount of records
being talked about in this thread. There is an AR plugin somewhere
called paginating find that works sort of like a cursor but doesn’t
use a cursor, it does limits and offsets.

But I do think there is an ActiveRecordExtensions[1] project that
fixes some of these issues as well.

Cheers-
– Ezra Z.
– Lead Rails Evangelist
– [email protected]
– Engine Y., Serious Rails Hosting
– (866) 518-YARD (9273)

[1] http://rubyforge.org/projects/arext/

dfg59 · February 9, 2007, 4:59pm

Ezra Z. wrote:

I’ve used the AR extensions project, it adds a lot of the stuff you
need to work with large data sets. Hosting lots of rails apps the
error i see most often is either the ActiveRecord n+1 query problem
where each page results in hundreds of sql queries. Or its the too
many :include joins that end up pulling hundreds or thousands of
objects into memory at once, killing the mongrels the app runs in.

Ezra -

Is it necessary to use the AR extensions project in conjunction with
rails or can I use it as described above, when I’m using ActiveRecord
without using the entire Rails framework? If so, do I install the
extension as you would a gem?

I must say, I was also amazed that there was no way to iterate across
large datasets without pulling every record into memory.

Thanks to everyone for the quick, informative responses to this issue.

-Drew