Q: Architecting large Web service download app in Ruby

cwd · April 24, 2009, 7:45pm

Architecting large Web service downloads in Ruby.

I’ve run up against a performance bottleneck (that I could have
predicted), but fear my design is causing more grief. Here’s an outline:

Background

My app has to download from 1,000 to 20,000 rows from a third-party
Web service (over HTTP, using xml-rpc, which uses net/http). No row
has any indicator when it was last updated, so local caches are
difficult if not impossible to reliably maintain. Everything has to be
considered “dirty”. These rows can be downloaded in batches of up to
100, so it’s on the order of 90 - 120 seconds over a quick net
connection to grab them and insert them into the primary table
synchronously.

The rub is that each row has a single detail row that is quite a bit
bulkier. Each of the master rows has an Active flag, and at any given
time between 50 and 80% of them are active. Iterating all the active
rows and populating the detail rows with individual Web service calls
takes on the order of 45-85 minutes, which is the real performance
problem. The data is usable without the detail information, but
minimally so.

The Question

Assuming we can’t improve the request/response rate of the Web service
calls or the granularity of the return data, is there a way to
implement some parallelism? I took a hack at it by creating a thread
pool, keeping up to 50 alive at once. Obviously tunable. However, the
problem is that every so often, a response is pretty darn garbled. Is
there a thread-safety issue in net/http that is causing results to be
stepped on? If that’s the case, does a different approach suggest
itself? Or is it a “you’re screwed, be patient” situation?

Thanks for reading, and HAHAHAHAHAHAHA is a perfectly acceptable
answer

–steve

cwd · April 24, 2009, 8:17pm

On Apr 24, 2009, at 12:44 PM, s.ross wrote:

Is there a thread-safety issue in net/http that is causing results
to be stepped on?

I can’t say I know for sure, but I’m doubting it.

If that’s the case, does a different approach suggest itself?

Well, you can fork() processes instead of Threads, if your not on
Windows. This would eliminate Thread safety concerns. You may have
to work out some IPC issues though if each process can’t work totally
on its own.

James Edward G. II

cwd · April 24, 2009, 10:41pm

On Apr 24, 2009, at 11:14 AM, James G. wrote:

Windows. This would eliminate Thread safety concerns. You may have
to work out some IPC issues though if each process can’t work
totally on its own.

James Edward G. II

Ok, I looked into it further, and there is a thread safety issue and
it’s in the xml-rpc library. If you use the call method in a thread,
you can have the response buffer overwritten by a response in another
thread. However, if you use call_async, then a new server connection
and response buffer is used, thereby rendering it thread safe. And the
performance win is astonishing!

Using a separate process with database connections either means dRb
and some interesting IPC, BackgroundRb, or something else like that.
I’m just pleased that I could bring the average turnaround per request
from 65ms to 4ms (leveled at 1000 combination calls the the master and
detail services). W00t!

cwd · April 25, 2009, 2:15am

s.ross wrote:

Assuming we can’t improve the request/response rate of the Web service
calls or the granularity of the return data, is there a way to
implement some parallelism?

Run the entire downloader from a cron task.

Your question ass-umes that you must run out of one controller action.
Wrong
mindset!

And BTW 10,000 XML records should be trivial, so you might look for a
bottleneck
there. I would not read them all as a huge Ruby string and then convert
them
into a huge DOM model in memory. That would thrash. I would use what I
think is
called the “SAX” model of reading, where you register a callback for
each node
type, then let your reader stream them in…

cwd · April 25, 2009, 8:55am

On Sat, Apr 25, 2009 at 7:35 AM, s.ross [email protected] wrote:

Run the entire downloader from a cron task.
And BTW 10,000 XML records should be trivial, so you might look for a
I had considered the cron task but that’s one step ahead of where I am

or port XML-RPC so as it works from evented architecture such as
EventMachine or Packet (in which case you can use traditional workers
for
concurrent download)

cwd · April 25, 2009, 4:06am

On Apr 24, 2009, at 5:15 PM, Phlip wrote:

s.ross wrote:

Assuming we can’t improve the request/response rate of the Web
service calls or the granularity of the return data, is there a
way to implement some parallelism?

Run the entire downloader from a cron task.

Your question ass-umes that you must run out of one controller
action. Wrong mindset!

I hope that’s not what my question ass-umes. I am able to get the
master records in chunks of 20-100. And they parse just fine. The hope
was to make the detail retrieval of these records happen in parallel
with fetching the next batch – which I have successfully done.

And BTW 10,000 XML records should be trivial, so you might look for
a bottleneck there. I would not read them all as a huge Ruby string
and then convert them into a huge DOM model in memory. That would
thrash. I would use what I think is called the “SAX” model of
reading, where you register a callback for each node type, then let
your reader stream them in…

Using DOM callbacks is just fine in the event you have a poorly
bounded rowset count. I have a pretty well-bounded count and parsing
the chunked data makes it quite manageable without callbacks.

I had considered the cron task but that’s one step ahead of where I am
right now. I’m running them from the console to determine the
acceptability of how the thing is architected. As I noted in a
followup post to the list, I discovered that using XmlRpc::Client#call
can expose some potential data corruption in a multi-threaded
implementation. However, XmlRpc::Client#async_call does not have that
same problem, and by shifting the detail record fetch process into
threads that begin after each chunk of master records are read, I
increased the effective processing efficiency by around 2.5x because
while the next master Web service fetch was blocking on the response,
all the little detail fetches were purring right along in their own
threads.

Thx,

Steve