Forum: Ruby Q: Architecting large Web service download app in Ruby

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Ef0db53920b243d6758c2f6b1306df0d?d=identicon&s=25 Steve Ross (cwd)
on 2009-04-24 19:45
(Received via mailing list)
Architecting large Web service downloads in Ruby.

I've run up against a performance bottleneck (that I could have
predicted), but fear my design is causing more grief. Here's an outline:

Background
==========

My app has to download from 1,000 to 20,000 rows from a third-party
Web service (over HTTP, using xml-rpc, which uses net/http). No row
has any indicator when it was last updated, so local caches are
difficult if not impossible to reliably maintain. Everything has to be
considered "dirty". These rows can be downloaded in batches of up to
100, so it's on the order of 90 - 120 seconds over a quick net
connection to grab them and insert them into the primary table
synchronously.

The rub is that each row has a single detail row that is quite a bit
bulkier. Each of the master rows has an Active flag, and at any given
time between 50 and 80% of them are active. Iterating all the active
rows and populating the detail rows with individual Web service calls
takes on the order of 45-85 minutes, which is the real performance
problem. The data is usable without the detail information, but
minimally so.

The Question
============

Assuming we can't improve the request/response rate of the Web service
calls or the granularity of the return data, is there a way to
implement some parallelism? I took a hack at it by creating a thread
pool, keeping up to 50 alive at once. Obviously tunable. However, the
problem is that every so often, a response is pretty darn garbled. Is
there a thread-safety issue in net/http that is causing results to be
stepped on? If that's the case, does a different approach suggest
itself? Or is it a "you're screwed, be patient" situation?

Thanks for reading, and HAHAHAHAHAHAHA is a perfectly acceptable
answer :)

--steve
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2009-04-24 20:17
(Received via mailing list)
On Apr 24, 2009, at 12:44 PM, s.ross wrote:

> Is there a thread-safety issue in net/http that is causing results
> to be stepped on?

I can't say I know for sure, but I'm doubting it.

> If that's the case, does a different approach suggest itself?

Well, you can fork() processes instead of Threads, if your not on
Windows.  This would eliminate Thread safety concerns.  You may have
to work out some IPC issues though if each process can't work totally
on its own.

James Edward Gray II
Ef0db53920b243d6758c2f6b1306df0d?d=identicon&s=25 Steve Ross (cwd)
on 2009-04-24 22:41
(Received via mailing list)
On Apr 24, 2009, at 11:14 AM, James Gray wrote:

> Windows.  This would eliminate Thread safety concerns.  You may have
> to work out some IPC issues though if each process can't work
> totally on its own.
>
> James Edward Gray II

Ok, I looked into it further, and there is a thread safety issue and
it's in the xml-rpc library. If you use the call method in a thread,
you can have the response buffer overwritten by a response in another
thread. However, if you use call_async, then a new server connection
and response buffer is used, thereby rendering it thread safe. And the
performance win is astonishing!

Using a separate process with database connections either means dRb
and some interesting IPC, BackgroundRb, or something else like that.
I'm just pleased that I could bring the average turnaround per request
from 65ms to 4ms (leveled at 1000 combination calls the the master and
detail services). W00t!
Aafa8848c4b764f080b1b31a51eab73d?d=identicon&s=25 Phlip (Guest)
on 2009-04-25 02:15
(Received via mailing list)
s.ross wrote:

> Assuming we can't improve the request/response rate of the Web service
> calls or the granularity of the return data, is there a way to
> implement some parallelism?

Run the entire downloader from a cron task.

Your question ass-umes that you must run out of one controller action.
Wrong
mindset!

And BTW 10,000 XML records should be trivial, so you might look for a
bottleneck
there. I would not read them all as a huge Ruby string and then convert
them
into a huge DOM model in memory. That would thrash. I would use what I
think is
called the "SAX" model of reading, where you register a callback for
each node
type, then let your reader stream them in...
Ef0db53920b243d6758c2f6b1306df0d?d=identicon&s=25 Steve Ross (cwd)
on 2009-04-25 04:06
(Received via mailing list)
On Apr 24, 2009, at 5:15 PM, Phlip wrote:

> s.ross wrote:
>
>> Assuming we can't improve the request/response rate of the Web
>> service  calls or the granularity of the return data, is there a
>> way to  implement some parallelism?
>
> Run the entire downloader from a cron task.
>
> Your question ass-umes that you must run out of one controller
> action. Wrong mindset!

I hope that's not what my question ass-umes. I am able to get the
master records in chunks of 20-100. And they parse just fine. The hope
was to make the detail retrieval of these records happen in parallel
with fetching the next batch -- which I have successfully done.

> And BTW 10,000 XML records should be trivial, so you might look for
> a bottleneck there. I would not read them all as a huge Ruby string
> and then convert them into a huge DOM model in memory. That would
> thrash. I would use what I think is called the "SAX" model of
> reading, where you register a callback for each node type, then let
> your reader stream them in...

Using DOM callbacks is just fine in the event you have a poorly
bounded rowset count. I have a pretty well-bounded count and parsing
the chunked data makes it quite manageable without callbacks.

I had considered the cron task but that's one step ahead of where I am
right now. I'm running them from the console to determine the
acceptability of how the thing is architected. As I noted in a
followup post to the list, I discovered that using XmlRpc::Client#call
can expose some potential data corruption in a multi-threaded
implementation. However, XmlRpc::Client#async_call does not have that
same problem, and by shifting the detail record fetch process into
threads that begin after each chunk of master records are read, I
increased the effective processing efficiency by around 2.5x because
while the next master Web service fetch was blocking on the response,
all the little detail fetches were purring right along in their own
threads.

Thx,

Steve
87e41d0d468ad56a3b07d9a6482fd6d5?d=identicon&s=25 Hemant Kumar (gnufied)
on 2009-04-25 08:55
(Received via mailing list)
On Sat, Apr 25, 2009 at 7:35 AM, s.ross <cwdinfo@gmail.com> wrote:

>> Run the entire downloader from a cron task.
>  And BTW 10,000 XML records should be trivial, so you might look for a
> I had considered the cron task but that's one step ahead of where I am
>
>
or port XML-RPC so as it works from evented architecture such as
EventMachine or Packet (in which case you can use traditional workers
for
concurrent download)
This topic is locked and can not be replied to.