Net/http and rexml

ljscoras · November 7, 2006, 10:55am

Hey all;

Net/http has a method for processing data in segments–by passing the
get method a block–but is there an easy way to get to the socket
directly?

The script I’m working on is grabbing a giant xml file from a remote
site, and I want to process it as it comes in. I’ve already changed
the code to use rexml’s pull parser, so I’m guessing that now all I
need to do is give it the correct IO handle and let it go.

I don’t see anything in the docs about exposing the socket, and I
don’t want to rip open the class if there’s something obvious I’m
missing. Any ideas on the best way to go about this?

ljscoras · November 7, 2006, 10:56am

Louis J Scoras wrote:

Hey all;

Net/http has a method for processing data in segments–by passing the
get method a block–but is there an easy way to get to the socket
directly?

The script I’m working on is grabbing a giant xml file from a remote
site, and I want to process it as it comes in.

You mean, line by line? The socket class you are describing doesn’t know
about lines, it knows about blocks. So try this: read a block, split it
into lines, do your processing. If you do this, you will discover some
blocks end in the middle of a line. Then you will say, “Gee, maybe I
should
read the whole thing, then do the processing.”

At that point, you will understand why the class is written as it is.

I’ve already changed
the code to use rexml’s pull parser, so I’m guessing that now all I
need to do is give it the correct IO handle and let it go.

I don’t see anything in the docs about exposing the socket, and I
don’t want to rip open the class if there’s something obvious I’m
missing. Any ideas on the best way to go about this?

Yep. Read the entire thing. Then process the result.

ljscoras · November 7, 2006, 10:56am

On 11/2/06, Paul L. [email protected] wrote:

You mean, line by line? The socket class you are describing doesn’t know
about lines, it knows about blocks.

Nope. Not line by line. All the parser should need is a token, and I
should only have to read as much data as I need to complete one, so
blocks would be fine.

So try this: read a block, split it into lines, do your processing. If you
do this, you will discover some blocks end in the middle of a line. Then you
will say, “Gee, maybe I should read the whole thing, then do the
processing.”

No, I wouldn’t say that I’d just read enough segments into a
buffer until I could complete the next token.

Yep. Read the entire thing. Then process the result.

Why? I should be able to start processing simultaneously. That’s
what the stream paradigm was developed for. What if I got three
tokens into the xml and found that it was malformed? That would be an
aweful waste of bandwidth, no?

ljscoras · November 7, 2006, 10:56am

Louis J Scoras wrote:

On 11/2/06, Paul L. [email protected] wrote:

You mean, line by line? The socket class you are describing doesn’t know
about lines, it knows about blocks.

Nope. Not line by line. All the parser should need is a token, and I
should only have to read as much data as I need to complete one, so
blocks would be fine.

And you could set things up to read more data when your block-oriented
input
stream is depleted, easy to arrange. This will provide the appearance of
a
local stream, a common arrangement in socket reading algorithms.

So try this: read a block, split it into lines, do your processing. If
you do this, you will discover some blocks end in the middle of a line.
Then you will say, “Gee, maybe I should read the whole thing, then do the
processing.”

No, I wouldn’t say that I’d just read enough segments into a
buffer until I could complete the next token.

s/segments/blocks/

Yep. Read the entire thing. Then process the result.

Why? I should be able to start processing simultaneously.

Block by block, yes. The block reading back end can be made to appear to
be
a stream locally, but there are excellent reasons to read blocks at the
network-protocol level, and sometimes the bigger the better.

That’s
what the stream paradigm was developed for.

Yes. You can always turn a block into a stream locally. And no, you
don’t
have to read the entire thing, I just prefer it that way. A personal
preference, nothing more, doubtless springing from my unreliable
Internet
access.

ljscoras · November 7, 2006, 10:56am

Louis J Scoras wrote:

The script I’m working on is grabbing a giant xml file from a remote
site, and I want to process it as it comes in. I’ve already changed
the code to use rexml’s pull parser, so I’m guessing that now all I
need to do is give it the correct IO handle and let it go.

I don’t see anything in the docs about exposing the socket, and I
don’t want to rip open the class if there’s something obvious I’m
missing. Any ideas on the best way to go about this?

I fully sympathize… I went through the same mess a while back.
IO Iis one of the spots where the Ruby standard library is fairly
messy.

net/http is overly complicated for this kind of stuff. Look at openuri:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

And you can’t actually “expose the socket” for the HTTP stream and
expect things to work - the server might very well be using HTTP/1.1
chunked encoding, which means you’d get things interspersed bytes
indicating the length of the following chunk etc., or the connection
might be marked Keep-Alive and use the content-length to indicate
how far you should read, so to pass it to REXML you’d need a wrapper -
which is what openuri provides you with.

Vidar

ljscoras · November 7, 2006, 10:56am

Paul L. wrote:

You mean, line by line? The socket class you are describing doesn’t know
about lines, it knows about blocks. So try this: read a block, split it
into lines, do your processing. If you do this, you will discover some
blocks end in the middle of a line. Then you will say, “Gee, maybe I should
read the whole thing, then do the processing.”

At that point, you will understand why the class is written as it is.

Class TCPSocket has both the methods each_line and readline. That isn’t
the problem.

The issue with net/http is that it’s an overly complicated API for
something that in most instances is very easy.

Yep. Read the entire thing. Then process the result.

Not all network streams (or HTTP initiated transfers) ever finish. And
often the files will be too large to process that way - especially with
REXML
which is extremely memory hungry.

A better solution is to use openuri:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

Or use a decent HTTP API instead of net/http.

Vidar