How to detect used protocol (SOAP, JSON, XML etc.)

Hi guys!.

I am building an ESB (Enterprise Service Bus) for a Content Management
System.

The esb (as you probably know) should be able to handle multiple
protocols (JSON, XML, SOAP).
So it has a tcp socket and it gets a “request”.

The problem is that I can’t figure out how to check what protocol is
used. (the only thing i can figure out is to just try to transform it
with every known “protocol” until it doesn’t return an error anymore.
This is however not the most elegant solution.

Does anyone have any idea !! (I am really stuck on this problem)
Thanxs!

greets Heldopslippers

Wel after a discussion with a few students I realized i could of course
use regular expressions !
Every protocol uses a unique way to package their information or
request.
for example:
Json {“get”:“hello, world”}
xml hello, world
http
GET /hello%20world HTTP/1.1

If you have an even more elegant solution I OF COURSE would love to hear
them.
Or of course if you don’t understand why somebody would want to build an
ESB for
content management systems. feel free to ask :slight_smile:

On Fri, Feb 12, 2010 at 2:22 AM, jeljer te Wies [email protected] wrote:

I am building an ESB (Enterprise Service Bus) for a Content Management
System.

The esb (as you probably know) should be able to handle multiple
protocols (JSON, XML, SOAP).
So it has a tcp socket and it gets a “request”.

The problem is that I can’t figure out how to check what protocol is
used.

If all these “protocols” are being delivered over HTTP, you can look
at the “Accept” header. Read RFC2616 for details…

HTH,

jeljer te Wies wrote:

The esb (as you probably know) should be able to handle multiple
protocols (JSON, XML, SOAP).
So it has a tcp socket and it gets a “request”.

The problem is that I can’t figure out how to check what protocol is
used. (the only thing i can figure out is to just try to transform it
with every known “protocol” until it doesn’t return an error anymore.
This is however not the most elegant solution.

This is an encapsulation issue.

If your messages are transported using HTTP then you have a
“Content-Type:” header which says whether it’s application/json or
application/xml. And/or you can have different HTTP endpoint URLs for
these different services (/foo.xml, /foo.json, /foo.soap or whatever)

I find it highly unlikely in a real ESB that the messages are squirted
over a raw TCP socket as you describe. For example, you wouldn’t be able
to send multiple messages over the same connection, because the only way
the endpoint would be able to determine that the message had finished
would be waiting for the socket to close (*).

But even in that case you can have multiple endpoints, by listening on
different TCP ports. So make your SOAP listener listen on one port, and
your JSON one on a different port.

HTH,

Brian.

(*) Yes, I know stream parsers can tell when the closing tag is
received. The closing tag could be followed by whitespace and/or
comments. It would be a very strange system which allowed multiple XML
roots to be sent one after the other. The nearest equivalent would be
something like XMPP, but that has a single opening root tag followed by
nested messages.

Hi Brian,

Thank you for your insights in this matter. I do think it is not
unlikely to have raw tcp sockets for transportation. It is really
possible:
“For example, you wouldn’t be able
to send multiple messages over the same connection, because the only way
the endpoint would be able to determine that the message had finished
would be waiting for the socket to close (*).”

This is not true. you could initiate a special protocol. for example:
the sender first sends the amount of traffic it wants to send (just in
integer) so the receiver knows what he can expect. so for example:

  • 51
  • {“server”:“menus”, “method”:“deleteItem”, “id”:“2”}
    this way you could have persistent connections (believe me it works I
    have tried it). I am not using this though. My current setup is just one
    request per tcp connection (the server wil just close it afther one
    request).

I know that if (for example a web browser needs more requests it has to
initiate a new connection for every request). I think this is just a
very small burden (or is it ? ).

“But even in that case you can have multiple endpoints, by listening on
different TCP ports. So make your SOAP listener listen on one port, and
your JSON one on a different port.”

This is a very good solution !! (i, of course, thought about it but
only for maybe a split second :stuck_out_tongue: ). I would still have to check the
message (ports could of course be set wrong).

SO if i want persistent and non persistent connections (and maybe even
UDP) i have just two options:
move everything to HTTP1.1 compliance
OR
have different TCP connections for the different “protocols”/“adapters”
(or whatever you would like to call it) :stuck_out_tongue:

Thanxs Brian! I am doing this research/building for my final thesis (on
my own) and I am building the prototype in RUBY (and then build the real
thing in C/C++ (beceause in Ruby it wil be SLLOOOOWW)) :P.

It is nice to have a forum where people like you actually take the time
to give different insights !
Of course everybody is still welcome to spam his (or hers) opinion!!! (I
think these discussions are really interesting :slight_smile: )

greets Jeljer

Hassan S. wrote:

On Fri, Feb 12, 2010 at 2:22 AM, jeljer te Wies [email protected] wrote:

I am building an ESB (Enterprise Service Bus) for a Content Management
System.

The esb (as you probably know) should be able to handle multiple
protocols (JSON, XML, SOAP).
So it has a tcp socket and it gets a “request”.

The problem is that I can’t figure out how to check what protocol is
used.

If all these “protocols” are being delivered over HTTP, you can look
at the “Accept” header. Read RFC2616 for details…

HTH,

Hmm no it isn’t :frowning: mostly they are send with just a tcp connection.)
Only the browsers who request a webpage are using (of course HTTP). the
other servers who want to talk with the esb have all different
protocols. (on TCP i mean)

But i will look more closely to the RFC’s they can be useful :stuck_out_tongue:

thanxs !! :slight_smile: appreciated

Then you’re not just squirting the object over a raw TCP socket; you’re
adding your own proprietary encapsulation/framing protocol around it.
And if you’re going to do that, you might as well add another field
which says whether it’s JSON or XML or whatever, and hence the problem
is solved.

Yes you are right. I could use an extra field. But I need the ability to
add more “adapters” when the esb gets more connections!.. so it’s not so
much what i am getting in but how I am able to detect what is getting
in.

I know that if (for example a web browser needs more requests it has to
initiate a new connection for every request).

No, it doesn’t. With HTTP/1.1 the client can hold open a single TCP
connection and send a series of requests one after the other. See RFC
2616 for more details.

Yes i know… but read what i wrote above that sentence:

this way you could have persistent connections (believe me it works I
have tried it). I am not using this though. My current setup is just one
request per tcp connection (the server wil just close it afther one
request).
sorry if this was confussing :stuck_out_tongue: ( As a graduate student I do know my
protocols)

It won’t necessarily be SLLOOOWW; it depends what you’re doing. If
you’re using HTTP, you can use one of the many excellent HTTP servers
(e.g. mongrel, unicorn, rainbow to name a few), which contain C for the
most important code paths; or something like Phusion Passenger under
Apache or Nginx, which is basically an Apache module for running a Ruby
Rack app. (Rack is the pluggable way to write a fast low-level HTTP
service in Ruby)

Yes you are totally right ! but it is not really the Ruby who is slow
but the many sql database connections who are need by every request (see
the attachment for a quick overview)

this weekend I build the ESB and know for every adapter there is a port.
this way I do know what is coming and i have the flexibility. (also
because all the adapters are in threads. if one crashes the others
don’t! )

As soon as i have finished it in Ruby i will be making it opensource (of
course ! ) … though i don’t think anybody would use such a complex
system :stuck_out_tongue:

Thanxs Brian !

jeljer te Wies wrote:

“For example, you wouldn’t be able
to send multiple messages over the same connection, because the only way
the endpoint would be able to determine that the message had finished
would be waiting for the socket to close (*).”

This is not true. you could initiate a special protocol. for example:
the sender first sends the amount of traffic it wants to send (just in
integer) so the receiver knows what he can expect. so for example:

  • 51
  • {“server”:“menus”, “method”:“deleteItem”, “id”:“2”}

Then you’re not just squirting the object over a raw TCP socket; you’re
adding your own proprietary encapsulation/framing protocol around it.
And if you’re going to do that, you might as well add another field
which says whether it’s JSON or XML or whatever, and hence the problem
is solved.

In this case, ultimately you’re re-inventing HTTP. HTTP is admittedly an
extremely complex protocol to implement correctly. For an example of
what you need to do it right, see
http://webmachine.basho.com/diagram.html

So you may be justified in your own app by doing something much simpler.
But first do a bit of research to see if some existing protocol might
work for you (e.g. XML-RPC, JSON-RPC, XMPP) as it would let you use
existing client code.

I know that if (for example a web browser needs more requests it has to
initiate a new connection for every request).

No, it doesn’t. With HTTP/1.1 the client can hold open a single TCP
connection and send a series of requests one after the other. See RFC
2616 for more details.

Thanxs Brian! I am doing this research/building for my final thesis (on
my own) and I am building the prototype in RUBY (and then build the real
thing in C/C++ (beceause in Ruby it wil be SLLOOOOWW)) :P.

It won’t necessarily be SLLOOOWW; it depends what you’re doing. If
you’re using HTTP, you can use one of the many excellent HTTP servers
(e.g. mongrel, unicorn, rainbow to name a few), which contain C for the
most important code paths; or something like Phusion Passenger under
Apache or Nginx, which is basically an Apache module for running a Ruby
Rack app. (Rack is the pluggable way to write a fast low-level HTTP
service in Ruby)

Then it depends how much actual request processing is required. In the
best case, if you’re just picking one of a few canned responses and
sending it back, your ruby app could be almost as fast as a C one.

If you do want to write a performant TCP server in a real high level
language, you might also want to look at erlang. For example, ejabberd
scales to tens of thousands of concurrent connections.

Regards,

Brian.

So either force the clients to use the same protocol you have defined,
or listen on different ports for the different protocols you want to
use. It sounds like you have now gone with the second option.

Yes i have gone with the second option. Because now I can for example
write an adapter for googe (in my esb) but the server who handles those
requests doesn’t have to know about that adapter ! ( it just sends it,
for example, in json to the esb. The esb will see that it should send it
to google and will load the appropriate adapter and sends a request to
the google server!) … I am very pleased with this solution. :slight_smile: after
all it is (I think) the most elegant way to solve it.

That’s what I mean. If the overhead of Ruby is small compared to the
overhead of whatever is going on in the SQL database, then there’s
little point rewriting the Ruby part in C++. Measure first, optimise
later.
Again you are right! I will first build the prototype in Ruby and make a
few account for a few clients i have. By storing the request (in some
kind of logging server) i can determine what the bottlenecks are and
which language i should use to solve it :).

Thanxs Brian for the input ! REALLY appreciated :slight_smile:

jeljer te Wies wrote:

The problem is that I can’t figure out how to check what protocol is
used. (the only thing i can figure out is to just try to transform it
with every known “protocol” until it doesn’t return an error anymore.

Silly wabbit! Just take a look at the JSON, XML and SOAP files you are
(or would be) getting. Pretty easy for a human to tell at a distance
which. And not hard to do a decent job in your server code.

XML, strictly formatted, always starts with a specific line that says
it’s XML. Here’s three examples I got by randomly searching the web:

<?xml version="1.0" encoding="ISO-8859-1"?> <?xml version="1.0" encoding=UTF-8"?> <?xml version="1.0"?>

Must be the first line. Therefore the first six characters in the whole
file should be '<?xml '. So like this:
if (datastring.substr(0, 6) == '<?xml ') …

sorry I’m writing in JS today.

XML, sloppily formatted, might blow this off. At any rate, it consists
of just like HTML. So the first character (after maybe some
whitespace) should be <. You’re not supposed to do sloppy XML, but my
boss does. If you have to accept sloppy xml, like this:
if (datastring.search(/^\s*</) >= 0) …

SOAP is a dialect of XML, I don’t know much about it but probably
there’s some clues close to the start; look at some files and/or read
some SOAP docs.

JSON usually starts with an object at the root, which almost always
starts with a quoted fieldname like this:
{“qleft”:5.5,“qright”:-22.01}

IF it doesn’t, chances are it has an array at the root like this:
[{“qleft”:5.5,“qright”:-22.01},{“qleft”:6.5, “qright”:-202.01}]

If your JSON is tight, like it often is, you can just look for one of
these two like this:
if (datastring.substr(0, 2) == ‘{"’ || datastring.substr(0, 1) == ‘[’)

In fact, if you look at your protocol, it’s very possible it’ll ALWAYS
start with an object at top, even simpler. so you can blow off starting
[s.

But of course you can legally stick in spaces so you could use this
regex:
if (datastring…search(/^\s*[[|{]/) >= 0) …

easy!

jeljer te Wies wrote:

Then you’re not just squirting the object over a raw TCP socket; you’re
adding your own proprietary encapsulation/framing protocol around it.
And if you’re going to do that, you might as well add another field
which says whether it’s JSON or XML or whatever, and hence the problem
is solved.

Yes you are right. I could use an extra field. But I need the ability to
add more “adapters” when the esb gets more connections!..

So either force the clients to use the same protocol you have defined,
or listen on different ports for the different protocols you want to
use. It sounds like you have now gone with the second option.

Yes you are totally right ! but it is not really the Ruby who is slow
but the many sql database connections who are need by every request (see
the attachment for a quick overview)

That’s what I mean. If the overhead of Ruby is small compared to the
overhead of whatever is going on in the SQL database, then there’s
little point rewriting the Ruby part in C++. Measure first, optimise
later.

Remember of course there is a different between latency and
throughput. You can have a high latency for an individual request, but
still serve large numbers of requests if there are lots of concurrent
clients and they are not contending on the same resource.

Regards,

Brian.

Allan Bonadio wrote:

But of course you can legally stick in spaces so you could use this
regex:
if (datastring…search(/^\s*[[|{]/) >= 0) …

oops sorry, take the vertical bar out of that regex.

Ehm guys. Why are you all commenting on a 6 months old post :wink:
I solved the problem already!!

I choose to use the TUple Space architecture. Then i can send just ruby
objects over tcp !
(very nice)

thanxs for the replies though :stuck_out_tongue:

Brian C. wrote:

Allan Bonadio wrote:

But of course you can legally stick in spaces so you could use this
regex:
if (datastring…search(/^\s*[[|{]/) >= 0) …

Note that even when fixed, that regexp would match

[hello]

In Ruby you have to use \A to match start of string.

Anyway, I’d say this sort of attempted format detection is fairly
pointless at best, and dangerous at worst. Better to get the client to
tell you explicitly what it has sent. In the case of HTTP, this could be
via Content-Type: header or via the URL endpoint, e.g. /foo.xml. In the
case of a raw TCP protocol then you can use different port numbers.

On 2/15/2010 12:31 AM, jeljer te Wies wrote:

this weekend I build the ESB and know for every adapter there is a port.
this way I do know what is coming and i have the flexibility. (also
because all the adapters are in threads. if one crashes the others
don’t! )
Actually, in threads if one crashes, everything crashes. If you have
them each in separate processes, then if one crashes, the others remain
unaffected.

Allan Bonadio wrote:

But of course you can legally stick in spaces so you could use this
regex:
if (datastring…search(/^\s*[[|{]/) >= 0) …

Note that even when fixed, that regexp would match

[hello]

In Ruby you have to use \A to match start of string.

Anyway, I’d say this sort of attempted format detection is fairly
pointless at best, and dangerous at worst. Better to get the client to
tell you explicitly what it has sent. In the case of HTTP, this could be
via Content-Type: header or via the URL endpoint, e.g. /foo.xml. In the
case of a raw TCP protocol then you can use different port numbers.