The inefficiency of horizontally scaling websockets with the websocket-rails gem

I’m trying understand how to horizontally scale a real-time tic-tac-toe
web app. I’m using Ruby on Rails to serve the content, websockets to
update the state of a game, and Heroku to host it all. Since Heroku’s
websocket architecture is fairly typical, this question is not specific
to Heroku, but websocket-based apps in general…

“The WebSocket protocol introduces state into a generally stateless
application architecture. It provides a mechanism for creating
persistent connections to a node in a stateless system (e.g. a web
browser connecting to a single web process). Because of this, each web
process is required to maintain the state of its own WebSocket
connections. If application data is shared across processes, global
state must also be maintained.”

To solve this, Heroku, and others, recommend using a global message
queue, to maintain a global state among the processes…

“Imagine a chat application that pushes messages from a Redis Pub/Sub
channel to all of its connected users. Every web process would have a
collection of persistent WebSocket connections open from active users.
Each user would not, however, have its own subscription to the Redis
channel. The web process would maintain a single connection to Redis,
and the state of each connected user would then be updated as incoming
messages arrive.”

Using my tic-tac-toe app as an example, what if I had to horizontally
scale my web app to 10 dynos/processes to handle a heavy number of users
playing my game? User1 connects to my web app and is assigned by the
load balancer to dyno1/process1. User2 connects to my web app and
assigned to dyno2/process2. If user1 and user2 subscribe to a private
channel using websockets, they will NOT be able to communicate with each
other since they’re on separate dynos. To solve this, I could use a
global message queue that both dynos subscribe to. That way, when user1
makes a move, I would send that move data, and the the name of the
private channel its associated with, to the global message queue. Then
all connected dynos (including dyno2), could broadcast the move data to
any clients that subscribe to the named channel, which most of the time
would be none.

If so, am I correct to understand that this global message queue acts as
a huge bottleneck that defeats the benefits of horizontally scaling in
the first place since all dynos/processes have to process moves by users
that aren’t even connected to them? That’s essentially the same as
having a single dyno/process to handle all the users.

My questions are…

  1. Am I understanding this correctly?
  2. If not, what am I missing?
  3. If so, is there a better solution to horizontally scaling a
    real-time tic-tac-toe web app?
  4. One way I could optimize this solution is to bypass the message
    queue when two users are connected to the same dyno/process. How can I
    tell if two users in a private channel are connected to the same

I apologize for the long post, but it’s a complicated question and
that’s the shortest I could make it and still feel comfortable getting
my point across. Thanks in advance for your wisdom!