The article is based on my own experience in building a long polling based server which pushes fast changing data to connected clients. The data is dynamically categorized into channels to which clients can subscribe to. At any moment, there are no more than 50 channels, and 4000 connected clients. All clients receive data using the so called long polling technique. A client issues an AJAX request asking the server for new data, and receives a response only when something new is available, after which new request must be issued. This results in a fairly high number of HTTP requests, in peak times about 2000 per second, all of which are handled in Erlang, without any front level caches. Each request comes from a separate client which is subscribed to its own subset of available channels, and therefore must be treated individually.
Comparing to big success stories, such as WhatsApp, Facebook, Heroku and others, this server might seem almost like a hello world app. Nevertheless, it took some attention and tuning to be able to handle that kind of load, and here I will present some approaches which have helped me accomplishing the task.
Removing single process reader bottleneck
Permanent client processes
Alternative approach is to hold client data in the ETS table. The request process can retrieve it directly from the table, which eliminates the need to have extra process per each client. I initially used this approach and then transformed it to explicit client process. The problem in my case was that there are multiple parallel modifiers of client state: long polling requests, channel (un)subscribe requests, and message dispatcher process which modifies client's inbox. Multiple parallel modifiers of the same ETS data require some manual synchronization. Additional downside of ETS is that data is copied while reading/writing, which can cause performance penalty depending on the data size. These issues disappear when using pure processes, and taking this approach increased performance in my case. In a different situation, e.g. with small state data and a single data writer, ETS based approach might work better.
Knowing that many clients listen to the same channels, what I wanted to achieve is to have most clients connected and in the waiting state before the message dispatch takes place. This opens up the possibility to identify clients which receive the same set of messages, and perform join/compress computation only once per each distinct subscription group (all clients listening to same channels).
To achieve this, I have resorted to the throttling approach. All messages are generated in the channel specific process, and initially, these processes dispatched messages immediately to clients. In the throttling version, I have introduced the single dispatcher process, which receives messages from channels, aggregates them and dispatches them to clients every two seconds. The interval was chosen as big enough to accomodate reasonably fast clients, and small enough not to introduce significant delay, because data must be served as fast as possible. As the result, most of the clients return to the server before new messages are dispatched, so now I could identify which clients listen to the same messages, and minimize join/compress computations (see next section). As an added bonus, since this approach reduces incoming requests rate, both CPU load and generated traffic have dropped. Prior to introducing throttling, I noticed peak load of about 3000 reqs/sec. After it was introduced, the requests rate has never went above 2000 reqs/sec even though the number of parallel clients has since increased by 33% (from 3000 to 4000).
Admittedly, fixed time throttling seems like a "lame" approach, but it was an easy solution to the problem at hand, which helped me take control over the CPU load and paved the way for further optimizations. It might be better to monitor returning clients, and dispatch new messages once some desired percentage of them has returned, which would eliminate arbitrary dispatch time delay. Additionally, switching to true permanent connection, such as web sockets or server sent events might completely eliminate the problem.
To take advantage of this, I have introduce the logic in the client processes which can reuse the join/compression operation from another process. The algorithm is a bit involved, but the general idea goes like this. When a client process receives the list of messages it is interested in, it verifies whether a combined message has already been computed by another client process. It does so via look up to the ETS table, using messages unique ids as the key. If a cached version exists, it is served. Otherwise, the client process joins messages, compresses the result and stores it to ETS.
The key thing here is that, while one process is joining messages, all others, interested in the same set of messages, are waiting for it to finish, so they can reuse its result. However, two processes which try to join different set of messages will not block each other. Consequently, join/compress computation is performed only once per distinct set of messages, but it is executed in parallel to join/compress for a different set of messages. There are couple of ways of accomplishing this in Erlang, and after some trials and errors, I have settled for a custom locking implementation. The generic Elixir library which I used for this task can be found here.
This is generally a more complicated approach. Not only do we have to worry about custom inter process synchronization, but there is also an expiry issue: the cached data is stored in the ETS table, so we must manually delete it at some point. Therefore, I generally find it better to prepare data in the provider process, if possible, because this should result in a simpler and better performant code.
A careful reader might also wonder whether a message dispatcher process could determine which clients receive same messages, since all messages are sent from that process. This is essentially possible, but it has its drawbacks. First, the dispatcher doesn't know which clients have returned to the server. A client might return after the next dispatch, so whatever the dispatcher has joined and compressed would then be useless for such clients (since additional new messages have arrived), making the computation completely redundant. Additionally, shifting the join/compress responsibility to the dispatcher, effectively removes parallelization of independent computations, turning a dispatcher into a sequential bottleneck.