Trade-offs are the bread and butter of software design. No matter what anyone says, every system is optimized for something, and one measure of a satisfactory system is whether the some-thing is the right thing.
We have a client-server application in which the server answers the client’s requests in JSON. Most of the substantive responses are in the neighborhood of 20K bytes, and consist largely of values retrieved from SQL tables or joins, the column names of those tables, aggregated with 10 to 20 simple keys and values most of which are just context for the other data. The server process is multi-processor aware and multi-threaded, and it is unlikely to experience performance problems of its own due to being asked to format the JSON for the responses.
There are two client applications: One is a web client whose pages are provided by a machine separate from the server part of the client-server application. The other one is an iPad app that is used for richer data visualization.
In formatting the JSON response, one can have redundancy in the data and absorb the consequent increase in the bandwidth used to transmit the information … or one can re-factor the data, reduce the bandwidth, and use CPU cycles to reassemble and “normalize” the data on the receiving end. Explain your thinking behind the tradeoffs in either/both cases of client applications.
One could also use a compression algorithm like Lempel-Ziv-Welch on each end. Given the nature of the data (specifically, the row sets obtained from SQL queries), what are some simpler ways to achieve the same or better results than LZW, where “better results” are measured in overall application throughput?
This problem is about …
Most real world programming is about boundaries and tradeoffs. Just like the realization that the piggy bank was worth at least $10.00 and not more than $250.00, any reasonable approach to a solution to most problems must incorporate an awareness of the edges of the solution space.
This problem also has tradeoffs. We can do things to the data before we send them, after we receive them, both, neither, and every addition means more lines of code.
Most useful problems have their origins in the real world. This one came to light in a discussion that Andy Bourassa and I had about our own product, and it was Andy who first framed it for use here.
Elements of a solution
A rough-order-of-magnitude impressive answer would include some back of the napkin calculations, but the key to a good answer is understanding that CPUs operate in the range of billions of events per second, and network transfer speeds over WANs involve millions of bits per second. Thus, it is almost always best to minimize the volume of data that is transferred, regardless of what the application does.
It is difficult to imagine a situation in which it is worse to consume CPU cycles than bandwidth. This is the reason that many CSS files are de-spaced once development is complete. Of course, crunching the CSS file doesn’t really save much bandwidth on the receiving end because the CSS file is only transmitted once and used on many pages, but economy of transmission is a true savings.
Sorting the rows, combined with columnar compression is a staple of database technology. Using it, we can put a placeholder like a star (“*”) in all cases where the value of a column has not changed from the previous row. This technique is essentially the same kind of approach as any other data compression system, but it exploits the additional fact that we know quite a bit about the data before we start.
In the cases where most of the columns change value with every row, we still have the advantage that we are factoring out the column headings and are avoiding needless duplication.