Data Ex Machina, Part 1
A Brand-New Reporting Web App
We needed to start over. Our old app had reached its limits, and we were cutting down on functionality because of it. My team's old app was extremely tightly coupled with a Microsoft SQL Server Reporting Services (SSRS) server, which had been hacked up to use the web app as a data source and generate reasonably nice-looking reports, and the hackage had determined most of the architecture of the web app code. In addition, despite lots of searching, we could not find a way to get results back from the database without loading them all into memory, using our current framework. We wanted to do large data exports, so that was unacceptable.
Besides, the web app was written in Java. And not good Java—the "Enterprise Java Bean"-filled XML-instantiated messy Java. Yuck.
Building a new web app would enable us to slowly move off of the old app, using the magic of a well-configured load balancer that we already had in place anyway. Our plan was to first build the data export feature, since that would actually have some customer impact. Of course, the wonderful side-effect would be being able to ditch SSRS, which had forever plagued our developers with impossible-to-update code and confusing logic. An app free from the chains of SSRS could be written well and let us build better, more interactive reports. But first, we needed to think about those data exports.
We held a naming contest, and its name from then on would be Data Ex Machina, or DXM for short.
The Search for a Good Python Framework
We were sure that we wanted to build our new app in Python, because each of us on the team had experienced firsthand the difference in productivity between Python and Java, and we knew that the amount of work that the app does is small enough that Python wouldn't become a performance problem. Since we wanted to get the technology choice right this time, we spent a little while doing some investigation as to what framework we were going to use. The frontrunners were Django, Pylons, and Tornado.
I was pulling for Tornado, because it seemed nice to work with, plus its default model of working was in streaming data, rather than building up an entire response and sending it to the client. Since we'd have to do that exact thing with the data exports, it seemed like a natural choice. But after a little bit of time with that, it became clear that we'd actually have to build our own asynchronous database driver. Our queries are extremely long-running: several seconds for simple reports, and under load, multiple minutes for complex reports. Building an asynchronous driver is surely doable, but a normal framework would be fine for us. That meant that Tornado would have to have something monstrously better than Django or Pylons for us to commit to it.
Between Pylons and Django, we were having a hard time deciding which was better than the other. Each was mostly good with a few small negatives each. But when we started load testing our proof-of-concept apps to see which performed beter under load, we immediately noticed that each server had terrible performance. Somehow, our database queries were executing serially rather than in parallel. After some work, we narrowed the blame down to the database connections, rather than Python or the frameworks. We were using pyodbc to connect to our database, a Vertica cluster, since Vertica doesn't provide a dedicated driver for any language, only ODBC and JDBC drivers. We even made a small test to show that the problem occurred with PyODBC connecting to two different databases, but not with psycopg2, a native PostgreSQL library. We tried posting on the PyODBC newsgroup, but after a lack of response, we had just about given up on Python. We'd have to go to Java after all.
One Last Hope
But wait! What about Jython? Sure, it only has compatibility up to Python 2.5, but that's better than nothing. Both Django and Pylons were supposed to work under Jython, so we gave that a try. Luckily for us, switching our proof-of-concepts to run on Jython went fairly smoothly. They performed admirably under load, and dutifully streamed CSVs of data out to our load testing script.
So, we were left with two decisions. First: Django or Pylons? Pylons was more used within the company, so that would have been the default choice. However, no one else was using Jython, so that throws out the shared knowledge argument. I liked Django a bit more, even though we wouldn't be using a lot of its core features like database model management. In the end, Django won out simply because it seemed like there was a larger community of people using Django on Jython than Pylons on Jython. And now, though it wasn't a factor then, Pylons has been end-of-lifed in favor of some new thing called Pyramid, so other people who previously chose Pylons are having to rethink their decision.
The last decision was whether it was worth it to go with Jython instead of Java. This was entirely subjective, but we knew that Jython was sufficiently mature for us to depend on it. We would have preferred Python, but without a way to use PyODBC, we were out of luck.
⋙
Of course, over 6 months later, the PyODBC maintainer finally did follow up on the problem, and someone else emailed me privately with a solution that probably would have allowed us to use Python and PyODBC. By the time that happened, we had already built a few nice things that depended on JVM libraries. If we want to switch back, we'll probably have to port the functionality ourselves. Besides, life is pretty nice on the JVM.