I've been building a small experimental app on Google App Engine. At some point in the summer the app outgrew the sensible but onerous restrictions of the GAE sandbox. Specifically, I wanted to generate images larger than 1MB, and there is simply no way to do this from inside a GAE app. This was quite a major speed bump.
Many months later, and I have the app managing a pool of AWS EC2 instances to do its image generation work. The GAE app uses the AWS SDK for Java to start and stop EC2 instances, and to publish to an SQS queue. It's working well.
However, getting to this point wasn't easy, because the official AWS SDK doesn't work in GAE. That's because another of the GAE restrictions prevents applications from opening their own sockets. All network activity should be channelled through Google's UrlFetchService. If you use Java's lame URL API then this chanelling is transparent. But if, as in the AWS case, you are using full featured HTTP Client, some work is required. There are various approaches for supporting Apache HttpClient, the most popular one being to provide it with fake socket-level connections, which parse the protocol emitted by the HttpClient and reconstitute enough information to be able to call UrlFetchService. Noteably, Adrian Petrescu used this approach and forked the official AWS SDK.
I'm very grateful to Adrian for making this fork, bacause it gave me faith hat it was possible to do what I wanted to do. And early on it worked well. But I couldn't get it to call any of the SQS methods; I kept getting invalid signature errors, but using a client outside of GAE I was generating an identical signature and the requests were succeeding - it was baffling. After a mammoth debugging session, including an outing for WireShark, I tracked down the problem: the fake HTTP Connection was throwing away the whole of the path for my HTTP requests so they didn't stand a chance of succeeding; the signature is just the first thing that AWS happens to check. Looking at the source code I found an ominous comment "Other information are just ignored safely". I hacked in a line of code to stop ignoring other information, and miraculously everything started working.
I thought it would be helpful for me to fix this bug properly (with an accompanying test) and submit a patch, but it gradually dawned on me how many similar bugs could lurk in the code. Afterall, this class should really understand the whole of the HTTP protocol. Put off by the scale of the challenge, I decided to try an alternative approach. I looked at the original AWS code, and it seemed pretty well isoalted already from HttpClient.
A couple of hours later and I had my own fork, aws-sdk-for-java-on-gae. This fork completely removes Apache HttpClient and hard-wires UrlFetchService in its place. It turns out that UrlFetchService has a much better API than Apache HttpClient (at least a lot better than the 3.x version that Amazon used) so conversion was pretty straightforward. I removed a fair bit code that was irrelavent because actual connections are opaque when you use UrlFetchService, so it's simpler as well!
It seems to be working fine for me so far, but I'm sure somebody will find a similar bug somewhere, give up fixing it and make their own fork. But in the meantime, give it a try!
Since the previous post, I've reading up more on the devops movement and found a few nice pieces, particularly Graham Bleach on internal borders.
Thinking about borders within operations departments, I'm reminded of a recent client who operated with a mind-blowing number of siloed operations teams. Coming at it from the development side, working through a problem with the operations staff was like exploring a never-ending labyrinth of teams. Every time I probed a little deeper, my contact would defer responsibility to another mysterious team, usually located somewhere distant, that owned a slightly lower level of the stack. Once I tracked down a real person in the mysterious team, the same thing would happen again, and I would be left searching for another distant team. It seemed like I would never actually get to the real hardware. An unforeseen consequence of virtualization is that it allows someone to declare themselves in charge of a Virtual Machine, but not in charge of the host or guest operating systems nor the hardware itself - staggeringly unhelpful.
The genius of these extremely-finely-sliced organisations is that they provide innumerable cracks down which responsibility, ownership and useful work can fall. If a team has a sufficiently narrow focus, it is almost certain that no problem will ever occur that falls cleanly within its boundaries, so there is no responsibility at all - a manager's dream.
One thing that encourages me a little bit is a recent trend for counting the number "engineers" in an organisation. I think all technical staff are counted in this metric, though I'm not sure whether people are including managers and other IT staff who might not be hands on. In his QCon talk, Aditya Agarwal was very proud of Facebook's metric of 1.1 million users per engineer, while Andres Kütt was similarly proud of Skype's 600,000 users per employee. Note that Facebook is very open about its number of engineers but secretive about its number of employees - I have no idea why. I think counting the total number of technical staff is useful because helps people look at the broader cost of running software - not just development or just operations. Also, I hope in the long term these kinds of metrics will highlight the painful inefficiency of silos and barriers.
I'm watching Michael Nygard's talk in the devops track at QCon. This is my first taste of the devops movement and it's certainly something I like the look of. I'm really struck by his assertion that we don't need a division between development and operations, but crucially he isn't saying that everyone needs the same skills - we will still have specialists.
I'm struck that this idea of bringing people together into a single team, while still maintaining specialisation, comes up a lot in the software world. Some examples:
Having a single team has all sorts of implications. I'm particularly keen on having a single backlog of work, even if not everybody has the skills to pull from that backlog. Most importantly, a single team stands a better chance of having a single goal than do multiple teams.
Why are specialist teams so common in software? My intuition is that it's a social phenomenon that people with similar skills and interests tend to congregate, and like working together. Finding an effective mechanism to counteract this force, while still maintaining high-skill and specialisation, seems like a Social Technology (in Malcolm Gladwell's terminology) that organisations will need to acquire to be successful.
For the first session I went to Developing Agile Leaders and Teams: A Developmental & Transformational Path with Gilles Brouillette, because I've recently taken on a management-heavy role and my ideas on leadership are are, shall we say, unformed. Gilles has a PhD in Transpersonal Psycology, and the session was heavy on what I can assume psychologists think about all day: questions about our perception of reality, and the models that we use to glean meaning from the world round us. It was all rather deep.
His spent most of the 90 minutes building up a model of levels of physcological development. The top three levels are apparently very rarely attained, but are where you need to be if you're going to be a successful leader. So how do you get to these high levels of physiological development? Apparently it's about changing the underlying way that you make sense of the world, from an "OR" model to an "AND" model. It sounded to me like being able to hold more than possibility in your mind at the same time; a bit like understanding a smooth spectrum of probability rather than jumping to single binary outcome.