Sunday, June 22, 2008

Agile infrastructure - missing pieces

My last 5 or 6 Agile projects have involved non-trivial architectures.  By this I mean that they've been more than a browser, application server, and a database.  While I would urge all people with architectural responsibility to avoid complexity, sometimes it's not feasible to simplify the architecture prior to first release. 


For the record, there are several reasons why complex architectures perform poorly, and there are several reasons why the agile approach exposes these deficiencies.  I'm assuming that an agile team will be aiming to allow a single developer pair to implement or fix any story that the business needs.


I'm going to talk about how complex architectures can adversely affect the velocity of the development team, and then throw around some patterns for offsetting that break.

o) Changing environments - even with simple architectures, if there is a shared dependency (such as shared DB schema, or network service), you can assume that someone will make a change to that dependency, and it won't be when the developer pair want it to be changed.  Typically shared dependency changes affect the entire development team, not just individual developers, causing a huge loss of either immediate development velocity, or a deferred loss of velocity due to reductions in quality.

o) Waiting for knowledge - complex environments often use a mix of technologies that take time for developer competency.  Such lead times reduce velocity.   In addition, having "experts" means that either the expert is put under huge pressure to deal with issues that exceed their capacity, or alternatively, the expert is under-utilized.

o) Investigation - when something does break in a complex architecture, it is often not immediately apparent why.  Typically there are multiple log files, multiple system accounts, multiple configurations (sometimes data driven), and multiple network servers all collaborating together.  To efficiently determine the cause of a failure can reduce velocity.

Suggested Patterns:

o) Sandbox environment - This means given each developer pair a share-nothing environment in which to develop software.  It is then the responsibility of the pair to manage their own environment, and to promote standards for this environment. Self-management means that the developer pair may make breaking changes without affecting others, and can also rule out outside interference if their own environment does break.  Providing a predictable self-managed environment forces experts to share knowledge, and to develop tooling that empowers the developer pair.  Conversely, developers will create tooling that facilitates common tasks, and share these with the rest of the team.  Note this shared-nothing environment is not necessarily restricted to a single machine, since it is desirable to be able to develop on a production-similar stack of technologies.  

o) Domain model for environment - This means building software and tooling that represents the development environment.  Using a domain model encourages both a consistent language when referring to architectural pieces, and also allows automated reasoning about a given environment.  By allowing all architectural tooling to understand a common domain model, it becomes possible to automate the setup of monitoring tools, diagrams, profiling.  Avoid IDE and product-specific tools to manage the domain model (although they may be used as appropriate by teams), and focus on a standard of deployment and configuration extrapolated from the environmental domain model.  For example, use programmatic configuration of Spring contexts that is driven from the domain model, rather than using property-file based configuration.
  
o) Branching by abstraction - Agile development teams often wish to change software and hardware architecture in response to issues that have been found.  They recognize that while hacks are appropriate in production support branches, such hacks have little place in the main development branch.  Architectural transforms may range from changing a persistence mechanism to switching database vendors.  Given that one team may wish to make a significant architectural change, they should avoid "big bang" introductions.  Once time-boxed spikes have been performed (to assess feasibility), the vision for the change should be shared with the team.  Once committed to the change, work starts by incrementally transforming the architecture.  These changes are distributed across the teams in small slices (through the main branch source control), potentially with two implementations co-existing within the same application, and switched over using configuration.  This allows functional software to be delivered to production at any point in the transformation.

o) Deployment Automation - setting up a sand box environment for a given developer pair is a complex task.  As such it should be an automated task, provided from the main automated build script.  This may mean automating the use of ssh in order to clean and create new database schemas, deploy EJBs or services.  We have found that dynamic programming languages (such as ruby and python) make a great alternative to shell scripts for these tasks.

o) Automated monitoring as acceptance criteria - Identifying failures is made much easier if there a single place to find information about system availability.  Those responsible for architecture should mandate monitoring of a new service as the success criteria of that service.  It is possible to automate the creation of host and services (and groups) for open source monitoring tools such as nagios, and ruby has excellent libraries for basic network service connectivity checking.  The level of monitoring required in the acceptance criteria will depend based on the value of the service.  For instance, if a duplicate server is needed for load balancing, the monitoring criteria may ping the load balancer to ensure that it can see the new server.  On the other hand, if the new piece is an ESB, the criteria may eschew basic IP connectivity in favor of firing sample messages and verifying downstream services receive the forwarded message(s).