Post

Moving teams to Trunk Based Development (an example)

In this post I am going to cover an example case study of introducing Trunk Based development to an existing enterprise Dev team building a monolithic web application. I’m not going to cover the details of trunk based development as that’s covered in detail elsewhere on the internet (I recommend trunkbaseddevelopment.com). For the purpose of this article I’m referring to all development teams working on a single code branch, constantly checking their changes in to that single branch (master). In our model the only reasons for ever creating a new branch was for one of the following reasons:

  1. A developer creates a personal branch off master for sole purposes of creating to Pull Request into the Master branch (for code review purposes).
  2. The release team take a branch off of master to create a Release Candidate build. Whilst this is not in the true spirit of Trunk based development it is often required for release management purpose. Microsoft use this model for their Azure DevOps tooling development and call it Release Flow.
  3. A team may still create a branch for a technical spike or proof of concept that will be dumped and wont ever me merged into Master.

Problem Statement

The development team in this case was an enterprise application development team building a large scale multi-channel application. There were five sprint teams working on the same monolithic application code-base with each team had its own branch of the code that they worked on independently from the other sprint teams. Teams were responsible for maintaining their own branches and merging in updates from previous releases but this was often not done correctly causing defects to emerge and features being accidentally regressed. Whats more, the teams branches would deviate more and more away from each other over time, making the problem worse with each subsequent release. Following a monthly release cycle, at the end of two fortnightly sprints all teams manually merged their code into a release branch which was then built and sanity tested. The merging of code was usually manual and often done via cherry picking individual changes from memory. Needless to say this process was error prone and complicated for larger releases. Once the merge was complete this was the first time that all the code had all been running together and any cross team impacting changes identified and so many issues were encountered at this point, either in failed CI builds or in release testing cycle. This Merge and test cycle took between 3 to 8 days therefore taking a week out of our delivery cycle and severely impacting velocity.  The more issues that were found the more the release testing was increased in an attempt to address quality issues, increasing lead times. It was clear that something needed to be done and so we decided to take the leap to trunk based development and eliminating waste by removing merges and getting each teams changes aligned sooner.

Solution

We moved to a single master/trunk stream (in Git) for ALL development and all sprint teams worked on this single stream. Each sprint team has its own development environment for their in-sprint testing of their changes. Every month we would create a cut of the master branch to create the release candidate branch for the route to live and to act as a stable/scope locked base for testing the upcoming release to production. This model follows Microsoft’s Release Flow model. Defect fixes for the release candidate were fixed in master or the release branch, depending on the sprint teams decision but each week changes made in the release branch (i.e. defect fixes) are back merged into master/trunk. This ‘back merges’ activity is owned by one sprint team per Release and this ownership is rotated each release so everyone shares the effort and benefits from improvements to this process

Feature Toggles

The move towards trunk based development was not possible without utilising Feature Toggles or Release Toggles to be more specific . We needed to support multiple teams working on the same codebase but building and testing their own changes (often targeting different releases). We needed to protect each team from every other teams untested changes, and so we used toggles to isolate new changes until they were tested/approved. All code changes were wrapped in an IF block with the old code in the ELSE block and a toggle check determines which route the code takes.

1
2
3
4
5
6
7
8
 if (ToggleManager.IsToggleEnabled("feature12345"))
 {
     // new code here 
 }
 else 
 {
     //wrap old code here unchanged
 }

As the system was a  Java EE application we chose FF4J as the toggle framework. FF4J is a simple but flexible feature flipping framework, and as the toggles are managed in a simple XML file it made it easier to implement our Jenkins solution described later. To be clear there are many frameworks that could be used and its simple to create your own.  To be able to support replacing the toggle framework and to make it as simple as possible for developers to create toggles/flips FF4J functionality was wrapped in a helper class which made adding a toggle a one line change.

We also implemented a Feature Toggle Tag Library so that JSF tags could be wrapped around content in JSF (Java Server Faces) pages to enable/disable depending on the FF4J toggle being on/off. Also a JavaScript wrapper for toggles was also developed that allowed toggles to be checked from within our client-side JS code.

A purposeful decision was made to bake the toggle config into the built code packages to prevent toggle definition files and the code drifting apart. This meant that the toggle file is baked into built JAR/EAR/TAR file for deployment. Once the package is created the toggles are set for that package and this prevents a code-configuration disconnect from forming which is the cause of many environmental stability issues. This was sometimes controversial as teams would want to change a toggle after deployment for the simple reason that they forgot to set it correctly or would hastily want to flip toggles on code in test which was not the design goal of the our Feature Toggles (although a valid scenario for toggles and a separate design was introduced for turning features on/off in Production).

All code changes are wrapped in a toggle and the toggle is set to OFF in the repository until the change has passed definition of done within the sprint team. Once the change has been completed (‘done’ includes testing in sprint) then the toggle is turned ON in the master branch - making it ON for all teams immediately (as the code base is shared). As the toggles for new untested changes are OFF in the code and all package builds come from the master code branch,  the new feature cannot be tested on a test environment without first flipping a toggle, so how can it be tested and turned on without impacting other teams?  For this we introduced Auto Toggling in our Jenkins job.  Our Jenkins CI jobs that build the code for test include a parameter for the sprint team indicator (to identify the team the built package is for) and this then automatically turns on the WIP toggles belonging to that team. This means that when Sprint Team A triggers a build, Jenkins will update the new work in progress toggles for Team A from OFF to ON in the package. This package is marked as WIP for Team A and so cannot be used by other teams. Team A can now deploy and test their new functionality but other teams will still not see their changes. Once Team A are happy with their changes they turn the Toggle ON in MASTER  for all teams and its widely available.

Unfortunately not everything can be toggled easily, and so a decision tree was built for developers to follow to understand the options available.  Where the toggle framework was not an option, then other approaches can be used. Sometimes a change can be “toggled/flipped” in other ways (e.g. a new function with a new name or a new column in the DB for new data, or an environment variable). The point is that the ability and desire to use Feature Toggles is not just about the Toggle framework you choose but instead its a philosophy and approach that must be adopted by the teams.  If after all other options there is no way to toggle then teams have to communicate changes amongst themselves  and take appropriate actions to communicate a potential breaking change.

What worked well

So what were the benefits seen after introducing trunk based development in the team?  Well firstly the obvious benefit of maintaining less code branches was immediately obvious. From day one every team was now on the latest codebase and no merging and cherry picking was required at end of sprint. This saved time, reduced merge related defects and increased the agility of the team.  Each team has saved time and effort by not having the housekeeping effort of maintaining their own branch. Use of environments became more flexible as in theory every new feature was on every environment behind a toggle meaning that Team A’s new feature can now be tested on Team B’s environment should the need arise.  Cross team design issues have been spotted earlier as a clash of changes between teams is seen during development and not later when the code is merged together prior to a release. Teams are now able to share new code more easily because as soon as a new helper function is coded it can be used by another team immediately. Any Improvements to CI/CD process or code organisation or tech debt can now be embraced by all teams at once without a delay as it filtered through each teams branches.

What didn’t go so well

Of course its not all perfect, and we have faced some challenges.  Now all teams are impacted immediately by the quality of the master/trunk branch. Broken builds have been reduced with the introduction of Pull Requests and Jenkins running quality gate builds on the check-ins on Master but any failure in the CI/CD build pipeline immediately impacts all the teams and so these issues need to be resolved ASAP (which they should anyway in any team). Teams must work together to resolve which does bring positives.   Which brings me to the next point - communication.

When teams are using trunk based development and sharing one codebase then communication between teams becomes more important. It is critical that there are open dialog between the teams to ensure that issues are resolved quickly, and that any changes that impact the trunk are communicated quickly and efficiently.  Whilst a lack of cross team comms exasperated in trunk based development any communication issues are a death-nell to a Dev team and should be resolved anyway. If this is an issue for your teams then be aware that trunk based development may not be for you until your teams are collaborating better. That said introducing trunk based development is a good way to encourage teams to collaborate better for their own benefit.

Feature toggles/flags are a key enabler to trunk based development, enabling you to isolate changes until they are ready but there is no denying that they add complexity and technical debt. We add a toggle around all changed code and use the Jira ID to name the toggle. Whilst we have a neat toggle solution there is no getting away from the fact that extra conditions mean extra complexity in your code. Unit tests must consider feature toggles and test with them on and off.  Feature toggles can make code harder to read and increase the cyclomatic complexity of functions which may result in code quality gates failing. We had to slightly adjust our Sonar quality gate to allow for this.  Whilst toggles do add technical debt to your code, but this is temporary debt and one that is an investment to improve overall productivity, and so is really in my opinion a valid form of technical debt. Its referred to as debt as its ok to borrow on a manageable scale if you pay it back over time. Removing toggles is critical to this process, and yet this is one area we have yet to address sufficiently. A process to remove toggles has been introduced but its proving harder than expected to motivate teams to remove toggles from previous releases at the same rate as they are being added. To this end we have added custom metrics in SonarQube to track toggle numbers and we will use this key metric to ensure that the number of toggles stabilises/reduces.

The state of toggles is an additional thing the team need to manage, however this is offset by the power to be able to control the release of new code changes to other teams.  We have found care should be taken to ensure that feature toggles are not mis-used for non intended purposes. Be clear on what that they are for and when they should/shouldn’t be flipped (for us its in the Definition of Done for a task in sprint). There can be demands to use them as a substitute for proper testing. Be clear on the types of Release/Feature toggles and provide guidance for what that can be used for. There is no doubt that they can be re-used at release time to back out unwanted features but this should be a controlled process and designing in from the start. We already had Release Switches for turning features on and off, but Feature Toggles (in our case) are used purely for isolating changes from other teams until ready. We strive to ensure that the toggle is set ON or OFF during the sprint before the code is moved through to release testing.

Conclusion

The benefits you derive from moving towards trunk based development will vary depending on your current processes. For a full guide to the general benefits of trunk based development check out this excellent resource trunkbaseddevelopment.com.

In our case the roll out was a success and achieved the desired improvements in cycle time and developer productivity. The process of delivering technical change was simplified in terms of configuration management and developers became more productive. This was despite the problems we faced and listed above.

There is no doubt that utilising feature toggles into your design from the start would make some of the technical challenges easier but we have proved it can be done with an existing brownfield monolith.

Next Steps

Next steps for the future are to constantly improve the toggle framework to reduce the instances where a change cant be toggled on/off, and to make it easier to communicate about changes that will impact all teams.  A renewed emphasis on removing old toggles from the code is required to ensure that teams and change approvers accept the “tax” each release of removing redundant toggles.

This post is licensed under CC BY 4.0 by the author.