Friday, February 14, 2014

The Amazing DevOps Transformation Of The HP LaserJet Firmware Team (Gary Gruver) [feedly]

The Amazing DevOps Transformation Of The HP LaserJet Firmware Team (Gary Gruver)
// IT Revolution

In this blog post, I want to share with you one of the most startling stories of DevOps-style transformation I've ever seen. It's a truly remarkable story for a variety of reasons: they not improved the productivity of developers by 2-3x, but it was done for the firmware that supported the enterprise HP LaserJet family of products.

This project was led by Gary Gruver (LinkedIn, @GruverGary) in 2006, who at the time was Director of LaserJet Firmware for Hewlett Packard. (He is now current VP of Quality Engineering, Release and IT Operations at Macy's.)

Yes, firmware. Gary Gruver still describes this transformation as fundamentally DevOps, even though firmware isn't delivered as a service by IT Operations. And I agree with him. You can see all of the Three Ways in each step of the transformation.

As Gary says, "We definitely did DevOps. Just as you'd bring the production environment to Development if you're running an online services property, I brought the printer to Development."

Gary's story shows us that DevOps is not just for unicorns (e.g., Google, Amazon, Netflix, Etsy, Twitter, etc.) — DevOps is for any value stream where Development, Test and IT Operations must work together to achieve the organizational goals.

Furthermore, I believe if Gary did DevOps for LaserJet firmware, you can do it for anything. (See how Darren Hague applied DevOps to SAP here, and how John Kordyback did it for a mainframe COBOL application here).

(The full story of Gary's HP LaserJet transformation is described in his book A Practical Approach To Large-Scale Agile Development. Furthermore, you can see a video his keynote presentation here, and his slides are here).

The Business Problem (2006-2007)

Gary's group was responsible for the firmware code that ran on all the enterprise LaserJet products, which included multi-function printers, copiers, scanners, etc. Like in the consumer printer market, it was incredibly competitive, where new and innovative offerings were showing up nearly every month.

Their problem was that the firmware group was having tremendous difficulty keeping up with the demand for new, innovative features, despite having somewhere between 400 and 800 developers working across the globe, supporting 10MM+ lines of code.

They were completing two software releases per year, with the majority of their time being spent porting their code to new products. Gary estimated that only 5% of their time was spent creating or supporting new features.

The net result was the Gary's group became the bottleneck for the entire business line. It was so difficult to get anything new into the HP LaserJet firmware, Marketing basically gave up asking for new things. As Gary says, "Marketing would come to us with a million ideas that would dazzle the customer, and we'd just tell them, 'Out of your list, pick the two things you'd like to get in the next 6-12 months.'"

"After years of that, Marketing basically gave up coming up with new ideas. Why bother? They knew we couldn't build it for them," he explained. "It's horrible when people give up and stop asking you for new things. When that happens, you know you're in a bad situation, and you certainly can't win in a competitive marketplace like we were in."

Worse, in 2008, Development costs grew by 2.5x (uh oh), with 80-90% of resources merely porting existing firmware to new products and qualifying it.

(A horrible, funny, horrible anti-pattern: The response to getting so little done was that Marketing would demand a re-plan, where Development has to prove they couldn't get anything more done. Gary says, "So, what little time we had was spent doing more 'planning.'")

The Technical Problem: Lead Time From "Code Committed" To "Ready To Ship"

Here are some relevant statistics of Gary's team that may seem all too familiar to other software projects we've been associated with:

  • 5% of Development time spent writing new features
  • 15-20% of Development time was spent integrating their code into mainline (i.e., trunk)
  • when a developer checked-in code, it took them 1 week to determine whether it integrated successfully into trunk
    • there were 8 teams, each with a "build boss," who would submit changes into a centralized build daily
    • it would take another day for it to reach the "integration build," where another day was needed to run the acceptance tests
    • a full (manual) integration testing required an additional 6 weeks

The point here is, a developer would often have to wait 8 full weeks before they would learn whether their code change actually worked or not!

(In a presentation by Paul Rogers, CTO, GE Energy, he described a situation where their builds took 11 hours to complete. When they were evaluating Electric Cloud to help parallelize their build and test process, they found 73 build failures. Paul observed, 'Finding and fixing each of these errors could have required 73 * 11 hours = 803 hours!'. Slow feedback loops ineed do kill.)

Slow Feedback Loops Kill

"We all knew we could do so much more if we could just get faster feedback," Gary explains. "We were constantly being interrupted by new defects that were introduced into the code base up to 6 weeks before — when you're trying to fix something that broke that long ago, it's a huge problem to figure out who actually made the change that caused the defect."

I laughed when Gary observed the following. "How can you expect a developer to learn anything under these conditions, so that we could prevent it from happening in the future? You can't! The only thing that happens is management yells at someone for something they did six weeks ago."

I think this is a profound observation. In other words, no one can learn anything if everyone is getting sufficiently slow feedback (e.g., 6 weeks). Learning can only occur if we can see the cause-effect linkage between the work we do and the outcomes.

Implementing Continuous Integration And A Dramatic Architecture Change

Architecture Changes

In the previous regime, each of the different LaserJet models would require a new code branch, with #ifdef's used to enable/disable code execution for different models and capabilities (e.g., copier, paper size, etc.).

This created many problems, including an ever-increasing number of branches that had to be actively maintained (as well as branches of branches), each creating a unique build, requiring separate testing.

The first goal was to move all developers onto a common code base, because they identified maintaining multiple branches as their largest inefficiency.

They eliminated separate branches for all the products, putting all LaserJet models into trunk. Instead of using branches and compile-time #ifdefs, printer capabilities are established at run-time, using an XML configuration file.

At the time of this writing, trunk supports 24 different HP LaserJet products. "Getting rid of code branching will often be your biggest efficiency gain. It's real. It works," Gary says. "The next think you'll need is good automated testing. Without automated testing, continuous integration is the fastest way to get a big pile of junk that never compiles or runs correctly."

Automated Testing

To support self-testing builds, they built a set of automated unit, acceptance and integration tests, which would continually be run against trunk. Furthermore, they created a culture that "stopped the line" anytime a developer checked in code that broke the build, broke a unit test, etc.

Testing printer firmware can be challenging. You get the highest level of assurance when you're actually testing on a real printer, printing on real paper, etc. "The problem is, we required over 15,000 hours of testing per day. There aren't enough trees to support this volume of testing," Gary says.

"Because the cost of testing goes up as you get closer to actual physical testing, we wanted to drive as much of the testing upstream, where it was cheaper, using simulators, emulators, etc," Gary continues.

To enable this level of testing, and still have the tests run quickly, they built in six weeks the infrastructure to support testing: 4 racks of servers, with 2 more in India — these held 500 physical servers, each running 4 VMs. Those 2,000 virtual servers ran printer simulators, which would load the firmware builds, and then report on the test results.

By doing this, they created fast feedback, where a developer would know quickly whether the code they committed worked within hours (running the full regression test would take 24 hours).

  • Before vs. After
    • Build cycle time: 1 week -> 3 hours (10-15 builds per day)
    • Commits: 1 commit/day -> 100 commits/day
    • Regression test cycle time: 6 weeks -> 24 hours

(Again, the improvements are astonishing. Gary and team improved almost all cycle-time/lead-time measures by 1 or 2 orders of magnitude.)

Cycle Time Reductions (Slide 7)

Stopping The Line When Builds And Tests Failed

Gary also describes how, once they had an automated test suite, they created a culture where all worked stopped when someone broke the build or the test suite (i.e., when someone breaks the deployment pipeline). To help developers get the deployment pipeline running again, they created a chat room to enable quick and easy communication. Gary said, "It became lab felony to commit code and then leave the office without confirming that the build is still 'green'. And no code commits were allowed on a 'red' build — that was a misdemeanor."

One time, they had a broken build for 5 days, which created enormous problems, because no developers could commit code for new features. Why? It turns out that some new tests that raised the quality bar was causing the test suite to fail.

"I needed a process to let good code in, and keep the bad code out. Instead, we were letting the train wreck happen on the main tracks," Gary explained. "We then started using gated commits to prevent bad code or bad tests from entering trunk, which enabled the rest of the developers to stay productive. This was an enormous productivity improvement. It meant that only, say, 25 developers would be affected, as opposed to 200 developers."

Performance Breakthroughs

Here's some astonishing stats on where Gary's team ended up:

  • 75K-100K lines of code changes every day (!!)
  • 100-150 code commits per day

"There's no way we could have been modifying 100K lines of code in the previous regime; you need continuous integration or continuous delivery infrastructure to allow teams to go fast. Only then can teams could work with enough autonomy."

I asked Gary whether there were any other big surprises? He said, "People say you can't do remote Agile. We didn't find that to be the case. We had teams working remotely, being productive. CI became the integration point for the remote teams."

Performance Breakthroughs

That concludes this case study.  What follows is a Q&A with Gary Gruver.


Gene Kim: Some people say DevOps in the enterprise doesn't make sense. What do you think about DevOps in the enterprise and the successes you have seen?

Gary Gruver: To me DevOps in the Enterprise is all about addressing the biggest weakness of most Agile implementations in the Enterprise. All to often I see Enterprise implementations of Agile that focus on scrum at the team level and completely ignore the Enterprise level definition of done. They talk about how many scrum teams they have and how well they are doing scrum. What they completely miss is one of the key values of Agile, which is always keeping the code base stable and close to releasable. To me DevOps in the Enterprise is all about driving that discipline into large organizations with different groups and applications that have to work together. At its core it is about defining and driving a more robust Enterprise level definition of done.

To truly be Agile in the Enterprise, organizations need to learn that the definition of done really needs to expand beyond the current "it works on my box mentality". Ideally this needs to extend all the way to working and stable in production with good monitoring capabilities in place. That said the journey to that ideal state takes awhile so it is helpful to have a strategy to drive the transformation. In my mind, there are two potential approaches to consider.

First is what I would call the Google model of taking developers into production. In this situation, the architecture is fairly clean and the management chain has very clear understanding of how hard it is to deliver and run supportable Enterprise level Software. In this model the management team has declared that the development team will own operational support of the new application for at least the first 9 months in production. After that time they can set up a meeting with operations to see if the team is willing to take over on-going support. If not the development team is responsible for operations. This creates the type of closed loop learning that will really get you rethinking development shortcuts and your definition of done. It makes sure that the development team is considering and prioritizing all the operational aspects of the product.

The second approach that is probably more practical in most organizations is working to drive an Enterprise operational like environment back upstream as close to the developers as possible. This approach forces the resolution of Operational and Enterprise level issues early and often. This can be done through Enterprise level Continuous Integration, Continuous Delivery and/or Continuous Deployment. In my mind, Enterprise level Continuous Integration is more related to what we did at HP with over 400 developers committing code on an ongoing basis that had to work across multiple products at the same time. What I have learned since is that when you try to implement a similar large scale operational feedback process for Enterprise level software you really start crossing the line into Continuous Delivery where the deployment aspects become much more important. Creating an Enterprise level DevOps environment upstream as close to the developers as possible really requires investing in creating all the Continuous Delivery capabilities that Jez Humble and David Farley documented in their excellent book. Once you have all those things in place developers realize real-time what it takes to have a clear definition of done in an Operational like environment. The final step in my mind and it is subtle is going all the way to Continuous Deployment where you get the additional advantages they describe of small batch sizes moving into production. The reason I make this subtle distinction is that some customers and businesses feel it is more important to batch up deployments into less frequent releases to provide more consistency for their customers. The key though is to use Continuous Delivery techniques to reach the point where it is easy to release more frequently than the business wants and remove the delivery of Software as a constraint for the business.

At HP the DevOps challenge was integrating code from over 400 developers around the world and making sure it still worked on all the different printers, copiers, and scanners under development and in the field. The operational piece was fairly straight forward compared to Enterprise Software. You just FTP the new code to the printer and made sure it worked. The challenge was making sure it worked and would stay working across that many different devices. Ideally from a DevOps perspective we would have wanted every check-in automatically tested against a real printer for every different device supported. Given that we had over 15,000 hours of tests that we were running daily this would have cost way too much and required turning every tree in Idaho into paper for testing. Therefore, we created an extensive deployment pipeline that heavily depended on simulator testing on VMs, and emulator testing that included our custom HW to create real-time operational environment feedback to our developers. Since the simulator and emulator testing were much cheaper and more frequent we heavily relied on them for our DevOps type environment. That said anytime we found a failure on real hardware that did not show up earlier we worked to improve our emulator and simulator testing capabilities and coverage.

My biggest learning in moving over to Enterprise Software was that the deployment process was much more complex and less reliable than the FTP to a printer I had grown to know and love. It is so difficult to get an entire website up and working with everything operationally correct that a lot of teams start depending on dedicated environments or defining done based on their subcomponent of the website. To me this is directly opposite of what definition of done should be in an Agile Enterprise. The DevOps perspective provides the discipline for integrating the entire Enterprise system that must work together in production as early and often as possible. The Enterprise system should be built up in a structured and logical manner ( Without it you are just kidding yourself that your Enterprise is Agile no matter how many scrum teams you have going.

Gene Kim: Tell me about your top lessons learned and biggest value created.

Gary Gruver: My biggest lesson was around learning how valuable a Large Scale DevOps type approach is for the Enterprise. In my mind having a production-like deployment pipeline with extensive automated testing is the single most important step for improving Software/Firmware development productivity in the Enterprise. Every engineer wants to write good code and believe they have until they get feedback that it is not working correctly or it has broken something else. If this feedback takes a long time and does not include a production like environment then you can't expect the developers to improve or the handoffs to production to work well.

That said getting an organization to make the transition is a huge change management challenge. Almost to an engineer when you describe the vision and direction of Large-Scale Continuous Delivery they will tell you why it won't work. They will go into long stories of how it will break when bringing in large changes and why you need feature branches. In the beginning at HP when I set out the vision of one main branch for all current and future products using Continuous Integration most of the engineers thought I had lost my mind. They would avoid making eye contact when I walked down the hall and were just hoping at some point I would come back to my senses. It was especially challenging for our technical lead over the build, test, and release process. He really wanted to engage me in the branching discussion to which I would say we aren't going to branch. It got to the point where I think he wore out a thesaurus trying to find ways to talk about branching without using the B word.

Once we had the deployment pipeline up and really working people got used to working that way and appreciated its advantages. Engineers would see bringing in big changes on trunk as the best way of getting real-time feedback on how their code would work with the rest of the system. When we were close to releasing our first product as the re-architecture came to completion I approach our release lead about being ready to branch. His response was that we shouldn't branch and loose all the efficiencies of having an Enterprise level DevOps type environment giving real-time feedback to the developers on one main trunk. The next time I saw him he had come back from a river trip down the middle fork of the Salmon with a flying B-ranch hat from the local airstrip and ranch. He walked up with the hat and said we don't need no stinking B-Ranch.

The reason I like this story is two fold. First and foremost is that my personal belief is that this is the single biggest opportunity for improving development productivity in a large Enterprise. Second is that it shows the big shift in mindset that needs to occur. Our release technical lead went from the biggest non-believer to the biggest supporter of the approach. When I talk to people almost to an engineer until they have worked in this type of environment they will never believe it can work. Then once they have developed in this type of environment they can't imagine how they could ever go back. I have had several different engineers that have left HP call me to talk about how backward their new companies are doing development. They complain they are not nearly as effective because the feedback is non-existent and it is very hard to release code into production. To them it feels like such a large step backwards they can't imagine why any company would develop without an Enterprise level DevOps type definition of done. It is just too slow and inefficient.

If you are focused on improving productivity in a large Enterprise it is important to remember that dedicated environments and branches are evil. They just create way too much overhead and delay the feedback that comes from integrating early and often. Additionally this feedback needs to come from as operational like an environment as soon as possible. DevOps in the Enterprise needs to be real but as the story shows seeing is believing.

The post The Amazing DevOps Transformation Of The HP LaserJet Firmware Team (Gary Gruver) appeared first on IT Revolution.

Shared via my feedly reader

Sent from my iPad

No comments:

Post a Comment