原文如下:
Hi Jon DeVaan here.
Steven wrote about how we organize the engineering team on Windows
which is a very important element of how work is done. Another
important part is how we organize the engineering project itself.
I’d like to start with a couple of quick notes. First is that Steven
reads and writes about ten times faster than I do, so don’t be too
surprised if you see about that distribution of words between the two
of us here. (Be assured that between us I am the deep thinker :-). Or
maybe I am just jealous.) Second is that we want do want to keep
sharing the “how we build Windows 7” topics since that gives us a
shared context for when we dive into feature discussion as we get
closer to the PDC and WinHEC. We want to discuss how we are engineering
Windows 7 including the lessons learned from Longhorn/Vista. All of
these realities go into our decision making on Windows 7.
OK, on to the tawdry bits.
Steven linked last time to the book Microsoft Secrets,
which is an excellent analysis of what I like to call version two of
the Microsoft Engineering System. (Version one involved index cards and
“floppy net” and you really don’t want to hear about it.) Version two
served Microsoft very well for far longer than anyone anticipated, but
learning from Windows XP, the truly different security environment that
emerged at that time and from Longhorn/Vista, it became clear that it
was time for another generational transformation in how we approach
engineering our products.
The lessons from XP revolve around the changed security landscape in
our industry. You can learn about how we put our learning into action
by looking at the Security Development Lifecycle,
which is the set of engineering practices recommended by Microsoft to
develop more secure software. We use these practices internally to
engineer Windows.
The comments on this blog show that the quality of a complete system
contains many different attributes, each of varying importance to
different people, and that people have a wide range of opinions about
Vista’s overall quality. I spend a lot of time on core reliability of
the OS and in studying the telemetry we collect from real users (only
if they opt-in to the Customer Experience Improvement Program)
I know that Vista SP1 is just as reliable as XP overall and more
reliable in some important ways. The telemetry guided us on what to
address in SP1. I was glad to see one way pointed out by people
commenting about sleep and resume working better in Vista. I am also
excited by the prospect of continuing our efforts (we are) using the
telemetry to drive Vista to be the most reliable version of Windows
ever. I add to the list of Vista’s qualities successfully cutting security vulnerabilities by just under half compared to XP.This blog is about Windows 7, but you should know that we are working
on Windows 7 with a deep understanding of the performance of Windows
Vista in the real world.
In the most important ways, people who have emailed and commented
have highlighted opportunities for us to improve the Windows
engineering system. Performance, reliability, compatibility, and
failing to deliver on new technology promises are popular themes in the
comments. One of the best ways we can address these is by better
day-to-day management of the engineering of the Windows 7 code base—or
the daily build quality. We have taken many concrete steps to improve
how we manage the project so that we do much better on this dimension.
I hope you are reading this and going, “Well, duh!” but my
experience with software projects of all sizes and in many
organizations tells me this is not as obvious or easily attainable as
we wish.
Daily Build Quality
Daily quality matters a great deal in a software project because
every day you make decisions based on your best understanding of how
much work is left. When the average daily build has low quality, it is
impossible to know how much work is left, and you make a lot of bad
engineering decisions. As the number of contributing engineers
increases (because we want to do more), the importance of daily quality
rises rapidly because the integration burden increases according to the
probability of any single programmer’s error. This problem is more than
just not knowing what the number of bugs in the product is. If that
were all the trouble caused then at least each developer would have
their fate in their own hands. The much more insidious side-effect is
when developers lack the confidence to integrate all of the daily
changes into their personal work. When this happens there are many
bugs, incompatibilities, and other issues that we can’t know because
the code changes have never been brought together on any machine.
I’ve prepared a graph to illustrate the phenomenon using a simple
formula predicting the build breaks caused by a 1 in 100 error rate on
the part of individual programmers over a spectrum of group sizes (blue
line). A one percent error rate is good. If one used a typical rate it
would be a little worse than that. I’ve included two other lines
showing the build break probability if we cut the average individual
error rate by half (red line) and by a tenth (green line). You can see
that mechanisms that improve the daily quality of each engineer impacts
the overall daily build quality by quite a large amount.
For a team the size of Windows, it is quite a feat for the daily builds to be reliable.
Our improvement in Windows 7 leveraged a big improvement in the
Vista engineering system, an investment in a common test automation
infrastructure across all the feature teams of Windows. (You will see
here that there is an inevitable link between the engineering processes
themselves and the organization of the team, a link many people don’t
recognize.) Using this infrastructure, we can verify the code changes
supplied by every feature team before they are merged into the daily
build. Inside of the feature team this infrastructure can be used to
verify the code changes of all of the programmers every day. You can
see in the chart how the average of 40 programmers per feature team
balances the build break probability so that inside of a feature team
the build breaks relatively infrequently.
For Windows 7 we have largely succeeded at keeping the build at a
high level of quality every day. While we have occasional breaks as we
integrate the work of all the developers, the automation allows us to
find and repair any issues and issue a high quality build virtually
every day. I have been using Windows 7 for my daily life since the
start of the project with relatively few difficulties. (I know many
folks are anxious to join me in using Windows 7 builds every day—hang
in there!)
For fun I’ve included a couple pictures from our build lab where
builds and verification tests for servers and clients are running 24x7:
Conclusion
Whew! That seems like a wind sprint through a deep topic that I
spend a lot of time on, but I hope you found it interesting. I hope you
start to get the idea that we have been very holistic in thinking
through new ways of working and improvements to how we engineer Windows
through this example. The ultimate test of our thinking will be the
quality of product itself. What is your point of view on this important
software engineering issue?