You are here

ROOT had bugs!

Hi!

I'm happy to see that the title got you here :-) As a matter of fact, this is about several bugs in ROOT. We had code like

 

if (!CheckServerIsUp(server)) {
  MarkServerBad(server);
}

That code is just fine, right? Well, read on...

We received a nice treatment by Coverity, a company that is famous for its open source report [link via Google, or register with Coverity to download the report]: They check the source code of many, many open source projects on a yearly basis. Linux kernel, MySQL, OpenSSL, libjpeg, Firefox - you name it. They use a static code analyzer: it is like a compiler, reading at the sources and warning about theoretically possible problems (i.e. they don't see which code gets executed in reality). The major problem of tehse analyzers is that they cannot distinguish relevant reports from noise. I find it very human to not trust the reports of a tool that has 50% false positives. I simply wouldn't waste my time understanding all those reports, assuming most of them are bogus anyway. So being relevant is much more important than finding all possible problems, which is impossible anyway for any algorithm according to some theory (seriously! I assume that's related to the Goedel incompleteness theorem).

And they offered to report the problems their scanner finds in ROOT. "Sure!" That was a few weeks before the v5.26/00 production release. Then the results arrived. Tons of them. Many were actually kosher (e.g. the code in the dictionary is under our control, we know what's happening in there, and to save CPU time we skip some tests for != 0 etc.) Many were not. Bad news spread, so all of a sudden some experiments and other software projects were interested, too! I gave a little presentation on it, people were as impressed as I was - especially by the show cases. Let's look at some.

Of course the do the usual checks: is there a memory leak? Variable with missing initialization? Using user input without checks? Buffer overflows? Stack exhausted? Unreachable code? The list of these obvious ones and their report quality is amazing: they show the comments embedded in the source code on a web page - which makes it trivial to understand, assign and fix. But what's really impressive is that they reduce the noise by checking at the context. An example:

switch (var) {
  case kApple: printf("Apple");
  case kOrange: printf("Orange"); break;
  default: printf("Incomparable");
}

This will trigger a report - the kApple case has a missing break! But this one will go unreported:

switch (var) {
  case kApple: printf("Apple, ");
     // intentional fall-through; an apple is a fruit
  case kFruit: printf("a kind of fruit"); break;
  default: printf("Better don't eat.");
}

They detect the comment. As I said before: bringing the noise down is a major challenge - and so they decide that if you want a fall-through you'ed better put a comment about it, and thus fi they see one they assume an intentional fall-through. That's neat but not magic, yet.

 

Magic starts here:

void f(const char* s) {
  size_t l = strlen(s);
  if (s != 0 && l > 12) {
    printf("s ends on: %s\n", s + 12);
  }
}

The fact that the code tests for

s != 0

after it's passed to strlen() will cause a report. If there were no tests for

s != 0

then they would assume that the developer has a guarantee that it's != 0. So either that test for

s != 0

is bogus, then it should be removed, or it's not, then it needs to be there before strlen(). It gets better yet!

 

The nicest feature I found so far is that they worry about you checking the return values of functions. If you have a function that can fail, and in 90% of all cases you check its return value but in 10% you don't: then they will complain about these 10%, arguing the same way: either you need the check or you don't. I heard ATLAS developers will be afraid of that one...

Coming back to the first example:

if (!CheckServerIsUp(server)) {
  MarkServerBad(server);
}

The problem was that

CheckServerIsUp()

contained

delete server

. And

MarkServerBad()

was accessing a member of

server

. You would never be able to tell just by reading the code - these things are so incredibly hard to find. And many of them (like this one) never happen in any test, because you rarely test the failure conditions, especially when it comes to network communication. So it's rare and impossible to reproduce.

 

Thank you, Coverity! Many big bugs in ROOT got killed for v5-26/00. We still have long way to go to get the number of reports down to 0 - but we'll continue to try!

In the next entry I'll talk about the major improvements in I/O that are part of v5.26/00. Every ROOT user can benefit from the speed improvement; I suppose that qualifies for a relevant report :-)

Talk to you soon!

Cheers,
 

Comments

Great post, Axel - interesting! So, they've run this on Atlas code?? I'd like to see the results of that.. ;)

Hi Andy!

Thanks for your comment! We're currently working on an agreement with Coverity that would allow CERN to use it, e.g. for the experiments. While some experiments have announced interest (CMS, Alice) I haven't heard from Atlas. They might just be afraid :-)

Cheers, Axel.

Hi,

I owe ATLAS an apology: they are now one of the most active experiments when it comes to static code checks. Currently, static analysis (using Coverity) is done on a regular basis by all the LHC experiments, by non-LHC experiments (NA61 etc), by common software projects (ROOT, Geant4, COOL, Castor, XRootd etc) and "even" by a beams division project. It was impressive how quickly and with how much enthusiasm a new tool like Coverity got picked up over here!

Cheers, Axel

Hi Collin,

Yes, it definitely helps :-) Our license agreement with Coverity covers the experiments' software. Please contact your experiment's software people who will be able to judge what do to.

Cheers, Axel