In the current implementation of TMemStat we have implemented 2 algorithms to unwind the stack after a malloc/free hook is called:
backtrace - a function of the GNU C Library,
__builtin_frame_address - a built-in function of gcc.
User can switch between algorithms by providing an option to TMemStat constructor. When user sets “gnubuiltin” as an option, than gcc builtin will be used, if user provides anything else or an empty string, than C Library backtrace will be used.
Each algorithm has its own advantages and disadvantages. Up to now we know, that C Library backtrace is very slow, but woks in most of the cases. Gcc builtin doesn't always work when application is compiled in optimized mode and frame pointers are omitted. Actually both algorithms could have problems if frame pointers are omitted (compiler optimization option).
The following systems were used for tests.
Linux 64bit:
Linux 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 12:36:24 CET 2010 x86_64 x86_64 x86_64 GNU/Linux
gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
Linux 32bit:
MacOSX 10.5:
MacOSX 10.6:
Darwin 10.2.0 Darwin Kernel Version 10.2.0: Tue Nov 3 10:37:10 PST 2009; root:xnu-1486.2.11~1/RELEASE_I386 i386
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)
Table 1. General comparison
algorithm / OS | Linix 64b (opt.[a]) | Linux 64b (debug[b]) | Linux 32b (opt.[a]) | Linux 32b (debug [b]) | MacOSX 10.5 (opt[a]) | MacOSX 10.5 (debug[b]) | MacOSX 10.6 (opt.[a]) | MacOSX 10.6 (debug[b]) |
---|---|---|---|---|---|---|---|---|
backtrace | Ok. | Ok. | - | - | - | - | Ok. [c] | Ok. |
builtin | X | Ok. | - | - | - | - | Ok. [c] | Ok. |
[a] opt. means that TMemStat compiled with the default ROOT optimization flags and the same valid for a test script. [b] debug means that TMemStat library is compiled with “make ROOTBUILD=debug” and a test script compiled by ACLIC with C++g option. [c] Works, but if the malloc/free is called in a loop, than we get two unique backtraces for each loop. Probably there is a partial optimization, and compiler unrolls the first iteration outside of the loop. This is why we actually get two calls for malloc/free, one before the loop and another is in the loop. I checked, that it doesn’t matter how many iteration we do, 10 or 100000. We always get two different back traces. Sometimes we even can get full loop unroll, which means we get as many return address as iterations in the loop. Both algorithms beehives the same in this case. What else we could expect, if the code was really optimized and compiler unrolled the loop... |
Starting from gcc 4.1, some optimization levels (e.g., -O, -Os, -O2) imply by the default the -fomit-frame-pointer flag. This flagprevents our “gcc builtin” algorithm to work properly. It you want memstat to use this algorithm and your application is comoled with optimization flags, we recommend to also use build your application with -fno-omit-frame-pointer option.