Important: You are required to change your passwords used for Gentoo services and set an email address for your Wiki account if you haven't done so. See the full announcement and Wiki email policy change for more information.

Project:Quality Assurance/Backtraces

From Gentoo Wiki
< Project:Quality Assurance
Revision as of 16:25, 9 December 2013 by Polynomial-C (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This guide is meant to provide users with a simple explanation of why a default Gentoo installation does not provide meaningful backtraces and how to set it up to get them.

Backtraces with Gentoo

What are backtraces?

A backtrace (sometimes also called bt, trace, or stack trace) is a human readable report of the calling stack of a program. It tells you at which point of a program you are and how you reached that point through all the functions up to main() (at least in theory). Backtraces are usually analyzed when error conditions such as segmentation faults or aborts are reached using debuggers like gdb (GNU debugger), to find the cause of the error.

A meaningful backtrace contains not only the shared objects where the call was generated, but also the name of the function, the filename and the line where it stopped. Unfortunately on a system optimised for performance and conserved disk space, the backtraces are useless and show only the pointers on the stack and a series of ?? instead of the functions' names and position.

This guide will show how it's possible to get useful, meaningful backtraces in Gentoo, by using Portage features.


Compiler flags

By default gcc does not build debug information inside the objects (libraries and programs) it builds, as that creates larger objects. Also, many optimisations interfere with how the debug information is saved. For these reasons, the first thing to pay attention to is that the CFLAGS are set to generate useful debug information.

The basic flag to add in this case is -g . That tells the compiler to include extra information in objects, such as filenames and line numbers. This is usually enough to have basic backtraces, but the flag -ggdb adds more information. There is actually another flag ( -g3 ), but its use is not recommended. It seems to break binary interfaces and might lead to extra crashes. For instance, glibc breaks when built with that flag. If you want to provide as much information as possible, you should use the -ggdb flag.


CodeExample of CFLAGS with debug information

CFLAGS="-march=k8 -O2 -ggdb"
CXXFLAGS="${CFLAGS}"

High optimisation levels, such as -O3 might cause the backtrace to be less faithful, or incorrect. Generally speaking, -O2 and -Os can be used safely to get an approximate backtrace, down to the function called and the area of the source file where the crash happened. For more precise backtraces, you should instead use -O1 .


Note
The use of -O0 is often suggested when trying to produce a complete backtrace. Unfortunately this does not always play fair with the software itself, as disabling all optimisations changes the implementation of functions in the GNU C library (sys-libs/glibc), to the point that it can be considered being two different libraries, one for optimised and one for non-optimised builds. Also, some software will fail to build entirely when using -O0 because of the changes in headers' includes, and lack of features such as constant propagation in dead code elimination.

Note for x86 architecture users: x86 users frequently have -fomit-frame-pointer in their CFLAGS. The x86 architecture has a limited set of general registers, and this flag can make an extra register available, which improves performance. However there is a cost: it makes it impossible for gdb to "walk the stack" — in other words, to generate a backtrace reliably. Remove this flag from CFLAGS to build something easier for gdb to understand. Most other platforms do not have to worry; either they generally don't set -fomit-frame-pointer anyway, or the code generated by gcc does not confuse gdb (in which case the flag is already enabled by -O2 optimisation level).

Hardened users have other things to worry about. The hardened FAQ provides the extra hints and tips you need to know.


Stripping

Just changing your CFLAGS and re-emerging world won't give you meaningful backtraces anyway, as you have to solve the stripping problem. By default Portage strips binaries. In other words, it removes the sections unneeded to run them to reduce the size of the installed files. This is a good thing for an average user not needing useful backtraces, but removes all the debug information generated by -g* flags, and also the symbol tables that are used to find the base information to show a backtrace in human readable form.

There are two ways to stop stripping from interfering with debugging and useful backtraces. The first is to tell Portage to not strip binaries at all, by adding nostrip to FEATURES. This will leave the installed files exactly as gcc created them, with all the debug information and symbol tables, which increases the disk space occupied by executables and libraries. To avoid this problem, in Portage version 2.0.54-r1 and the 2.1 series, it's possible to use the splitdebug FEATURE instead.

With splitdebug enabled, Portage will still strip the binaries installed in the system. But before doing that, all the useful debug information is copied to a ".debug" file, which is then installed inside /usr/lib/debug (the complete name of the file would be given by appending to that the path where the file is actually installed). The path to that file is then saved in the original file inside an ELF section called ".gnu_debuglink", so that gdb knows which file to load the symbols from.


Important
If you enable both nostrip and splitdebug features, Portage won't strip binaries at all, so you have to pay attention to what you want.

Another advantage of splitdebug is that it doesn't require you to rebuild the package to get rid of the debug information. This is helpful when you build some packages with debugging to get a backtrace of a single error. Once it's fixed, you just need to remove the /usr/lib/debug directory.

To be sure to not strip binaries, you must also be sure you don't have the -s flag set in your LDFLAGS. That tells the linker to strip the resulting binaries in the link phase. Also note that using that flag might lead to further problems. It won't respect the strip restrictions imposed by some packages that stop working when entirely stripped.


Note
Some packages unfortunately handle stripping by themselves, inside the upstream provided makefiles. This is an error and should be reported. All packages should leave Portage the task of the stripping or simply restrict stripping entirely. The main exception to this are binary packages. They are usually stripped by upstream, outside of Portage control.


debug USE flag

Some ebuilds provide a debug USE flag. While some mistakenly use it to provide debug information and play with compiler flags when it is enabled, that is not its purpose.

If you're trying to debug a reproduceable crash, you want to leave this USE flag alone, as it'll be building a different source than what you had before. It is more efficient to get first a backtrace without changing the code, by simply emitting symbol information, and just afterward enable debug features to track the issue further down.

Debug features that are enabled by the USE flag include assertions, debug logs on screen, debug files, leak detection and extra-safe operations (such as scrubbing memory before use). Some of them might be taxing, especially for complex software or software where performance is an important issue.

For these reasons, please exercise caution when enabling the debug USE flag, and only consider it a last-chance card.


Introducing gdb

Once your packages are built with debug information and are not stripped, you just need to get the backtrace. To do so you need the sys-devel/gdb package. It contains the GNU debugger ( gdb ). After installing that, you can proceed with getting the backtrace. The simplest way to get one is to run the program from inside gdb . To do so, you need to point gdb to the path of the program to run, give it the arguments it will need, and then run it:


CodeRunning ls through gdb

$ gdb /bin/ls
GNU gdb 6.4
[...]

(gdb) set args /usr/share/fonts
(gdb) run
Starting program: /bin/ls /usr/share/fonts
[Thread debugging using libthread_db enabled]
[New Thread 47467411020832 (LWP 11100)]
100dpi  aquafont     baekmuk-fonts  cyrillic  dejavu     fonts.cache-1  kochi-substitute  misc  xdtv
75dpi   arphicfonts  CID            default   encodings  fonts.dir      mikachan-font     util

Program exited normally.
(gdb)

The message "Program exited normally" means that the program exited with the code 0. That means that no errors were reached. You shouldn't trust that too much, as there are programs that exit with status 0 when they reach error conditions. Another common message is "Program exited with code nn ". That simply tells you which non-zero status code they returned. That might imply a handled or expected error condition. For segmentation faults and aborts, you get instead a "Program received signal SIGsomething" message.

When a program receives a signal, it might be for many different reasons. In case of SIGSEGV and SIGABRT (respectively segmentation fault and abort), it usually means the code is doing something wrong, like doing a wrong syscall or trying to access memory through a broken pointer. Other common signals are SIGTERM, SIGQUIT and SIGINT (the latter is received when CTRL-C is sent to the program, and usually gets caught by gdb and ignored by the program).

Finally there is the series of "Real-Time events". They are named SIG nn with nn being a number greater than 31. The pthread implementation usually uses them to syncronise the different threads of the program, and thus they don't represent error conditions of any sort. It's easy to provide meaningless backtraces when confusing the Real-Time signals with error conditions. To prevent this, you can tell gdb to not stop the program when they are received, and instead pass them directly to the program, like in the following example.


CodeRunning xine-ui through gdb, ignoring real-time signals.

$ gdb /usr/bin/xine
GNU gdb 6.4
[...]

(gdb) run
Starting program: /usr/bin/xine
[...]

Program received signal SIG33, Real-time event 33.
[Switching to Thread 1182845264 (LWP 11543)]
0x00002b661d87d536 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
(gdb) handle SIG33 nostop noprint noignore pass
Signal        Stop      Print   Pass to program Description
SIG33         No        No      Yes             Real-time event 33
(gdb) kill
Kill the program being debugged? (y or n) y
(gdb) run

The handle command tells gdb what it should do when the given signal is sent to the command; in this case the flags are nostop (don't stop the program returning the command to the debugger), noprint (don't bother printing the reception of such a signal), noignore (don't ignore the signal — ignoring signals is dangerous, as it means discarding them without passing them to the program), pass (pass the signal to the debugged program).

After the eventual Real-Time events are being ignored by gdb , you should try to reproduce the crash you want to report. If you can reproduce it systematically, it's quite easy. When gdb tells you that the program received the SIGSEGV or SIGABRT signal (or whatever else signal might represent the error condition for the program), you'll have to actually ask for the backtrace, possibly saving it somewhere. The basic command to do that is bt , which is short for backtrace , which will show you the backtrace of the current thread (if the program is single-threaded, there's only one thread).

An alternative command to get a more detailed backtrace is bt full . That also gives you the information about parameters and local variables to the function where calls are being made (when they are available and not removed by optimisations). This makes the trace longer but also more useful when trying to find, for example, why a pointer is uninitialised.

Lately it's not rare that even simple programs are written with multiple threads, making the use of a simple bt output, albeit meaningful, quite useless, as it might represent the status of a thread different from the one in which the signal is thrown, or from the one where the error condition manifested (in case there's another thread responsible for throwing signals). For this reason, you should instead get the trace with the longer command thread apply all bt full , that tells the debug to print the full tracing of all the threads currently running.

If the backtrace is short, it's easy to copy and paste it out of the terminal (unless the failure happens on a terminal without X), but sometimes it's just too long to be copied easily, because it spans over multiple pages. To be able to get the backtraces on a file to attach to a bug, you can use the logging feature:


CodeUsing logging feature to save the backtrace to file

$ gdb /usr/bin/xine
GNU gdb 6.5
[...]

(gdb) run
[...]

(gdb) set logging file backtrace.log
(gdb) set logging on
Copying output to backtrace.log.
(gdb) bt
#0  0x0000003000eb7472 in __select_nocancel () from /lib/libc.so.6
...
(gdb) set logging off
Done logging to backtrace.log.
(gdb) quit

Now you can get the backtrace in the backtrace.log file, and just send it via email or attach that file to the related bug.


Core dumps

Sometimes the crashes are difficult to reproduce, the program is vastly threaded, it's too slow to run in gdb or it's messed up when run through it (shouldn't surprise anybody that running inside the debugger there are more bugs than are reproducible without the debugger itself). In these cases, there is one tool that comes in useful: the core dump.

A core dump is a file that contains the whole memory area of a program when it crashed. Using that file, it's possible to extract the stack backtrace even if the program has crashed outside gdb , assuming core dumps are enabled. By default core dumps are not enabled on Gentoo Linux (they are, however, enabled by default on Gentoo/FreeBSD ), so you have to enable them.

The core dump files are generated directly by the kernel; for this reason, the kernel need to have the feature enabled at build time to work properly. While all the default configurations enable core dump files, if you're running an embedded kernel, or you have configured otherwise standard kernel features, you should verify the following options:


Note
You can skip this step if you haven't enabled the “Configure standard kernel features” option at all, which you shouldn't have if you don't know whether you did.


CodeKernel options to enable core dumps

General Setup --->
   Configure standard kernel features --->
      Enable ELF core dumps

Core dumps can be enabled on the system level or the shell session level. In the first case, everything in the system that crashes and does not have already a crash handler (see later for more notes about KDE's crash handler) will dump. When enabled at shell session level, only the programs started from that session will leave behind a dump.

To enable core dumps on a system level, you have to edit either /etc/security/limits.conf (if you're using PAM, as is the default) or /etc/limits.conf . In the first case, you must define a limit (whether hard or, most commonly, soft; for core files, that might be anywhere from 0 to no limit). In the latter case, you just need to set the variable C to the size limit of a core file (here there's no "unlimited").


CodeExample of rule to get unlimited core files when using PAM

# /etc/security/limits.conf
*             soft      core       unlimited


CodeExample of rule to get core files up to 20MB when not using PAM

# /etc/limits.conf
*       C20480

To enable core files on a single shell session you can use the ulimit command with the -c option. 0 means disabled; any other positive number is the size in KB of the generated core file, while unlimited simply removes the limit on core file dimension. From that point on, all the programs that exit because of a signal like SIGABRT or SIGSEGV will leave behind a core file that might be called either "core" or "core. pid " (where pid is replaced with the actual pid of the program that died).


CodeExample of ulimit use

$ ulimit -c unlimited
$ crashing-program
[...]
Abort (Core Dumped)
$ 


Note
The ulimit command is an internal command in bash and zsh. On other shells it might be called in other ways or might even not be available at all.

After you get a core dump, you can run gdb on it, specifying both the path to the file that generated the core dump (it has to be the same exact binary, so if you recompile, the core dump is useless) and the path to the core file. Once you have gdb open on it, you can follow the same instructions given above as it had just received the signal killing it.


CodeStarting gdb on a core file

$ gdb $(which crashing-program) --core core

As an alternative, you can use gdb 's command-line capabilities to get the backtrace without entering the interactive mode. This also makes it easier to save the backtrace in a file or to send it to a pipe of any kind. The trick lies in the --batch and -ex options that are accepted by gdb . You can use the following bash function to get the full backtrace of a core dump (including all threads) on the standard output stream.


CodeFunction to get the whole backtrace out of a core dump

gdb_get_backtrace() {
    local exe=$1
    local core=$2

    gdb ${exe} \
        --core ${core} \
        --batch \
        --quiet \
        -ex "thread apply all bt full" \
        -ex "quit"
}


KDE crash handler's notes

KDE-based applications runs by default with their own crash handler, which is presented by the user by the means of "Dr. Konqi" if it's installed (the package is either kde-base/kdebase or kde-base/drkonqi (included in kdebase-meta ). This crash handler shows the user an informative dialog telling him that the program has crashed. On this dialog there is a "Backtrace" tab that, when loaded, calls gdb and makes it load the data and generate the full backtrace on the behalf of the user, showing it in the main text box and allowing it to be saved directly to a file. That backtrace is usually good enough for reporting a problem.

When drkonqi is not installed, the crashes won't generate a core dump anyway, and the user will receive no information by default. To avoid this, it's possible to use the --nocrashhandler argument on all the KDE-based applications. That disables the crash handler entirely and leaves the signals to be handled by the operating system as usual. This is useful to generate core files when drkonqi is not available or when wanting to inspect stack frames by hand.


Acknowledgements

We would like to thank the following authors and editors for their contributions to this guide:

  • Diego E. Pettenò
  • Ned Ludd
  • Kevin Quinn
  • Donnie Berkholz