Diagnosing segmentation faults in reconstruction software

From GlueXWiki
Jump to: navigation, search

As Matt and Dave have reported, 32-bit builds of the GlueX reconstruction code currently generate segfaults if built with default optimization -O2. I have verified this, with subversion releases as recent as 7585. These segfaults sometimes show up as hangs, due to a memory access deadlock between threads that can take place when the signal handler tries to generate the backtrace. But the hangs are just segfaults in disguise. The segfaults occur in the method DTrackCandidate_factory_CDC::FindThetaZRegression() inside the STL sort algorithm that is used to sort the cdc hits in order of increasing radius, before the regression is carried out.

Executive Summary

Basically the current SortInteractions(A,B) method that compares two DVector3 objects violates the specification for the STL sort compare method in a subtle way that affects ONLY code that is generated with the x87 floating point unit math. The x87 unit does floating point math in internal registers with 80-bit precision, but when it writes between memory and these registers it down-converts to 64-bit precision. This means that if you compute a value and store it in memory and then pull it back into a register, the value in the register is different from what you had originally. It is this irreversibility that caused the problem.

This recalls the suggestion heard at a recent offline software meeting to go with "-mathfp=sse" across all platforms, and avoid issues with rounding in the x87 math instructions.

If you are interested in the details, read on.

Requirements for the STL sort algorithm

According to the http://www.cplusplus.com reference on STL sort, the non-default comp function must respect the following restrictions:

comp - Comparison function object that, taking two values of the same type than those contained in the range, returns true if the first argument goes before the second argument in the specific strict weak ordering it defines, and false otherwise.

What is this "strict weak ordering"? Getting a rigorous definition requires wading through some jargon from set theory, but one of the consequences is that if the value S is a member of the set then running comp(A,S) for all A in the set must return false at least once, as A iterates over the set. This is a requirement that the author of comp(A,S) must guarantee in actual practice, because assuming that this is the case helps the algorithm be more efficient. Writing a comp(A,S) function that behaves correctly for x87 math requires that the inequality be written to cover the irreversibility window of the 80/64 bit doubles. Consider the following function:

   //  a special kind of weed killer designed for x87 math
   #define ROUND_UP 1.0000000000000001
   SortComp(DVector3 &A, DVector3 &B) {
       Double_t rA = A.Perp();
       Double_t rB = B.Perp();
       if (rA < rB) return true;        //  buggy, violates strict weak ordering in x87 math context
       if (rA*ROUND_UP < rB) return true;   // obeys strict weak ordering in x87 math context
       return false;

What was actually happening

The STL sort algorithm uses a divide-and-conquer approach to sort a vector. The particular case I diagnosed had 93 members, the first of which was the beam axis and the rest of which were CDC stereo hits, apparently. In the first bifurcation, the sort algorithm split the vector at 45, and started sorting the top 47 members. By [bad] luck, the maximum radius in all of the hits above element 45 was in member (you guessed it) 45, which had the following value at 80-bit precision


Now the sort algorithm wants to compare this value with all of the members from 45 to 93. It turns out that this value is the largest one in the list, so the SortIntersections(A,B) function will return true for all values except the first one, except.... it returns true for that one too! At first it seems like (A<A) should never be true, but it can be in the weird world of x87 math. Inside the SortIntersections(A,B) code with A=B, the Perp() of the A argument gets computed first, then saved on the stack while B.Perp() is computed, then popped back. Problem is that when the above value is stored in memory and then pulled back into a floating point register, its value has changed to:


So now (A<A) is true, resulting in a violation of the "strict weak ordering" requirement. The memory past the end of the vector happened to have a lot of zeros in it, resulting in a continuation of (A<B) being true, until the iterator walked right off the end of the data segment in memory.

Is this a compiler optimization error?

I would say no, because the compiler code is as correct as x87 math allows it to be. Maybe in this particular case compiling with -O0 happens to force both values to be stored/restored from memory before they are compared, but I don't see why that MUST be the case, even for code with -O0. In writing this compiler, even if I want to store all intermediate values in memory, there is no reason why I have to dump the copy in the register and restore it from memory before doing the compare. Given a little time, I am fairly certain I could find a test case where (A<A) could evaluate to true even with -O0. Anyone writing code for x87 math has to allow for the fact that digits beyond the 64-bit precision boundary cannot be relied upon to be stable. This is an issue the user must deal with, no way around it. The compiler simply cannot be required to write every value out to memory and read it back again between every floating point operation. Either we must allow for random fluctuations in the bottom 16 bits of doubles in all of our code, or we need to specify that x87 math breaks our reconstruction code.

How to diagnose this kind of error

  1. find the event in which it occurs, and create an input file containing just this event.
  2. verify that the error occurs in processing the one-event file. If not, start going backwards in increasing intervals, adding events prior to the faulting event, to find a minimal data context for reproducing the error.
  3. build the code with symbols (option -g) but with the optimization unchanged (-O2 by default in the GlueX BMS system)
  4. run the symbol-loaded application in the gdb debugger and let it segfault. Use "info stack" to print out the backtrace of the faulting thread and record for future use.
  5. run again in gdb, this time setting a breakpoint in the function/method in which the segfault occurred. This may be several stack frames above level 0 in the backtrace, but should be the lowest stack frame that contains user code.
  6. the first time you hit the breakpoint, issue the command "c 999999999" to continue. This will segfault again, but this time you will be able to find out how many times the given function is entered before the fatal one. When the fault occurs, type "info break" to see how many times the breakpoint was hit after you did "c 999999999". Record the difference between 999999999 and this number as B.
  7. run again in gdb, set the above breakpoint again, but this time do "c <B>" after the first breakpoint, where <B> is the number found in the previous step. This time it will stop again at the breakpoint right before the point where the fault occurs.
  8. set more breakpoints further down in the code, or single-step forward, until you reach the STL algorithm inside which the segfault occurs.
  9. switch to mixed c / assembly language listings using the "layout split" command. It helps at this point to have a large terminal window open, so you can view everything at once.
  10. step forward one instruction at a time using the "stepi" command (abbreviated "si"). You can mostly ignore the vain attempts gdb is making to keep track of where you are in the c++ code window. The optimizer is smarter than gdb, and gdb flails around guessing (usually within a dozen statements, but rarely better than 2-3) where the execution point in the c++ code actually is. However, the assembly language window is precise and easy to read, and the c++ code listing is there to remind you what the instructions are trying to do.
  11. just "si" forward (or "ni" to skip over calls to trivial functions that you think are working ok) just like you would use "s" (or "n") in stepping through unoptimized c code.
  12. most of the interesting stuff is in the arguments passed to functions, and the values returned from functions. These are passed on the stack, and you can just read the values right off the stack right before the "call <SUB>" instruction. For example, the gdb command "x/4x $esp" reads the first 4 elements on the stack, which are the first 4 arguments to your function at the "call <SUB>" instruction, if simple values or pointers. See "x86 calling conventions" on wikipedia for details about the "cdecl" calling convention for more complex types, and for how return values are stored upon the "ret" instruction.
  13. with the c code listed in the upper window (don't pay any attention to where the program pointer is pointing, just read the code and correlate it to what you see in the assembly listing) it is very easy to see what the code is doing, and follow the values of local variables as they are passed around between processor registers. Floating point numbers live in the registers $st0, $st1, ... and can be printed out using a gdb command like "p $st0". Both the c code and the assembly windows scroll forward, tracking with your execution point as you advance through the code. At any point you can set a breakpoint at a particular instruction, or a watchpoint at a particular memory location.

It's a lot like watching a bad movie! Have fun with it.