#19 closed defect (fixed)

Segmentation fault in HashMultipleBlocks

Reported by: francois Owned by: nejucomo
Priority: major Milestone:
Version: 0.5.1 Keywords:
Cc: Launchpad Bug:

Description

During some Tahoe development, I happen to run "make quicktest" frequently. Sometimes, in a non-reproducible way, a segmentation fault kills the Python interpreter before the end of the test run.

Fortunately, I was able to catch it in gdb, here is the beginning of the stackstrace.

(gdb) bt
#0  CryptoPP::IteratedHashBase<unsigned int, CryptoPP::HashTransformation>::HashMultipleBlocks (this=0x7fe5a00f7370, input=0x79cffd4, 
    length=<value optimized out>) at cryptopp/misc.h:674
#1  0x00007fe5aa755a3f in CryptoPP::IteratedHashBase<unsigned int, CryptoPP::HashTransformation>::Update (this=0x7fe5a00f7370, input=0x58e0054 "", 
    len=140621524238411) at cryptopp/iterhash.cpp:56
#2  0x00007fe5aa72b899 in CryptoPP::PK_MessageAccumulatorBase::Update (
    this=0x7fe5a00f72b0, input=0x58e0054 "", length=140621524238411)
    at ./cryptopp/pubkey.h:292
#3  0x00007fe5aa7a2f5a in CryptoPP::PK_Verifier::VerifyMessage (
    this=0x7fe5a0066590, message=0x58e0054 "", messageLen=140621524238411, 
    signature=0x7fe5a01ba214 "dJ\232��\r�a�\"\204\216uY�\213�P�6_�UL\211\004�\035\037?}Xh$\ro�\022s�r�\0344خS�HE�\236m}�\216\027,�\224\213O���!�3�'�]�LX���I�\031\223\v\224�\232f\037��)����\204]\021$\021�LB\226�Ʊ�.�3��J\177\2100�\0211��\022\220\032\026YQS|�\004\207\230�\017�qG�\t\006\212x��a\226�vx����\223\002t�lK&ܾįX\006\\+I\225��\021�F\220��;\\�\aT3\036��l�\216\017�,�\217\vI�"..., 
    signatureLength=256) at cryptopp/cryptlib.cpp:681
#4  0x00007fe5aa7b015d in ?? ()
   from /home/francois/WORK/dev/tahoe-534/pycryptopp-0.5.10-py2.6-linux-x86_64.egg/pycryptopp/publickey/rsa.so
#5  0x00000000004a2b03 in PyEval_EvalFrameEx ()
#6  0x00000000004a3dae in PyEval_EvalFrameEx ()
#7  0x00000000004a4649 in PyEval_EvalCodeEx ()

Tahoe's Makefile is automatically installing this version of pycryptopp:

pycryptopp-0.5.10-py2.6-linux-x86_64.egg

Catch me on IRC if more gdb black magic is required.

Change History (14)

comment:1 Changed at 2009-04-27T18:45:19Z by zooko

Thank you for the bug report! Could you please get the pycryptopp source code itself and try something like the following to see if pycryptopp reliably passes its own unit tests on your system:

/bin/true
while [ $? = 0 ] ; do
  python ./setup.py test
done

comment:2 Changed at 2009-05-01T12:14:39Z by francois

I've run the unit tests on both 0.5.12 and latest trunk for a few hours without any errors.

comment:3 Changed at 2009-08-27T02:47:47Z by zooko

  • Owner set to francois

François: do you think this might have been a hardware problem on your x86-64 machine? Have there been any known bugs in pycryptopp since this that would explain this bug? Would you be so kind as to re-run this test again using the current version of pycryptopp?

comment:4 Changed at 2009-09-01T03:07:56Z by zooko

Aaron Cordova reported a similar-looking problem with Snow Leopard:

http://allmydata.org/pipermail/tahoe-dev/2009-August/002751.html

comment:5 Changed at 2009-09-08T04:58:40Z by nejucomo

When looking at the stack trace I see that the message size is ~140 Terrabytes, so either Francois has quite the heavy-duty application, or this is related to the corruption.

Here's an untested hypothesis:

PyArg_ParseTupleAndKeywords is called with "t#" and passed msgsize, a Py_ssize_t, to receive the length.

The documentation on python argument parsing says "t#" is like "s#" and "s#" says:
"""
Starting with Python 2.5 the type of the length argument can be controlled by defining the macro PY_SSIZE_T_CLEAN before including Python.h. If the macro is defined, length is a Py_ssize_t rather than an int.
"""
See: http://docs.python.org/c-api/arg.html

So maybe on x86_64 and with PY_SSIZE_T_CLEAN toggled the wrong way, PyArg_ParseTupleAndKeywords (which is a varargs function) writes the output arguments using the wrong boundaries.

Another curious potential for a bug is the implicit cast of msgsize to a size_t when calling VerifyMessage?, but I haven't thought that one out yet.

comment:6 Changed at 2009-09-09T06:39:17Z by warner

Nice catch! I just pushed [669] with your fix: set PY_SSIZE_T_CLEAN everywhere, and update the two remaining uses of "int"-sized t# length variables to be Py_ssize_t instead. The segfaults go away, and tests pass on linux/32bit, OS-X/10.6/64bit, OS-X/10.5/probably-64-bit.

I didn't touch that implicit cast. If your analysis points out a problem, give a yell.

francois: could you play with this patch and see if it makes your segfault go away? I know that it happens infrequently enough that it'll be hard to develop confidence by merely running the test suite a few times, but maybe if you observed no crash for three times as long as the usual time it takes to crash, we could treat this one as done..

comment:7 Changed at 2009-09-11T18:55:40Z by francois

Soory, I'm not able to reproduce this bug anymore on my current setup. This is probably due to the upgrade to Ubuntu karmic koala I made a few weeks ago.

comment:8 Changed at 2009-09-11T21:12:34Z by zooko

Oh, something else that we should do: the fix (http://allmydata.org/trac/pycryptopp/changeset/669 ) should fix it only if the problem was that sizeo(size_t) != sizeof(Py_ssize_t) or sizeof(int) != sizeof(Py_ssize_t). So everyone involved (including you, François) please run the following C code and post your results:

#include "Python.h"
#include <stdio.h>
int main(int argc, char**argv) {
  printf('%d, %d, %d", sizeof(int), sizeof(size_t), sizeof(Py_ssize_t));
  return 0;
}

If all three of the values are equal on your system, then Nathan's patch cannot be the fix to the problem on your system. Therefore, I believe that these values will differ on all affected systems. (However, François's upgrade of the OS may have changed these values on his system.)

comment:9 Changed at 2009-09-11T23:13:24Z by zooko

First!

 Wonwin-McBrootles-Computer:~$ cat > test.c
#include "Python.h"
#include <stdio.h>
int main(int argc, char**argv) {
  printf('%d, %d, %d", sizeof(int), sizeof(size_t), sizeof(Py_ssize_t));
  return 0;
}
 Wonwin-McBrootles-Computer:~$ gcc -I/Library/Frameworks/Python.framework/Versions/2.5/include/python2.5/ test.c
test.c: In function ‘main’:
test.c:4: error: missing terminating ' character
test.c:5: error: parse error before ‘return’

Okay, how about this instead:

#include "Python.h"
#include <stdio.h>
int main(int argc, char**argv) {
  printf("%d, %d, %d\n", sizeof(int), sizeof(size_t), sizeof(Py_ssize_t));
  return 0;
}
 Wonwin-McBrootles-Computer:~$ cat > test.c
#include "Python.h"
#include <stdio.h>
int main(int argc, char**argv) {
  printf("%d, %d, %d\n", sizeof(int), sizeof(size_t), sizeof(Py_ssize_t));
  return 0;
}
 Wonwin-McBrootles-Computer:~$ gcc -I/Library/Frameworks/Python.framework/Versions/2.5/include/python2.5/ test.c -o sizes
 Wonwin-McBrootles-Computer:~$ ./sizes
4, 4, 4

The fact that all three are the same size on this machine (Mac OS 10.5 Intel) is consistent with the fact that the pycryptopp and Tahoe-LAFS unit tests have never segfaulted on this machine.

comment:10 Changed at 2009-09-13T18:22:39Z by zooko

Aaron Cordova (Mac OS 10.6 a.k.a. "Snow Leopard") and Ruben Kerkhof (Linux Fedora amd64) both wrote to me and said that their sizes were:

sizeof(int): 4, sizeof(size_t): 8, sizeof(Py_ssize_t): 8

Both of them experienced the crash and for both of them the crash was fixed by this patch. (Right, Aaron?)

Interestingly, my Linux Ubuntu amd64 also has these sizes: 4,4,8, and I've never gotten a crash in pycryptopp. This can be explained by the uninitialized four bytes of the msgsize local variable declared on line 62 being accidentally equal to zero: pycryptopp/publickey/rsamodule.cpp@617#L58.

There are two things that I don't understand yet, though. One is, why didn't valgrind warn me about the use of these uninitialized bits? I ran the code without Nathan's patch with valgrind, and it reported no errors. I then initialized the msgsize to be equal to MAX_INT+2 and this caused the assertion on line 67 to fail, showing that the the call to PyArg_ParseTupleAndKeywords() on line 65 is not writing all the bits of msgsize, but also confusing me further, because how come msgsize ended up < 0 ? For that matter, why isn't size_t and unsigned type?

Next step:

Understand why valgrind doesn't notice the use of uninitialized bits, and why initializing msgsize to an invalid value results in the effect that it does. My guess is that it has to do with the fact that PyArg_ParseTupleAndKeywords() uses the va_arg feature of C to process the arguments one at a time... And could there be a relevant difference in padding/alignment of stack values on my Linux versus on Ruben's Linux and Aaron's Mac OS X ?

comment:11 Changed at 2009-09-13T20:11:58Z by francois

On my machine named 'korn' and running Ubuntu karmic koala.

sizeof(int) = 4 bytes
sizeof(size_t) = 8 bytes
sizeof(Py_ssize_t) = 8 bytes

So, it seems consistent with your analysis.

comment:12 Changed at 2009-09-13T23:08:00Z by zooko

Okay, after fiddling with it all day I generated a minimal test case of a function that declares a local ("automatic") variable and then uses it, which valgrind *does* warn is use of uninitialized bits -- a minimal program in which the variable in question is in the main() function.

Then I generated another test case of a function that declares a local variable and then uses it which valgrind does *not* warn about -- a C function that gets invoked by the Python interpreter.

Then it dawned on me -- duh -- of course! Valgrind memcheck is tracking whether bits have been initialized or not initialized by this process -- and in my minimal test case they have not and in the second test case they have (by some function which finished executing earlier before this function got invoked). It would be nice if valgrind would set its 'V' bits as well as its 'A' bits when the stack pointer changes (http://valgrind.org/docs/manual/mc-manual.html#mc-manual.vaddress ) but as far as I understand, it doesn't, so bits that were written by the last function to use this part of the stack stay "Valid" (initialized) as far as valgrind perceives the world.

Okay, so there are no more mysteries about this bug. The next step is to get Nathan to sign off that he agrees, and then release a new pycryptopp version and push it out to all distributors who have redistributed pycryptopp (e.g. Debian and Ubuntu).

comment:13 Changed at 2009-10-10T16:01:31Z by zooko

  • Owner changed from francois to nejucomo

Nathan: please sign off on this by marking it as "resolved".

comment:14 Changed at 2009-10-12T01:39:57Z by nejucomo

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.