#738 closed defect (fixed)

failure in block hash tree

Reported by: midnightmagic Owned by: bdew
Priority: critical Milestone: 1.5.0
Component: code-encoding Version: 1.4.1
Keywords: integrity Cc: bdew, midnightmagic
Launchpad Bug:

Description (last modified by warner)

Running tahoe on the machine in which python2.5 setup.py test fails as reported in trac ticket#737, generates the attached incident report.

Brief summary from flog debugger viewer:

<ValidatedReadBucketProxy #1>
(2-<ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe>-m2st7xk76cvd):
 hash failure in block=0, shnum=2
 on <ReadBucketProxy 150097484 to peer [mgnbpxki] SI jow42sylefxjxsns3alv5ptghe>

[...]

<BlockDownloader #2>(<ValidatedReadBucketProxy #3>-2): failure to get block

[... etc ...]

failure in block hash tree

.. and so on.

Actual cli error message is:

tahoe get URI:CHK:lapry55oui4psmeiyxhvitfmpi:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080 meh

Error, got 410 Gone
NotEnoughSharesError: no shares could be found. Zero shares usually indicates
a corrupt URI, or that no servers were connected, but it might also indicate
severe corruption. You should perform a filecheck on this object to learn
more.

Finally, dump-share on the 33MB file:

share filename: /v/tahoe/.tahoe/storage/shares/jo/jow42sylefxjxsns3alv5ptghe/2
             version: 1
           file_size: 99758080
        num_segments: 762
        segment_size: 131073
       needed_shares: 3
        total_shares: 5

          codec_name: crs
        codec_params: 131073-3-5
   tail_codec_params: 11529-3-5

      crypttext_hash: zodzh33f7mnowsxine5mzejiahlxsilgggpxmop5bfrh4zzzdpha
 crypttext_root_hash: nuqsysg5zqkz5nsvpi32n5h6h5ilbepsbvmssji2xv773kqw53tq
     share_root_hash: m2st7xk76cvdutgf5lzmkdzbf72h75cxpkytwjegi5jgntir3u5q
            UEB_hash: 75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq
          verify-cap: URI:CHK-Verifier:jow42sylefxjxsns3alv5ptghe:75mb37to6iauypych6bkqfkxxxfk6nhekhomzipkqzwt46v64hdq:3:5:99758080

 Size of data within the share:
                data: 33252694
       uri-extension: 325
          validation: 196648

 Lease #0: owner=0, expire in 2607324s (30 days)

Machine details:

NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386 

AMD Athlon(tm) XP 2500+ (single-core, 32-bit) 2.5GB RAM 

Python 2.5.2 

This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads *should* be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.

NOTE: The share file IS AVAILABLE UPON REQUEST. I will save it for posterity.

Attachments (6)

incident-2009-06-16-211442-qvfj7eq.flog.bz2 (279.3 KB) - added by midnightmagic at 2009-06-17T18:38:54Z.
Incident report. View with: flogtool web-viewer -p 8081 incident-2009-06-16-211442-qvfj7eq.flog.bz2
738-dumpblock.diff (3.6 KB) - added by warner at 2009-06-24T01:33:33Z.
patch to dump the offending block into logs/
badhash-from-2ccpv6ww-SI-jow42sylefxjxsns3alv5ptghe-shnum-0-blocknum-0 (42.7 KB) - added by midnightmagic at 2009-06-29T23:22:39Z.
first badhash
badhash-from-r4tndnav-SI-jow42sylefxjxsns3alv5ptghe-shnum-1-blocknum-0 (42.7 KB) - added by midnightmagic at 2009-06-29T23:22:46Z.
badhash-from-rzozr3qr-SI-jow42sylefxjxsns3alv5ptghe-shnum-2-blocknum-0 (42.7 KB) - added by midnightmagic at 2009-06-29T23:22:59Z.
should be last attachment for this..
incident-2009-06-29-161825-brn5ypi.flog.bz2 (328.9 KB) - added by midnightmagic at 2009-06-29T23:37:51Z.
incident file to go with badhashes (i believe.)

Download all attachments as: .zip

Change History (24)

Changed at 2009-06-17T18:38:54Z by midnightmagic

Incident report. View with: flogtool web-viewer -p 8081 incident-2009-06-16-211442-qvfj7eq.flog.bz2

comment:1 Changed at 2009-06-17T18:39:42Z by midnightmagic

  • Component changed from unknown to code-storage
  • Owner nobody deleted

comment:2 Changed at 2009-06-17T19:22:32Z by zooko

  • Component changed from code-storage to code-encoding
  • Keywords integrity added
  • Milestone changed from undecided to 1.5.0
  • Priority changed from major to critical

I want to know if Tahoe-LAFS unit tests pass (excluding the one that locks up as described in #737) and if pycryptopp unit tests pass.

comment:3 Changed at 2009-06-18T03:20:50Z by warner

  • Description modified (diff)

(wrapped some of the description text to improve formatting)

I've looked a bit at the share file you sent me, and it seems ok (no corruption that I've seen so far). My next step is to examine the Incident report and see if I can figure out exactly which hash check failed, and compare them against hashes that I'll generate locally from that share.

Another approach will be to get a copy of two more shares, put them in a private grid, and attempt to download the file. If successful, the shares must be ok, and we'll focus on how the download process might be acting differently on your host.

comment:4 follow-up: Changed at 2009-06-18T18:07:51Z by warner

I've looked at that report and compared it against the scrubber that I wrote (a server-side share verification tool). It looks like your client is generating a different hash for the first data block than it's supposed to. The incident report contains a copy of the first 50 bytes and the last 50 bytes of the block, and they match what I'm getting out of the share.

So, either your client is incorrectly computing the SHA256d hash of that 43kB-ish data, or it is using a block of data that is corrupted somewhere in the middle. Your client seems to compute the rest of the hash tree correctly (and I think you might have said that pycryptopp tests pass on this platform), so it seems like SHA256d is working in general. So that points to either the wrong hash tag (in allmydata.util.hashutil), or some sort of transport-level error that is corrupting or inserting/deleting data in the middle of the block.

I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working correctly: could you pull a trunk tree, build, and then specifically run "make test TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by running the later test_util directly, we can rule out this one hypothesis.

If that passes, the next step will be to patch the download code to save the full block to disk, so we can examine it and see if it matches what it's supposed to be.

comment:5 in reply to: ↑ 4 Changed at 2009-06-19T19:27:28Z by midnightmagic

Replying to warner:

I've just pushed some known-answer-tests to confirm that allmydata.util.hashutil is working correctly: could you pull a trunk tree, build, and then specifically run "make test TEST=allmydata.test.test_util" ? I know that an earlier test is hanging on your system; by running the later test_util directly, we can rule out this one hypothesis.

If that passes, the next step will be to patch the download code to save the full block to disk, so we can examine it and see if it matches what it's supposed to be.

My system make is a form of bmake. I had to run gmake to execute your tests. However, the result is apparently success:

Ran 63 tests in 5.852s

PASSED (successes=63)

In doing so, I have found that the default, lone python interpreter was actually different from the one I was manually using to build and install tahoe. Additionally, my tahoe installation is installed system-wide. I will (hrm.. manually?) de-install it and try again with a proper symlink from /usr/pkg/bin/python to python2.5. (Long shot with no evidence, I know.. but still.)

Nope. Same problem.

I'm sorry I don't know more about Python or I would be a *lot* more useful to you.

Changed at 2009-06-24T01:33:33Z by warner

patch to dump the offending block into logs/

comment:6 follow-up: Changed at 2009-06-24T01:36:05Z by warner

ok, if you would, please apply the 738-dumpblock.diff patch to a trunk tree, then perform the failing download again. That patch will write out the block-with-bad-hash into logs/badhash-STUFF .. then please attach that file to this ticket, and we'll compare it against the block that we were supposed to get, to see if your machine is receiving bad data, or if it's computing the hashes incorrectly.

Also, double-check that "test_known_answers" were in the output of the "test_util" run that you just did, to make sure that your tree was new enough to have the tests I added.

thanks!

comment:7 Changed at 2009-06-29T23:22:10Z by midnightmagic

Excellent! I got some badhash's! I *believe* they're the same ones from that same file. I recognise the beginning of the hash reference anyway (jo blah blah).

I'll try to attach them to this note.

Changed at 2009-06-29T23:22:59Z by midnightmagic

should be last attachment for this..

comment:8 in reply to: ↑ 6 Changed at 2009-06-29T23:36:11Z by midnightmagic

Also, double-check that "test_known_answers" were in the output of the "test_util" run that you just did, to make sure that your tree was new enough to have the tests I added.

Indeed it was, and in a new test with the latest (via darcs pull) and that includes the patch that generated the badhash-* files:

  HashUtilTests
    test_chk ...                                      [OK]
    test_constant_time_compare ...                    [OK]
    test_hashers ...                                  [OK]
    test_known_answers ...                            [OK]
    test_random_key ...                               [OK]
    test_sha256d ...                                  [OK]
    test_sha256d_truncated ...                        [OK]
[ ... and etc ...]
  TimeFormat
    test_epoch ...                                    [OK]
    test_epoch_in_London ...                          [OK]

---------------------------------------------------------
Ran 63 tests in 5.815s

PASSED (successes=63)

Changed at 2009-06-29T23:37:51Z by midnightmagic

incident file to go with badhashes (i believe.)

comment:9 Changed at 2009-06-30T00:58:30Z by warner

pycryptopp#24 opened.

midnightmagic and I were able to narrow this down to a failure in pycryptopp, in which hashing a 128-byte string in two chunks of size (33,95) gets the wrong value on !NetBSD. The block data hasher uses a tag (including netstring padding) of length 33. I suspect that the actual problem is with any block size BS such that (33+BS)%128==0.

This smells a lot like pycryptopp#17, which was an ARM-specific alignment issue that corrupted AES output on certain chunk sizes. I haven't looked deeply at the SHA256 code yet, but I suspect the same sort of bug, this time affecting i386.

comment:10 Changed at 2009-06-30T02:23:04Z by warner

how interesting.. Black Dew's debian/i386 buildslave (which has experienced hangs in the test suite that look fairly similar to hangs that midnightmagic has seen) fails the new pycryptopp-24 test in exactly the same way.

comment:11 Changed at 2009-06-30T03:27:21Z by zooko

  • Cc bdew midnightmagic added
  • Owner set to bdew

adding Cc: bdew, midnightmagic so they will know that there is something they can do to help. Setting 'assigned to' bdew at random.

comment:12 Changed at 2009-06-30T19:44:52Z by zooko

A-ha! Now Black Dew's buildslave got an internal compiler error in g++ while building Crypto++:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/25/steps/build/logs/stdio

This suggests to me that the machine has hardware problems.

This raises an interesting question of: what if anything can Tahoe-LAFS do to be robust and to fail clearly and nicely in the presence of hardware problems such as flaky RAM?

comment:13 Changed at 2009-07-02T20:46:37Z by zooko

See also Black Dew's discoveries over in http://allmydata.org/trac/pycryptopp/ticket/24 . Crypto++ is being built to use the MOVDQA instruction, which may be buggy on his AthlonXP.

comment:14 Changed at 2009-07-03T22:29:53Z by midnightmagic

So! Just as an update, this particular issue may be solved by the fact that crypto++ on my machine actually seems to fail (when ASM optimizations are turned on) on a test program I wrote. See the aforementioned pycryptopp ticket for more details and the test program.

comment:15 follow-up: Changed at 2009-07-03T23:01:19Z by midnightmagic

SUCCESS! Rebuilding pycryptopp without ASM optimizations makes it pass the chunked SHA256 test, and setting PYTHONPATH to that top-level directory makes "tahoe ls" Just Work on an existing grid, and the failing command NOW WORKS PERFECTLY.

So there is a patch in pycryptopp/ticket/24 for setup.py which detects the platform involved and turns off assembly optimizations in just that specific platform and bit-width (32bit).

I would say that, if bdew could do the same and it works, we can put a platform detection for his also and likely close all these tickets out until crypto++ fixes their CPU features detection.

comment:16 in reply to: ↑ 15 Changed at 2009-07-06T05:07:00Z by midnightmagic

Wow, Wei Dai is fast. Check it out, he's fixed the problem already in SVN:

Crypto++ SVN Rev 470

Impressive.

comment:17 Changed at 2009-07-06T23:18:28Z by midnightmagic

pycryptopp in trunk now works perfectly (well the tests don't fail anyway) on all three machines, as listed in pycryptopp ticket 24. Using pycryptopp trunk, I now have apparently perfectly-working tahoe nodes where before they were only remotely usable.

Therefore, I believe this ticket can be closed, from my perspective. If 1.5.0 is going to include these fixes, then all's well!

comment:18 Changed at 2009-07-07T19:09:34Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

Fixed by 9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.

Note: See TracTickets for help on using tickets.