#812 new defect

server-side crawlers: tolerate corrupted shares, verify shares

Reported by: zooko Owned by: warner
Priority: major Milestone: undecided
Component: code-storage Version: 1.4.1
Keywords: reliability Cc:
Launchpad Bug:

Description (last modified by zooko)

From twistd.log on prodtahoe17 data6:

2009/09/25 13:00 -0700 [-] Log opened.
2009/09/25 13:00 -0700 [-] twistd 2.5.0 (/usr/bin/python 2.5.2) starting up
2009/09/25 13:00 -0700 [-] reactor class: <class 'twisted.internet.selectreactor.SelectReactor'>
2009/09/25 13:00 -0700 [-] Loading tahoe-client.tac...
2009-09-25 20:01:14.954Z [-] Loaded.
2009-09-25 20:01:14.956Z [-] foolscap.pb.Listener starting on 39324
2009-09-25 20:01:14.956Z [-] twisted.conch.manhole_ssh.ConchFactory starting on 8226
2009-09-25 20:01:14.956Z [-] Starting factory <twisted.conch.manhole_ssh.ConchFactory instance at 0x8bfe2cc>
2009-09-25 20:01:14.957Z [-] nevow.appserver.NevowSite starting on 9006
2009-09-25 20:01:14.957Z [-] Starting factory <nevow.appserver.NevowSite instance at 0x8db516c>
2009-09-25 20:01:14.957Z [-] Manhole listening via SSH on port 8226
2009-09-25 20:01:14.958Z [-] twisted.internet.protocol.DatagramProtocol starting on 35896
2009-09-25 20:01:14.958Z [-] Starting protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c>
2009-09-25 20:01:14.960Z [-] (Port 35896 Closed)
2009-09-25 20:01:14.961Z [-] Stopping protocol <twisted.internet.protocol.DatagramProtocol instance at 0x8db576c>
2009-09-27 12:57:40.124Z [-] lease-checker error processing /data6/storage/storage/shares/g6/g6rvkc5jdkgoqhljuxgkquzhvq/5
2009-09-27 12:57:40.130Z [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.5/site-packages/twisted/internet/base.py", line 561, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 262, in start_slice
            self.start_current_prefix(start_slice)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 321, in start_current_prefix
            buckets, start_slice)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/crawler.py", line 361, in process_prefixdir
            self.process_bucket(cycle, prefix, prefixdir, bucket)
        --- <exception caught here> ---
          File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 128, in process_bucket
            wks = self.process_share(sharefile)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/expirer.py", line 171, in process_share
            for li in sf.get_leases():
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 242, in get_leases
            for i, lease in self._enumerate_leases(f):
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 247, in _enumerate_leases
            for i in range(self._get_num_lease_slots(f)):
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 227, in _get_num_lease_slots
            num_extra_leases = self._read_num_extra_leases(f)
          File "/usr/lib/python2.5/site-packages/allmydata/storage/mutable.py", line 129, in _read_num_extra_leases
            (num_extra_leases,) = struct.unpack(">L", f.read(4))
          File "/usr/lib/python2.5/struct.py", line 87, in unpack
            return o.unpack(s)
        struct.error: unpack requires a string argument of length 4
$ tahoe --version
tahoe-server: 1.4.1, foolscap: 0.4.2, pycryptopp: 0.5.16-r669, zfec: 1.4.0-4, Twisted: 2.5.0, Nevow: 0.9.26, zope.interface: 3.3.1, python: 2.5.2, platform: Linux-Ubuntu_8.04-i686-32bit, pyutil: 1.3.20, simplejson: 1.7.3, argparse: 0.8.0, pyOpenSSL: 0.6, z-base-32: 1.0.1, setuptools: 0.6c8

No incident logs.

Attachments (1)

5 (2.9 KB) - added by zooko at 2009-10-05T18:00:40Z.
/Users/wonwinmcbrootles/5

Download all attachments as: .zip

Change History (11)

comment:1 Changed at 2009-10-05T17:59:49Z by zooko

Here is the mutable share in question, attached.

Changed at 2009-10-05T18:00:40Z by zooko

  • Attachment 5 added

/Users/wonwinmcbrootles/5

comment:2 Changed at 2009-10-05T23:49:10Z by warner

hm, the share appears to be truncated. I looked at sh7 (in /data5 on that same box), which appears to be intact. Both are for seqnum 6, and both appear to match up until the truncation point (which occurs in the middle of the encrypted private key).

These shares are supposed to be written atomically, with a write vector of length one (consisting of the entire share) that is written in a single python f.write call. Perhaps the write got interrupted (node reboot or system crash), in such a bad way that it caused only part of the data to be written out to disk? The truncation point isn't on a particularly round boundary (the file is 0xb7e bytes long), so it doesn't feel like a disk block size or anything like that.

Weird.

I suppose the important part is to recover gracefully from it. I believe the share-crawler should keep going after the error.. that'd be the first thing to verify.

I guess the second step would be to build a validating share crawler, and/or have some code in the lease-checking share crawler which would respond to problems like this by moving the corrupt share into a junkpile and logging the issue somewhere.

comment:3 Changed at 2009-10-06T16:50:46Z by zooko

  • Summary changed from exception from attempt to parse leases to handle corrupted lease files

Let's make this ticket be about recovering from this sort of corruption:

  • keep share-crawling after encountering corrupted leases
  • move the corrupted lease file into a junkpile and log the issue

In the meanwhile, I assume the work-around is to rm that file, right?

comment:4 Changed at 2009-10-08T16:53:12Z by warner

  • Summary changed from handle corrupted lease files to server-side crawlers: tolerate corrupted shares, verify shares

Sounds good. To be specific, this is unrelated to leases, it's just that the lease-expiring crawler is what first noticed the corruption. So this ticket is about:

  • verify that the crawlers keep crawling after an exception in their per-share handler functions (I believe that it already does this, but we should verify it)
  • implement a share-verifying crawler (the server-side verifier), and have it quarantine any corrupted share in some offline junkpile

And yeah, just rm the file, it's useless to anyone. The next time that directory is modified, a new copy of that share will be created.

comment:5 Changed at 2009-11-03T03:38:09Z by davidsarah

  • Keywords reliability added

comment:6 Changed at 2013-11-14T19:06:57Z by zooko

  • Description modified (diff)

#1834 would remove the lease-checking crawler and bucket-counting crawlers, making this ticket irrelevant. However, we might then want to invent a share-verifying crawler, just for the purpose of looking for corrupted shares, which would make this ticket relevant again.

comment:7 Changed at 2014-11-28T15:59:12Z by zooko

My consulting client (codename "WAG") (see comment:6:ticket:1278) has corrupt shares. This message is in the storage server's twistd.log:

2014-10-16 20:28:32+0530 [-] lease-checker error processing /GFS2_tahoe/.tahoe-filestore1/storage/shares/jz/jze7ynr5mtrehxj7vchc3dqtpa/5
2014-10-16 20:28:32+0530 [-] Unhandled Error
        Traceback (most recent call last):
          File "/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 261, in start_slice
            self.start_current_prefix(start_slice)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 320, in start_current_prefix
            buckets, start_slice)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/crawler.py", line 360, in process_prefixdir
            self.process_bucket(cycle, prefix, prefixdir, bucket)
        --- <exception caught here> ---
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 128, in process_bucket
            wks = self.process_share(sharefile)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/expirer.py", line 171, in process_share
            for li in sf.get_leases():
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 265, in get_leases
            for i, lease in self._enumerate_leases(f):
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 270, in _enumerate_leases
            for i in range(self._get_num_lease_slots(f)):
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 250, in _get_num_lease_slots
            num_extra_leases = self._read_num_extra_leases(f)
          File "/data/allmydata-tahoe-1.10.0/src/allmydata/storage/mutable.py", line 133, in _read_num_extra_leases
            (num_extra_leases,) = struct.unpack(">L", f.read(4))
        struct.error: unpack requires a string argument of length 4
        

And this message is in an incident report file generated by the node (which is both the storage server and the gateway):

CorruptShareError server=5e4d7u5c shnum[2]: block hash tree failure: new hash soakg4py7uswum3y6vzvyk4qt3736ia37l7ti22mce3ntesuhe7a does not match existing hash wuoq2uiuyzope6gwyvucwx3b5q47hn46kvynkhsvzh7jndrcowla at [2 of 3] (leaf [1] of 2)
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:827:_cbDeferred
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:380:callback
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:488:_startRunCallbacks
--- <exception caught here> ---
/data/allmydata-tahoe-1.10.0/support/lib/python2.6/site-packages/Twisted-13.0.0-py2.6-linux-x86_64.egg/twisted/internet/defer.py:575:_runCallbacks
/data/allmydata-tahoe-1.10.0/src/allmydata/mutable/retrieve.py:797:_validate_block

This is with this version of Tahoe-LAFS:

Nevow: 0.11.1
Twisted: 13.0.0
allmydata-tahoe: 1.10.0
foolscap: 0.6.4
mock: 1.0.1
platform: Linux-redhat_6.5-x86_64-64bit_ELF
pyOpenSSL: 0.10
pyasn1: 0.1.7
pycrypto: 2.6.1
pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958
python: 2.6.6
setuptools: 0.6c16dev4
simplejson: 2.0.9
twisted: 13.0.0
zfec: 1.4.7
zope.interface: unknown

comment:8 Changed at 2014-11-28T22:01:32Z by zooko

Okay, I've been looking into this, and I see that in fact this kind of corruption is handled, by logging it, accounting for it in a count called corrupt-shares, and skipping the corrupted share. See expirer.py. The only reason that we've thought it was not being handled all these years is because it calls twisted.python.log.err(), which emits a string to the twistd.log that says "Unhandled Error", plus a stack trace. So I propose that we just remove that call to twisted.python.log.err() and add a unit test which requires the code under test to detect and skip over corrupted shares without emitting this error log.

Last edited at 2014-11-28T22:17:02Z by zooko (previous) (diff)

comment:9 Changed at 2014-11-28T22:45:29Z by daira

Wait, why do we ever log directly to the Twisted log? Let's remove *all* the places that do that.

comment:10 Changed at 2014-11-28T22:56:18Z by daira

Filed #2343 (remove all direct logging to twisted.python.log/err).

Note: See TracTickets for help on using tickets.