Context Navigation

does it sometimes use 750 MB

Reported by:	terrell	Owned by:	terrell
Priority:	major	Milestone:	undecided
Component:	code	Version:	1.5.0
Keywords:	leak memory	Cc:
Launchpad Bug:

Description

My main machine, Intel iMac on 10.5.7, has been running a tahoe node on the volunteergrid for a while (a few weeks).

I noticed today when looking at the Activity Monitor that the node is using over 750MB of memory.

Please advise for what I can do to catch it in the act.

Attachments (8)

longrunning.png (145.0 KB) - added by terrell at 2009-08-08T02:33:21Z.: screenshot of Activity Monitor
tahoesample.txt (27.1 KB) - added by terrell at 2009-08-08T02:33:57Z.: sample file for the longrunning process
prodtahoe3.allmydata.com-tahoe_nodememory_storage-day.png (29.1 KB) - added by zooko at 2009-08-08T15:46:01Z.: some random storage server at allmydata.com Look: no major memory leak! (graph shows a day)
prodtahoe3.allmydata.com-tahoe_nodememory_storage-week.png (34.7 KB) - added by zooko at 2009-08-08T15:46:14Z.: some random storage server at allmydata.com Look: no major memory leak! (graph shows a week)
prodtahoe3.allmydata.com-tahoe_nodememory_storage-month.png (35.5 KB) - added by zooko at 2009-08-08T15:46:41Z.: some random storage server at allmydata.com Look: no major memory leak! (graph shows a month)
prodtahoe3.allmydata.com-tahoe_nodememory_storage-year.png (41.4 KB) - added by zooko at 2009-08-08T15:46:53Z.: some random storage server at allmydata.com Look: no major memory leak! (graph shows a year)
longrunning2.png (150.9 KB) - added by terrell at 2009-08-08T21:11:47Z.: second snapshot - this time back down to 57M
longrunning3.png (153.0 KB) - added by terrell at 2009-08-10T12:46:07Z.: and now down to 22MB - so... not a leak?

Download all attachments as: .zip

Change History (21)

Changed at 2009-08-08T02:33:21Z by terrell

Attachment longrunning.png added

screenshot of Activity Monitor

Changed at 2009-08-08T02:33:57Z by terrell

Attachment tahoesample.txt added

sample file for the longrunning process

comment:1 Changed at 2009-08-08T02:38:40Z by terrell

[10:36:24:trel:~] ps -fA | grep tahoe
  501 71020     1   0   9:04.72 ??        33:17.14 /System/Library/Frameworks/Python.framework/Versions/2.5/Resources/Python.app/Contents/MacOS/Python /usr/bin/twistd -y tahoe-client.tac --logfile logs/twistd.log

comment:2 Changed at 2009-08-08T02:40:44Z by terrell

[10:36:26:trel:~] tahoe --version

allmydata-tahoe: 1.4.1-r3995, foolscap: 0.4.1, pycryptopp: 0.5.15, zfec: 1.4.2, Twisted: 2.5.0, Nevow: 0.9.32, zope.interface: 3.3.0, python: 2.5.1, platform: Darwin-9.7.0-i386-32bit, sqlite: 3.4.0, simplejson: 2.0.1, argparse: 0.8.0, pyOpenSSL: 0.6, pyutil: 1.3.28, zbase32: 1.1.1, setuptools: 0.6c12dev, pysqlite: 2.3.2

comment:3 Changed at 2009-08-08T02:41:32Z by terrell

Summary changed from long running tahoe process - appears to be a memory leak to long running tahoe process - appears to be a slow memory leak

comment:4 Changed at 2009-08-08T15:28:45Z by zooko

What tool generated that "tahoesample.txt" sample file, and what is the meaning of the contents of that file?

Hm, let's see, how else can we figure out what's going on in there. The presence of 11 threads is a bit surprising to me. Oh! Look for incident report files. It would really be good if we made this better documented and even more automated. Anyway, look in $TAHOEBASEDIR/logs/incidents and attach the most recent ones to this ticket. Also try pressing the "Report an incident" button on the welcome page.

comment:5 Changed at 2009-08-08T15:41:53Z by zooko

Priority changed from major to critical

I'm elevating the priority to "critical" because a big memory leak like this could prevent Tahoe-LAFS from being used in some cases. By the way, we've always been careful about this. We have graphs of the virtual memory usage of short-running programs which get automatically generated on darcs commit:

http://allmydata.org/trac/tahoe/wiki/Performance

And allmydata.com has graphs of the virtual and resident memory of long-running storage server processes. Unfortunately those graphs aren't public. I'll ask Peter Secor if we could make the live view on those graphs public. I'll attach the graphs from one of those servers to this ticket.

So, it is interesting that Trel's is the first report of something like this. I've been running long-running Tahoe-LAFS on my Intel Mac 10.4, and I've never seen something like this.

Note: we *have* seen major memory problems, but never a long-running slow memory leak like this, only a catastrophic "Out Of Memory -- now I am totally confused and broken" -- #651. Trel: please look in your twistd.log for MemoryError.

Changed at 2009-08-08T15:46:01Z by zooko

Attachment prodtahoe3.allmydata.com-tahoe_nodememory_storage-day.png added

some random storage server at allmydata.com Look: no major memory leak! (graph shows a day)

Changed at 2009-08-08T15:46:14Z by zooko

Attachment prodtahoe3.allmydata.com-tahoe_nodememory_storage-week.png added

some random storage server at allmydata.com Look: no major memory leak! (graph shows a week)

Changed at 2009-08-08T15:46:41Z by zooko

Attachment prodtahoe3.allmydata.com-tahoe_nodememory_storage-month.png added

some random storage server at allmydata.com Look: no major memory leak! (graph shows a month)

Changed at 2009-08-08T15:46:53Z by zooko

Attachment prodtahoe3.allmydata.com-tahoe_nodememory_storage-year.png added

some random storage server at allmydata.com Look: no major memory leak! (graph shows a year)

comment:6 Changed at 2009-08-08T15:48:15Z by zooko

I attached graphs of the memory usage of one of the allmydata.com storage servers. Note that these are (I think) running Tahoe-LAFS v1.3.0.

comment:7 Changed at 2009-08-08T21:11:17Z by terrell

That tahoesample.txt was generated from the Activity Monitor in OS X - with the 'Sample Process' button at the top after selecting the running 'Python' app.

I'm attaching another screenshot - this time showing that the memory usage had dropped back to 57M - and then I waited another 10 hours, and it seems to still be at 57M. So... now even more confused. I'll look for incident files when I return. Need to head out the door now.

Changed at 2009-08-08T21:11:47Z by terrell

Attachment longrunning2.png added

second snapshot - this time back down to 57M

comment:8 Changed at 2009-08-08T22:44:05Z by warner

FYI, I've seen unexpected memory usage in storage servers that are receiving shares, but not huge consumption (it felt like the 100KB-ish strings weren't being freed as quickly as I was expecting). I think we've also seen unexpected behavior in busy webapi servers.. we should check the allmydata.com webapi2/webapi3 nodes to see what their memory usage munin graphs look like.

Occasionally we've seen a node use up so much memory that it hits MemoryError, and then everything falls apart (because the reactor's unhandled-error handling code runs out of memory too.. it would be great if MemoryError weren't catchable, or at least if the reactor didn't try to catch it). We haven't been able to figure out how it got into that state, though.. there were no obvious smoking guns, just the fatal exit wound :).

Changed at 2009-08-10T12:46:07Z by terrell

Attachment longrunning3.png added

and now down to 22MB - so... not a leak?

comment:9 Changed at 2009-10-27T06:05:25Z by zooko

Priority changed from critical to major
Summary changed from long running tahoe process - appears to be a slow memory leak to does it sometimes use 750 MB

comment:10 Changed at 2009-12-13T02:25:04Z by davidsarah

Keywords memory added

comment:11 Changed at 2009-12-13T04:25:53Z by zooko

Owner changed from somebody to terrell

I appreciate the bug report, Terrell, and I don't consider it acceptable for Tahoe-LAFS to occasionally use 750 MB, but I don't see how to make progress on this ticket, unless you experience the problem again and this time you have verbose logging turned on or it generates an incident report file. Let's close this as 'wontfix' for now so that the ticket doesn't sit her open waiting for the event to reoccur on your system. Maybe it has been fixed! But please do re-open this ticket if it reoccurs.

comment:12 Changed at 2009-12-13T04:26:01Z by zooko

Resolution set to wontfix
Status changed from new to closed

comment:13 Changed at 2009-12-13T04:47:06Z by terrell

haven't seen this since it happened four months ago. closed is fine.

Note: See TracTickets for help on using tickets.

Download in other formats: