[tahoe-dev] tree size increased?

zooko zooko at zooko.com
Fri Dec 28 21:13:58 PST 2007


Brian:

There are lots of issues here; among them how do we relatively value  
downloadable size, disk space, and Desert Island installation.  Also,  
there is the issue of aesthetics.  I think that it bothers you  
aesthetically to have redundant copies of a file in the tree  
(setuptools_darcs, or the setuptools 0.6c7-py2.5 egg, for example).   
I value aesthetics, too, and I also value having the tahoe source  
package be appealing to other people (especially you) even when this  
issue isn't particularly important to my aesthetics.

Anyway, here are some numbers about the effect that compressing  
internal tarballs has on untarred disk space and on the size of the  
compressed downloadable allmydata-tahoe tarball.

A. "current": current configuration (uncompressed tarballs within  
uncompressed  tarballs)
B. "compress": gzip -9 compress tahoe/misc/dependencies/*.tar
C. "deep compress": recursively untar all tarballs within tarballs  
and gzip -9 them all
D. "deep 7z compress": recursively untar all tarballs and 7z -mx=9  
them all (note that this doesn't actually work for the Desert Island  
build since setuptools doesn't automatically un-7z source tarballs  
that it finds -- this is solely for comparison purposes)
E. "no deps": rm misc/dependencies/*.tar*

	compression used on the overall allmydata-tahoe.tar tarball -->
		none		gzip			bzip2		rzip			lrzip			7zip
		----			----			-----			----			-----			----
A.	11,243,520	5,146,208	3,933,778	1,702,507	1,617,109	1,585,380
B.	 6,430,720	5,147,929	4,970,992	4,321,795	4,038,092	4,023,087
C.	 6,440,960	5,161,208	4,955,399	3,257,477	3,174,995	3,165,861
D.	 6,205,440	4,930,134	4,899,728	4,757,996	4,778,605	4,768,450
E.	 2,252,800	  978,474	  906,879	  795,064	  774,494	  769,483

table 1. size of allmydata.tahoe.tar.$COMPRESSION tarball in bytes

It's interesting how much worse the older compression algorithms are  
at taking advantage of the huge redundant pieces spread far apart.   
It's also interesting to see that 7zip is even better than lrzip on  
this input and has the added advantages of being streamable, faster  
and ported to more platforms (lrzip works only on Linux).


> I think you're optimizing the wrong thing here.. I have a dozen  
> tahoe trees
> on my laptop, and now they consume over half a gigabyte (588MB  
> versus the
> previous 312MB), but I only ever download the .tar.gz maybe once a  
> month, and
> even for users downloading it once a day, 2.37MB is not worth  
> reducing.

As the table above shows, compressing misc/dependencies/*.tar saves  
about 5 MB per tree of disk space, has no effect on the allmydata- 
tahoe.tar.gz, and reduces the allmydata-tahoe.tar.7z from 4 MB to 1.5  
MB.  Whether this is a win or a lose depends on whether you value  
disk space or fast downloads more.  Actually I think that the concern  
you mentioned wasn't due to this added 5 MB (which would have  
increased your 12 trees from 312 MB to 373 MB, not to 588 MB), but  
rather the addition of bundled easy_installable dependencies in order  
to enable Desert Island installation.  So perhaps the trade-off that  
you are weighing is more disk space usage vs. Desert Island  
installation, than disk space usage vs. compressed tarball size.  I  
just went back and added the E. row to inform us about that issue.   
rm'ing the deps saves 9 MB of disk space per tree.  It also makes the  
downloadable much smaller, but of course if the user also has to  
download some of those dependencies then it quickly becomes a net  
loss of human time.


> (maybe we
> should consider auto-creating a .tar.gz which contains the support  
> tarballs
> but not put them in the SCM tree).

That's a good idea.

It would certainly solve the disk-space usage problem, and I think it  
appeals more aesthetically.  I'm not going to work on this before  
0.7.0 (instead of I'm going to work on The Roadmap [1], e.g. fixing  
the automatic .deb builds (#246) and lots and lots of documentation.   
Here, I just created #249 -- "move bundled dependencies out of  
revision control history and make them optional".

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/roadmap

tickets mentioned in this message:
http://allmydata.org/trac/tahoe/ticket/246
http://allmydata.org/trac/tahoe/ticket/249


More information about the tahoe-dev mailing list