#964 assigned defect

show sizes in unambiguous way that doesn't get mistaken for different units

Reported by: USSJoin Owned by: zooko
Priority: minor Milestone: undecided
Component: code-storage Version: 1.6.0
Keywords: usability Cc:
Launchpad Bug:

Description

When setting up a storage node, it took me a long time to figure out why the storage wasn't respecting my set 15GB of reserved space on the drive. I finally realized that *like* hard drive manufacturers but *unlike* the rest of the planet, Tahoe is counting size in base-10, not base-2-- so kilobytes are 1000 bytes, not 1024, and so on. This leads to reports from Tahoe being *dramatically* different than, say, df -h, and thus creates confusion.

Change History (15)

comment:1 Changed at 2010-02-21T01:09:53Z by zooko

A good way out of this mess is to spell it out explicitly -- write "230" or "109" or "one billion".

See also: http://en.wikipedia.org/wiki/Binary_prefix

comment:2 Changed at 2010-02-21T01:22:30Z by zooko

Hitherto I believe we've been using "GiB" to mean 230 (per http://en.wikipedia.org/wiki/Binary_prefix ) and we may have sometimes been using "GB" to mean 109. That latter usage, while technically correct, and accurate and meaningful to the vast majority of users (who do base-10 arithmetic in their heads but not base-2 arithmetic), is confusing to hackers like USSJoin. So we should avoid it, possibly by replacing uses of "GB" with "109".

comment:3 Changed at 2010-02-21T01:35:37Z by USSJoin

Alternately, you could simply use GiB consistently everywhere (for instance, on the storage information page) and do the units *as* GiB, not 109. That way people don't have to go "what's a 109?" when looking at information pages.

comment:4 follow-up: Changed at 2010-02-21T05:30:35Z by zooko

We generally prefer base-10 arithmetic, because it is easier for users to use. For example, if you ask my mom how many 20-byte things she can store in a 2-terabyte bucket, she'll probably ask "What's a terabyte?", and if you tell her it is a trillion bytes, she'll say "Then I can store 100 billion of them.". If instead you tell her that it is 240 bytes, then she'll either have to get out a calculator or she'll just give up. In fact, I strongly suspect that a similar problem applies to computer hackers as well as to moms. Quick, how many 20-byte things can you store in a bucket of size 241 bytes (two TiB)? I think it will take you longer to answer that question that it would take my mom to answer the base-10 variant of the question. If you answered "about a hundred billion of them" then your answer was 10% off! Back when our buckets were on the order of thousands of elements in size, the approximation of 210≅103 was only 2.5% off. The approximation of 220≅106 is 5% off, 230≅109 is 7.5% off, and 240≅1012 is 10% off. By the way, Apple products now report file sizes and filesystem spaces in base-10: http://support.apple.com/kb/TS2419

comment:5 in reply to: ↑ 4 Changed at 2010-02-22T00:26:13Z by davidsarah

  • Keywords usability added

We're with USSJoin on this. Resist the hard disk manufacturers' conspiracy.

Replying to zooko:

We generally prefer base-10 arithmetic, because it is easier for users to use. For example, if you ask my mom how many 20-byte things she can store in a 2-terabyte bucket, she'll probably ask "What's a terabyte?", and if you tell her it is a trillion bytes, she'll say "Then I can store 100 billion of them."

This is not a convincing argument, since you're more likely to need to know how many 1 MiB files, say, can be stored in a terabyte. (BTW, your mom's estimate would be wildly wrong for 20-byte files due to overhead.)

What mattered was that there was a consistent convention. Since most uses of "GB" (for example) still mean 230 bytes, Tahoe is going in the wrong direction to reduce confusion.

comment:6 Changed at 2010-02-22T06:44:13Z by kmarkley86

People have strong reasons for strong preferences on both sides. How about making this configurable, so then we can fight about the default instead of forcing one style on everyone? (Also, base-2 sizes should be the default.)

comment:7 follow-up: Changed at 2010-02-22T23:41:01Z by warner

/me runs into the room waving his hands madly like a muppet. nooo!

Please don't contribute to the confusion by printing "x GB" and silently using it to mean 230. The entire non-computer world, the SI, and every dictionary on the planet knows that the metric G suffix means giga means 109. And while I find "GiB" pretty funny-looking, it is an unambiguous, learnable, and eventually-straightforward term that clearly means 230. Let's not conflate the two. Sure, this helps the hard-drive manufacturers, but it's a terminology bugfix, not a conspiracy :).

I'm -1 on having a config option for pretending GB=230: someone looking at the web page (and not at the config file) would be unable to learn the truth.

In places where we have evidence that people want both sorts of values, we should give them both sorts of values. For example, on the "Storage Server Status" page, we currently show abbreviated GB (109) and unabbreviated number-of-bytes:

Total disk space:       319.73 GB       (319728959488)
Disk space used:        - 311.44 GB     (311444250624)

My hope was that the "319728959488" would look enough like the "319.73" to cue the reader into remembering that GB means 109, but the original poster's experience suggests this failed. Some other options for that display:

Total disk space:       319.73 GB       (319728959488) (297.77 GiB)

Total disk space:       319.73 GB(10^9) (319728959488)

Total disk space:       319.73 GB(10^9) (319728959488) 297.77 GiB(2^30)

Since this is a web page, we could also have a popup over the "319.73 GB" line that displays a number of other formats, not unlike we recently added a popup to Foolscap's log-web-viewer display to show timestamps in alternate formats (UTC/local/short/long):

319728959488
319.73 GB (10^9)
297.77 GiB (2^30)

I'm -0 on having a config option that makes these pages display GiB instead of GB, as long as it never ever tries to pretend that GB is 230, and that there continues to be a full-number-of-bytes display so that someone looking at the abbreviation has a chance to figure out what it means and become confident in our consistent use of terms.

This is not a convincing argument, since you're more likely to need to know how many 1 MiB files, say, can be stored in a terabyte. (BTW, your mom's estimate would be wildly wrong for 20-byte files due to overhead.)

Huh? Except for programmer-driven test cases, I don't think there's any particular quantization on filesizes. I'm not counting 1 MB or 1 MiB files, I'm counting how many digital pictures I can stuff onto a disk, and they're all sorts of random sizes. The only real quantization I can think of would be the chapters on a ripped DVD image (according to wikipedia these are usually 1 GiB in size), but I really don't think "how many non-terminal DVD VOB files can I fit on this disk" is a common question.

Hitherto I believe we've been using "GiB" to mean 230 (per http://en.wikipedia.org/wiki/Binary_prefix ) and we may have sometimes been using "GB" to mean 109.

We're always using GB to mean 109 and GiB to mean 230. I fix the code if I discover it doing otherwise.

comment:8 follow-up: Changed at 2010-02-23T06:34:14Z by zooko

This is what people call a bike shed. The theory goes that few people are willing to contribute their opinions about designing nuclear power plants, because that is very complex and requires high expertise, but many people are willing to contribute their opinions about designing a bike shed, because it is simple enough that they can see how they would like it to be.

(Aside: I don't really like that metaphor of a "bike shed" because it belittles the concerns of the contributors. I actually agree with USSJoin, davidsarah, and kmarkley86 that user interface issues are important, including this one. Don't forget that the original post by USSJoin explained how he actually lost some of his time due to confusion. Wasting user time is not okay! Also, a design being simple and easy to understand doesn't mean that it doesn't matter how it is done!)

However, this issue has now distracted both David-Sarah and Brian from building nuclear power plants. Let's put a stop to the discussion. Our policy will be to express numbers in units that are as unambiguous as possible so that a user who assumes that "GB" means 220 and a user who assume that "GB" means 109 will both have a minimal chance of wasting their time with confusion. Specifically, the suggestions that Brian made in comment:7 about redundantly listing the same value in different units would probably help.

That's the main idea -- to make the user interface sufficiently clear (even at the cost of redundancy) that nobody wastes their time mistaking the units. I believe this policy will satisfice.

We will continue to use KiB to mean 103, MiB to mean 106, GiB to mean 109, TiB to mean 1012 etc. as per http://en.wikipedia.org/wiki/Binary_prefix , and never use KB to mean 210 etc.. However, as per the main idea, above, we will probably try to reduce the use of KB at all in favor of less ambiguous designations.

comment:9 in reply to: ↑ 7 Changed at 2010-03-25T22:32:56Z by zooko

  • Summary changed from List sizes for storage using base-2 sizes, not base-10 to show sizes in unambiguous way that doesn't get mistaken for different units

Replying to warner:

In places where we have evidence that people want both sorts of values, we should give them both sorts of values. For example, on the "Storage Server Status" page, we currently show abbreviated GB (109) and unabbreviated number-of-bytes:

Total disk space:       319.73 GB       (319728959488)
Disk space used:        - 311.44 GB     (311444250624)

My hope was that the "319728959488" would look enough like the "319.73" to cue the reader into remembering that GB means 109, but the original poster's experience suggests this failed.

You know what? This might have worked if the bytes display had included commas, like this:

Total disk space:       319.73 GB       (319,728,959,488)
Disk space used:        - 311.44 GB     (311,444,250,624)

I don't know about USSJoin, but for me, my eyeballs just slide right off of "319728959488" after the first couple of digits. Proposed action items to make this ticket closable:

  • change the Summary to "show sizes in unambiguous way that doesn't get mistaken for different units"
  • add commas to outputs which are in terms of bytes
  • ask people for feedback on whether this is now sufficiently clear to them
  • close the ticket as fixed

comment:10 Changed at 2010-12-22T18:51:12Z by zooko

  • Owner set to zooko
  • Status changed from new to assigned

comment:11 Changed at 2010-12-24T23:43:00Z by warner

I like the commas idea: my eyeballs slide off long numbers too. My hesitation is that every once in a rare while, I cut-and-paste a number like that into a calculator or python repl, and the commas would mess that up. But I think readability trumps cut-and-pasteability. So +1 on the commas.

You might also add units: (319,728,959,488 bytes). But maybe not.

Oh, you know, there might possibly be a CSS styling thing that lets you tell the system that this is a number, and that it ought to add comma-like things according to the current locale (since they'd be periods in europe). I have a hazy memory that suggests doing this would also retain cut-and-pasteability, because the commas/periods would be purely visual: cut/paste would still get the original non-comma-ified number. Does this ring any bells for anyone, or am I completely imagining it?

comment:12 follow-up: Changed at 2010-12-28T16:56:53Z by ScottD

The only way I know to pull off locale-formatted numbers is to use a span with a CSS-class and use javascript to read those elements using parseInt()/parseFloat() and replacing them with toLocaleString(). The locale will get picked up from whatever browser is being used and degrades nicely to plain numbers if javascript is unavailable.

comment:13 in reply to: ↑ 12 Changed at 2010-12-28T20:09:47Z by davidsarah

Replying to ScottD:

The only way I know to pull off locale-formatted numbers is to use a span with a CSS-class and use javascript to read those elements using parseInt()/parseFloat() and replacing them with toLocaleString(). The locale will get picked up from whatever browser is being used and degrades nicely to plain numbers if javascript is unavailable.

That's not worth the complexity IMHO. Separating the digit groups with spaces, e.g.

Total disk space:       319.73 GB       (319 728 959 488 bytes)
Disk space used:        - 311.44 GB     (311 444 250 624 bytes)

is understood internationally, and personally I think it's more readable. In HTML,   (NARROW NO-BREAK SPACE) might be better.

comment:14 in reply to: ↑ 8 ; follow-up: Changed at 2013-05-20T21:52:34Z by Zancas

Replying to zooko:

This is what people call a bike shed. The theory goes that few people are willing to contribute their opinions about designing nuclear power plants, because that is very complex and requires high expertise, but many people are willing to contribute their opinions about designing a bike shed, because it is simple enough that they can see how they would like it to be.

(Aside: I don't really like that metaphor of a "bike shed" because it belittles the concerns of the contributors. I actually agree with USSJoin, davidsarah, and kmarkley86 that user interface issues are important, including this one. Don't forget that the original post by USSJoin explained how he actually lost some of his time due to confusion. Wasting user time is not okay! Also, a design being simple and easy to understand doesn't mean that it doesn't matter how it is done!)

However, this issue has now distracted both David-Sarah and Brian from building nuclear power plants. Let's put a stop to the discussion. Our policy will be to express numbers in units that are as unambiguous as possible so that a user who assumes that "GB" means 220 and a user who assume that "GB" means 109 will both have a minimal chance of wasting their time with confusion. Specifically, the suggestions that Brian made in comment:7 about redundantly listing the same value in different units would probably help.

That's the main idea -- to make the user interface sufficiently clear (even at the cost of redundancy) that nobody wastes their time mistaking the units. I believe this policy will satisfice.

We will continue to use KiB to mean 103, MiB to mean 106, GiB to mean 109, TiB to mean 1012 etc. as per http://en.wikipedia.org/wiki/Binary_prefix , and never use KB to mean 210 etc.. However, as per the main idea, above, we will probably try to reduce the use of KB at all in favor of less ambiguous designations.

Oops.... isn't it KiB means 210... et cetera? If I Understand Correctly the infixed "i" means base 2, and exponent increments of 10.

comment:15 in reply to: ↑ 14 Changed at 2013-05-20T22:12:50Z by zooko

Replying to Zancas:

We will continue to use KiB to mean 103, MiB to mean 106, GiB to mean 109, TiB to mean 1012 etc. as per http://en.wikipedia.org/wiki/Binary_prefix , and never use KB to mean 210 etc.. However, as per the main idea, above, we will probably try to reduce the use of KB at all in favor of less ambiguous designations.

Oops.... isn't it KiB means 210... et cetera? If I Understand Correctly the infixed "i" means base 2, and exponent increments of 10.

Argh! How did I screw that up‽ Thanks, Zancas, for noticing. Everyone reading this: disregard what I wrote and just believe that we're going to do what http://en.wikipedia.org/wiki/Binary_prefix says about how to spell the base-2 things. Also, as previously mentioned on this ticket, spelling out numbers with commas in place is unambiguous and is the standard format for integers in Internet English writing.

Note: See TracTickets for help on using tickets.