Re: Adding compression/checksum support for bluestore.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chris Dunlop <chris@onthe.net.au>
To: Sage Weil <sage@newdream.net>
Cc: Allen Samuels <Allen.Samuels@sandisk.com>,
	Igor Fedotov <ifedotov@mirantis.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression/checksum support for bluestore.
Date: Wed, 6 Apr 2016 16:38:49 +1000	[thread overview]
Message-ID: <20160406063849.GA5139@onthe.net.au> (raw)
In-Reply-To: <20160405151030.GA20891@onthe.net.au>

On Wed, Apr 06, 2016 at 01:10:30AM +1000, Chris Dunlop wrote:
> On Tue, Apr 05, 2016 at 08:35:43AM -0400, Sage Weil wrote:
>> D = block size, U = hw UBER, C = checksum.  Let's add N = number of bits 
>> you actually want to read.  In that case, we have to read (N / D) blocks 
>> of D bits, and we get
>> 
>> P(reading N bits and getting some bad data and not knowing it)
>> 	= (D * U) / (2^C) * (N / D)
>> 	= U * N / 2^C
>> 
>> and the D term (block size) disappears.  IIUC this is what Chris was 
>> originally getting at.  The block size affects the probability I get an 
>> error on one block, but if I am a user reading something, you don't care 
>> about block size--you care about how much data you want to read.  I think 
>> in that case it doesn't really matter (modulo rounding error, minimum read 
>> size, how precisely we can locate the error, etc.).
>>
>> Is that right?
> 
> Yep, that's pretty much what I was thinking. My probability algebra is more
> than a little rusty but that looks like the formula I was starting to put
> together.

OK, here's my (1st!, more below) working:

P(bad bit)
  = probability bit is bad (BER)
  = U

P(bad check)
  = probability checksum fails (i.e. a false match)
  = 2^-C

P(bit fail)
  = probability bit is bad, and subsequent checksum fail
  = P(bad bit) * P(bad check)

P(bit ok)
  = probability bit is good, or flagged by checksum
  = 1 - P(bit fail)
  = 1 - (P(bad bit) * P(bad check))

P(block ok)
  = probability D-bit block is all-bits-good, or flagged by checksum
  = P(bit ok) ^ D
  = (1 - (P(bad bit) * P(bad check))) ^ D
  = (1 - (U * 2^-C)) ^ D

N  = number of bits in data in which we're interested

P(data ok)
  = probability data is ok (good, or flagged)
  = P(block ok) ^ (number_of_blocks)
  = P(block ok) ^ (N / D)
  = ((1 - (P(bad bit) * P(bad check))) ^ D) ^ (N / D)
  = (1 - (P(bad bit) * P(bad check))) ^ N
  = (1 - (U * 2^-C)) ^ N

P(bad data)
  = probability of getting unflagged bad data
  = 1 - P(data ok)
  = 1 - ((1 - (U * 2^-C)) ^ N)

Like Sage's working, the size of the individual data blocks disappears. But
unfortunately it doesn't match Sage's formula. :-(

OK, here's another attempt, looking at it a little differently: my 1st
working applies the P(bad check) at the bit level, the same as for P(bad
bit). But the checksum is actually done at the block level. So:

P(bad bit)
  = probability bit is bad (BER)
  = U

P(good bit)
  = probability bit is good
  = 1 - U

P(good block)
  = probability D-bit block is all-bits-good
  = (1 - U) ^ D

P(bad block)
  = probability block has at least one bad bit
  = 1 - P(good block)
  = 1 - ((1 - U) ^ D)

P(bad check)
  = probability checksum fails (i.e. a false match)
  = 2^-C

P(good check)
  = probability checksum succeeds (flags bad data)
  = 1 - P(bad check)
  = 1 - (2^-C)

P(block ok)
  = probability block is all-good, or bad and flagged by checksum
  = P(good block) + (P(bad block) * P(good check))
  = ((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C))) 
  = 2^-C * ((1-U)^D-1)+1

N  = number of bits in data in which we're interested

P(data ok)
  = probability data is ok (good, or flagged)
  = P(block ok) ^ (number_of_blocks)
  = P(block ok) ^ (N / D)
  = (((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))) ^ (N / D)
  = (2^-C * (((1-U)^D) - 1) + 1) ^ (N / D)

P(bad data)
  = probability of getting unflagged bad data
  = 1 - P(data ok)
  = 1 - ((2^-C * ((1-U)^D-1)+1) ^ (N / D))

...and this time the D term doesn't disappear. I.e. it supports Allen's
contention that the block size matters.

Let's plug some numbers into Wolfram Alpha to see if we get plausible results:

U = 10^-15
C = 32
D = 4 * 1024
N = 5PB = 5 * 8 * 10^15

---------
Chris #1
---------
P(bad data)
  = 1 - ((1 - (U * 2^-C)) ^ N)
  = 1 - ((1 - ((10^-15) * (2^-32))) ^ (5 * 8 * (10^15)))
  = 9.3132257027866983914620845681582388414865913202250969 × 10^-9

---------
Chris #2
---------
P(bad data)
  = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))
  = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (4 * 1024)))
  = 9.3132257027676295619288907910277855094237197529999931 × 10^-9
  (differs from Chris #1 at 10^-20)

If this formula is correct, it's still relatively immune to the data block
size, e.g. at D = (1024 * 1024):
  = 1 - (((2^-32) * ((1-(10^-15))^(1024 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (1024 * 1024)))
  = 9.3132256979038905963931781911807383342120258917664411 × 10^-9
  (differs from D=(4 * 1024) at 10^-17)

---------
Sage
---------
> P(reading N bits and getting some bad data and not knowing it)
> 	= U * N / 2^C

P(bad data)
  = (10^-15 * 5 * 8 * 10^15) / (2^32)
  = 9.31322574615478515625 × 10^-9
  (differs from Chris #2 at 10^-17)

---------
Allen
---------
> So the overall probability of a failure for this read is:
> (1 - (1-U)^D) / (2^C).

P(bad data)
  = (1 - (1-U)^D) / (2^C)
  = (1 - (1-(10^-15))^(4 * 1024)) / (2^32)
  = 9.536743164042973518371608678388595553788002932094000 × 10^-22
  (seems implausibly low?)


Now what?

Anyone want to check the assumptions and calcs, and/or come up with your own?

On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage) are
anywhere near correct, they indicate with a block size of 4K and 32-bit
checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get to a 1%
chance of seeing unflagged bad data, e.g.:

Chris #2 @ U=10^-15, C=32, D=(4 * 1024), N=(5 * 8 * 10^21)
  = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * 10^21) / (4 * 1024)))
  = 0.009269991978615439689381062400448881380202158231615685460
  = 0.92%


Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2016-04-06  6:38 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-30 19:46 Adding compression/checksum support for bluestore Allen Samuels
2016-03-30 20:41 ` Vikas Sinha-SSI
2016-03-30 22:24   ` Sage Weil
2016-03-30 22:35     ` Allen Samuels
2016-03-31 16:31   ` Igor Fedotov
2016-03-30 22:15 ` Sage Weil
2016-03-30 22:22   ` Gregory Farnum
2016-03-30 22:30     ` Sage Weil
2016-03-30 22:43       ` Allen Samuels
2016-03-30 22:32   ` Allen Samuels
2016-03-30 22:52   ` Allen Samuels
2016-03-30 22:57     ` Sage Weil
2016-03-30 23:03       ` Gregory Farnum
2016-03-30 23:08         ` Allen Samuels
2016-03-31 23:02       ` Milosz Tanski
2016-04-01  3:56     ` Chris Dunlop
2016-04-01  4:56       ` Sage Weil
2016-04-01  5:28         ` Chris Dunlop
2016-04-01 14:58           ` Sage Weil
2016-04-01 19:49             ` Chris Dunlop
2016-04-01 23:08               ` Allen Samuels
2016-04-02  2:23                 ` Allen Samuels
2016-04-02  2:51                   ` Gregory Farnum
2016-04-02  5:05                     ` Chris Dunlop
2016-04-02  5:48                       ` Allen Samuels
2016-04-02  6:18                       ` Gregory Farnum
2016-04-03 13:27                         ` Sage Weil
2016-04-04 15:33                           ` Chris Dunlop
2016-04-04 15:51                             ` Chris Dunlop
2016-04-04 17:58                               ` Allen Samuels
2016-04-04 15:26                         ` Chris Dunlop
2016-04-04 17:56                           ` Allen Samuels
2016-04-02  5:08                     ` Allen Samuels
2016-04-02  4:07                 ` Chris Dunlop
2016-04-02  5:38                   ` Allen Samuels
2016-04-04 15:00                     ` Chris Dunlop
2016-04-04 23:58                       ` Allen Samuels
2016-04-05 12:35                         ` Sage Weil
2016-04-05 15:10                           ` Chris Dunlop
2016-04-06  6:38                             ` Chris Dunlop [this message]
2016-04-06 15:47                               ` Allen Samuels
2016-04-06 17:17                                 ` Chris Dunlop
2016-04-06 18:06                                   ` Allen Samuels
2016-04-07  0:43                                     ` Chris Dunlop
2016-04-07  0:52                                       ` Allen Samuels
2016-04-07  2:59                                         ` Chris Dunlop
2016-04-07  9:51                                           ` Willem Jan Withagen
2016-04-07 12:21                                             ` Atchley, Scott
2016-04-07 15:01                                               ` Willem Jan Withagen
2016-04-07  9:51                                           ` Chris Dunlop
2016-04-08 23:16                                             ` Allen Samuels
2016-04-05 20:41                           ` Allen Samuels
2016-04-05 21:14                             ` Sage Weil
2016-04-05 12:57                         ` Dan van der Ster
2016-04-05 20:50                           ` Allen Samuels
2016-04-06  7:15                             ` Dan van der Ster
2016-03-31 16:27   ` Igor Fedotov
2016-03-31 16:32     ` Allen Samuels
2016-03-31 17:18       ` Igor Fedotov
2016-03-31 17:39         ` Piotr.Dalek
2016-03-31 18:44         ` Allen Samuels
2016-03-31 16:58 ` Igor Fedotov
2016-03-31 18:38   ` Allen Samuels
2016-04-04 12:14     ` Igor Fedotov
2016-04-04 14:44       ` Allen Samuels

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160406063849.GA5139@onthe.net.au \
    --to=chris@onthe.net.au \
    --cc=Allen.Samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ifedotov@mirantis.com \
    --cc=sage@newdream.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.