bloom filter thoughts

All of lore.kernel.org
 help / color / mirror / Atom feed

* bloom filter thoughts
@ 2013-09-26  0:34 Sage Weil
  2013-09-26 13:20 ` Mark Nelson
  0 siblings, 1 reply; 4+ messages in thread
From: Sage Weil @ 2013-09-26  0:34 UTC (permalink / raw)
  To: ceph-devel

I spent some time on the plane playing with bloom filters.  We're looking 
at using these on the OSD to (more) efficiently keep track of which 
objects have (probably) been touched in recent history.  This can then be 
used by the caching and tiering code to identify objects that have 
(definitely) not been touched that should be demoted or purged from a 
given tier.

The false positive probility (in our case, probability we'll think an 
object has been touched when it actually hasn't) is dependent on size of 
the bloom filter and the number of insertions.  

When you create the bloom_filter, you tell it how many insertions you 
expect to do and what fpp you want and it sizes itself.  For ~1% fpp, it 
costs just over 1 byte per insertion.

wip-bloom has some cleanups and unit tests that test the false positive 
behavior.

A few issues/notes:

- If we build a sequence of filters covering intervals of time, this let's 
us estimate "atime".  However, if we have N filters of 1/Nth the size, our 
total_ffp = N * fpp, so each bin's bloom needs to be 1/Nth as big with 
1/Nth the desired fpp.  In my simple test, going from 1 to 16 bins moves 
the bloom_filter size from 20 KB to 32 KB (that's 16k insertions and 
target_ffp of .01).

- In the OSD, we won't necessarily know how many insertions a bin will 
have.  The user inputs will/should be something like:
 - target fpp
 - time interval we want to record/estimate over (say, 1 day)
and we'll need to translate that into a particular sized bloom.  We 
probably need to estimate that based on recent pg workload/activity.  If 
we have a filter that "fills up" quickly, the next one will be bigger.  If 
we have an unfull one after some period of time, we can compress/fold it 
down into a smaller filter (a convenient trick) and start the next smaller 
one.

I suggest the pg_pool_t get:

 uint8_t bloom_fpp_pow;  //< desired false positive rate == 2^(-bloom_fpp_pow)
 uint16_t bloom_period;  //< period for that false positive rate (seconds)
 uint16_t bloom_bins;    //< the approx "resolution" we want
 uint16_t bloom_max;     //< how many periods we retain history for

which would make the OSD try to create a new filter roughly every 
(period/bins) seconds with some simple estimation.

Other thoughts on the OSD side stuff:

- The past history of filters may be large.  I suggest we stick them in an 
omap object that lives with the PG and is recovered/backfilled but that 
needs to happen in a way that is out of band wrt to other objects.  It 
should get scrubbed.

- Before anything is trimmed from the pg log it needs to be added to the 
currently-accumulating bloom filter.  We can delay this until (well) after 
last_complete_on_disk so that we never have to worry about divergent 
entries.

sage

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bloom filter thoughts
  2013-09-26  0:34 bloom filter thoughts Sage Weil
@ 2013-09-26 13:20 ` Mark Nelson
  2013-09-26 15:52   ` Sage Weil
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Nelson @ 2013-09-26 13:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 09/25/2013 07:34 PM, Sage Weil wrote:
> I spent some time on the plane playing with bloom filters.  We're looking
> at using these on the OSD to (more) efficiently keep track of which
> objects have (probably) been touched in recent history.  This can then be
> used by the caching and tiering code to identify objects that have
> (definitely) not been touched that should be demoted or purged from a
> given tier.

You know this is making me want to work on my photon mapping engine 
again instead of running benchmarks for Bryan. :P  Anyway, this seems 
like a great use case.  Are there any other places we want to fast 
intersection tests where we could use these?  k-nearest-neighbor 
searches would be tons of fun to do too if we only had a use case!

>
> The false positive probility (in our case, probability we'll think an
> object has been touched when it actually hasn't) is dependent on size of
> the bloom filter and the number of insertions.
>
> When you create the bloom_filter, you tell it how many insertions you
> expect to do and what fpp you want and it sizes itself.  For ~1% fpp, it
> costs just over 1 byte per insertion.

It's been a long time since I looked at bloom filters, but isn't this 
also affected by the number of hash functions used?  Are we going to try 
to have it auto-select the number?

>
> wip-bloom has some cleanups and unit tests that test the false positive
> behavior.
>
> A few issues/notes:
>
> - If we build a sequence of filters covering intervals of time, this let's
> us estimate "atime".  However, if we have N filters of 1/Nth the size, our
> total_ffp = N * fpp, so each bin's bloom needs to be 1/Nth as big with
> 1/Nth the desired fpp.  In my simple test, going from 1 to 16 bins moves
> the bloom_filter size from 20 KB to 32 KB (that's 16k insertions and
> target_ffp of .01).
>
> - In the OSD, we won't necessarily know how many insertions a bin will
> have.  The user inputs will/should be something like:
>   - target fpp
>   - time interval we want to record/estimate over (say, 1 day)
> and we'll need to translate that into a particular sized bloom.  We
> probably need to estimate that based on recent pg workload/activity.  If
> we have a filter that "fills up" quickly, the next one will be bigger.  If
> we have an unfull one after some period of time, we can compress/fold it
> down into a smaller filter (a convenient trick) and start the next smaller
> one.

So basically we try to keep a running log of the PG activity so we can 
guess as to how big future filters should be, and if our guess is short 
we move on and make the next one bigger, but if we overshot we compress 
it into a smaller filter and make the next one smaller?  I think this is 
exactly what you said but I want to make sure I got it. :)

>
> I suggest the pg_pool_t get:
>
>   uint8_t bloom_fpp_pow;  //< desired false positive rate == 2^(-bloom_fpp_pow)
>   uint16_t bloom_period;  //< period for that false positive rate (seconds)
>   uint16_t bloom_bins;    //< the approx "resolution" we want
>   uint16_t bloom_max;     //< how many periods we retain history for
>
> which would make the OSD try to create a new filter roughly every
> (period/bins) seconds with some simple estimation.
>
> Other thoughts on the OSD side stuff:
>
> - The past history of filters may be large.  I suggest we stick them in an
> omap object that lives with the PG and is recovered/backfilled but that
> needs to happen in a way that is out of band wrt to other objects.  It
> should get scrubbed.

Any guesses as to how large this will get yet?  It's too early in the 
morning for me to do math...

>
> - Before anything is trimmed from the pg log it needs to be added to the
> currently-accumulating bloom filter.  We can delay this until (well) after
> last_complete_on_disk so that we never have to worry about divergent
> entries.
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bloom filter thoughts
  2013-09-26 13:20 ` Mark Nelson
@ 2013-09-26 15:52   ` Sage Weil
  2013-09-26 16:06     ` Gregory Farnum
  0 siblings, 1 reply; 4+ messages in thread
From: Sage Weil @ 2013-09-26 15:52 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

On Thu, 26 Sep 2013, Mark Nelson wrote:
> On 09/25/2013 07:34 PM, Sage Weil wrote:
> > I spent some time on the plane playing with bloom filters.  We're looking
> > at using these on the OSD to (more) efficiently keep track of which
> > objects have (probably) been touched in recent history.  This can then be
> > used by the caching and tiering code to identify objects that have
> > (definitely) not been touched that should be demoted or purged from a
> > given tier.
> 
> You know this is making me want to work on my photon mapping engine again
> instead of running benchmarks for Bryan. :P  Anyway, this seems like a great
> use case.  Are there any other places we want to fast intersection tests where
> we could use these?  k-nearest-neighbor searches would be tons of fun to do
> too if we only had a use case!
> 
> > 
> > The false positive probility (in our case, probability we'll think an
> > object has been touched when it actually hasn't) is dependent on size of
> > the bloom filter and the number of insertions.
> > 
> > When you create the bloom_filter, you tell it how many insertions you
> > expect to do and what fpp you want and it sizes itself.  For ~1% fpp, it
> > costs just over 1 byte per insertion.
> 
> It's been a long time since I looked at bloom filters, but isn't this also
> affected by the number of hash functions used?  Are we going to try to have it
> auto-select the number?

The bloom_filter class hides this.  You feed it the desired fpp and 
insertion count and it chooses the number of hashes (actually, 1 hash but 
N salts) and table size.  Unit tests verify the actual fpp very closely 
matches the target.

> 
> > 
> > wip-bloom has some cleanups and unit tests that test the false positive
> > behavior.
> > 
> > A few issues/notes:
> > 
> > - If we build a sequence of filters covering intervals of time, this let's
> > us estimate "atime".  However, if we have N filters of 1/Nth the size, our
> > total_ffp = N * fpp, so each bin's bloom needs to be 1/Nth as big with
> > 1/Nth the desired fpp.  In my simple test, going from 1 to 16 bins moves
> > the bloom_filter size from 20 KB to 32 KB (that's 16k insertions and
> > target_ffp of .01).
> > 
> > - In the OSD, we won't necessarily know how many insertions a bin will
> > have.  The user inputs will/should be something like:
> >   - target fpp
> >   - time interval we want to record/estimate over (say, 1 day)
> > and we'll need to translate that into a particular sized bloom.  We
> > probably need to estimate that based on recent pg workload/activity.  If
> > we have a filter that "fills up" quickly, the next one will be bigger.  If
> > we have an unfull one after some period of time, we can compress/fold it
> > down into a smaller filter (a convenient trick) and start the next smaller
> > one.
> 
> So basically we try to keep a running log of the PG activity so we can guess
> as to how big future filters should be, and if our guess is short we move on
> and make the next one bigger, but if we overshot we compress it into a smaller
> filter and make the next one smaller?  I think this is exactly what you said
> but I want to make sure I got it. :)

Exactly.

> > 
> > I suggest the pg_pool_t get:
> > 
> >   uint8_t bloom_fpp_pow;  //< desired false positive rate ==
> > 2^(-bloom_fpp_pow)
> >   uint16_t bloom_period;  //< period for that false positive rate (seconds)
> >   uint16_t bloom_bins;    //< the approx "resolution" we want
> >   uint16_t bloom_max;     //< how many periods we retain history for
> > 
> > which would make the OSD try to create a new filter roughly every
> > (period/bins) seconds with some simple estimation.
> > 
> > Other thoughts on the OSD side stuff:
> > 
> > - The past history of filters may be large.  I suggest we stick them in an
> > omap object that lives with the PG and is recovered/backfilled but that
> > needs to happen in a way that is out of band wrt to other objects.  It
> > should get scrubbed.
> 
> Any guesses as to how large this will get yet?  It's too early in the morning
> for me to do math...

Greg did a quick calc yesterday, I forget.  It was enough that it might 
fit in RAM for a spinning disk for 1-2 days of history, but only 
marginally.

sage

 
> > 
> > - Before anything is trimmed from the pg log it needs to be added to the
> > currently-accumulating bloom filter.  We can delay this until (well) after
> > last_complete_on_disk so that we never have to worry about divergent
> > entries.
> > 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: bloom filter thoughts
  2013-09-26 15:52   ` Sage Weil
@ 2013-09-26 16:06     ` Gregory Farnum
  0 siblings, 0 replies; 4+ messages in thread
From: Gregory Farnum @ 2013-09-26 16:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Mark Nelson, ceph-devel@vger.kernel.org

On Thu, Sep 26, 2013 at 8:52 AM, Sage Weil <sage@inktank.com> wrote:
> On Thu, 26 Sep 2013, Mark Nelson wrote:
>> On 09/25/2013 07:34 PM, Sage Weil wrote:
>> > I spent some time on the plane playing with bloom filters.  We're looking
>> > at using these on the OSD to (more) efficiently keep track of which
>> > objects have (probably) been touched in recent history.  This can then be
>> > used by the caching and tiering code to identify objects that have
>> > (definitely) not been touched that should be demoted or purged from a
>> > given tier.
>>
>> You know this is making me want to work on my photon mapping engine again
>> instead of running benchmarks for Bryan. :P  Anyway, this seems like a great
>> use case.  Are there any other places we want to fast intersection tests where
>> we could use these?  k-nearest-neighbor searches would be tons of fun to do
>> too if we only had a use case!
>>
>> >
>> > The false positive probility (in our case, probability we'll think an
>> > object has been touched when it actually hasn't) is dependent on size of
>> > the bloom filter and the number of insertions.
>> >
>> > When you create the bloom_filter, you tell it how many insertions you
>> > expect to do and what fpp you want and it sizes itself.  For ~1% fpp, it
>> > costs just over 1 byte per insertion.
>>
>> It's been a long time since I looked at bloom filters, but isn't this also
>> affected by the number of hash functions used?  Are we going to try to have it
>> auto-select the number?
>
> The bloom_filter class hides this.  You feed it the desired fpp and
> insertion count and it chooses the number of hashes (actually, 1 hash but
> N salts) and table size.  Unit tests verify the actual fpp very closely
> matches the target.
>
>>
>> >
>> > wip-bloom has some cleanups and unit tests that test the false positive
>> > behavior.
>> >
>> > A few issues/notes:
>> >
>> > - If we build a sequence of filters covering intervals of time, this let's
>> > us estimate "atime".  However, if we have N filters of 1/Nth the size, our
>> > total_ffp = N * fpp, so each bin's bloom needs to be 1/Nth as big with
>> > 1/Nth the desired fpp.  In my simple test, going from 1 to 16 bins moves
>> > the bloom_filter size from 20 KB to 32 KB (that's 16k insertions and
>> > target_ffp of .01).
>> >
>> > - In the OSD, we won't necessarily know how many insertions a bin will
>> > have.  The user inputs will/should be something like:
>> >   - target fpp
>> >   - time interval we want to record/estimate over (say, 1 day)
>> > and we'll need to translate that into a particular sized bloom.  We
>> > probably need to estimate that based on recent pg workload/activity.  If
>> > we have a filter that "fills up" quickly, the next one will be bigger.  If
>> > we have an unfull one after some period of time, we can compress/fold it
>> > down into a smaller filter (a convenient trick) and start the next smaller
>> > one.
>>
>> So basically we try to keep a running log of the PG activity so we can guess
>> as to how big future filters should be, and if our guess is short we move on
>> and make the next one bigger, but if we overshot we compress it into a smaller
>> filter and make the next one smaller?  I think this is exactly what you said
>> but I want to make sure I got it. :)
>
> Exactly.
>
>> >
>> > I suggest the pg_pool_t get:
>> >
>> >   uint8_t bloom_fpp_pow;  //< desired false positive rate ==
>> > 2^(-bloom_fpp_pow)
>> >   uint16_t bloom_period;  //< period for that false positive rate (seconds)
>> >   uint16_t bloom_bins;    //< the approx "resolution" we want
>> >   uint16_t bloom_max;     //< how many periods we retain history for
>> >
>> > which would make the OSD try to create a new filter roughly every
>> > (period/bins) seconds with some simple estimation.
>> >
>> > Other thoughts on the OSD side stuff:
>> >
>> > - The past history of filters may be large.  I suggest we stick them in an
>> > omap object that lives with the PG and is recovered/backfilled but that
>> > needs to happen in a way that is out of band wrt to other objects.  It
>> > should get scrubbed.
>>
>> Any guesses as to how large this will get yet?  It's too early in the morning
>> for me to do math...
>
> Greg did a quick calc yesterday, I forget.  It was enough that it might
> fit in RAM for a spinning disk for 1-2 days of history, but only
> marginally.

Well, depends on the sort of windows and overlaps we want, but:
120 ops/s * 3600 s/hour * 1 byte/op = ~422KB/hour, or <10MB/day.
(where we here assume each op touches a different object, which is unlikely)

So it should actually fit in RAM just fine as long as we make it
durable — but we need to handle all-SSD setups as well, and those can
get much larger. What's more interesting here is that we can feasibly
store a list of actual, touched objects for only 4 times as much data
(4 byte hashes versus 1 byte/insert in the bloom filter) for those who
require it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-09-26 16:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-26  0:34 bloom filter thoughts Sage Weil
2013-09-26 13:20 ` Mark Nelson
2013-09-26 15:52   ` Sage Weil
2013-09-26 16:06     ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.