efficient removal of old objects

All of lore.kernel.org
 help / color / mirror / Atom feed

* efficient removal of old objects
@ 2012-02-01  0:33 Sage Weil
  2012-02-01  0:52 ` Josh Durgin
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Sage Weil @ 2012-02-01  0:33 UTC (permalink / raw)
  To: ceph-devel

Currently rgw logs objects it wants to delete after some period of time, 
and an radosgw-admin command comes back later to process the log.  It 
works, but is currently slow (one sync op at a time).

A better approach would be to mark objects for later removal, and have the 
OSD do it in some more efficient way.  wip-objs-expire has a client side 
(librados) interface for this.

I think there are a couple questions:

Should this be generalized to saying "do these osd ops at time X" instead 
of "delete at time X".  Then it could setxattr, remove, call into a class, 
whatever.

How would the OSD implement this?  A kludgey way would be to do it during 
scrub.  The current scrub implementation may make that problematic because 
it does a whole PG at time, and we probably don't want to issue a whole 
PG's worth of deletes at a time.  Is there a way to make that less 
painful?  

Not using scrub means we need some sort of index to keep track of objects 
with delayed events.  Using a collection for this might work, but loading 
all this state into memory would be slow if there were too many events 
registered.

Given all that, and that we need a solution to the expiration soon 
(weeks), do we
 - do a complete solution now,
 - parallelize radosgw-admin log processing,
 - or hack it into scrub?

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01  0:33 efficient removal of old objects Sage Weil
@ 2012-02-01  0:52 ` Josh Durgin
  2012-02-01  1:02 ` Tommi Virtanen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Josh Durgin @ 2012-02-01  0:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On 01/31/2012 04:33 PM, Sage Weil wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).
>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.
>
> I think there are a couple questions:
>
> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.
>
> How would the OSD implement this?  A kludgey way would be to do it during
> scrub.  The current scrub implementation may make that problematic because
> it does a whole PG at time, and we probably don't want to issue a whole
> PG's worth of deletes at a time.  Is there a way to make that less
> painful?
>
> Not using scrub means we need some sort of index to keep track of objects
> with delayed events.  Using a collection for this might work, but loading
> all this state into memory would be slow if there were too many events
> registered.
>
> Given all that, and that we need a solution to the expiration soon
> (weeks), do we
>   - do a complete solution now,
>   - parallelize radosgw-admin log processing,
>   - or hack it into scrub?
>
> sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01  0:33 efficient removal of old objects Sage Weil
  2012-02-01  0:52 ` Josh Durgin
@ 2012-02-01  1:02 ` Tommi Virtanen
       [not found]   ` <CAC-hyiExnN6CxMh=+5tLoZy3T0=Mx6Y3P796rG3L01mZ-=+vOg@mail.gmail.com>
  2012-02-02  0:11   ` Mark Kampe
  2012-02-01  1:19 ` Josh Durgin
  2012-02-01  8:26 ` Yehuda Sadeh Weinraub
  3 siblings, 2 replies; 12+ messages in thread
From: Tommi Virtanen @ 2012-02-01  1:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Tue, Jan 31, 2012 at 16:33, Sage Weil <sage@newdream.net> wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).
>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.

Is there some reason why this would be significantly more performant
when done by the OSD itself? It seems like the deletion times can be
bucketed by time nicely, then each bucket just contains a set of ids
-- a good fit for the map data type -- and the client for running this
deletion just streams the bucket contents over and issues delete
messages for everything. What makes that inherently slow?

> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.

That sounds like a really complex API, for quite marginal gain.

To make my point even clearer: point me to another data store that has
that idiom.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01  0:33 efficient removal of old objects Sage Weil
  2012-02-01  0:52 ` Josh Durgin
  2012-02-01  1:02 ` Tommi Virtanen
@ 2012-02-01  1:19 ` Josh Durgin
  2012-02-01  8:26 ` Yehuda Sadeh Weinraub
  3 siblings, 0 replies; 12+ messages in thread
From: Josh Durgin @ 2012-02-01  1:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

(sorry for the extra email)

On 01/31/2012 04:33 PM, Sage Weil wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).
>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.
>
> I think there are a couple questions:
>
> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.

What are some other use cases for this? It may be useful in the future,
but if the only immediate use is speeding up rgw-admin, I don't think
it's worth further complicating the osd and all the layers above it.

> How would the OSD implement this?  A kludgey way would be to do it during
> scrub.  The current scrub implementation may make that problematic because
> it does a whole PG at time, and we probably don't want to issue a whole
> PG's worth of deletes at a time.  Is there a way to make that less
> painful?

This would also tie it to scrub actually happening. This means osds
with high load would never process the operations, unless you disable
the load check, in which case you slow down loady osds with scrubbing.

> Not using scrub means we need some sort of index to keep track of objects
> with delayed events.  Using a collection for this might work, but loading
> all this state into memory would be slow if there were too many events
> registered.
 >
> Given all that, and that we need a solution to the expiration soon
> (weeks), do we
>   - do a complete solution now,
>   - parallelize radosgw-admin log processing,

I'm in favor of this, since it's much simpler and easier to maintain
than a full-blown time-based op, and the scrub kludge will be even
worse to maintain (plus it turns a read-only operation into a
read-write one).

>   - or hack it into scrub?
>
> sage


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
       [not found]   ` <CAC-hyiExnN6CxMh=+5tLoZy3T0=Mx6Y3P796rG3L01mZ-=+vOg@mail.gmail.com>
@ 2012-02-01  8:04     ` Yehuda Sadeh Weinraub
  0 siblings, 0 replies; 12+ messages in thread
From: Yehuda Sadeh Weinraub @ 2012-02-01  8:04 UTC (permalink / raw)
  To: ceph-devel

(resending to list, sorry tv)

On Tue, Jan 31, 2012 at 5:02 PM, Tommi Virtanen
<tommi.virtanen@dreamhost.com> wrote:
>
> On Tue, Jan 31, 2012 at 16:33, Sage Weil <sage@newdream.net> wrote:
> > Currently rgw logs objects it wants to delete after some period of time,
> > and an radosgw-admin command comes back later to process the log.  It
> > works, but is currently slow (one sync op at a time).
> >
> > A better approach would be to mark objects for later removal, and have the
> > OSD do it in some more efficient way.  wip-objs-expire has a client side
> > (librados) interface for this.
>
> Is there some reason why this would be significantly more performant
> when done by the OSD itself? It seems like the deletion times can be
> bucketed by time nicely, then each bucket just contains a set of ids
> -- a good fit for the map data type -- and the client for running this
> deletion just streams the bucket contents over and issues delete
> messages for everything. What makes that inherently slow?

Random access to random cold objects is generally slower than doing
the operations on a single pg. E.g., if doing it as part of the scrub,
then objects are accessed anyway and are hopefully cached.

>
> > Should this be generalized to saying "do these osd ops at time X" instead
> > of "delete at time X".  Then it could setxattr, remove, call into a class,
> > whatever.
>
> That sounds like a really complex API, for quite marginal gain.

I do agree that for the sake of a garbage collection it's an overkill, however,

>
> To make my point even clearer: point me to another data store that has
> that idiom.

I can see its use, and even if not, I'm sure that there would be users
who would need it. I don't think there's any relevance to the question
whether there's any data store that implements that.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01  0:33 efficient removal of old objects Sage Weil
                   ` (2 preceding siblings ...)
  2012-02-01  1:19 ` Josh Durgin
@ 2012-02-01  8:26 ` Yehuda Sadeh Weinraub
  2012-02-01 17:39   ` Gregory Farnum
  3 siblings, 1 reply; 12+ messages in thread
From: Yehuda Sadeh Weinraub @ 2012-02-01  8:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Tue, Jan 31, 2012 at 4:33 PM, Sage Weil <sage@newdream.net> wrote:
> Currently rgw logs objects it wants to delete after some period of time,
> and an radosgw-admin command comes back later to process the log.  It
> works, but is currently slow (one sync op at a time).

Intent log generation doesn't come free of charge, it adds some load
on the system.

>
> A better approach would be to mark objects for later removal, and have the
> OSD do it in some more efficient way.  wip-objs-expire has a client side
> (librados) interface for this.

Note that setting expiration on an object is a more lightweight
operation than appending the intent log, as it would be done as a sub
op in the compound operation that created the object.

>
> I think there are a couple questions:
>
> Should this be generalized to saying "do these osd ops at time X" instead
> of "delete at time X".  Then it could setxattr, remove, call into a class,
> whatever.

While I think it'd make a nice feature, I also think that the problem
space of a garbage collection is a bit different, and given the time
constraints it wouldn't make sense implementing this right now anyway.
>
> How would the OSD implement this?  A kludgey way would be to do it during
> scrub.  The current scrub implementation may make that problematic because
> it does a whole PG at time, and we probably don't want to issue a whole
> PG's worth of deletes at a time.  Is there a way to make that less
> painful?

If we need to lock the entire pg while removing the objects it wouldn't work.
I'm not too familiar with the scrub code, and I don't want to dive
here into possible implementation details, but getting the scrub to
generate a list of objects for removal may be possible.

>
> Not using scrub means we need some sort of index to keep track of objects
> with delayed events.  Using a collection for this might work, but loading
> all this state into memory would be slow if there were too many events
> registered.
>
> Given all that, and that we need a solution to the expiration soon
> (weeks), do we
>  - do a complete solution now,
>  - parallelize radosgw-admin log processing,
>  - or hack it into scrub?
>
I don't expect to see many hands going up for "hacking" anything. I
would argue that having a garbage collection related job going on
inside a maintenance activity is not that far fetched. Not at any cost
though.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01  8:26 ` Yehuda Sadeh Weinraub
@ 2012-02-01 17:39   ` Gregory Farnum
  2012-02-01 18:53     ` Yehuda Sadeh Weinraub
  0 siblings, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2012-02-01 17:39 UTC (permalink / raw)
  To: ceph-devel; +Cc: Sage Weil, Yehuda Sadeh Weinraub

On Wed, Feb 1, 2012 at 12:04 AM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> (resending to list, sorry tv)
>
> On Tue, Jan 31, 2012 at 5:02 PM, Tommi Virtanen
> <tommi.virtanen@dreamhost.com> wrote:
>>
>> On Tue, Jan 31, 2012 at 16:33, Sage Weil <sage@newdream.net> wrote:
>> > Currently rgw logs objects it wants to delete after some period of time,
>> > and an radosgw-admin command comes back later to process the log.  It
>> > works, but is currently slow (one sync op at a time).
>> >
>> > A better approach would be to mark objects for later removal, and have the
>> > OSD do it in some more efficient way.  wip-objs-expire has a client side
>> > (librados) interface for this.
>>
>> Is there some reason why this would be significantly more performant
>> when done by the OSD itself? It seems like the deletion times can be
>> bucketed by time nicely, then each bucket just contains a set of ids
>> -- a good fit for the map data type -- and the client for running this
>> deletion just streams the bucket contents over and issues delete
>> messages for everything. What makes that inherently slow?
>
> Random access to random cold objects is generally slower than doing
> the operations on a single pg. E.g., if doing it as part of the scrub,
> then objects are accessed anyway and are hopefully cached.

You are dramatically overstating the impact of latency on an
inherently parallelizable and non-interactive operation. A couple disk
seeks *do not matter.*

On Wed, Feb 1, 2012 at 12:26 AM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Tue, Jan 31, 2012 at 4:33 PM, Sage Weil <sage@newdream.net> wrote:
>> A better approach would be to mark objects for later removal, and have the
>> OSD do it in some more efficient way.  wip-objs-expire has a client side
>> (librados) interface for this.
>
> Note that setting expiration on an object is a more lightweight
> operation than appending the intent log, as it would be done as a sub
> op in the compound operation that created the object.

...you're going to set expirations on the objects when you write them?
What if the user's upload takes longer than you expect?

>> I think there are a couple questions:
>>
>> Should this be generalized to saying "do these osd ops at time X" instead
>> of "delete at time X".  Then it could setxattr, remove, call into a class,
>> whatever.
>
> While I think it'd make a nice feature, I also think that the problem
> space of a garbage collection is a bit different, and given the time
> constraints it wouldn't make sense implementing this right now anyway.

This is client-side garbage collection, not RADOS garbage collection.
Don't confuse those issues, either — the second is appropriate to put
into the OSDs as special logic; the first is not. That's why we think
that any OSD implementation of this should be generalized as a class
interface, rather than a specific hack.

>> How would the OSD implement this?  A kludgey way would be to do it during
>> scrub.  The current scrub implementation may make that problematic because
>> it does a whole PG at time, and we probably don't want to issue a whole
>> PG's worth of deletes at a time.  Is there a way to make that less
>> painful?
>
> If we need to lock the entire pg while removing the objects it wouldn't work.

That's how scrub works right now...

> I'm not too familiar with the scrub code, and I don't want to dive
> here into possible implementation details, but getting the scrub to
> generate a list of objects for removal may be possible.

Sam and I tossed around a few ideas for how to do this, and it's not
impossible, but it was significantly more complicated than everybody
thinks it is at first glance. (You need to make sure that it doesn't
interact with recovery at all, which means it needs to go through the
normal request mechanism, which means you need to build up a queue of
deletes while scrubbing and then dispatch it properly without
disrupting client requests or running out of memory; you need to make
sure that scrubbing runs more reliably than it does right now...etc
etc)

>> Not using scrub means we need some sort of index to keep track of objects
>> with delayed events.  Using a collection for this might work, but loading
>> all this state into memory would be slow if there were too many events
>> registered.
>>
>> Given all that, and that we need a solution to the expiration soon
>> (weeks), do we
>>  - do a complete solution now,
>>  - parallelize radosgw-admin log processing,
>>  - or hack it into scrub?
>>
> I don't expect to see many hands going up for "hacking" anything. I
> would argue that having a garbage collection related job going on
> inside a maintenance activity is not that far fetched. Not at any cost
> though.

The problem is that it changes the nature of scrub. Right now, scrub
doesn't change anything at all; scrub repair sets the replicas to have
the same state as the primary. You want to add a client-controlled
state mutation that is triggered as part of scrub, which *really*
makes it different (and complicated)...or else it's a hack to have
scrub trigger some weird sequences of requests (as I outlined above).
Either way, it's a big change to scrub that smells hacky.

The basic issue here is that the RGW stuff can all be done as
client-side operations, and all you've demonstrated is that doing it
serially with a single client is slow (but not slower than the
generation of the objects, which means that it does actually work).
The correct response to that is not to add half-baked features to the
OSD; the correct response is to make your client behave well.
If we *do* want to add time-based triggers that clients can set up,
that ought to be a well-thought-out interface that isn't limited to a
single use-case. I'm totally fine with the idea, as long as it comes
at some point in the future when we aren't all working hard to
stabilize the core system.

-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01 17:39   ` Gregory Farnum
@ 2012-02-01 18:53     ` Yehuda Sadeh Weinraub
  2012-02-01 19:35       ` Gregory Farnum
  2012-02-01 19:43       ` Sage Weil
  0 siblings, 2 replies; 12+ messages in thread
From: Yehuda Sadeh Weinraub @ 2012-02-01 18:53 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Sage Weil

On Wed, Feb 1, 2012 at 9:39 AM, Gregory Farnum
<gregory.farnum@dreamhost.com> wrote:
> On Wed, Feb 1, 2012 at 12:04 AM, Yehuda Sadeh Weinraub
> <yehudasa@gmail.com> wrote:
>> (resending to list, sorry tv)
>>
>> On Tue, Jan 31, 2012 at 5:02 PM, Tommi Virtanen
>> <tommi.virtanen@dreamhost.com> wrote:
>>>
>>> On Tue, Jan 31, 2012 at 16:33, Sage Weil <sage@newdream.net> wrote:
>>> > Currently rgw logs objects it wants to delete after some period of time,
>>> > and an radosgw-admin command comes back later to process the log.  It
>>> > works, but is currently slow (one sync op at a time).
>>> >
>>> > A better approach would be to mark objects for later removal, and have the
>>> > OSD do it in some more efficient way.  wip-objs-expire has a client side
>>> > (librados) interface for this.
>>>
>>> Is there some reason why this would be significantly more performant
>>> when done by the OSD itself? It seems like the deletion times can be
>>> bucketed by time nicely, then each bucket just contains a set of ids
>>> -- a good fit for the map data type -- and the client for running this
>>> deletion just streams the bucket contents over and issues delete
>>> messages for everything. What makes that inherently slow?
>>
>> Random access to random cold objects is generally slower than doing
>> the operations on a single pg. E.g., if doing it as part of the scrub,
>> then objects are accessed anyway and are hopefully cached.
>
> You are dramatically overstating the impact of latency on an
> inherently parallelizable and non-interactive operation. A couple disk
> seeks *do not matter.*

Do not matter to whom? It affects the overall osd performance, and
given enough threads going on in parallel doing the cleanup, it
*really* matters, and this is the basic issue.

>
> On Wed, Feb 1, 2012 at 12:26 AM, Yehuda Sadeh Weinraub
> <yehudasa@gmail.com> wrote:
>> On Tue, Jan 31, 2012 at 4:33 PM, Sage Weil <sage@newdream.net> wrote:
>>> A better approach would be to mark objects for later removal, and have the
>>> OSD do it in some more efficient way.  wip-objs-expire has a client side
>>> (librados) interface for this.
>>
>> Note that setting expiration on an object is a more lightweight
>> operation than appending the intent log, as it would be done as a sub
>> op in the compound operation that created the object.
>
> ...you're going to set expirations on the objects when you write them?
> What if the user's upload takes longer than you expect?

You're a few months too late. Go back to the atomic get/put discussion.

>
>>> I think there are a couple questions:
>>>
>>> Should this be generalized to saying "do these osd ops at time X" instead
>>> of "delete at time X".  Then it could setxattr, remove, call into a class,
>>> whatever.
>>
>> While I think it'd make a nice feature, I also think that the problem
>> space of a garbage collection is a bit different, and given the time
>> constraints it wouldn't make sense implementing this right now anyway.
>
> This is client-side garbage collection, not RADOS garbage collection.
> Don't confuse those issues, either — the second is appropriate to put

No. This is a garbage collection utility that RADOS can provide.

> into the OSDs as special logic; the first is not. That's why we think
> that any OSD implementation of this should be generalized as a class
> interface, rather than a specific hack.

It'd be nice to have a generalized class interface, but garbage
collection is garbage collection. I'm all for extending class to do
all sorts of crazy things, provide users a flexible enough framework
to work with. However, you'd agree that it's not something we'd do in
the near future. Cleaning up temp objects is a real issue now.

>
>>> How would the OSD implement this?  A kludgey way would be to do it during
>>> scrub.  The current scrub implementation may make that problematic because
>>> it does a whole PG at time, and we probably don't want to issue a whole
>>> PG's worth of deletes at a time.  Is there a way to make that less
>>> painful?
>>
>> If we need to lock the entire pg while removing the objects it wouldn't work.
>
> That's how scrub works right now...
>
>> I'm not too familiar with the scrub code, and I don't want to dive
>> here into possible implementation details, but getting the scrub to
>> generate a list of objects for removal may be possible.
>
> Sam and I tossed around a few ideas for how to do this, and it's not
> impossible, but it was significantly more complicated than everybody
> thinks it is at first glance. (You need to make sure that it doesn't
> interact with recovery at all, which means it needs to go through the
> normal request mechanism, which means you need to build up a queue of
> deletes while scrubbing and then dispatch it properly without
> disrupting client requests or running out of memory; you need to make
> sure that scrubbing runs more reliably than it does right now...etc
> etc)

I think you're overstating complexity. We're already disrupting client
requests by running the cleanup externally. Leveraging scrub
throttling due to system load is a strength, not a weakness. If any,
using the intent log cleanup blindly is a real issue. Also, add to
that the fact that we leverage the fact that scrub runs over the
objects anyway and heats up the caches, the performance gain we'd get
is much bigger.

>
>>> Not using scrub means we need some sort of index to keep track of objects
>>> with delayed events.  Using a collection for this might work, but loading
>>> all this state into memory would be slow if there were too many events
>>> registered.
>>>
>>> Given all that, and that we need a solution to the expiration soon
>>> (weeks), do we
>>>  - do a complete solution now,
>>>  - parallelize radosgw-admin log processing,
>>>  - or hack it into scrub?
>>>
>> I don't expect to see many hands going up for "hacking" anything. I
>> would argue that having a garbage collection related job going on
>> inside a maintenance activity is not that far fetched. Not at any cost
>> though.
>
> The problem is that it changes the nature of scrub. Right now, scrub
> doesn't change anything at all; scrub repair sets the replicas to have

It doesn't change anything with scrub's nature if it's only used to
generate the list of objects to remove (per pg).

> the same state as the primary. You want to add a client-controlled
> state mutation that is triggered as part of scrub, which *really*
> makes it different (and complicated)...or else it's a hack to have
> scrub trigger some weird sequences of requests (as I outlined above).
> Either way, it's a big change to scrub that smells hacky.
>
> The basic issue here is that the RGW stuff can all be done as
> client-side operations, and all you've demonstrated is that doing it
> serially with a single client is slow (but not slower than the
> generation of the objects, which means that it does actually work).
> The correct response to that is not to add half-baked features to the
> OSD; the correct response is to make your client behave well.

The problem is not the client behaving well, but the impact that it
has on the overall system performance due to random seeks.


> If we *do* want to add time-based triggers that clients can set up,
> that ought to be a well-thought-out interface that isn't limited to a
> single use-case. I'm totally fine with the idea, as long as it comes
> at some point in the future when we aren't all working hard to
> stabilize the core system.

We may create that sometime in the future, and implement garbage
collection using that. But you're failing to understand the point that
using the scrub is just an implementation detail. I do think that we
need object expiration in rados. This is not a single use case.
I also think that using external client for that is a mistake
(performance on one hand, but also adding administrative pain).


Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01 18:53     ` Yehuda Sadeh Weinraub
@ 2012-02-01 19:35       ` Gregory Farnum
  2012-02-01 20:01         ` Yehuda Sadeh Weinraub
  2012-02-01 19:43       ` Sage Weil
  1 sibling, 1 reply; 12+ messages in thread
From: Gregory Farnum @ 2012-02-01 19:35 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: ceph-devel, Sage Weil

On Wed, Feb 1, 2012 at 10:53 AM, Yehuda Sadeh Weinraub
<yehudasa@gmail.com> wrote:
> On Wed, Feb 1, 2012 at 9:39 AM, Gregory Farnum
> <gregory.farnum@dreamhost.com> wrote:
>> You are dramatically overstating the impact of latency on an
>> inherently parallelizable and non-interactive operation. A couple disk
>> seeks *do not matter.*
>
> Do not matter to whom? It affects the overall osd performance, and
> given enough threads going on in parallel doing the cleanup, it
> *really* matters, and this is the basic issue.
You can impact the random lookups based on how the intent log is
actually designed, to the point that those lookups should not impact
the OSDs noticeably (a few hundred requests per day to do once-a-day
cleanups). Once you have the information on who to delete, you are
going to run the same sequence of operations; the only question is
whether they originate on the client or on the OSD. Assuming a large
load (as you are) and a not-trivial PG, the deletes are all going to
have to go to disk to find the inodes anyway. So the vast majority of
the load required for a client-side solution is identical to the load
required for a scrub-based solution.

>>> Note that setting expiration on an object is a more lightweight
>>> operation than appending the intent log, as it would be done as a sub
>>> op in the compound operation that created the object.
>>
>> ...you're going to set expirations on the objects when you write them?
>> What if the user's upload takes longer than you expect?
>
> You're a few months too late. Go back to the atomic get/put discussion.
I remember this discussion, but I thought we'd ended up setting
intent-to-delete when we did the final clone into place?

>>>> I think there are a couple questions:
>>>>
>>>> Should this be generalized to saying "do these osd ops at time X" instead
>>>> of "delete at time X".  Then it could setxattr, remove, call into a class,
>>>> whatever.
>>>
>>> While I think it'd make a nice feature, I also think that the problem
>>> space of a garbage collection is a bit different, and given the time
>>> constraints it wouldn't make sense implementing this right now anyway.
>>
>> This is client-side garbage collection, not RADOS garbage collection.
>> Don't confuse those issues, either — the second is appropriate to put
>
> No. This is a garbage collection utility that RADOS can provide.

Yes, it *can*, but that doesn't mean it *should*. Core OSD
functionality should be stuff that's widely-used by many clients, and
I can't think of any other client that's going to want time-based
garbage collection of this sort. Every other scenario I can think of
will just delete when they are done with the object, or else will want
more sophisticated checks than the amount of elapsed time. Which is
why I support the eventual addition of time-based class triggers, but
not an interface tailored exclusively for radosgw.

>>>> How would the OSD implement this?  A kludgey way would be to do it during
>>>> scrub.  The current scrub implementation may make that problematic because
>>>> it does a whole PG at time, and we probably don't want to issue a whole
>>>> PG's worth of deletes at a time.  Is there a way to make that less
>>>> painful?
>>>
>>> If we need to lock the entire pg while removing the objects it wouldn't work.
>>
>> That's how scrub works right now...
>>
>>> I'm not too familiar with the scrub code, and I don't want to dive
>>> here into possible implementation details, but getting the scrub to
>>> generate a list of objects for removal may be possible.
>>
>> Sam and I tossed around a few ideas for how to do this, and it's not
>> impossible, but it was significantly more complicated than everybody
>> thinks it is at first glance. (You need to make sure that it doesn't
>> interact with recovery at all, which means it needs to go through the
>> normal request mechanism, which means you need to build up a queue of
>> deletes while scrubbing and then dispatch it properly without
>> disrupting client requests or running out of memory; you need to make
>> sure that scrubbing runs more reliably than it does right now...etc
>> etc)
>
> I think you're overstating complexity. We're already disrupting client
> requests by running the cleanup externally. Leveraging scrub
> throttling due to system load is a strength, not a weakness. If any,
> using the intent log cleanup blindly is a real issue. Also, add to
> that the fact that we leverage the fact that scrub runs over the
> objects anyway and heats up the caches, the performance gain we'd get
> is much bigger.
I thought the whole reason this had suddenly become such an issue is
because not cleaning up the intent log stuff has a performance impact
on the cluster. Scrub *doesn't run* when the load is too high...which
means that by leveraging scrub you will get into a circle of death
where cleanup never occurs because the load is too high, which causes
the load to continue increasing...

>> The problem is that it changes the nature of scrub. Right now, scrub
>> doesn't change anything at all; scrub repair sets the replicas to have
>
> It doesn't change anything with scrub's nature if it's only used to
> generate the list of objects to remove (per pg).

If your contention is that doing delayed work immediately following a
scrub is a huge performance win, then we ought to be able to hang more
than deletes off of scrub. Adding deletes now with the intention of
expanding it later creates an interface and code maintenance nightmare
— we either maintain two parallel code tracks or else we have to
convert old-style delete requests to new-style interface requests.
Either way, eww! This is directly contrary to the work we're doing
with message encoding et al to work towards more stable interfaces.

>> the same state as the primary. You want to add a client-controlled
>> state mutation that is triggered as part of scrub, which *really*
>> makes it different (and complicated)...or else it's a hack to have
>> scrub trigger some weird sequences of requests (as I outlined above).
>> Either way, it's a big change to scrub that smells hacky.
>>
>> The basic issue here is that the RGW stuff can all be done as
>> client-side operations, and all you've demonstrated is that doing it
>> serially with a single client is slow (but not slower than the
>> generation of the objects, which means that it does actually work).
>> The correct response to that is not to add half-baked features to the
>> OSD; the correct response is to make your client behave well.
>
> The problem is not the client behaving well, but the impact that it
> has on the overall system performance due to random seeks.

Maybe you've presented data on this to somebody, but the group hasn't
seen it. Please do show and tell! And demonstrate that the performance
impact is inherent in a client-based solution, rather than in the way
it's currently implemented. :)

>> If we *do* want to add time-based triggers that clients can set up,
>> that ought to be a well-thought-out interface that isn't limited to a
>> single use-case. I'm totally fine with the idea, as long as it comes
>> at some point in the future when we aren't all working hard to
>> stabilize the core system.
>
> We may create that sometime in the future, and implement garbage
> collection using that. But you're failing to understand the point that
> using the scrub is just an implementation detail.
Hanging it off of scrub is not just an implementation detail — if you
do it without scrub, then the work becomes dramatically more complex
and the fact that it does deletes instead of arbitrary code execution
is just an implementation detail. My opposition is to both the
implementation and the interface (that we have to carry forever).

> I do think that we
> need object expiration in rados. This is not a single use case.
> I also think that using external client for that is a mistake
> (performance on one hand, but also adding administrative pain).

I can't find object expiration anywhere except in S3, and that
interface is very clearly about end-user ease of use rather than
pushing the expiration into the object store. The only use case they
can come up with is logs stored in objects, and expiration generates
explicit bucket access logs which makes it look to me like it's run as
a separate process using bucket scanning. *shrug*
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01 18:53     ` Yehuda Sadeh Weinraub
  2012-02-01 19:35       ` Gregory Farnum
@ 2012-02-01 19:43       ` Sage Weil
  1 sibling, 0 replies; 12+ messages in thread
From: Sage Weil @ 2012-02-01 19:43 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: Gregory Farnum, ceph-devel

An expiration API seems sufficiently general and useful to ignore the 
other bike shed options.

I don't buy the readonly vs readwrite issue with scrub triggering 
deletions.  On the other hand, having scrub trigger it is a problem in 
general because the time periods may not align.  It would make garbage 
collection only useful when it is on the same order of magnitude as scrub.

Viewing the deletion workload in isolation, doing it from scrub is 
significantly more efficient.  (Assuming it is implemented well. I think 
the current scrub needs to be reworked before that could happen.)

However, in general, deletion is only a fraction of the overall system 
load.  In fact, we'd have one deletion to follow up every overwrite PUT, 
so we effectively double the number of write ops if the client does it for 
that case.  I don't think that will be a large part of the workload.

Doing it from the client also means we can control the period independent 
of scrub... e.g. 1 hour instead of a day or days.

In any case, I think we should leave it on the client, at least for this 
sprint.  We need to make the scrubbing incremental first.

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01 19:35       ` Gregory Farnum
@ 2012-02-01 20:01         ` Yehuda Sadeh Weinraub
  0 siblings, 0 replies; 12+ messages in thread
From: Yehuda Sadeh Weinraub @ 2012-02-01 20:01 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Sage Weil

On Wed, Feb 1, 2012 at 11:35 AM, Gregory Farnum
<gregory.farnum@dreamhost.com> wrote:
> On Wed, Feb 1, 2012 at 10:53 AM, Yehuda Sadeh Weinraub
> <yehudasa@gmail.com> wrote:
>> On Wed, Feb 1, 2012 at 9:39 AM, Gregory Farnum
>> <gregory.farnum@dreamhost.com> wrote:
>>> You are dramatically overstating the impact of latency on an
>>> inherently parallelizable and non-interactive operation. A couple disk
>>> seeks *do not matter.*
>>
>> Do not matter to whom? It affects the overall osd performance, and
>> given enough threads going on in parallel doing the cleanup, it
>> *really* matters, and this is the basic issue.
> You can impact the random lookups based on how the intent log is
> actually designed, to the point that those lookups should not impact
> the OSDs noticeably (a few hundred requests per day to do once-a-day

I lost you here with that assessment. In any case you need to scale
the number of clients that do it with the number of your osds.


> cleanups). Once you have the information on who to delete, you are
> going to run the same sequence of operations; the only question is
> whether they originate on the client or on the OSD. Assuming a large
> load (as you are) and a not-trivial PG, the deletes are all going to
> have to go to disk to find the inodes anyway. So the vast majority of
> the load required for a client-side solution is identical to the load
> required for a scrub-based solution.
>
>>>> Note that setting expiration on an object is a more lightweight
>>>> operation than appending the intent log, as it would be done as a sub
>>>> op in the compound operation that created the object.
>>>
>>> ...you're going to set expirations on the objects when you write them?
>>> What if the user's upload takes longer than you expect?
>>
>> You're a few months too late. Go back to the atomic get/put discussion.
> I remember this discussion, but I thought we'd ended up setting
> intent-to-delete when we did the final clone into place?

In any case you expire the old cloned object, not the new one that is
being uploaded, so that only happens when the PUT was completed.

>
>>>>> I think there are a couple questions:
>>>>>
>>>>> Should this be generalized to saying "do these osd ops at time X" instead
>>>>> of "delete at time X".  Then it could setxattr, remove, call into a class,
>>>>> whatever.
>>>>
>>>> While I think it'd make a nice feature, I also think that the problem
>>>> space of a garbage collection is a bit different, and given the time
>>>> constraints it wouldn't make sense implementing this right now anyway.
>>>
>>> This is client-side garbage collection, not RADOS garbage collection.
>>> Don't confuse those issues, either — the second is appropriate to put
>>
>> No. This is a garbage collection utility that RADOS can provide.
>
> Yes, it *can*, but that doesn't mean it *should*. Core OSD
> functionality should be stuff that's widely-used by many clients, and
> I can't think of any other client that's going to want time-based
> garbage collection of this sort. Every other scenario I can think of

Tell that to the Swift guys, or to Amazon.

> will just delete when they are done with the object, or else will want
> more sophisticated checks than the amount of elapsed time. Which is
> why I support the eventual addition of time-based class triggers, but
> not an interface tailored exclusively for radosgw.

See my previous comment.

>
>>>>> How would the OSD implement this?  A kludgey way would be to do it during
>>>>> scrub.  The current scrub implementation may make that problematic because
>>>>> it does a whole PG at time, and we probably don't want to issue a whole
>>>>> PG's worth of deletes at a time.  Is there a way to make that less
>>>>> painful?
>>>>
>>>> If we need to lock the entire pg while removing the objects it wouldn't work.
>>>
>>> That's how scrub works right now...
>>>
>>>> I'm not too familiar with the scrub code, and I don't want to dive
>>>> here into possible implementation details, but getting the scrub to
>>>> generate a list of objects for removal may be possible.
>>>
>>> Sam and I tossed around a few ideas for how to do this, and it's not
>>> impossible, but it was significantly more complicated than everybody
>>> thinks it is at first glance. (You need to make sure that it doesn't
>>> interact with recovery at all, which means it needs to go through the
>>> normal request mechanism, which means you need to build up a queue of
>>> deletes while scrubbing and then dispatch it properly without
>>> disrupting client requests or running out of memory; you need to make
>>> sure that scrubbing runs more reliably than it does right now...etc
>>> etc)
>>
>> I think you're overstating complexity. We're already disrupting client
>> requests by running the cleanup externally. Leveraging scrub
>> throttling due to system load is a strength, not a weakness. If any,
>> using the intent log cleanup blindly is a real issue. Also, add to
>> that the fact that we leverage the fact that scrub runs over the
>> objects anyway and heats up the caches, the performance gain we'd get
>> is much bigger.
> I thought the whole reason this had suddenly become such an issue is
> because not cleaning up the intent log stuff has a performance impact
> on the cluster. Scrub *doesn't run* when the load is too high...which

That's an underlying filesystem (probably temporary) issue. I wouldn't
build my architecture on it.

> means that by leveraging scrub you will get into a circle of death
> where cleanup never occurs because the load is too high, which causes
> the load to continue increasing...
>
>>> The problem is that it changes the nature of scrub. Right now, scrub
>>> doesn't change anything at all; scrub repair sets the replicas to have
>>
>> It doesn't change anything with scrub's nature if it's only used to
>> generate the list of objects to remove (per pg).
>
> If your contention is that doing delayed work immediately following a
> scrub is a huge performance win, then we ought to be able to hang more

No. We don't ought to do that.

> than deletes off of scrub. Adding deletes now with the intention of
> expanding it later creates an interface and code maintenance nightmare
> — we either maintain two parallel code tracks or else we have to
> convert old-style delete requests to new-style interface requests.
> Either way, eww! This is directly contrary to the work we're doing
> with message encoding et al to work towards more stable interfaces.

Sorry, lost you above.

>
>>> the same state as the primary. You want to add a client-controlled
>>> state mutation that is triggered as part of scrub, which *really*
>>> makes it different (and complicated)...or else it's a hack to have
>>> scrub trigger some weird sequences of requests (as I outlined above).
>>> Either way, it's a big change to scrub that smells hacky.
>>>
>>> The basic issue here is that the RGW stuff can all be done as
>>> client-side operations, and all you've demonstrated is that doing it
>>> serially with a single client is slow (but not slower than the
>>> generation of the objects, which means that it does actually work).
>>> The correct response to that is not to add half-baked features to the
>>> OSD; the correct response is to make your client behave well.
>>
>> The problem is not the client behaving well, but the impact that it
>> has on the overall system performance due to random seeks.
>
> Maybe you've presented data on this to somebody, but the group hasn't
> seen it. Please do show and tell! And demonstrate that the performance
> impact is inherent in a client-based solution, rather than in the way
> it's currently implemented. :)

Ok, I'm off this thread. Greg, you lost me here with this kind of attitude.

>
>>> If we *do* want to add time-based triggers that clients can set up,
>>> that ought to be a well-thought-out interface that isn't limited to a
>>> single use-case. I'm totally fine with the idea, as long as it comes
>>> at some point in the future when we aren't all working hard to
>>> stabilize the core system.
>>
>> We may create that sometime in the future, and implement garbage
>> collection using that. But you're failing to understand the point that
>> using the scrub is just an implementation detail.
> Hanging it off of scrub is not just an implementation detail — if you
> do it without scrub, then the work becomes dramatically more complex
> and the fact that it does deletes instead of arbitrary code execution
> is just an implementation detail. My opposition is to both the
> implementation and the interface (that we have to carry forever).
>
>> I do think that we
>> need object expiration in rados. This is not a single use case.
>> I also think that using external client for that is a mistake
>> (performance on one hand, but also adding administrative pain).
>
> I can't find object expiration anywhere except in S3, and that
> interface is very clearly about end-user ease of use rather than
> pushing the expiration into the object store. The only use case they
> can come up with is logs stored in objects, and expiration generates
> explicit bucket access logs which makes it look to me like it's run as
> a separate process using bucket scanning. *shrug*
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: efficient removal of old objects
  2012-02-01  1:02 ` Tommi Virtanen
       [not found]   ` <CAC-hyiExnN6CxMh=+5tLoZy3T0=Mx6Y3P796rG3L01mZ-=+vOg@mail.gmail.com>
@ 2012-02-02  0:11   ` Mark Kampe
  1 sibling, 0 replies; 12+ messages in thread
From: Mark Kampe @ 2012-02-02  0:11 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On 01/31/12 17:02, Tommi Virtanen wrote:

> To make my point even clearer: point me to another data store that has
> that idiom.

(a) Automatic expiration and deletion is, and has long been, a
     standard feature of archival systems ... and our RADOS
     clouds are much larger than most archival systems.

(b) I have no competent opinions on the short term solution to this
     particular problem, but in the longer term I do not believe
     that garbage collection can or should be entrusted to clients.
     Clients are ephemeral and cannot be depended on to remember,
     a few years (or even hours) from now, that there were some
     files they were supposed to delete.

     IMHO, object store intelligence is not merely about back-ground
     replication and migration, but about "being able to take
     responsibility for the life cycle of the data they hold".
     The amount of data we store will quickly grow beyond the
     ability of external agents to manage it, and lifecycle
     automation will become increasingly critical.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-02-02  0:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-01  0:33 efficient removal of old objects Sage Weil
2012-02-01  0:52 ` Josh Durgin
2012-02-01  1:02 ` Tommi Virtanen
     [not found]   ` <CAC-hyiExnN6CxMh=+5tLoZy3T0=Mx6Y3P796rG3L01mZ-=+vOg@mail.gmail.com>
2012-02-01  8:04     ` Yehuda Sadeh Weinraub
2012-02-02  0:11   ` Mark Kampe
2012-02-01  1:19 ` Josh Durgin
2012-02-01  8:26 ` Yehuda Sadeh Weinraub
2012-02-01 17:39   ` Gregory Farnum
2012-02-01 18:53     ` Yehuda Sadeh Weinraub
2012-02-01 19:35       ` Gregory Farnum
2012-02-01 20:01         ` Yehuda Sadeh Weinraub
2012-02-01 19:43       ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.