[PATCH] PG: Do not discard op data too early

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] PG: Do not discard op data too early
@ 2012-09-27 21:56 Jim Schutt
  2012-09-27 22:07 ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-09-27 21:56 UTC (permalink / raw)
  To: ceph-devel; +Cc: Jim Schutt

Under a sustained cephfs write load where the offered load is higher
than the storage cluster write throughput, a backlog of replication ops
that arrive via the cluster messenger builds up.  The client message
policy throttler, which should be limiting the total write workload
accepted by the storage cluster, is unable to prevent it, for any
value of osd_client_message_size_cap, under such an overload condition.

The root cause is that op data is released too early, in op_applied().

If instead the op data is released at op deletion, then the limit
imposed by the client policy throttler applies over the entire
lifetime of the op, including commits of replication ops.  That
makes the policy throttler an effective means for an OSD to
protect itself from a sustained high offered load, because it can
effectively limit the total, cluster-wide resources needed to process
in-progress write ops.

Signed-off-by: Jim Schutt <jaschut@sandia.gov>
---
 src/osd/ReplicatedPG.cc |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index a64abda..80bec2a 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -3490,10 +3490,6 @@ void ReplicatedPG::op_applied(RepGather *repop)
   dout(10) << "op_applied " << *repop << dendl;
   if (repop->ctx->op)
     repop->ctx->op->mark_event("op_applied");
-
-  // discard my reference to the buffer
-  if (repop->ctx->op)
-    repop->ctx->op->request->clear_data();

   repop->applying = false;
   repop->applied = true;
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] PG: Do not discard op data too early
  2012-09-27 21:56 [PATCH] PG: Do not discard op data too early Jim Schutt
@ 2012-09-27 22:07 ` Gregory Farnum
  2012-09-27 22:23   ` Jim Schutt
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2012-09-27 22:07 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

Have you tested that this does what you want? If it does, I think
we'll want to implement this so that we actually release the memory,
but continue accounting it.

On Thu, Sep 27, 2012 at 2:56 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> Under a sustained cephfs write load where the offered load is higher
> than the storage cluster write throughput, a backlog of replication ops
> that arrive via the cluster messenger builds up.  The client message
> policy throttler, which should be limiting the total write workload
> accepted by the storage cluster, is unable to prevent it, for any
> value of osd_client_message_size_cap, under such an overload condition.
>
> The root cause is that op data is released too early, in op_applied().
>
> If instead the op data is released at op deletion, then the limit
> imposed by the client policy throttler applies over the entire
> lifetime of the op, including commits of replication ops.  That
> makes the policy throttler an effective means for an OSD to
> protect itself from a sustained high offered load, because it can
> effectively limit the total, cluster-wide resources needed to process
> in-progress write ops.
>
> Signed-off-by: Jim Schutt <jaschut@sandia.gov>
> ---
>  src/osd/ReplicatedPG.cc |    4 ----
>  1 files changed, 0 insertions(+), 4 deletions(-)
>
> diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
> index a64abda..80bec2a 100644
> --- a/src/osd/ReplicatedPG.cc
> +++ b/src/osd/ReplicatedPG.cc
> @@ -3490,10 +3490,6 @@ void ReplicatedPG::op_applied(RepGather *repop)
>    dout(10) << "op_applied " << *repop << dendl;
>    if (repop->ctx->op)
>      repop->ctx->op->mark_event("op_applied");
> -
> -  // discard my reference to the buffer
> -  if (repop->ctx->op)
> -    repop->ctx->op->request->clear_data();
>
>    repop->applying = false;
>    repop->applied = true;
> --
> 1.7.8.2
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PG: Do not discard op data too early
  2012-09-27 22:07 ` Gregory Farnum
@ 2012-09-27 22:23   ` Jim Schutt
  2012-09-27 22:27     ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-09-27 22:23 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 09/27/2012 04:07 PM, Gregory Farnum wrote:
> Have you tested that this does what you want? If it does, I think
> we'll want to implement this so that we actually release the memory,
> but continue accounting it.

Yes.  I have diagnostic patches where I add an "advisory" option
to Throttle, and apply it in advisory mode to the cluster throttler.
In advisory mode Throttle counts bytes but never throttles.

When I run all the clients I can muster (222) against a relatively
small number of OSDs (48-96), with osd_client_message_size_cap set
to 10,000,000 bytes I see spikes of > 100,000,000 bytes tied up
in ops that came through the cluster messenger, and I see long
wait times (> 60 secs) on ops coming through the client throttler.

With this patch applied, I can raise osd_client_message_size_cap
to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
tied up in ops that came through the cluster messenger.  Wait times
for ops coming through the client policy throttler are lower,
overall daemon memory usage is lower, but throughput is the same.

Overall, with this patch applied, my storage cluster "feels" much
less brittle when overloaded.

-- Jim

>
> On Thu, Sep 27, 2012 at 2:56 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> Under a sustained cephfs write load where the offered load is higher
>> than the storage cluster write throughput, a backlog of replication ops
>> that arrive via the cluster messenger builds up.  The client message
>> policy throttler, which should be limiting the total write workload
>> accepted by the storage cluster, is unable to prevent it, for any
>> value of osd_client_message_size_cap, under such an overload condition.
>>
>> The root cause is that op data is released too early, in op_applied().
>>
>> If instead the op data is released at op deletion, then the limit
>> imposed by the client policy throttler applies over the entire
>> lifetime of the op, including commits of replication ops.  That
>> makes the policy throttler an effective means for an OSD to
>> protect itself from a sustained high offered load, because it can
>> effectively limit the total, cluster-wide resources needed to process
>> in-progress write ops.
>>
>> Signed-off-by: Jim Schutt<jaschut@sandia.gov>
>> ---
>>   src/osd/ReplicatedPG.cc |    4 ----
>>   1 files changed, 0 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
>> index a64abda..80bec2a 100644
>> --- a/src/osd/ReplicatedPG.cc
>> +++ b/src/osd/ReplicatedPG.cc
>> @@ -3490,10 +3490,6 @@ void ReplicatedPG::op_applied(RepGather *repop)
>>     dout(10)<<  "op_applied "<<  *repop<<  dendl;
>>     if (repop->ctx->op)
>>       repop->ctx->op->mark_event("op_applied");
>> -
>> -  // discard my reference to the buffer
>> -  if (repop->ctx->op)
>> -    repop->ctx->op->request->clear_data();
>>
>>     repop->applying = false;
>>     repop->applied = true;
>> --
>> 1.7.8.2
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PG: Do not discard op data too early
  2012-09-27 22:23   ` Jim Schutt
@ 2012-09-27 22:27     ` Gregory Farnum
  2012-09-27 22:36       ` Jim Schutt
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2012-09-27 22:27 UTC (permalink / raw)
  To: Jim Schutt; +Cc: ceph-devel

On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 09/27/2012 04:07 PM, Gregory Farnum wrote:
>>
>> Have you tested that this does what you want? If it does, I think
>> we'll want to implement this so that we actually release the memory,
>> but continue accounting it.
>
>
> Yes.  I have diagnostic patches where I add an "advisory" option
> to Throttle, and apply it in advisory mode to the cluster throttler.
> In advisory mode Throttle counts bytes but never throttles.

Can't you also do this if you just set up a throttler with a limit of 0? :)

>
> When I run all the clients I can muster (222) against a relatively
> small number of OSDs (48-96), with osd_client_message_size_cap set
> to 10,000,000 bytes I see spikes of > 100,000,000 bytes tied up
> in ops that came through the cluster messenger, and I see long
> wait times (> 60 secs) on ops coming through the client throttler.
>
> With this patch applied, I can raise osd_client_message_size_cap
> to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
> tied up in ops that came through the cluster messenger.  Wait times
> for ops coming through the client policy throttler are lower,
> overall daemon memory usage is lower, but throughput is the same.
>
> Overall, with this patch applied, my storage cluster "feels" much
> less brittle when overloaded.

Okay, cool. Are you interested in reducing the memory usage a little
more by deallocating the memory separately from accounting it?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PG: Do not discard op data too early
  2012-09-27 22:27     ` Gregory Farnum
@ 2012-09-27 22:36       ` Jim Schutt
  2012-10-26 20:52         ` Gregory Farnum
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-09-27 22:36 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 09/27/2012 04:27 PM, Gregory Farnum wrote:
> On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> On 09/27/2012 04:07 PM, Gregory Farnum wrote:
>>>
>>> Have you tested that this does what you want? If it does, I think
>>> we'll want to implement this so that we actually release the memory,
>>> but continue accounting it.
>>
>>
>> Yes.  I have diagnostic patches where I add an "advisory" option
>> to Throttle, and apply it in advisory mode to the cluster throttler.
>> In advisory mode Throttle counts bytes but never throttles.
>
> Can't you also do this if you just set up a throttler with a limit of 0? :)

Hmmm, I expect so.  I guess I just didn't think of doing it that way....

>
>>
>> When I run all the clients I can muster (222) against a relatively
>> small number of OSDs (48-96), with osd_client_message_size_cap set
>> to 10,000,000 bytes I see spikes of>  100,000,000 bytes tied up
>> in ops that came through the cluster messenger, and I see long
>> wait times (>  60 secs) on ops coming through the client throttler.
>>
>> With this patch applied, I can raise osd_client_message_size_cap
>> to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
>> tied up in ops that came through the cluster messenger.  Wait times
>> for ops coming through the client policy throttler are lower,
>> overall daemon memory usage is lower, but throughput is the same.
>>
>> Overall, with this patch applied, my storage cluster "feels" much
>> less brittle when overloaded.
>
> Okay, cool. Are you interested in reducing the memory usage a little
> more by deallocating the memory separately from accounting it?
>
>

My testing doesn't indicate a need -- even keeping the memory
around until the op is done, my daemons use less memory overall
to get the same throughput.  So, unless some other load condition
indicates a need, I'd counsel simplicity.

-- Jim




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PG: Do not discard op data too early
  2012-09-27 22:36       ` Jim Schutt
@ 2012-10-26 20:52         ` Gregory Farnum
  2012-10-26 21:07           ` Jim Schutt
  0 siblings, 1 reply; 8+ messages in thread
From: Gregory Farnum @ 2012-10-26 20:52 UTC (permalink / raw)
  To: Jim Schutt, Sage Weil, Samuel Just; +Cc: ceph-devel

Wanted to touch base on this patch again. If Sage and Sam agree that
we don't want to play any tricks with memory accounting, we should
pull this patch in. I'm pretty sure we want it for Bobtail!
-Greg

On Thu, Sep 27, 2012 at 3:36 PM, Jim Schutt <jaschut@sandia.gov> wrote:
> On 09/27/2012 04:27 PM, Gregory Farnum wrote:
>>
>> On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>>>
>>> On 09/27/2012 04:07 PM, Gregory Farnum wrote:
>>>>
>>>>
>>>> Have you tested that this does what you want? If it does, I think
>>>> we'll want to implement this so that we actually release the memory,
>>>> but continue accounting it.
>>>
>>>
>>>
>>> Yes.  I have diagnostic patches where I add an "advisory" option
>>> to Throttle, and apply it in advisory mode to the cluster throttler.
>>> In advisory mode Throttle counts bytes but never throttles.
>>
>>
>> Can't you also do this if you just set up a throttler with a limit of 0?
>> :)
>
>
> Hmmm, I expect so.  I guess I just didn't think of doing it that way....
>
>
>>
>>>
>>> When I run all the clients I can muster (222) against a relatively
>>> small number of OSDs (48-96), with osd_client_message_size_cap set
>>> to 10,000,000 bytes I see spikes of>  100,000,000 bytes tied up
>>> in ops that came through the cluster messenger, and I see long
>>> wait times (>  60 secs) on ops coming through the client throttler.
>>>
>>> With this patch applied, I can raise osd_client_message_size_cap
>>> to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
>>> tied up in ops that came through the cluster messenger.  Wait times
>>> for ops coming through the client policy throttler are lower,
>>> overall daemon memory usage is lower, but throughput is the same.
>>>
>>> Overall, with this patch applied, my storage cluster "feels" much
>>> less brittle when overloaded.
>>
>>
>> Okay, cool. Are you interested in reducing the memory usage a little
>> more by deallocating the memory separately from accounting it?
>>
>>
>
> My testing doesn't indicate a need -- even keeping the memory
> around until the op is done, my daemons use less memory overall
> to get the same throughput.  So, unless some other load condition
> indicates a need, I'd counsel simplicity.
>
> -- Jim
>
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PG: Do not discard op data too early
  2012-10-26 20:52         ` Gregory Farnum
@ 2012-10-26 21:07           ` Jim Schutt
  2012-10-26 21:30             ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Jim Schutt @ 2012-10-26 21:07 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, Samuel Just, ceph-devel

On 10/26/2012 02:52 PM, Gregory Farnum wrote:
> Wanted to touch base on this patch again. If Sage and Sam agree that
> we don't want to play any tricks with memory accounting, we should
> pull this patch in. I'm pretty sure we want it for Bobtail!

I've been running with it since I posted it.
I think it would be great if you could pick it up!

-- Jim

> -Greg
>
> On Thu, Sep 27, 2012 at 3:36 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
>> On 09/27/2012 04:27 PM, Gregory Farnum wrote:
>>>
>>> On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt<jaschut@sandia.gov>   wrote:
>>>>
>>>> On 09/27/2012 04:07 PM, Gregory Farnum wrote:
>>>>>
>>>>>
>>>>> Have you tested that this does what you want? If it does, I think
>>>>> we'll want to implement this so that we actually release the memory,
>>>>> but continue accounting it.
>>>>
>>>>
>>>>
>>>> Yes.  I have diagnostic patches where I add an "advisory" option
>>>> to Throttle, and apply it in advisory mode to the cluster throttler.
>>>> In advisory mode Throttle counts bytes but never throttles.
>>>
>>>
>>> Can't you also do this if you just set up a throttler with a limit of 0?
>>> :)
>>
>>
>> Hmmm, I expect so.  I guess I just didn't think of doing it that way....
>>
>>
>>>
>>>>
>>>> When I run all the clients I can muster (222) against a relatively
>>>> small number of OSDs (48-96), with osd_client_message_size_cap set
>>>> to 10,000,000 bytes I see spikes of>   100,000,000 bytes tied up
>>>> in ops that came through the cluster messenger, and I see long
>>>> wait times (>   60 secs) on ops coming through the client throttler.
>>>>
>>>> With this patch applied, I can raise osd_client_message_size_cap
>>>> to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
>>>> tied up in ops that came through the cluster messenger.  Wait times
>>>> for ops coming through the client policy throttler are lower,
>>>> overall daemon memory usage is lower, but throughput is the same.
>>>>
>>>> Overall, with this patch applied, my storage cluster "feels" much
>>>> less brittle when overloaded.
>>>
>>>
>>> Okay, cool. Are you interested in reducing the memory usage a little
>>> more by deallocating the memory separately from accounting it?
>>>
>>>
>>
>> My testing doesn't indicate a need -- even keeping the memory
>> around until the op is done, my daemons use less memory overall
>> to get the same throughput.  So, unless some other load condition
>> indicates a need, I'd counsel simplicity.
>>
>> -- Jim
>>
>>
>>
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] PG: Do not discard op data too early
  2012-10-26 21:07           ` Jim Schutt
@ 2012-10-26 21:30             ` Sage Weil
  0 siblings, 0 replies; 8+ messages in thread
From: Sage Weil @ 2012-10-26 21:30 UTC (permalink / raw)
  To: Jim Schutt; +Cc: Gregory Farnum, Samuel Just, ceph-devel

On Fri, 26 Oct 2012, Jim Schutt wrote:
> On 10/26/2012 02:52 PM, Gregory Farnum wrote:
> > Wanted to touch base on this patch again. If Sage and Sam agree that
> > we don't want to play any tricks with memory accounting, we should
> > pull this patch in. I'm pretty sure we want it for Bobtail!
> 
> I've been running with it since I posted it.
> I think it would be great if you could pick it up!

Applied, 65ed99be85f285ac501a14224b185364c79073a9.  Sorry, I could have 
sworn I applied this... whoops!

sage


> 
> -- Jim
> 
> > -Greg
> > 
> > On Thu, Sep 27, 2012 at 3:36 PM, Jim Schutt<jaschut@sandia.gov>  wrote:
> > > On 09/27/2012 04:27 PM, Gregory Farnum wrote:
> > > > 
> > > > On Thu, Sep 27, 2012 at 3:23 PM, Jim Schutt<jaschut@sandia.gov>   wrote:
> > > > > 
> > > > > On 09/27/2012 04:07 PM, Gregory Farnum wrote:
> > > > > > 
> > > > > > 
> > > > > > Have you tested that this does what you want? If it does, I think
> > > > > > we'll want to implement this so that we actually release the memory,
> > > > > > but continue accounting it.
> > > > > 
> > > > > 
> > > > > 
> > > > > Yes.  I have diagnostic patches where I add an "advisory" option
> > > > > to Throttle, and apply it in advisory mode to the cluster throttler.
> > > > > In advisory mode Throttle counts bytes but never throttles.
> > > > 
> > > > 
> > > > Can't you also do this if you just set up a throttler with a limit of 0?
> > > > :)
> > > 
> > > 
> > > Hmmm, I expect so.  I guess I just didn't think of doing it that way....
> > > 
> > > 
> > > > 
> > > > > 
> > > > > When I run all the clients I can muster (222) against a relatively
> > > > > small number of OSDs (48-96), with osd_client_message_size_cap set
> > > > > to 10,000,000 bytes I see spikes of>   100,000,000 bytes tied up
> > > > > in ops that came through the cluster messenger, and I see long
> > > > > wait times (>   60 secs) on ops coming through the client throttler.
> > > > > 
> > > > > With this patch applied, I can raise osd_client_message_size_cap
> > > > > to 40,000,000 bytes, but I rarely see more than 80,000,000 bytes
> > > > > tied up in ops that came through the cluster messenger.  Wait times
> > > > > for ops coming through the client policy throttler are lower,
> > > > > overall daemon memory usage is lower, but throughput is the same.
> > > > > 
> > > > > Overall, with this patch applied, my storage cluster "feels" much
> > > > > less brittle when overloaded.
> > > > 
> > > > 
> > > > Okay, cool. Are you interested in reducing the memory usage a little
> > > > more by deallocating the memory separately from accounting it?
> > > > 
> > > > 
> > > 
> > > My testing doesn't indicate a need -- even keeping the memory
> > > around until the op is done, my daemons use less memory overall
> > > to get the same throughput.  So, unless some other load condition
> > > indicates a need, I'd counsel simplicity.
> > > 
> > > -- Jim
> > > 
> > > 
> > > 
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-10-26 21:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-27 21:56 [PATCH] PG: Do not discard op data too early Jim Schutt
2012-09-27 22:07 ` Gregory Farnum
2012-09-27 22:23   ` Jim Schutt
2012-09-27 22:27     ` Gregory Farnum
2012-09-27 22:36       ` Jim Schutt
2012-10-26 20:52         ` Gregory Farnum
2012-10-26 21:07           ` Jim Schutt
2012-10-26 21:30             ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.