Possible bug in op path?

ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Possible bug in op path?
@ 2020-05-20  7:40 Robert LeBlanc
  2020-05-20  7:50 ` Fwd: " Robert LeBlanc
  2020-05-20  8:37 ` [ceph-users] " Dan van der Ster
  0 siblings, 2 replies; 4+ messages in thread
From: Robert LeBlanc @ 2020-05-20  7:40 UTC (permalink / raw)
  To: ceph-users, ceph-devel

We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed
that op behavior has changed. This is an HDD cluster (NVMe journals and
NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running
WPQ with the high cut-off, it was rock solid. When we had recoveries going
on it barely dented the client ops and when the client ops on the cluster
went down the backfills would run as fast as the cluster could go. I could
have max_backfills set to 10 and the cluster performed admirably.
After upgrading to Nautilus the cluster struggles with any kind of recovery
and if there is any significant client write load the cluster can get into
a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the
heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive.
As the person who wrote the WPQ code initially, I know that it was fair and
proportional to the op priority and in Jewel it worked. It's not working in
Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and
setting the recovery priority to 1 or zero barely makes any difference. My
best estimation is that the op priority is getting lost before reaching the
WPQ scheduler and is thus not prioritizing and dispatching ops correctly.
It's almost as if all ops are being treated the same and there is no
priority at all.
Unfortunately, I do not have the time to set up the dev/testing environment
to track this down and we will be moving away from Ceph. But I really like
Ceph and want to see it succeed. I strongly suggest that someone look into
this because I think it will resolve a lot of problems people have had on
the mailing list. I'm not sure if a bug was introduced with the other
queues that touches more of the op path or if something in the op path
restructuring that changed how things work (I know that was being discussed
around the time that Jewel was released). But my guess is that it is
somewhere between the op being created and being received into the queue.
I really hope that this helps in the search for this regression. I spent a
lot of time studying the issue to come up with WPQ and saw it work great
when I switched this cluster from PRIO to WPQ. I've also spent countless
hours studying how it's changed in Nautilus.

Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list -- ceph-users-a8pt6IJUokc@public.gmane.org
To unsubscribe send an email to ceph-users-leave-a8pt6IJUokc@public.gmane.org

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Fwd: Possible bug in op path?
  2020-05-20  7:40 Possible bug in op path? Robert LeBlanc
@ 2020-05-20  7:50 ` Robert LeBlanc
  2020-05-20  8:37 ` [ceph-users] " Dan van der Ster
  1 sibling, 0 replies; 4+ messages in thread
From: Robert LeBlanc @ 2020-05-20  7:50 UTC (permalink / raw)
  To: ceph-devel

De-HTMLified
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

---------- Forwarded message ---------
From: Robert LeBlanc <robert@leblancnet.us>
Date: Wed, May 20, 2020 at 12:40 AM
Subject: Possible bug in op path?
To: ceph-users <ceph-users@ceph.io>, ceph-devel <ceph-devel@vger.kernel.org>

We upgraded our Jewel cluster to Nautilus a few months ago and I've
noticed that op behavior has changed. This is an HDD cluster (NVMe
journals and NVMe CephFS metadata pool) with about 800 OSDs. When on
Jewel and running WPQ with the high cut-off, it was rock solid. When
we had recoveries going on it barely dented the client ops and when
the client ops on the cluster went down the backfills would run as
fast as the cluster could go. I could have max_backfills set to 10 and
the cluster performed admirably.
After upgrading to Nautilus the cluster struggles with any kind of
recovery and if there is any significant client write load the cluster
can get into a death spiral. Even heavy client write bandwidth (3-4
GB/s) can cause the heartbeat checks to raise, blocked IO and even
OSDs becoming unresponsive.
As the person who wrote the WPQ code initially, I know that it was
fair and proportional to the op priority and in Jewel it worked. It's
not working in Nautilus. I've tweaked a lot of things trying to
troubleshoot the issue and setting the recovery priority to 1 or zero
barely makes any difference. My best estimation is that the op
priority is getting lost before reaching the WPQ scheduler and is thus
not prioritizing and dispatching ops correctly. It's almost as if all
ops are being treated the same and there is no priority at all.
Unfortunately, I do not have the time to set up the dev/testing
environment to track this down and we will be moving away from Ceph.
But I really like Ceph and want to see it succeed. I strongly suggest
that someone look into this because I think it will resolve a lot of
problems people have had on the mailing list. I'm not sure if a bug
was introduced with the other queues that touches more of the op path
or if something in the op path restructuring that changed how things
work (I know that was being discussed around the time that Jewel was
released). But my guess is that it is somewhere between the op being
created and being received into the queue.
I really hope that this helps in the search for this regression. I
spent a lot of time studying the issue to come up with WPQ and saw it
work great when I switched this cluster from PRIO to WPQ. I've also
spent countless hours studying how it's changed in Nautilus.

Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ceph-users] Possible bug in op path?
  2020-05-20  7:40 Possible bug in op path? Robert LeBlanc
  2020-05-20  7:50 ` Fwd: " Robert LeBlanc
@ 2020-05-20  8:37 ` Dan van der Ster
  2020-05-20 16:56   ` Robert LeBlanc
  1 sibling, 1 reply; 4+ messages in thread
From: Dan van der Ster @ 2020-05-20  8:37 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: ceph-users, ceph-devel

Hi Robert,

Since you didn't mention -- are you using osd_op_queue_cut_off low or
high? I know you are usually advocating high, but the default is still
low and most users don't change this setting.

Cheers, Dan


On Wed, May 20, 2020 at 9:41 AM Robert LeBlanc <robert@leblancnet.us> wrote:
>
> We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed
> that op behavior has changed. This is an HDD cluster (NVMe journals and
> NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running
> WPQ with the high cut-off, it was rock solid. When we had recoveries going
> on it barely dented the client ops and when the client ops on the cluster
> went down the backfills would run as fast as the cluster could go. I could
> have max_backfills set to 10 and the cluster performed admirably.
> After upgrading to Nautilus the cluster struggles with any kind of recovery
> and if there is any significant client write load the cluster can get into
> a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the
> heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive.
> As the person who wrote the WPQ code initially, I know that it was fair and
> proportional to the op priority and in Jewel it worked. It's not working in
> Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and
> setting the recovery priority to 1 or zero barely makes any difference. My
> best estimation is that the op priority is getting lost before reaching the
> WPQ scheduler and is thus not prioritizing and dispatching ops correctly.
> It's almost as if all ops are being treated the same and there is no
> priority at all.
> Unfortunately, I do not have the time to set up the dev/testing environment
> to track this down and we will be moving away from Ceph. But I really like
> Ceph and want to see it succeed. I strongly suggest that someone look into
> this because I think it will resolve a lot of problems people have had on
> the mailing list. I'm not sure if a bug was introduced with the other
> queues that touches more of the op path or if something in the op path
> restructuring that changed how things work (I know that was being discussed
> around the time that Jewel was released). But my guess is that it is
> somewhere between the op being created and being received into the queue.
> I really hope that this helps in the search for this regression. I spent a
> lot of time studying the issue to come up with WPQ and saw it work great
> when I switched this cluster from PRIO to WPQ. I've also spent countless
> hours studying how it's changed in Nautilus.
>
> Thank you,
> Robert LeBlanc
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-leave@ceph.io

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ceph-users] Possible bug in op path?
  2020-05-20  8:37 ` [ceph-users] " Dan van der Ster
@ 2020-05-20 16:56   ` Robert LeBlanc
  0 siblings, 0 replies; 4+ messages in thread
From: Robert LeBlanc @ 2020-05-20 16:56 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-users, ceph-devel

We are using high and the people on the list that have also changed
have not seen the improvements that I would expect.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, May 20, 2020 at 1:38 AM Dan van der Ster <dan@vanderster.com> wrote:
>
> Hi Robert,
>
> Since you didn't mention -- are you using osd_op_queue_cut_off low or
> high? I know you are usually advocating high, but the default is still
> low and most users don't change this setting.
>
> Cheers, Dan
>
>
> On Wed, May 20, 2020 at 9:41 AM Robert LeBlanc <robert@leblancnet.us> wrote:
> >
> > We upgraded our Jewel cluster to Nautilus a few months ago and I've noticed
> > that op behavior has changed. This is an HDD cluster (NVMe journals and
> > NVMe CephFS metadata pool) with about 800 OSDs. When on Jewel and running
> > WPQ with the high cut-off, it was rock solid. When we had recoveries going
> > on it barely dented the client ops and when the client ops on the cluster
> > went down the backfills would run as fast as the cluster could go. I could
> > have max_backfills set to 10 and the cluster performed admirably.
> > After upgrading to Nautilus the cluster struggles with any kind of recovery
> > and if there is any significant client write load the cluster can get into
> > a death spiral. Even heavy client write bandwidth (3-4 GB/s) can cause the
> > heartbeat checks to raise, blocked IO and even OSDs becoming unresponsive.
> > As the person who wrote the WPQ code initially, I know that it was fair and
> > proportional to the op priority and in Jewel it worked. It's not working in
> > Nautilus. I've tweaked a lot of things trying to troubleshoot the issue and
> > setting the recovery priority to 1 or zero barely makes any difference. My
> > best estimation is that the op priority is getting lost before reaching the
> > WPQ scheduler and is thus not prioritizing and dispatching ops correctly.
> > It's almost as if all ops are being treated the same and there is no
> > priority at all.
> > Unfortunately, I do not have the time to set up the dev/testing environment
> > to track this down and we will be moving away from Ceph. But I really like
> > Ceph and want to see it succeed. I strongly suggest that someone look into
> > this because I think it will resolve a lot of problems people have had on
> > the mailing list. I'm not sure if a bug was introduced with the other
> > queues that touches more of the op path or if something in the op path
> > restructuring that changed how things work (I know that was being discussed
> > around the time that Jewel was released). But my guess is that it is
> > somewhere between the op being created and being received into the queue.
> > I really hope that this helps in the search for this regression. I spent a
> > lot of time studying the issue to come up with WPQ and saw it work great
> > when I switched this cluster from PRIO to WPQ. I've also spent countless
> > hours studying how it's changed in Nautilus.
> >
> > Thank you,
> > Robert LeBlanc
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-leave@ceph.io

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-05-20 16:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-20  7:40 Possible bug in op path? Robert LeBlanc
2020-05-20  7:50 ` Fwd: " Robert LeBlanc
2020-05-20  8:37 ` [ceph-users] " Dan van der Ster
2020-05-20 16:56   ` Robert LeBlanc

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).