* [LSF/MM ATTEND] multipath redesign and dm blk-mq issues
@ 2016-01-28 21:23 Benjamin Marzinski
2016-01-28 22:37 ` Mike Snitzer
2016-01-29 6:59 ` Hannes Reinecke
0 siblings, 2 replies; 7+ messages in thread
From: Benjamin Marzinski @ 2016-01-28 21:23 UTC (permalink / raw)
To: lsf-pc; +Cc: linux-block, dm-devel
I'd like to attend LSF/MM 2016 to participate in any discussions about
redesigning how device-mapper multipath operates. I spend a significant
chunk of time dealing with issues around multipath and I'd like to
be part of any discussion about redesigning it.
In addition, I'd be interesting in disucssions that deal with how
device-mapper targets are dealing with blk-mq in general. For instance,
it looks like the current dm-multipath blk-mq implementation is running
into performance bottlenecks, and changing how path selection works into
something that allows for more parallelism is a worthy discussion. But
it would also be worth looking into changes about how the dm blk-mq
impementation deals with the mapping between it's swqueues and
hwqueue(s). Right now all the dm mapping is done in .queue_rq, instead
of in .map_queue, but I'm not convinced it belongs there. There's also
the issue that the bio targets may scale better on blk-mq devices than
the blk-mq targets.
If there happen to be any GFS2 related discussions, I'd be interested in
those as well.
Thanks
-Ben
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues
2016-01-28 21:23 [LSF/MM ATTEND] multipath redesign and dm blk-mq issues Benjamin Marzinski
@ 2016-01-28 22:37 ` Mike Snitzer
2016-01-29 1:33 ` Benjamin Marzinski
2016-01-29 6:59 ` Hannes Reinecke
1 sibling, 1 reply; 7+ messages in thread
From: Mike Snitzer @ 2016-01-28 22:37 UTC (permalink / raw)
To: Benjamin Marzinski; +Cc: linux-block, dm-devel, lsf-pc
On Thu, Jan 28 2016 at 4:23pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:
> I'd like to attend LSF/MM 2016 to participate in any discussions about
> redesigning how device-mapper multipath operates. I spend a significant
> chunk of time dealing with issues around multipath and I'd like to
> be part of any discussion about redesigning it.
>
> In addition, I'd be interesting in disucssions that deal with how
> device-mapper targets are dealing with blk-mq in general. For instance,
> it looks like the current dm-multipath blk-mq implementation is running
> into performance bottlenecks, and changing how path selection works into
> something that allows for more parallelism is a worthy discussion.
At this point this isn't the sexy topic we'd like it to be -- not too
sure how a 30 minute session on this will go. The devil is really in
the details. Hopefully we can have more details once LSF rolls around
to make an in-person discussion productive.
I've spent the past few days working on this and while there are
certainly various questions it is pretty clear that DM multipath's
m->lock (spinlock) is really _not_ a big bottleneck. It is an obvious
one for sure, but I removed the spinlock entirely (debug only) and then
the 'perf report -g' was completely benign -- no obvious bottlenecks.
Yet DM mpath performance on a really fast null_blk device, ~1850K read
IOPs, was still only ~950K -- as Jens rightly pointed out to me today:
"sure, it's slower but taking a step back, it's about making sure we
have a pretty low overhead, so actual application workloads don't spend
a lot of time in the kernel
~1M IOPS is a _lot_".
But even still, DM mpath is dropping 50% of potential IOPs on the floor.
There must be something inherently limiting in all the extra work done
to: 1) stack blk-mq devices (2 completely different sw -> hw mappings)
2) clone top-level blk-mq requests for submission on the underlying
blk-mq paths.
Anyway, my goal is to have my contribution to this LSF session be all
about what was wrong and how it has been fixed ;)
But given how much harder analyzing this problem has become I'm less
encouraged I'll be able to do so.
> But it would also be worth looking into changes about how the dm blk-mq
> impementation deals with the mapping between it's swqueues and
> hwqueue(s). Right now all the dm mapping is done in .queue_rq, instead
> of in .map_queue, but I'm not convinced it belongs there.
blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as
it deals with getting a request from the underlying paths.
blk-mq's .map_queue is all about mapping sw to hw queues. It is very
blk-mq specific and isn't something DM has a roll in -- cannot yet see
why it'd need to.
> There's also the issue that the bio targets may scale better on blk-mq
> devices than the blk-mq targets.
Why is that surprising? request-based DM (and block core) has quite a
bit more work that it does.
bio-based DM targets take a ~20% IOPs hit, whereas blk-mq request-based
DM takes a ~50% hit. I'd _love_ for request-based DM to get to only a
~20% hit. (And for the bio-based 20% hit to be reduced further).
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues
2016-01-28 22:37 ` Mike Snitzer
@ 2016-01-29 1:33 ` Benjamin Marzinski
2016-01-29 2:11 ` Benjamin Marzinski
0 siblings, 1 reply; 7+ messages in thread
From: Benjamin Marzinski @ 2016-01-29 1:33 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-block, dm-devel, lsf-pc
On Thu, Jan 28, 2016 at 05:37:33PM -0500, Mike Snitzer wrote:
> On Thu, Jan 28 2016 at 4:23pm -0500,
> Benjamin Marzinski <bmarzins@redhat.com> wrote:
>
> > I'd like to attend LSF/MM 2016 to participate in any discussions about
> > redesigning how device-mapper multipath operates. I spend a significant
> > chunk of time dealing with issues around multipath and I'd like to
> > be part of any discussion about redesigning it.
> >
> > In addition, I'd be interesting in disucssions that deal with how
> > device-mapper targets are dealing with blk-mq in general. For instance,
> > it looks like the current dm-multipath blk-mq implementation is running
> > into performance bottlenecks, and changing how path selection works into
> > something that allows for more parallelism is a worthy discussion.
>
> At this point this isn't the sexy topic we'd like it to be -- not too
> sure how a 30 minute session on this will go. The devil is really in
> the details. Hopefully we can have more details once LSF rolls around
> to make an in-person discussion productive.
>
> I've spent the past few days working on this and while there are
> certainly various questions it is pretty clear that DM multipath's
> m->lock (spinlock) is really _not_ a big bottleneck. It is an obvious
> one for sure, but I removed the spinlock entirely (debug only) and then
> the 'perf report -g' was completely benign -- no obvious bottlenecks.
> Yet DM mpath performance on a really fast null_blk device, ~1850K read
> IOPs, was still only ~950K -- as Jens rightly pointed out to me today:
>
> "sure, it's slower but taking a step back, it's about making sure we
> have a pretty low overhead, so actual application workloads don't spend
> a lot of time in the kernel
>
> ~1M IOPS is a _lot_".
>
> But even still, DM mpath is dropping 50% of potential IOPs on the floor.
> There must be something inherently limiting in all the extra work done
> to: 1) stack blk-mq devices (2 completely different sw -> hw mappings)
> 2) clone top-level blk-mq requests for submission on the underlying
> blk-mq paths.
>
> Anyway, my goal is to have my contribution to this LSF session be all
> about what was wrong and how it has been fixed ;)
>
> But given how much harder analyzing this problem has become I'm less
> encouraged I'll be able to do so.
>
> > But it would also be worth looking into changes about how the dm blk-mq
> > impementation deals with the mapping between it's swqueues and
> > hwqueue(s). Right now all the dm mapping is done in .queue_rq, instead
> > of in .map_queue, but I'm not convinced it belongs there.
>
> blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as
> it deals with getting a request from the underlying paths.
>
> blk-mq's .map_queue is all about mapping sw to hw queues. It is very
> blk-mq specific and isn't something DM has a roll in -- cannot yet see
> why it'd need to.
At the moment, we only have one hwqueue. But we could have one hwqueue
per path. Then queue_rq would just be in charge of handing the requst
down to the underlying device. In that setup, instead using a default
mapping of all swqueues to one hwqueue in .map_queue, we would be
mapping to the hardware queue for the path. I'd have to look through
the blk-mq code more to know if one of these methods has an obvious
advantage, but it seems like this way, if different cpus were using
different paths (with the per-cpu load-balancing), you wouldn't
constantly be accessing the hwqueue from different cpus. Although I
suppose you may do better just by leaving multipath_map where it is now,
and just adjusting the number of hardware queues. Speaking of which,
have you tried fiddling around with that in your tests?
> > There's also the issue that the bio targets may scale better on blk-mq
> > devices than the blk-mq targets.
>
> Why is that surprising? request-based DM (and block core) has quite a
> bit more work that it does.
>
> bio-based DM targets take a ~20% IOPs hit, whereas blk-mq request-based
> DM takes a ~50% hit. I'd _love_ for request-based DM to get to only a
> ~20% hit. (And for the bio-based 20% hit to be reduced further).
Right. But like I said in an earlier email, if bio-based mpath would
give us better performance on this class of devices, then all the blk-mq
performance work helps both multipath and the other targets. I realize
that bio based multipath had issues other than simply IO performance
that caused us to switch, like a lack of good error information. But if
the performance gap between request-based and bio-based dm persists for
blk-mq devices (even assuming both improve), then we should at least
revist the issues with bio-based multipath to see which set of problems
looks easiest to tackle.
-Ben
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues
2016-01-29 1:33 ` Benjamin Marzinski
@ 2016-01-29 2:11 ` Benjamin Marzinski
2016-01-29 2:48 ` Mike Snitzer
0 siblings, 1 reply; 7+ messages in thread
From: Benjamin Marzinski @ 2016-01-29 2:11 UTC (permalink / raw)
To: Mike Snitzer; +Cc: linux-block, dm-devel, lsf-pc
On Thu, Jan 28, 2016 at 07:33:16PM -0600, Benjamin Marzinski wrote:
> On Thu, Jan 28, 2016 at 05:37:33PM -0500, Mike Snitzer wrote:
> > On Thu, Jan 28 2016 at 4:23pm -0500,
> > Benjamin Marzinski <bmarzins@redhat.com> wrote:
> > blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as
> > it deals with getting a request from the underlying paths.
> >
> > blk-mq's .map_queue is all about mapping sw to hw queues. It is very
> > blk-mq specific and isn't something DM has a roll in -- cannot yet see
> > why it'd need to.
>
> At the moment, we only have one hwqueue. But we could have one hwqueue
> per path. Then queue_rq would just be in charge of handing the requst
> down to the underlying device. In that setup, instead using a default
> mapping of all swqueues to one hwqueue in .map_queue, we would be
> mapping to the hardware queue for the path. I'd have to look through
> the blk-mq code more to know if one of these methods has an obvious
> advantage, but it seems like this way, if different cpus were using
> different paths (with the per-cpu load-balancing), you wouldn't
> constantly be accessing the hwqueue from different cpus. Although I
> suppose you may do better just by leaving multipath_map where it is now,
> and just adjusting the number of hardware queues. Speaking of which,
> have you tried fiddling around with that in your tests?
>
O.k. a quick look shows that map_queue get called so often that any sort
of dynamic mapping there would be a pain. But constantly having all the
cpus accessing one hwqueue seems like it could be part of the
performance issue. So, it would definitely be worth playing around with
that.
-Ben
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues
2016-01-29 2:11 ` Benjamin Marzinski
@ 2016-01-29 2:48 ` Mike Snitzer
0 siblings, 0 replies; 7+ messages in thread
From: Mike Snitzer @ 2016-01-29 2:48 UTC (permalink / raw)
To: Benjamin Marzinski; +Cc: linux-block, dm-devel, lsf-pc
On Thu, Jan 28 2016 at 9:11pm -0500,
Benjamin Marzinski <bmarzins@redhat.com> wrote:
> On Thu, Jan 28, 2016 at 07:33:16PM -0600, Benjamin Marzinski wrote:
> > On Thu, Jan 28, 2016 at 05:37:33PM -0500, Mike Snitzer wrote:
> > > On Thu, Jan 28 2016 at 4:23pm -0500,
> > > Benjamin Marzinski <bmarzins@redhat.com> wrote:
>
> > > blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as
> > > it deals with getting a request from the underlying paths.
> > >
> > > blk-mq's .map_queue is all about mapping sw to hw queues. It is very
> > > blk-mq specific and isn't something DM has a roll in -- cannot yet see
> > > why it'd need to.
> >
> > At the moment, we only have one hwqueue. But we could have one hwqueue
> > per path. Then queue_rq would just be in charge of handing the requst
> > down to the underlying device. In that setup, instead using a default
> > mapping of all swqueues to one hwqueue in .map_queue, we would be
> > mapping to the hardware queue for the path. I'd have to look through
> > the blk-mq code more to know if one of these methods has an obvious
> > advantage, but it seems like this way, if different cpus were using
> > different paths (with the per-cpu load-balancing), you wouldn't
> > constantly be accessing the hwqueue from different cpus. Although I
> > suppose you may do better just by leaving multipath_map where it is now,
> > and just adjusting the number of hardware queues. Speaking of which,
> > have you tried fiddling around with that in your tests?
> >
>
> O.k. a quick look shows that map_queue get called so often that any sort
> of dynamic mapping there would be a pain. But constantly having all the
> cpus accessing one hwqueue seems like it could be part of the
> performance issue. So, it would definitely be worth playing around with
> that.
Yeah, I have a patch that makes both hw_queues and queue_depth tunable:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=99ebcaf36d9d1fa3acec98492c36664d57ba8fbd
Increasing nr_hw_queues doesn't help (in fact it hurts, going from 1 to
2 results in a drop from ~970K to ~945K IOPs, to 4 I get ~930K).
Will need to revisit the blk-mq code in general to appreciate how the
sw -> hw mapping will scale, etc.
And verify assumptions like: the top-level dm-mpath rq->mq_ctx->cpu
matches the underlying path's clone->mq_ctx->cpu
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues
2016-01-28 21:23 [LSF/MM ATTEND] multipath redesign and dm blk-mq issues Benjamin Marzinski
2016-01-28 22:37 ` Mike Snitzer
@ 2016-01-29 6:59 ` Hannes Reinecke
2016-01-29 15:34 ` Benjamin Marzinski
1 sibling, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2016-01-29 6:59 UTC (permalink / raw)
To: Benjamin Marzinski, lsf-pc; +Cc: linux-block, dm-devel
On 01/28/2016 10:23 PM, Benjamin Marzinski wrote:
> I'd like to attend LSF/MM 2016 to participate in any discussions about
> redesigning how device-mapper multipath operates. I spend a significant
> chunk of time dealing with issues around multipath and I'd like to
> be part of any discussion about redesigning it.
>
And while you're there, we should be discussing systemd / udev
/dracut integration. I have sunk far too many man-hours into this,
and it's still nowhere near mainline.
And I guess the same goes for any other distro :-)
That doesn't warrant a full LSF session, though, the important bit
is the discussion itself :-)
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
hare@suse.de +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues
2016-01-29 6:59 ` Hannes Reinecke
@ 2016-01-29 15:34 ` Benjamin Marzinski
0 siblings, 0 replies; 7+ messages in thread
From: Benjamin Marzinski @ 2016-01-29 15:34 UTC (permalink / raw)
To: Hannes Reinecke; +Cc: linux-block, dm-devel, lsf-pc
On Fri, Jan 29, 2016 at 07:59:09AM +0100, Hannes Reinecke wrote:
> On 01/28/2016 10:23 PM, Benjamin Marzinski wrote:
> > I'd like to attend LSF/MM 2016 to participate in any discussions about
> > redesigning how device-mapper multipath operates. I spend a significant
> > chunk of time dealing with issues around multipath and I'd like to
> > be part of any discussion about redesigning it.
> >
> And while you're there, we should be discussing systemd / udev
> /dracut integration. I have sunk far too many man-hours into this,
> and it's still nowhere near mainline.
> And I guess the same goes for any other distro :-)
Sure. It would be great to avoid so much duplicated work and get some
more consistency in this area.
-Ben
>
> That doesn't warrant a full LSF session, though, the important bit
> is the discussion itself :-)
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Teamlead Storage & Networking
> hare@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-01-29 15:34 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-28 21:23 [LSF/MM ATTEND] multipath redesign and dm blk-mq issues Benjamin Marzinski
2016-01-28 22:37 ` Mike Snitzer
2016-01-29 1:33 ` Benjamin Marzinski
2016-01-29 2:11 ` Benjamin Marzinski
2016-01-29 2:48 ` Mike Snitzer
2016-01-29 6:59 ` Hannes Reinecke
2016-01-29 15:34 ` Benjamin Marzinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).