* [LSF/MM ATTEND] multipath redesign and dm blk-mq issues @ 2016-01-28 21:23 Benjamin Marzinski 2016-01-28 22:37 ` Mike Snitzer 2016-01-29 6:59 ` Hannes Reinecke 0 siblings, 2 replies; 7+ messages in thread From: Benjamin Marzinski @ 2016-01-28 21:23 UTC (permalink / raw) To: lsf-pc; +Cc: linux-block, dm-devel I'd like to attend LSF/MM 2016 to participate in any discussions about redesigning how device-mapper multipath operates. I spend a significant chunk of time dealing with issues around multipath and I'd like to be part of any discussion about redesigning it. In addition, I'd be interesting in disucssions that deal with how device-mapper targets are dealing with blk-mq in general. For instance, it looks like the current dm-multipath blk-mq implementation is running into performance bottlenecks, and changing how path selection works into something that allows for more parallelism is a worthy discussion. But it would also be worth looking into changes about how the dm blk-mq impementation deals with the mapping between it's swqueues and hwqueue(s). Right now all the dm mapping is done in .queue_rq, instead of in .map_queue, but I'm not convinced it belongs there. There's also the issue that the bio targets may scale better on blk-mq devices than the blk-mq targets. If there happen to be any GFS2 related discussions, I'd be interested in those as well. Thanks -Ben ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues 2016-01-28 21:23 [LSF/MM ATTEND] multipath redesign and dm blk-mq issues Benjamin Marzinski @ 2016-01-28 22:37 ` Mike Snitzer 2016-01-29 1:33 ` Benjamin Marzinski 2016-01-29 6:59 ` Hannes Reinecke 1 sibling, 1 reply; 7+ messages in thread From: Mike Snitzer @ 2016-01-28 22:37 UTC (permalink / raw) To: Benjamin Marzinski; +Cc: linux-block, dm-devel, lsf-pc On Thu, Jan 28 2016 at 4:23pm -0500, Benjamin Marzinski <bmarzins@redhat.com> wrote: > I'd like to attend LSF/MM 2016 to participate in any discussions about > redesigning how device-mapper multipath operates. I spend a significant > chunk of time dealing with issues around multipath and I'd like to > be part of any discussion about redesigning it. > > In addition, I'd be interesting in disucssions that deal with how > device-mapper targets are dealing with blk-mq in general. For instance, > it looks like the current dm-multipath blk-mq implementation is running > into performance bottlenecks, and changing how path selection works into > something that allows for more parallelism is a worthy discussion. At this point this isn't the sexy topic we'd like it to be -- not too sure how a 30 minute session on this will go. The devil is really in the details. Hopefully we can have more details once LSF rolls around to make an in-person discussion productive. I've spent the past few days working on this and while there are certainly various questions it is pretty clear that DM multipath's m->lock (spinlock) is really _not_ a big bottleneck. It is an obvious one for sure, but I removed the spinlock entirely (debug only) and then the 'perf report -g' was completely benign -- no obvious bottlenecks. Yet DM mpath performance on a really fast null_blk device, ~1850K read IOPs, was still only ~950K -- as Jens rightly pointed out to me today: "sure, it's slower but taking a step back, it's about making sure we have a pretty low overhead, so actual application workloads don't spend a lot of time in the kernel ~1M IOPS is a _lot_". But even still, DM mpath is dropping 50% of potential IOPs on the floor. There must be something inherently limiting in all the extra work done to: 1) stack blk-mq devices (2 completely different sw -> hw mappings) 2) clone top-level blk-mq requests for submission on the underlying blk-mq paths. Anyway, my goal is to have my contribution to this LSF session be all about what was wrong and how it has been fixed ;) But given how much harder analyzing this problem has become I'm less encouraged I'll be able to do so. > But it would also be worth looking into changes about how the dm blk-mq > impementation deals with the mapping between it's swqueues and > hwqueue(s). Right now all the dm mapping is done in .queue_rq, instead > of in .map_queue, but I'm not convinced it belongs there. blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as it deals with getting a request from the underlying paths. blk-mq's .map_queue is all about mapping sw to hw queues. It is very blk-mq specific and isn't something DM has a roll in -- cannot yet see why it'd need to. > There's also the issue that the bio targets may scale better on blk-mq > devices than the blk-mq targets. Why is that surprising? request-based DM (and block core) has quite a bit more work that it does. bio-based DM targets take a ~20% IOPs hit, whereas blk-mq request-based DM takes a ~50% hit. I'd _love_ for request-based DM to get to only a ~20% hit. (And for the bio-based 20% hit to be reduced further). ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues 2016-01-28 22:37 ` Mike Snitzer @ 2016-01-29 1:33 ` Benjamin Marzinski 2016-01-29 2:11 ` Benjamin Marzinski 0 siblings, 1 reply; 7+ messages in thread From: Benjamin Marzinski @ 2016-01-29 1:33 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-block, dm-devel, lsf-pc On Thu, Jan 28, 2016 at 05:37:33PM -0500, Mike Snitzer wrote: > On Thu, Jan 28 2016 at 4:23pm -0500, > Benjamin Marzinski <bmarzins@redhat.com> wrote: > > > I'd like to attend LSF/MM 2016 to participate in any discussions about > > redesigning how device-mapper multipath operates. I spend a significant > > chunk of time dealing with issues around multipath and I'd like to > > be part of any discussion about redesigning it. > > > > In addition, I'd be interesting in disucssions that deal with how > > device-mapper targets are dealing with blk-mq in general. For instance, > > it looks like the current dm-multipath blk-mq implementation is running > > into performance bottlenecks, and changing how path selection works into > > something that allows for more parallelism is a worthy discussion. > > At this point this isn't the sexy topic we'd like it to be -- not too > sure how a 30 minute session on this will go. The devil is really in > the details. Hopefully we can have more details once LSF rolls around > to make an in-person discussion productive. > > I've spent the past few days working on this and while there are > certainly various questions it is pretty clear that DM multipath's > m->lock (spinlock) is really _not_ a big bottleneck. It is an obvious > one for sure, but I removed the spinlock entirely (debug only) and then > the 'perf report -g' was completely benign -- no obvious bottlenecks. > Yet DM mpath performance on a really fast null_blk device, ~1850K read > IOPs, was still only ~950K -- as Jens rightly pointed out to me today: > > "sure, it's slower but taking a step back, it's about making sure we > have a pretty low overhead, so actual application workloads don't spend > a lot of time in the kernel > > ~1M IOPS is a _lot_". > > But even still, DM mpath is dropping 50% of potential IOPs on the floor. > There must be something inherently limiting in all the extra work done > to: 1) stack blk-mq devices (2 completely different sw -> hw mappings) > 2) clone top-level blk-mq requests for submission on the underlying > blk-mq paths. > > Anyway, my goal is to have my contribution to this LSF session be all > about what was wrong and how it has been fixed ;) > > But given how much harder analyzing this problem has become I'm less > encouraged I'll be able to do so. > > > But it would also be worth looking into changes about how the dm blk-mq > > impementation deals with the mapping between it's swqueues and > > hwqueue(s). Right now all the dm mapping is done in .queue_rq, instead > > of in .map_queue, but I'm not convinced it belongs there. > > blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as > it deals with getting a request from the underlying paths. > > blk-mq's .map_queue is all about mapping sw to hw queues. It is very > blk-mq specific and isn't something DM has a roll in -- cannot yet see > why it'd need to. At the moment, we only have one hwqueue. But we could have one hwqueue per path. Then queue_rq would just be in charge of handing the requst down to the underlying device. In that setup, instead using a default mapping of all swqueues to one hwqueue in .map_queue, we would be mapping to the hardware queue for the path. I'd have to look through the blk-mq code more to know if one of these methods has an obvious advantage, but it seems like this way, if different cpus were using different paths (with the per-cpu load-balancing), you wouldn't constantly be accessing the hwqueue from different cpus. Although I suppose you may do better just by leaving multipath_map where it is now, and just adjusting the number of hardware queues. Speaking of which, have you tried fiddling around with that in your tests? > > There's also the issue that the bio targets may scale better on blk-mq > > devices than the blk-mq targets. > > Why is that surprising? request-based DM (and block core) has quite a > bit more work that it does. > > bio-based DM targets take a ~20% IOPs hit, whereas blk-mq request-based > DM takes a ~50% hit. I'd _love_ for request-based DM to get to only a > ~20% hit. (And for the bio-based 20% hit to be reduced further). Right. But like I said in an earlier email, if bio-based mpath would give us better performance on this class of devices, then all the blk-mq performance work helps both multipath and the other targets. I realize that bio based multipath had issues other than simply IO performance that caused us to switch, like a lack of good error information. But if the performance gap between request-based and bio-based dm persists for blk-mq devices (even assuming both improve), then we should at least revist the issues with bio-based multipath to see which set of problems looks easiest to tackle. -Ben ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues 2016-01-29 1:33 ` Benjamin Marzinski @ 2016-01-29 2:11 ` Benjamin Marzinski 2016-01-29 2:48 ` Mike Snitzer 0 siblings, 1 reply; 7+ messages in thread From: Benjamin Marzinski @ 2016-01-29 2:11 UTC (permalink / raw) To: Mike Snitzer; +Cc: linux-block, dm-devel, lsf-pc On Thu, Jan 28, 2016 at 07:33:16PM -0600, Benjamin Marzinski wrote: > On Thu, Jan 28, 2016 at 05:37:33PM -0500, Mike Snitzer wrote: > > On Thu, Jan 28 2016 at 4:23pm -0500, > > Benjamin Marzinski <bmarzins@redhat.com> wrote: > > blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as > > it deals with getting a request from the underlying paths. > > > > blk-mq's .map_queue is all about mapping sw to hw queues. It is very > > blk-mq specific and isn't something DM has a roll in -- cannot yet see > > why it'd need to. > > At the moment, we only have one hwqueue. But we could have one hwqueue > per path. Then queue_rq would just be in charge of handing the requst > down to the underlying device. In that setup, instead using a default > mapping of all swqueues to one hwqueue in .map_queue, we would be > mapping to the hardware queue for the path. I'd have to look through > the blk-mq code more to know if one of these methods has an obvious > advantage, but it seems like this way, if different cpus were using > different paths (with the per-cpu load-balancing), you wouldn't > constantly be accessing the hwqueue from different cpus. Although I > suppose you may do better just by leaving multipath_map where it is now, > and just adjusting the number of hardware queues. Speaking of which, > have you tried fiddling around with that in your tests? > O.k. a quick look shows that map_queue get called so often that any sort of dynamic mapping there would be a pain. But constantly having all the cpus accessing one hwqueue seems like it could be part of the performance issue. So, it would definitely be worth playing around with that. -Ben ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues 2016-01-29 2:11 ` Benjamin Marzinski @ 2016-01-29 2:48 ` Mike Snitzer 0 siblings, 0 replies; 7+ messages in thread From: Mike Snitzer @ 2016-01-29 2:48 UTC (permalink / raw) To: Benjamin Marzinski; +Cc: linux-block, dm-devel, lsf-pc On Thu, Jan 28 2016 at 9:11pm -0500, Benjamin Marzinski <bmarzins@redhat.com> wrote: > On Thu, Jan 28, 2016 at 07:33:16PM -0600, Benjamin Marzinski wrote: > > On Thu, Jan 28, 2016 at 05:37:33PM -0500, Mike Snitzer wrote: > > > On Thu, Jan 28 2016 at 4:23pm -0500, > > > Benjamin Marzinski <bmarzins@redhat.com> wrote: > > > > blk-mq's .queue_rq hook is the logical place to do the mpath mapping, as > > > it deals with getting a request from the underlying paths. > > > > > > blk-mq's .map_queue is all about mapping sw to hw queues. It is very > > > blk-mq specific and isn't something DM has a roll in -- cannot yet see > > > why it'd need to. > > > > At the moment, we only have one hwqueue. But we could have one hwqueue > > per path. Then queue_rq would just be in charge of handing the requst > > down to the underlying device. In that setup, instead using a default > > mapping of all swqueues to one hwqueue in .map_queue, we would be > > mapping to the hardware queue for the path. I'd have to look through > > the blk-mq code more to know if one of these methods has an obvious > > advantage, but it seems like this way, if different cpus were using > > different paths (with the per-cpu load-balancing), you wouldn't > > constantly be accessing the hwqueue from different cpus. Although I > > suppose you may do better just by leaving multipath_map where it is now, > > and just adjusting the number of hardware queues. Speaking of which, > > have you tried fiddling around with that in your tests? > > > > O.k. a quick look shows that map_queue get called so often that any sort > of dynamic mapping there would be a pain. But constantly having all the > cpus accessing one hwqueue seems like it could be part of the > performance issue. So, it would definitely be worth playing around with > that. Yeah, I have a patch that makes both hw_queues and queue_depth tunable: http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=devel2&id=99ebcaf36d9d1fa3acec98492c36664d57ba8fbd Increasing nr_hw_queues doesn't help (in fact it hurts, going from 1 to 2 results in a drop from ~970K to ~945K IOPs, to 4 I get ~930K). Will need to revisit the blk-mq code in general to appreciate how the sw -> hw mapping will scale, etc. And verify assumptions like: the top-level dm-mpath rq->mq_ctx->cpu matches the underlying path's clone->mq_ctx->cpu ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues 2016-01-28 21:23 [LSF/MM ATTEND] multipath redesign and dm blk-mq issues Benjamin Marzinski 2016-01-28 22:37 ` Mike Snitzer @ 2016-01-29 6:59 ` Hannes Reinecke 2016-01-29 15:34 ` Benjamin Marzinski 1 sibling, 1 reply; 7+ messages in thread From: Hannes Reinecke @ 2016-01-29 6:59 UTC (permalink / raw) To: Benjamin Marzinski, lsf-pc; +Cc: linux-block, dm-devel On 01/28/2016 10:23 PM, Benjamin Marzinski wrote: > I'd like to attend LSF/MM 2016 to participate in any discussions about > redesigning how device-mapper multipath operates. I spend a significant > chunk of time dealing with issues around multipath and I'd like to > be part of any discussion about redesigning it. > And while you're there, we should be discussing systemd / udev /dracut integration. I have sunk far too many man-hours into this, and it's still nowhere near mainline. And I guess the same goes for any other distro :-) That doesn't warrant a full LSF session, though, the important bit is the discussion itself :-) Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [LSF/MM ATTEND] multipath redesign and dm blk-mq issues 2016-01-29 6:59 ` Hannes Reinecke @ 2016-01-29 15:34 ` Benjamin Marzinski 0 siblings, 0 replies; 7+ messages in thread From: Benjamin Marzinski @ 2016-01-29 15:34 UTC (permalink / raw) To: Hannes Reinecke; +Cc: linux-block, dm-devel, lsf-pc On Fri, Jan 29, 2016 at 07:59:09AM +0100, Hannes Reinecke wrote: > On 01/28/2016 10:23 PM, Benjamin Marzinski wrote: > > I'd like to attend LSF/MM 2016 to participate in any discussions about > > redesigning how device-mapper multipath operates. I spend a significant > > chunk of time dealing with issues around multipath and I'd like to > > be part of any discussion about redesigning it. > > > And while you're there, we should be discussing systemd / udev > /dracut integration. I have sunk far too many man-hours into this, > and it's still nowhere near mainline. > And I guess the same goes for any other distro :-) Sure. It would be great to avoid so much duplicated work and get some more consistency in this area. -Ben > > That doesn't warrant a full LSF session, though, the important bit > is the discussion itself :-) > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Teamlead Storage & Networking > hare@suse.de +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton > HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-01-29 15:34 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-01-28 21:23 [LSF/MM ATTEND] multipath redesign and dm blk-mq issues Benjamin Marzinski 2016-01-28 22:37 ` Mike Snitzer 2016-01-29 1:33 ` Benjamin Marzinski 2016-01-29 2:11 ` Benjamin Marzinski 2016-01-29 2:48 ` Mike Snitzer 2016-01-29 6:59 ` Hannes Reinecke 2016-01-29 15:34 ` Benjamin Marzinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).