* Blk-mq/scsi-mq Tuning @ 2015-10-28 20:11 Chad Dupuis 2015-10-29 18:04 ` Bart Van Assche 2015-10-30 7:44 ` Hannes Reinecke 0 siblings, 2 replies; 8+ messages in thread From: Chad Dupuis @ 2015-10-28 20:11 UTC (permalink / raw) To: bvanassche@acm.org, hch@lst.de, linux-scsi Cc: Giridhar Malavali, Saurav Kashyap, Nilesh Javali [-- Attachment #1: Type: text/plain, Size: 895 bytes --] Hi Folks, We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there were any best practices in terms of block layer settings. We¹re looking specifically at the FCoE and iSCSI protocols. A little background on the queues in our hardware first: we have a per connection transmit queue and multiple, global receive queues. The transmit queues are not pegged to a particular CPU. The receive queues are pegged to the first N CPUs where N is the number of receive queues. We set the nr_hw_queues in the scsi_host_template to N as well. In our initial testing we¹re not seeing the performance scale as we would expect so we wanted to see if there some Œknobs¹ if you will that we could try tuning to try to increase the performance. Also, one question we did have is there an official API to be able to set the CPU affinity of the hw_ctx_queues? Thanks, Chad [-- Attachment #2: winmail.dat --] [-- Type: application/ms-tnef, Size: 3606 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Blk-mq/scsi-mq Tuning 2015-10-28 20:11 Blk-mq/scsi-mq Tuning Chad Dupuis @ 2015-10-29 18:04 ` Bart Van Assche 2015-10-30 7:44 ` Hannes Reinecke 1 sibling, 0 replies; 8+ messages in thread From: Bart Van Assche @ 2015-10-29 18:04 UTC (permalink / raw) To: Chad Dupuis, hch@lst.de, linux-scsi Cc: Giridhar Malavali, Saurav Kashyap, Nilesh Javali, Jens Axboe On 10/28/2015 01:11 PM, Chad Dupuis wrote: > We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there were > any best practices in terms of block layer settings. We¹re looking > specifically at the FCoE and iSCSI protocols. > > A little background on the queues in our hardware first: we have a per > connection transmit queue and multiple, global receive queues. The > transmit queues are not pegged to a particular CPU. The receive queues > are pegged to the first N CPUs where N is the number of receive queues. > We set the nr_hw_queues in the scsi_host_template to N as well. > > In our initial testing we¹re not seeing the performance scale as we would > expect so we wanted to see if there some Œknobs¹ if you will that we could > try tuning to try to increase the performance. Also, one question we did > have is there an official API to be able to set the CPU affinity of the > hw_ctx_queues? (added Jens to CC-list) Hello Chad, It's great news that you are looking into adding scsi-mq support for FCoE and iSCSI initiator HBA's. If you do not see the performance scale as expected that probably means that lock contention occurs in the code that submits requests to the SCSI request queues. Have you already tried to measure L3 cache misses with perf (e.g. perf record -ag -e LLC-store-misses sleep 10 && perf report) ? If a single function is responsible for more than 10% of the L3 cache misses usually that means that that function is causing a bottleneck. As far as I know an official API for setting the CPU affinity of the hw_ctx queues is not yet available. The approach of the SRP initiator driver (ib_srp) is that it assumes that the HCA supports MSI-X and that MSI-X interrupts have been spread evenly over processors. The ib_srp driver selects an MSI-X interrupt for each hw_ctx queue via the comp_vector member of struct ib_cq_init_attr. The script I am using myself is available at http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/21312/focus=98409. I hope one day that script will be superfluous :-) Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Blk-mq/scsi-mq Tuning 2015-10-28 20:11 Blk-mq/scsi-mq Tuning Chad Dupuis 2015-10-29 18:04 ` Bart Van Assche @ 2015-10-30 7:44 ` Hannes Reinecke 2015-10-30 13:25 ` Chad Dupuis 1 sibling, 1 reply; 8+ messages in thread From: Hannes Reinecke @ 2015-10-30 7:44 UTC (permalink / raw) To: Chad Dupuis, bvanassche@acm.org, hch@lst.de, linux-scsi Cc: Giridhar Malavali, Saurav Kashyap, Nilesh Javali On 10/28/2015 09:11 PM, Chad Dupuis wrote: > Hi Folks, > > We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there were > any best practices in terms of block layer settings. We¹re looking > specifically at the FCoE and iSCSI protocols. > > A little background on the queues in our hardware first: we have a per > connection transmit queue and multiple, global receive queues. The > transmit queues are not pegged to a particular CPU. The receive queues > are pegged to the first N CPUs where N is the number of receive queues. > We set the nr_hw_queues in the scsi_host_template to N as well. > Weelll ... I think you'll run into issues here. The whole point of the multiqueue implementation is that you can tag the submission _and_ completion queue to a single CPU, thereby eliminating locking. If you only peg the completion queue to a CPU you'll still have contention on the submission queue, needing to take locks etc. Plus you will _inevitably_ incur cache misses, as the completion will basically never occur on the same CPU which did the submissoin. Hence the context needs to be bounced to the CPU holding the completion queue, or you'll need to do a IPI to inform the submitting CPU. But if you do that you're essentially doing single-queue submission, so I doubt we're seeing that great improvements. > In our initial testing we¹re not seeing the performance scale as we would > expect so we wanted to see if there some Œknobs¹ if you will that we could > try tuning to try to increase the performance. Also, one question we did > have is there an official API to be able to set the CPU affinity of the > hw_ctx_queues? > As above, given the underlying design I'm not surprised. But above you mentioned 'per-connection submission queues'; from which one could infer that there are several _hardware_ submission queues? If so, _maybe_ we should look into doing MC/S (in the iSCSI case), which would allow us to keep the 1:1 submission/completion ratio preferred by blk-mq and still use several queues ... Hmm? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Blk-mq/scsi-mq Tuning 2015-10-30 7:44 ` Hannes Reinecke @ 2015-10-30 13:25 ` Chad Dupuis 2015-10-30 13:38 ` Hannes Reinecke 0 siblings, 1 reply; 8+ messages in thread From: Chad Dupuis @ 2015-10-30 13:25 UTC (permalink / raw) To: Hannes Reinecke Cc: bvanassche@acm.org, hch@lst.de, linux-scsi, Giridhar Malavali, Saurav Kashyap, Nilesh Javali [-- Attachment #1: Type: text/plain, Size: 2446 bytes --] On Fri, 30 Oct 2015, Hannes Reinecke wrote: > On 10/28/2015 09:11 PM, Chad Dupuis wrote: >> Hi Folks, >> >> We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there were >> any best practices in terms of block layer settings. We¹re looking >> specifically at the FCoE and iSCSI protocols. >> >> A little background on the queues in our hardware first: we have a per >> connection transmit queue and multiple, global receive queues. The >> transmit queues are not pegged to a particular CPU. The receive queues >> are pegged to the first N CPUs where N is the number of receive queues. >> We set the nr_hw_queues in the scsi_host_template to N as well. >> > Weelll ... I think you'll run into issues here. > The whole point of the multiqueue implementation is that you can tag the > submission _and_ completion queue to a single CPU, thereby eliminating > locking. > If you only peg the completion queue to a CPU you'll still have > contention on the submission queue, needing to take locks etc. > > Plus you will _inevitably_ incur cache misses, as the completion will > basically never occur on the same CPU which did the submissoin. > Hence the context needs to be bounced to the CPU holding the completion > queue, or you'll need to do a IPI to inform the submitting CPU. > But if you do that you're essentially doing single-queue submission, > so I doubt we're seeing that great improvements. This was why I was asking if there was a blk-mq API to be able to set CPU affinity for the hardware context queues so I could steer the submissions to the CPUs that my receive queues are on (even if they are allowed to float). >> In our initial testing we¹re not seeing the performance scale as we would >> expect so we wanted to see if there some Œknobs¹ if you will that we could >> try tuning to try to increase the performance. Also, one question we did >> have is there an official API to be able to set the CPU affinity of the >> hw_ctx_queues? >> > As above, given the underlying design I'm not surprised. > > But above you mentioned 'per-connection submission queues'; from which > one could infer that there are several _hardware_ submission queues? > If so, _maybe_ we should look into doing MC/S (in the iSCSI case), > which would allow us to keep the 1:1 submission/completion ratio > preferred by blk-mq and still use several queues ... Hmm? Yes, each connection has a transmit queue. > > Cheers, > > Hannes > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Blk-mq/scsi-mq Tuning 2015-10-30 13:25 ` Chad Dupuis @ 2015-10-30 13:38 ` Hannes Reinecke 2015-10-30 14:12 ` Chad Dupuis 0 siblings, 1 reply; 8+ messages in thread From: Hannes Reinecke @ 2015-10-30 13:38 UTC (permalink / raw) To: Chad Dupuis Cc: bvanassche@acm.org, hch@lst.de, linux-scsi, Giridhar Malavali, Saurav Kashyap, Nilesh Javali On 10/30/2015 02:25 PM, Chad Dupuis wrote: > > > On Fri, 30 Oct 2015, Hannes Reinecke wrote: > >> On 10/28/2015 09:11 PM, Chad Dupuis wrote: >>> Hi Folks, >>> >>> We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there >>> were >>> any best practices in terms of block layer settings. We¹re looking >>> specifically at the FCoE and iSCSI protocols. >>> >>> A little background on the queues in our hardware first: we have a per >>> connection transmit queue and multiple, global receive queues. The >>> transmit queues are not pegged to a particular CPU. The receive queues >>> are pegged to the first N CPUs where N is the number of receive queues. >>> We set the nr_hw_queues in the scsi_host_template to N as well. >>> >> Weelll ... I think you'll run into issues here. >> The whole point of the multiqueue implementation is that you can tag the >> submission _and_ completion queue to a single CPU, thereby eliminating >> locking. >> If you only peg the completion queue to a CPU you'll still have >> contention on the submission queue, needing to take locks etc. >> >> Plus you will _inevitably_ incur cache misses, as the completion will >> basically never occur on the same CPU which did the submissoin. >> Hence the context needs to be bounced to the CPU holding the completion >> queue, or you'll need to do a IPI to inform the submitting CPU. >> But if you do that you're essentially doing single-queue submission, >> so I doubt we're seeing that great improvements. > > This was why I was asking if there was a blk-mq API to be able to set > CPU affinity for the hardware context queues so I could steer the > submissions to the CPUs that my receive queues are on (even if they are > allowed to float). > But what would that achieve? Each of the hardware context queues would still having to use the same submission queue, so you'd have to have some serialisation with spinlocks et.al. during submission. Which is what blk-mq tries to avoid. Am I wrong? Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Blk-mq/scsi-mq Tuning 2015-10-30 13:38 ` Hannes Reinecke @ 2015-10-30 14:12 ` Chad Dupuis 2015-10-30 15:00 ` Hannes Reinecke 0 siblings, 1 reply; 8+ messages in thread From: Chad Dupuis @ 2015-10-30 14:12 UTC (permalink / raw) To: Hannes Reinecke Cc: bvanassche@acm.org, hch@lst.de, linux-scsi, Giridhar Malavali, Saurav Kashyap, Nilesh Javali [-- Attachment #1: Type: text/plain, Size: 2287 bytes --] On Fri, 30 Oct 2015, Hannes Reinecke wrote: > On 10/30/2015 02:25 PM, Chad Dupuis wrote: >> >> >> On Fri, 30 Oct 2015, Hannes Reinecke wrote: >> >>> On 10/28/2015 09:11 PM, Chad Dupuis wrote: >>>> Hi Folks, >>>> >>>> We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there >>>> were >>>> any best practices in terms of block layer settings. We¹re looking >>>> specifically at the FCoE and iSCSI protocols. >>>> >>>> A little background on the queues in our hardware first: we have a per >>>> connection transmit queue and multiple, global receive queues. The >>>> transmit queues are not pegged to a particular CPU. The receive queues >>>> are pegged to the first N CPUs where N is the number of receive queues. >>>> We set the nr_hw_queues in the scsi_host_template to N as well. >>>> >>> Weelll ... I think you'll run into issues here. >>> The whole point of the multiqueue implementation is that you can tag the >>> submission _and_ completion queue to a single CPU, thereby eliminating >>> locking. >>> If you only peg the completion queue to a CPU you'll still have >>> contention on the submission queue, needing to take locks etc. >>> >>> Plus you will _inevitably_ incur cache misses, as the completion will >>> basically never occur on the same CPU which did the submissoin. >>> Hence the context needs to be bounced to the CPU holding the completion >>> queue, or you'll need to do a IPI to inform the submitting CPU. >>> But if you do that you're essentially doing single-queue submission, >>> so I doubt we're seeing that great improvements. >> >> This was why I was asking if there was a blk-mq API to be able to set >> CPU affinity for the hardware context queues so I could steer the >> submissions to the CPUs that my receive queues are on (even if they are >> allowed to float). >> > But what would that achieve? > Each of the hardware context queues would still having to use the > same submission queue, so you'd have to have some serialisation > with spinlocks et.al. during submission. Which is what blk-mq > tries to avoid. > Am I wrong? Sadly, no I believe you're correct. So essentially the upshot seems to be if you can have a 1x1 request:response queue then sticking with the older queuecommand method is better? > > Cheers, > > Hannes > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Blk-mq/scsi-mq Tuning 2015-10-30 14:12 ` Chad Dupuis @ 2015-10-30 15:00 ` Hannes Reinecke 2015-10-30 20:15 ` Mike Christie 0 siblings, 1 reply; 8+ messages in thread From: Hannes Reinecke @ 2015-10-30 15:00 UTC (permalink / raw) To: Chad Dupuis Cc: bvanassche@acm.org, hch@lst.de, linux-scsi, Giridhar Malavali, Saurav Kashyap, Nilesh Javali, Lee Duncan, Mike Christie On 10/30/2015 03:12 PM, Chad Dupuis wrote: > > > On Fri, 30 Oct 2015, Hannes Reinecke wrote: > >> On 10/30/2015 02:25 PM, Chad Dupuis wrote: >>> >>> >>> On Fri, 30 Oct 2015, Hannes Reinecke wrote: >>> >>>> On 10/28/2015 09:11 PM, Chad Dupuis wrote: >>>>> Hi Folks, >>>>> >>>>> We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there >>>>> were >>>>> any best practices in terms of block layer settings. We¹re looking >>>>> specifically at the FCoE and iSCSI protocols. >>>>> >>>>> A little background on the queues in our hardware first: we have a per >>>>> connection transmit queue and multiple, global receive queues. The >>>>> transmit queues are not pegged to a particular CPU. The receive >>>>> queues >>>>> are pegged to the first N CPUs where N is the number of receive >>>>> queues. >>>>> We set the nr_hw_queues in the scsi_host_template to N as well. >>>>> >>>> Weelll ... I think you'll run into issues here. >>>> The whole point of the multiqueue implementation is that you can tag >>>> the >>>> submission _and_ completion queue to a single CPU, thereby eliminating >>>> locking. >>>> If you only peg the completion queue to a CPU you'll still have >>>> contention on the submission queue, needing to take locks etc. >>>> >>>> Plus you will _inevitably_ incur cache misses, as the completion will >>>> basically never occur on the same CPU which did the submissoin. >>>> Hence the context needs to be bounced to the CPU holding the completion >>>> queue, or you'll need to do a IPI to inform the submitting CPU. >>>> But if you do that you're essentially doing single-queue submission, >>>> so I doubt we're seeing that great improvements. >>> >>> This was why I was asking if there was a blk-mq API to be able to set >>> CPU affinity for the hardware context queues so I could steer the >>> submissions to the CPUs that my receive queues are on (even if they are >>> allowed to float). >>> >> But what would that achieve? >> Each of the hardware context queues would still having to use the >> same submission queue, so you'd have to have some serialisation >> with spinlocks et.al. during submission. Which is what blk-mq >> tries to avoid. >> Am I wrong? > > Sadly, no I believe you're correct. So essentially the upshot seems to > be if you can have a 1x1 request:response queue then sticking with the > older queuecommand method is better? > Hmm; you might be getting some performance improvements as the submission path from the blocklayer down is more efficient, but in your case the positive effects might be eliminated by reducing the number of receive queues. But then you never know until you try :-) The alternative would indeed be to move to MC/S with block-mq; that should give you some benefits as you'd be able to utilize several queues. I have actually discussed that with Emulex; moving to MC/S in the iSCSI stack might indeed be viable when using blk-mq. It would be a rather good match with the existing blk-mq implementation, and most of the implementation would be in the iSCSI stack, reducing the burden on the driver vendors :-) Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Blk-mq/scsi-mq Tuning 2015-10-30 15:00 ` Hannes Reinecke @ 2015-10-30 20:15 ` Mike Christie 0 siblings, 0 replies; 8+ messages in thread From: Mike Christie @ 2015-10-30 20:15 UTC (permalink / raw) To: Hannes Reinecke, Chad Dupuis Cc: bvanassche@acm.org, hch@lst.de, linux-scsi, Giridhar Malavali, Saurav Kashyap, Nilesh Javali, Lee Duncan On 10/30/2015 10:00 AM, Hannes Reinecke wrote: > On 10/30/2015 03:12 PM, Chad Dupuis wrote: >> >> >> On Fri, 30 Oct 2015, Hannes Reinecke wrote: >> >>> On 10/30/2015 02:25 PM, Chad Dupuis wrote: >>>> >>>> >>>> On Fri, 30 Oct 2015, Hannes Reinecke wrote: >>>> >>>>> On 10/28/2015 09:11 PM, Chad Dupuis wrote: >>>>>> Hi Folks, >>>>>> >>>>>> We¹ve begun to explore blk-mq and scsi-mq and wanted to know if there >>>>>> were >>>>>> any best practices in terms of block layer settings. We¹re looking >>>>>> specifically at the FCoE and iSCSI protocols. >>>>>> >>>>>> A little background on the queues in our hardware first: we have a per >>>>>> connection transmit queue and multiple, global receive queues. The >>>>>> transmit queues are not pegged to a particular CPU. The receive >>>>>> queues >>>>>> are pegged to the first N CPUs where N is the number of receive >>>>>> queues. >>>>>> We set the nr_hw_queues in the scsi_host_template to N as well. >>>>>> >>>>> Weelll ... I think you'll run into issues here. >>>>> The whole point of the multiqueue implementation is that you can tag >>>>> the >>>>> submission _and_ completion queue to a single CPU, thereby eliminating >>>>> locking. >>>>> If you only peg the completion queue to a CPU you'll still have >>>>> contention on the submission queue, needing to take locks etc. >>>>> >>>>> Plus you will _inevitably_ incur cache misses, as the completion will >>>>> basically never occur on the same CPU which did the submissoin. >>>>> Hence the context needs to be bounced to the CPU holding the completion >>>>> queue, or you'll need to do a IPI to inform the submitting CPU. >>>>> But if you do that you're essentially doing single-queue submission, >>>>> so I doubt we're seeing that great improvements. >>>> >>>> This was why I was asking if there was a blk-mq API to be able to set >>>> CPU affinity for the hardware context queues so I could steer the >>>> submissions to the CPUs that my receive queues are on (even if they are >>>> allowed to float). >>>> >>> But what would that achieve? >>> Each of the hardware context queues would still having to use the >>> same submission queue, so you'd have to have some serialisation >>> with spinlocks et.al. during submission. Which is what blk-mq >>> tries to avoid. >>> Am I wrong? >> >> Sadly, no I believe you're correct. So essentially the upshot seems to >> be if you can have a 1x1 request:response queue then sticking with the >> older queuecommand method is better? >> > Hmm; you might be getting some performance improvements as the > submission path from the blocklayer down is more efficient, but in > your case the positive effects might be eliminated by reducing the > number of receive queues. > But then you never know until you try :-) > > The alternative would indeed be to move to MC/S with block-mq; that > should give you some benefits as you'd be able to utilize several queues. > I have actually discussed that with Emulex; moving to MC/S in the iSCSI > stack might indeed be viable when using blk-mq. It would be a rather > good match with the existing blk-mq implementation, and most of the > implementation would be in the iSCSI stack, reducing the burden on the > driver vendors :-) > I think the mulit session mq stuff would actually just work too. It was done with hw iscsi in mind. MC/s might be nicer in their case though. For qla4xxx type of cards, would all the MC/S stuff be done in firmware, so all you need is a common interface to expose the connection details and then some common code to map them to hw queues? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-10-30 20:15 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-28 20:11 Blk-mq/scsi-mq Tuning Chad Dupuis 2015-10-29 18:04 ` Bart Van Assche 2015-10-30 7:44 ` Hannes Reinecke 2015-10-30 13:25 ` Chad Dupuis 2015-10-30 13:38 ` Hannes Reinecke 2015-10-30 14:12 ` Chad Dupuis 2015-10-30 15:00 ` Hannes Reinecke 2015-10-30 20:15 ` Mike Christie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).