From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Christie Subject: Re: [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion Date: Thu, 08 Jan 2015 17:26:36 -0600 Message-ID: <54AF122C.9070703@cs.wisc.edu> References: <54AD5DDD.2090808@dev.mellanox.co.il> <54AD6563.4040603@suse.de> <54ADA777.6090801@cs.wisc.edu> <54AE36CE.8020509@acm.org> <1420755361.2842.16.camel@haakon3.risingtidesystems.com> <1420756142.11310.9.camel@HansenPartnership.com> <1420757822.2842.39.camel@haakon3.risingtidesystems.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <1420757822.2842.39.camel@haakon3.risingtidesystems.com> Sender: target-devel-owner@vger.kernel.org To: "Nicholas A. Bellinger" Cc: James Bottomley , Bart Van Assche , open-iscsi@googlegroups.com, Hannes Reinecke , Sagi Grimberg , lsf-pc@lists.linux-foundation.org, linux-scsi , target-devel List-Id: linux-scsi@vger.kernel.org On 1/8/15, 4:57 PM, Nicholas A. Bellinger wrote: > On Thu, 2015-01-08 at 14:29 -0800, James Bottomley wrote: >> On Thu, 2015-01-08 at 14:16 -0800, Nicholas A. Bellinger wrote: >>> On Thu, 2015-01-08 at 08:50 +0100, Bart Van Assche wrote: >>>> On 01/07/15 22:39, Mike Christie wrote: >>>>> On 01/07/2015 10:57 AM, Hannes Reinecke wrote: >>>>>> On 01/07/2015 05:25 PM, Sagi Grimberg wrote: >>>>>>> Hi everyone, >>>>>>> >>>>>>> Now that scsi-mq is fully included, we need an iSCSI initiator = that >>>>>>> would use it to achieve scalable performance. The need is even = greater >>>>>>> for iSCSI offload devices and transports that support multiple = HW >>>>>>> queues. As iSER maintainer I'd like to discuss the way we would= choose >>>>>>> to implement that in iSCSI. >>>>>>> >>>>>>> My measurements show that iSER initiator can scale up to ~2.1M = IOPs >>>>>>> with multiple sessions but only ~630K IOPs with a single sessio= n where >>>>>>> the most significant bottleneck the (single) core processing >>>>>>> completions. >>>>>>> >>>>>>> In the existing single connection per session model, given that= command >>>>>>> ordering must be preserved session-wide, we end up in a serial = command >>>>>>> execution over a single connection which is basically a single = queue >>>>>>> model. The best fit seems to be plugging iSCSI MCS as a multi-q= ueued >>>>>>> scsi LLDD. In this model, a hardware context will have a 1x1 ma= pping >>>>>>> with an iSCSI connection (TCP socket or a HW queue). >>>>>>> >>>>>>> iSCSI MCS and it's role in the presence of dm-multipath layer w= as >>>>>>> discussed several times in the past decade(s). The basic need f= or MCS is >>>>>>> implementing a multi-queue data path, so perhaps we may want to= avoid >>>>>>> doing any type link aggregation or load balancing to not overla= p >>>>>>> dm-multipath. For example we can implement ERL=3D0 (which is ba= sically the >>>>>>> scsi-mq ERL) and/or restrict a session to a single portal. >>>>>>> >>>>>>> As I see it, the todo's are: >>>>>>> 1. Getting MCS to work (kernel + user-space) with ERL=3D0 and a >>>>>>> round-robin connection selection (per scsi command executi= on). >>>>>>> 2. Plug into scsi-mq - exposing num_connections as nr_hw_queues= and >>>>>>> using blk-mq based queue (conn) selection. >>>>>>> 3. Rework iSCSI core locking scheme to avoid session-wide locki= ng >>>>>>> as much as possible. >>>>>>> 4. Use blk-mq pre-allocation and tagging facilities. >>>>>>> >>>>>>> I've recently started looking into this. I would like the commu= nity to >>>>>>> agree (or debate) on this scheme and also talk about implementa= tion >>>>>>> with anyone who is also interested in this. >>>>>>> >>>>>> Yes, that's a really good topic. >>>>>> >>>>>> I've pondered implementing MC/S for iscsi/TCP but then I've figu= red my >>>>>> network implementation knowledge doesn't spread that far. >>>>>> So yeah, a discussion here would be good. >>>>>> >>>>>> Mike? Any comments? >>>>> >>>>> I have been working under the assumption that people would be ok = with >>>>> MCS upstream if we are only using it to handle the issue where we= want >>>>> to do something like have a tcp/iscsi connection per CPU then map= the >>>>> connection to a blk_mq_hw_ctx. In this more limited MCS implement= ation >>>>> there would be no iscsi layer code to do something like load bala= nce >>>>> across ports or transport paths like how dm-multipath does, so th= ere >>>>> would be no feature/code duplication. For balancing across hctxs,= then >>>>> the iscsi layer would also leave that up to whatever we end up wi= th in >>>>> upper layers, so again no feature/code duplication with upper lay= ers. >>>>> >>>>> So pretty non controversial I hope :) >>>>> >>>>> If people want to add something like round robin connection selec= tion in >>>>> the iscsi layer, then I think we want to leave that for after the >>>>> initial merge, so people can argue about that separately. >>>> >>>> Hello Sagi and Mike, >>>> >>>> I agree with Sagi that adding scsi-mq support in the iSER initiato= r >>>> would help iSER users because that would allow these users to conf= igure >>>> a single iSER target and use the multiqueue feature instead of hav= ing to >>>> configure multiple iSER targets to spread the workload over multip= le >>>> cpus at the target side. >>>> >>>> And I agree with Mike that implementing scsi-mq support in the iSE= R >>>> initiator as multiple independent connections probably is a better >>>> choice than MC/S. RFC 3720 namely requires that iSCSI numbering is >>>> session-wide. This means maintaining a single counter for all MC/S >>>> sessions. Such a counter would be a contention point. I'm afraid t= hat >>>> because of that counter performance on a multi-socket initiator sy= stem >>>> with a scsi-mq implementation based on MC/S could be worse than wi= th the >>>> approach with multiple iSER targets. Hence my preference for an ap= proach >>>> based on multiple independent iSER connections instead of MC/S. >>>> >>> >>> The idea that a simple session wide counter for command sequence nu= mber >>> assignment adds such a degree of contention that it renders MC/S at= a >>> performance disadvantage vs. multi-session configurations with all = of >>> the extra multipath logic overhead on top is at best, a naive >>> proposition. >>> >>> On the initiator side for MC/S, literally the only thing that needs= to >>> be serialized is the assignment of the command sequence number to >>> individual non-immediate PDUs. The sending of the outgoing PDUs + >>> immediate data by the initiator can happen out-of-order, and it's u= p to >>> the target to ensure that the submission of the commands to the dev= ice >>> server is in command sequence number order. >>> >>> All of the actual immediate data + R2T -> data-out processing by th= e >>> target can also be done out-of-order as well. >> >> Right, but what he's saying is that we've taken great pains in the M= Q >> situation to free our issue queues of all entanglements and cross qu= eue >> locking so they can fly as fast as possible. If we have to assign a= n >> in-order sequence number across all the queues, this becomes both a >> cross CPU bus lock point to ensure atomicity and a sync point to ens= ure >> sequencing. Na=C3=AFvely that does look to be a bottleneck which wo= uldn't >> necessarily be mitigated simply by allowing everything to proceed ou= t of >> order after this point. >> > > The point is that a simple session wide counter for command sequence > number assignment is significantly less overhead than all of the > overhead associated with running a full multipath stack atop multiple > sessions. I think we are still going to want to use dm multipath on top of iscsi=20 for devices that do failover across some sort group of paths like with=20 ALUA, so we have to solve dm multipath problems either way. There is greater memory overhead, but how bad is it? With lots of CPUs=20 and lots of transport paths, I can see where it could get crazy. I have= =20 no idea how bad it will be though. Are you also seeing a perf issue that is caused by dm? Hannes had an idea where we could merge the lower and upper levels=20 somehow and that might solve some of the issues you are thinking about. > > Not to mention that our iSCSI/iSER initiator is already taking a sess= ion > wide lock when sending outgoing PDUs, so adding a session wide counte= r > isn't adding any additional synchronization overhead vs. what's alrea= dy > in place. I am not sure if we want this to be a deciding factor. I think the=20 session wide lock is something that can be removed in the main IO paths= =2E A lot of what it is used for now is cmd/task related handling like list= =20 accesses. When we have the scsi layer alloc/free/manage that, then we=20 can simplify that a lot for iser/bnx2i/cxgb*i since there send path is=20 less complicated than software iscsi. It is also used for the state check but I think that is overkill.