From: Jens Axboe <axboe@kernel.dk>
To: Sagi Grimberg <sagig@dev.mellanox.co.il>, Christoph Hellwig <hch@lst.de>
Cc: linux-scsi <linux-scsi@vger.kernel.org>,
Or Gerlitz <ogerlitz@mellanox.com>, Oren Duer <oren@mellanox.com>,
Max Gurtovoy <maxg@mellanox.com>
Subject: Re: blk-mq queue selection and queue_rq preemption
Date: Wed, 09 Apr 2014 07:37:24 -0600 [thread overview]
Message-ID: <53454D14.50804@kernel.dk> (raw)
In-Reply-To: <534538A4.6050103@dev.mellanox.co.il>
On 04/09/2014 06:10 AM, Sagi Grimberg wrote:
> On 4/8/2014 6:40 PM, Jens Axboe wrote:
>> On 2014-04-08 05:10, Sagi Grimberg wrote:
>>> On 4/7/2014 10:45 PM, Jens Axboe wrote:
>>>> On 04/07/2014 06:44 AM, Sagi Grimberg wrote:
>>>>> Hey Jens, Christoph & Co,
>>>>>
>>>>> I raised this question at LSF but didn't get a clear answer on this
>>>>> matter.
>>>>> So it seems to me that the hctx selection and the actual request
>>>>> dispatch (queue_rq) are preemptive:
>>>>> (1) blk_mq_get_ctx(q);
>>>>> (2) map_queue(q, ctx->cpu);
>>>>> ...
>>>>> (3) blk_mq_put_ctx(ctx);
>>>>> (4) blk_mq_run_hw_queue(hctx, async);
>>>>>
>>>>> It is possible that an MQ device driver may want to implement a
>>>>> lockless
>>>>> scheme counting on (running) CPU <-> hctx attachment.
>>>>> Generally speaking, I think that LLDs will be more comfortable knowing
>>>>> that they are not preemptive in the dispatch flow.
>>>>>
>>>>> My question is, is this a must? if so can you please explain why?
>>>>>
>>>>> Is it possible to put the hctx (restoring preemption) after
>>>>> run_hw_queue
>>>>> allowing to LLDs to be sure that the selected queue
>>>>> match the running CPU?
>>>>
>>>> It's a good question, and one I have thought about before. As you
>>>> note, in the existing code, the mappings are what I would refer to as
>>>> "soft". Generally speaking, CPU X will always map to hardware queue Y,
>>>> but there are no specific guarantees made to effect. It would be
>>>> trivial to make this mapping hard, and I'd be very open to doing that.
>>>> But so far I haven't seen cases where it would improve things. If you
>>>> end up being preempted and moved to a different CPU, it doesn't really
>>>> matter if this happens before or after you queued the IO - the
>>>> completion will end up in the "wrong" location regardless.
>>>>
>>>> But if drivers can be simplified and improved through relying on hard
>>>> mappings (and preempt hence disabled), then I would definitely provide
>>>> that possibility as well. If it doesn't hurt by default, we can just
>>>> switch to that model.
>>>>
>>>
>>> Hey Jens, thanks for the quick response!
>>>
>>> So in my driver I would definitely want to rely on hard mappings.
>>> The reason is that I maintain a context for IO completions (lets call it
>>> "completer") and a submission queue percpu.
>>> The completer handles IO completions and also peeks at the submission
>>> queue to handle pending IOs.
>>> I wish to keep mutual exclusion between the completer and the submission
>>> context.
>>> The driver is capable of setting the IO submission so that the
>>> completion will end up on the same core.
>>>
>>> Hard mappings providing non-preemptive submission flows will guarantee
>>> that the above scheme will work.
>>> At the moment I do:
>>> (1) queue = hctx->driver_data
>>> process request...
>>> (2) cpu = get_cpu()
>>> (3) if (cpu != queue.id)
>>> queue = queues[cpu]
>>> (4) submit_io(queue)
>>> (5) put_cpu()
>>>
>>> So, adding hard_map indicator to blk_mq_reg will be great.
>>>
>>> As I mentioned in the general case, I think that LLDs will be more
>>> comfortable with hard mappings in order to avoid/reduce
>>> lock contention in the submission path and also possibly exploiting
>>> cache/NUMA locality. Moreover I think that setting the device to
>>> generate
>>> IO completion on the submission CPU is common practice for blk-mq
>>> implementation. isn't it?
>>>
>>> I would definitely like to get more input from the driver folks on this.
>>
>> I've rolled up a set of changes to test this out, I have attached the
>> small series. As I originally stated, it was due to performance
>> concerns that I didn't do this originally. If this works better (or
>> the same) on cases where we don't necessarily care about the hard
>> mappings, then we should just do it. It's a cleaner model, and the
>> hardware that has as many queues as CPUs, it's definitely the right
>> (and obvious) thing to do.
>>
>> Note that the attached are TOTALLY untested.
>>
>
> Thanks Jens!
>
> I'll give it a go...
The patches were since tested, and miraculously, they actually do seem
to work as-is. At least from the perspective of "doesn't crash" - I'll
need to double check things and add some debug logic to check that we
don't run on the wrong CPUs from some paths.
--
Jens Axboe
prev parent reply other threads:[~2014-04-09 13:37 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-07 12:44 blk-mq queue selection and queue_rq preemption Sagi Grimberg
2014-04-07 19:45 ` Jens Axboe
2014-04-08 11:10 ` Sagi Grimberg
2014-04-08 15:40 ` Jens Axboe
2014-04-09 12:10 ` Sagi Grimberg
2014-04-09 13:37 ` Jens Axboe [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53454D14.50804@kernel.dk \
--to=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=linux-scsi@vger.kernel.org \
--cc=maxg@mellanox.com \
--cc=ogerlitz@mellanox.com \
--cc=oren@mellanox.com \
--cc=sagig@dev.mellanox.co.il \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).