On 2014-04-08 05:10, Sagi Grimberg wrote:
> On 4/7/2014 10:45 PM, Jens Axboe wrote:
>> On 04/07/2014 06:44 AM, Sagi Grimberg wrote:
>>> Hey Jens, Christoph & Co,
>>>
>>> I raised this question at LSF but didn't get a clear answer on this
>>> matter.
>>> So it seems to me that the hctx selection and the actual request
>>> dispatch (queue_rq) are preemptive:
>>> (1) blk_mq_get_ctx(q);
>>> (2) map_queue(q, ctx->cpu);
>>> ...
>>> (3) blk_mq_put_ctx(ctx);
>>> (4) blk_mq_run_hw_queue(hctx, async);
>>>
>>> It is possible that an MQ device driver may want to implement a lockless
>>> scheme counting on (running) CPU <-> hctx attachment.
>>> Generally speaking, I think that LLDs will be more comfortable knowing
>>> that they are not preemptive in the dispatch flow.
>>>
>>> My question is, is this a must? if so can you please explain why?
>>>
>>> Is it possible to put the hctx (restoring preemption) after run_hw_queue
>>> allowing to LLDs to be sure that the selected queue
>>> match the running CPU?
>>
>> It's a good question, and one I have thought about before. As you
>> note, in the existing code, the mappings are what I would refer to as
>> "soft". Generally speaking, CPU X will always map to hardware queue Y,
>> but there are no specific guarantees made to effect. It would be
>> trivial to make this mapping hard, and I'd be very open to doing that.
>> But so far I haven't seen cases where it would improve things. If you
>> end up being preempted and moved to a different CPU, it doesn't really
>> matter if this happens before or after you queued the IO - the
>> completion will end up in the "wrong" location regardless.
>>
>> But if drivers can be simplified and improved through relying on hard
>> mappings (and preempt hence disabled), then I would definitely provide
>> that possibility as well. If it doesn't hurt by default, we can just
>> switch to that model.
>>
>
> Hey Jens, thanks for the quick response!
>
> So in my driver I would definitely want to rely on hard mappings.
> The reason is that I maintain a context for IO completions (lets call it
> "completer") and a submission queue percpu.
> The completer handles IO completions and also peeks at the submission
> queue to handle pending IOs.
> I wish to keep mutual exclusion between the completer and the submission
> context.
> The driver is capable of setting the IO submission so that the
> completion will end up on the same core.
>
> Hard mappings providing non-preemptive submission flows will guarantee
> that the above scheme will work.
> At the moment I do:
> (1) queue = hctx->driver_data
> process request...
> (2) cpu = get_cpu()
> (3) if (cpu != queue.id)
>              queue = queues[cpu]
> (4) submit_io(queue)
> (5) put_cpu()
>
> So, adding hard_map indicator to blk_mq_reg will be great.
>
> As I mentioned in the general case, I think that LLDs will be more
> comfortable with hard mappings in order to avoid/reduce
> lock contention in the submission path and also possibly exploiting
> cache/NUMA locality. Moreover I think that setting the device to generate
> IO completion on the submission CPU is common practice for blk-mq
> implementation. isn't it?
>
> I would definitely like to get more input from the driver folks on this.

I've rolled up a set of changes to test this out, I have attached the 
small series. As I originally stated, it was due to performance concerns 
that I didn't do this originally. If this works better (or the same) on 
cases where we don't necessarily care about the hard mappings, then we 
should just do it. It's a cleaner model, and the hardware that has as 
many queues as CPUs, it's definitely the right (and obvious) thing to do.

Note that the attached are TOTALLY untested.

-- 
Jens Axboe