On 2014-04-08 05:10, Sagi Grimberg wrote: > On 4/7/2014 10:45 PM, Jens Axboe wrote: >> On 04/07/2014 06:44 AM, Sagi Grimberg wrote: >>> Hey Jens, Christoph & Co, >>> >>> I raised this question at LSF but didn't get a clear answer on this >>> matter. >>> So it seems to me that the hctx selection and the actual request >>> dispatch (queue_rq) are preemptive: >>> (1) blk_mq_get_ctx(q); >>> (2) map_queue(q, ctx->cpu); >>> ... >>> (3) blk_mq_put_ctx(ctx); >>> (4) blk_mq_run_hw_queue(hctx, async); >>> >>> It is possible that an MQ device driver may want to implement a lockless >>> scheme counting on (running) CPU <-> hctx attachment. >>> Generally speaking, I think that LLDs will be more comfortable knowing >>> that they are not preemptive in the dispatch flow. >>> >>> My question is, is this a must? if so can you please explain why? >>> >>> Is it possible to put the hctx (restoring preemption) after run_hw_queue >>> allowing to LLDs to be sure that the selected queue >>> match the running CPU? >> >> It's a good question, and one I have thought about before. As you >> note, in the existing code, the mappings are what I would refer to as >> "soft". Generally speaking, CPU X will always map to hardware queue Y, >> but there are no specific guarantees made to effect. It would be >> trivial to make this mapping hard, and I'd be very open to doing that. >> But so far I haven't seen cases where it would improve things. If you >> end up being preempted and moved to a different CPU, it doesn't really >> matter if this happens before or after you queued the IO - the >> completion will end up in the "wrong" location regardless. >> >> But if drivers can be simplified and improved through relying on hard >> mappings (and preempt hence disabled), then I would definitely provide >> that possibility as well. If it doesn't hurt by default, we can just >> switch to that model. >> > > Hey Jens, thanks for the quick response! > > So in my driver I would definitely want to rely on hard mappings. > The reason is that I maintain a context for IO completions (lets call it > "completer") and a submission queue percpu. > The completer handles IO completions and also peeks at the submission > queue to handle pending IOs. > I wish to keep mutual exclusion between the completer and the submission > context. > The driver is capable of setting the IO submission so that the > completion will end up on the same core. > > Hard mappings providing non-preemptive submission flows will guarantee > that the above scheme will work. > At the moment I do: > (1) queue = hctx->driver_data > process request... > (2) cpu = get_cpu() > (3) if (cpu != queue.id) > queue = queues[cpu] > (4) submit_io(queue) > (5) put_cpu() > > So, adding hard_map indicator to blk_mq_reg will be great. > > As I mentioned in the general case, I think that LLDs will be more > comfortable with hard mappings in order to avoid/reduce > lock contention in the submission path and also possibly exploiting > cache/NUMA locality. Moreover I think that setting the device to generate > IO completion on the submission CPU is common practice for blk-mq > implementation. isn't it? > > I would definitely like to get more input from the driver folks on this. I've rolled up a set of changes to test this out, I have attached the small series. As I originally stated, it was due to performance concerns that I didn't do this originally. If this works better (or the same) on cases where we don't necessarily care about the hard mappings, then we should just do it. It's a cleaner model, and the hardware that has as many queues as CPUs, it's definitely the right (and obvious) thing to do. Note that the attached are TOTALLY untested. -- Jens Axboe