[PATCH 3/3] NVMe: Convert to blk-mq

All of lore.kernel.org
 help / color / mirror / Atom feed

From: m@bjorling.me (Matias Bjørling)
Subject: [PATCH 3/3] NVMe: Convert to blk-mq
Date: Fri, 18 Oct 2013 21:06:22 +0200	[thread overview]
Message-ID: <526186AE.3080006@bjorling.me> (raw)
In-Reply-To: <alpine.LRH.2.03.1310180837140.4763@AMR>

On 10/18/2013 05:13 PM, Keith Busch wrote:
> On Fri, 18 Oct 2013, Matias Bjorling wrote:
>> The nvme driver implements itself as a bio-based driver. This primarily
>> because of high lock congestion for high-performance nvm devices. To
>> remove the congestion within the traditional block layer, a multi-queue
>> block layer is being implemented.
>>
>> This patch converts the current bio-based approach to work with the
>> request-based approach found in the multi-queue block layer. This means
>> that bio responsibility is moved from the driver, into the block layer.
>> In return the block layer packs request structures and submit them to
>> the nvme  according to the features/limits of nvme hardware.
>>
>> The patch consists of:
>> * Initialization of multi-queue data structures
>> * Conversion of bio function call into request function calls.
>> * Separate cmdid patchs for admin and normal queues.
>> * Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
>>   by blk-mq.
>> * Uses the timeout framework blk-mq where possible.
>>
>> Signed-off-by: Matias Bjorling <m at bjorling.me>
>> ---
>> drivers/block/nvme-core.c | 765
>> +++++++++++++++++++++++-----------------------
>> drivers/block/nvme-scsi.c |  39 +--
>> include/linux/nvme.h      |   7 +-
>> 3 files changed, 385 insertions(+), 426 deletions(-)
>>
>> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
>> index e99a30a..36bf45c 100644
>> --- a/drivers/block/nvme-core.c
>> +++ b/drivers/block/nvme-core.c
>
> [snip]
>
>> -static void nvme_start_io_acct(struct bio *bio)
>> +static void nvme_start_io_acct(struct request *rq)
>> {
>> -    struct gendisk *disk = bio->bi_bdev->bd_disk;
>> -    const int rw = bio_data_dir(bio);
>> +    struct gendisk *disk = rq->rq_disk;
>> +    const int rw = rq_data_dir(rq);
>>     int cpu = part_stat_lock();
>>     part_round_stats(cpu, &disk->part0);
>>     part_stat_inc(cpu, &disk->part0, ios[rw]);
>> -    part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
>> +    part_stat_add(cpu, &disk->part0, sectors[rw], blk_rq_sectors(rq));
>>     part_inc_in_flight(&disk->part0, rw);
>>     part_stat_unlock();
>> }
>>
>> -static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
>> +static void nvme_end_io_acct(struct request *rq, unsigned long
>> start_time)
>> {
>> -    struct gendisk *disk = bio->bi_bdev->bd_disk;
>> -    const int rw = bio_data_dir(bio);
>> +    struct gendisk *disk = rq->rq_disk;
>> +    const int rw = rq_data_dir(rq);
>>     unsigned long duration = jiffies - start_time;
>>     int cpu = part_stat_lock();
>>     part_stat_add(cpu, &disk->part0, ticks[rw], duration);
>> @@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio,
>> unsigned long start_time)
>>     part_stat_unlock();
>> }
>
> I think you can remove the io accounting, right? These were added here
> because the diskstats are not updated in the block layer for bio-based
> block drivers.

Yeap, I'll make a patch for the next version that removes them.

>
>> @@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct
>> nvme_queue *nvmeq, struct nvme_ns *ns,
>>         dma_dir = DMA_FROM_DEVICE;
>>     }
>>
>> -    result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
>> -    if (result <= 0)
>> +    if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
>>         goto free_cmdid;
>> -    length = result;
>>
>> -    cmnd->rw.command_id = cmdid;
>> +    length = blk_rq_bytes(rq);
>> +
>> +    cmnd->rw.command_id = rq->tag;
>
> The command ids have to be unique on a submission queue. Since each
> namespace's blk-mq has its own 'tags' used as command ids here but share
> submission queues, what's stopping the tags for commands sent to namespace
> 1 from clashing with tags for namespace 2?
>
> I think this would work better if one blk-mq was created per device
> rather than namespace. It would fix the tag problem above and save a
> lot of memory potentially wasted on millions of requests allocated that
> can't be used.

You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also 
fix the command id issues.

>
> Do you know how/if this is planned to work with scsi? Will there be one
> blk-mq per LUN or per host controller?

Christoph Hellwig and Nicholas Bellinger are working on scsi-mq.

https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/log/?h=scsi-mq

I think they will map it out per controller. That'll be the most natural.

WARNING: multiple messages have this Message-ID (diff)

From: "Matias Bjørling" <m@bjorling.me>
To: Keith Busch <keith.busch@intel.com>
Cc: axboe@kernel.dk, willy@linux.intel.com,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 3/3] NVMe: Convert to blk-mq
Date: Fri, 18 Oct 2013 21:06:22 +0200	[thread overview]
Message-ID: <526186AE.3080006@bjorling.me> (raw)
In-Reply-To: <alpine.LRH.2.03.1310180837140.4763@AMR>

On 10/18/2013 05:13 PM, Keith Busch wrote:
> On Fri, 18 Oct 2013, Matias Bjorling wrote:
>> The nvme driver implements itself as a bio-based driver. This primarily
>> because of high lock congestion for high-performance nvm devices. To
>> remove the congestion within the traditional block layer, a multi-queue
>> block layer is being implemented.
>>
>> This patch converts the current bio-based approach to work with the
>> request-based approach found in the multi-queue block layer. This means
>> that bio responsibility is moved from the driver, into the block layer.
>> In return the block layer packs request structures and submit them to
>> the nvme  according to the features/limits of nvme hardware.
>>
>> The patch consists of:
>> * Initialization of multi-queue data structures
>> * Conversion of bio function call into request function calls.
>> * Separate cmdid patchs for admin and normal queues.
>> * Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
>>   by blk-mq.
>> * Uses the timeout framework blk-mq where possible.
>>
>> Signed-off-by: Matias Bjorling <m@bjorling.me>
>> ---
>> drivers/block/nvme-core.c | 765
>> +++++++++++++++++++++++-----------------------
>> drivers/block/nvme-scsi.c |  39 +--
>> include/linux/nvme.h      |   7 +-
>> 3 files changed, 385 insertions(+), 426 deletions(-)
>>
>> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
>> index e99a30a..36bf45c 100644
>> --- a/drivers/block/nvme-core.c
>> +++ b/drivers/block/nvme-core.c
>
> [snip]
>
>> -static void nvme_start_io_acct(struct bio *bio)
>> +static void nvme_start_io_acct(struct request *rq)
>> {
>> -    struct gendisk *disk = bio->bi_bdev->bd_disk;
>> -    const int rw = bio_data_dir(bio);
>> +    struct gendisk *disk = rq->rq_disk;
>> +    const int rw = rq_data_dir(rq);
>>     int cpu = part_stat_lock();
>>     part_round_stats(cpu, &disk->part0);
>>     part_stat_inc(cpu, &disk->part0, ios[rw]);
>> -    part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
>> +    part_stat_add(cpu, &disk->part0, sectors[rw], blk_rq_sectors(rq));
>>     part_inc_in_flight(&disk->part0, rw);
>>     part_stat_unlock();
>> }
>>
>> -static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
>> +static void nvme_end_io_acct(struct request *rq, unsigned long
>> start_time)
>> {
>> -    struct gendisk *disk = bio->bi_bdev->bd_disk;
>> -    const int rw = bio_data_dir(bio);
>> +    struct gendisk *disk = rq->rq_disk;
>> +    const int rw = rq_data_dir(rq);
>>     unsigned long duration = jiffies - start_time;
>>     int cpu = part_stat_lock();
>>     part_stat_add(cpu, &disk->part0, ticks[rw], duration);
>> @@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio,
>> unsigned long start_time)
>>     part_stat_unlock();
>> }
>
> I think you can remove the io accounting, right? These were added here
> because the diskstats are not updated in the block layer for bio-based
> block drivers.

Yeap, I'll make a patch for the next version that removes them.

>
>> @@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct
>> nvme_queue *nvmeq, struct nvme_ns *ns,
>>         dma_dir = DMA_FROM_DEVICE;
>>     }
>>
>> -    result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
>> -    if (result <= 0)
>> +    if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
>>         goto free_cmdid;
>> -    length = result;
>>
>> -    cmnd->rw.command_id = cmdid;
>> +    length = blk_rq_bytes(rq);
>> +
>> +    cmnd->rw.command_id = rq->tag;
>
> The command ids have to be unique on a submission queue. Since each
> namespace's blk-mq has its own 'tags' used as command ids here but share
> submission queues, what's stopping the tags for commands sent to namespace
> 1 from clashing with tags for namespace 2?
>
> I think this would work better if one blk-mq was created per device
> rather than namespace. It would fix the tag problem above and save a
> lot of memory potentially wasted on millions of requests allocated that
> can't be used.

You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also 
fix the command id issues.

>
> Do you know how/if this is planned to work with scsi? Will there be one
> blk-mq per LUN or per host controller?

Christoph Hellwig and Nicholas Bellinger are working on scsi-mq.

https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/log/?h=scsi-mq

I think they will map it out per controller. That'll be the most natural.

next prev parent reply	other threads:[~2013-10-18 19:06 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-18 13:14 [PATCH 0/3] Convert from bio-based to blk-mq v2 Matias Bjorling
2013-10-18 13:14 ` Matias Bjorling
2013-10-18 13:14 ` [PATCH 1/3] blk-mq: call exit_hctx on hw queue teardown Matias Bjorling
2013-10-18 13:14   ` Matias Bjorling
2013-10-18 13:14 ` [PATCH 2/3] NVMe: Extract admin queue size Matias Bjorling
2013-10-18 13:14   ` Matias Bjorling
2013-10-18 13:14 ` [PATCH 3/3] NVMe: Convert to blk-mq Matias Bjorling
2013-10-18 13:14   ` Matias Bjorling
2013-10-18 15:13   ` Keith Busch
2013-10-18 15:13     ` Keith Busch
2013-10-18 19:06     ` Matias Bjørling [this message]
2013-10-18 19:06       ` Matias Bjørling
2013-10-22 16:55       ` Keith Busch
2013-10-22 16:55         ` Keith Busch
2013-10-22 18:55         ` Matias Bjorling
2013-10-22 18:55           ` Matias Bjorling
2013-10-22 19:52           ` Keith Busch
2013-10-22 19:52             ` Keith Busch
2013-10-18 15:48 ` [PATCH 0/3] Convert from bio-based to blk-mq v2 Matthew Wilcox
2013-10-18 15:48   ` Matthew Wilcox
2013-10-18 19:10   ` Matias Bjørling
2013-10-18 19:10     ` Matias Bjørling
2013-10-18 19:21   ` Matias Bjorling
2013-10-18 19:21     ` Matias Bjorling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=526186AE.3080006@bjorling.me \
    --to=m@bjorling.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.