From: keith.busch@intel.com (Keith Busch)
Subject: [PATCH 3/3] NVMe: Convert to blk-mq
Date: Fri, 18 Oct 2013 09:13:49 -0600 (MDT) [thread overview]
Message-ID: <alpine.LRH.2.03.1310180837140.4763@AMR> (raw)
In-Reply-To: <1382102062-22270-4-git-send-email-m@bjorling.me>
On Fri, 18 Oct 2013, Matias Bjorling wrote:
> The nvme driver implements itself as a bio-based driver. This primarily
> because of high lock congestion for high-performance nvm devices. To
> remove the congestion within the traditional block layer, a multi-queue
> block layer is being implemented.
>
> This patch converts the current bio-based approach to work with the
> request-based approach found in the multi-queue block layer. This means
> that bio responsibility is moved from the driver, into the block layer.
> In return the block layer packs request structures and submit them to
> the nvme according to the features/limits of nvme hardware.
>
> The patch consists of:
> * Initialization of multi-queue data structures
> * Conversion of bio function call into request function calls.
> * Separate cmdid patchs for admin and normal queues.
> * Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
> by blk-mq.
> * Uses the timeout framework blk-mq where possible.
>
> Signed-off-by: Matias Bjorling <m at bjorling.me>
> ---
> drivers/block/nvme-core.c | 765 +++++++++++++++++++++++-----------------------
> drivers/block/nvme-scsi.c | 39 +--
> include/linux/nvme.h | 7 +-
> 3 files changed, 385 insertions(+), 426 deletions(-)
>
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index e99a30a..36bf45c 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
[snip]
> -static void nvme_start_io_acct(struct bio *bio)
> +static void nvme_start_io_acct(struct request *rq)
> {
> - struct gendisk *disk = bio->bi_bdev->bd_disk;
> - const int rw = bio_data_dir(bio);
> + struct gendisk *disk = rq->rq_disk;
> + const int rw = rq_data_dir(rq);
> int cpu = part_stat_lock();
> part_round_stats(cpu, &disk->part0);
> part_stat_inc(cpu, &disk->part0, ios[rw]);
> - part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
> + part_stat_add(cpu, &disk->part0, sectors[rw], blk_rq_sectors(rq));
> part_inc_in_flight(&disk->part0, rw);
> part_stat_unlock();
> }
>
> -static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
> +static void nvme_end_io_acct(struct request *rq, unsigned long start_time)
> {
> - struct gendisk *disk = bio->bi_bdev->bd_disk;
> - const int rw = bio_data_dir(bio);
> + struct gendisk *disk = rq->rq_disk;
> + const int rw = rq_data_dir(rq);
> unsigned long duration = jiffies - start_time;
> int cpu = part_stat_lock();
> part_stat_add(cpu, &disk->part0, ticks[rw], duration);
> @@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
> part_stat_unlock();
> }
I think you can remove the io accounting, right? These were added here
because the diskstats are not updated in the block layer for bio-based
block drivers.
> @@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
> dma_dir = DMA_FROM_DEVICE;
> }
>
> - result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
> - if (result <= 0)
> + if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
> goto free_cmdid;
> - length = result;
>
> - cmnd->rw.command_id = cmdid;
> + length = blk_rq_bytes(rq);
> +
> + cmnd->rw.command_id = rq->tag;
The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?
I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.
Do you know how/if this is planned to work with scsi? Will there be one
blk-mq per LUN or per host controller?
WARNING: multiple messages have this Message-ID (diff)
From: Keith Busch <keith.busch@intel.com>
To: Matias Bjorling <m@bjorling.me>
Cc: axboe@kernel.dk, willy@linux.intel.com, keith.busch@intel.com,
linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org
Subject: Re: [PATCH 3/3] NVMe: Convert to blk-mq
Date: Fri, 18 Oct 2013 09:13:49 -0600 (MDT) [thread overview]
Message-ID: <alpine.LRH.2.03.1310180837140.4763@AMR> (raw)
In-Reply-To: <1382102062-22270-4-git-send-email-m@bjorling.me>
On Fri, 18 Oct 2013, Matias Bjorling wrote:
> The nvme driver implements itself as a bio-based driver. This primarily
> because of high lock congestion for high-performance nvm devices. To
> remove the congestion within the traditional block layer, a multi-queue
> block layer is being implemented.
>
> This patch converts the current bio-based approach to work with the
> request-based approach found in the multi-queue block layer. This means
> that bio responsibility is moved from the driver, into the block layer.
> In return the block layer packs request structures and submit them to
> the nvme according to the features/limits of nvme hardware.
>
> The patch consists of:
> * Initialization of multi-queue data structures
> * Conversion of bio function call into request function calls.
> * Separate cmdid patchs for admin and normal queues.
> * Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
> by blk-mq.
> * Uses the timeout framework blk-mq where possible.
>
> Signed-off-by: Matias Bjorling <m@bjorling.me>
> ---
> drivers/block/nvme-core.c | 765 +++++++++++++++++++++++-----------------------
> drivers/block/nvme-scsi.c | 39 +--
> include/linux/nvme.h | 7 +-
> 3 files changed, 385 insertions(+), 426 deletions(-)
>
> diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> index e99a30a..36bf45c 100644
> --- a/drivers/block/nvme-core.c
> +++ b/drivers/block/nvme-core.c
[snip]
> -static void nvme_start_io_acct(struct bio *bio)
> +static void nvme_start_io_acct(struct request *rq)
> {
> - struct gendisk *disk = bio->bi_bdev->bd_disk;
> - const int rw = bio_data_dir(bio);
> + struct gendisk *disk = rq->rq_disk;
> + const int rw = rq_data_dir(rq);
> int cpu = part_stat_lock();
> part_round_stats(cpu, &disk->part0);
> part_stat_inc(cpu, &disk->part0, ios[rw]);
> - part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
> + part_stat_add(cpu, &disk->part0, sectors[rw], blk_rq_sectors(rq));
> part_inc_in_flight(&disk->part0, rw);
> part_stat_unlock();
> }
>
> -static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
> +static void nvme_end_io_acct(struct request *rq, unsigned long start_time)
> {
> - struct gendisk *disk = bio->bi_bdev->bd_disk;
> - const int rw = bio_data_dir(bio);
> + struct gendisk *disk = rq->rq_disk;
> + const int rw = rq_data_dir(rq);
> unsigned long duration = jiffies - start_time;
> int cpu = part_stat_lock();
> part_stat_add(cpu, &disk->part0, ticks[rw], duration);
> @@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
> part_stat_unlock();
> }
I think you can remove the io accounting, right? These were added here
because the diskstats are not updated in the block layer for bio-based
block drivers.
> @@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct nvme_queue *nvmeq, struct nvme_ns *ns,
> dma_dir = DMA_FROM_DEVICE;
> }
>
> - result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
> - if (result <= 0)
> + if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
> goto free_cmdid;
> - length = result;
>
> - cmnd->rw.command_id = cmdid;
> + length = blk_rq_bytes(rq);
> +
> + cmnd->rw.command_id = rq->tag;
The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?
I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.
Do you know how/if this is planned to work with scsi? Will there be one
blk-mq per LUN or per host controller?
next prev parent reply other threads:[~2013-10-18 15:13 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-18 13:14 [PATCH 0/3] Convert from bio-based to blk-mq v2 Matias Bjorling
2013-10-18 13:14 ` Matias Bjorling
2013-10-18 13:14 ` [PATCH 1/3] blk-mq: call exit_hctx on hw queue teardown Matias Bjorling
2013-10-18 13:14 ` Matias Bjorling
2013-10-18 13:14 ` [PATCH 2/3] NVMe: Extract admin queue size Matias Bjorling
2013-10-18 13:14 ` Matias Bjorling
2013-10-18 13:14 ` [PATCH 3/3] NVMe: Convert to blk-mq Matias Bjorling
2013-10-18 13:14 ` Matias Bjorling
2013-10-18 15:13 ` Keith Busch [this message]
2013-10-18 15:13 ` Keith Busch
2013-10-18 19:06 ` Matias Bjørling
2013-10-18 19:06 ` Matias Bjørling
2013-10-22 16:55 ` Keith Busch
2013-10-22 16:55 ` Keith Busch
2013-10-22 18:55 ` Matias Bjorling
2013-10-22 18:55 ` Matias Bjorling
2013-10-22 19:52 ` Keith Busch
2013-10-22 19:52 ` Keith Busch
2013-10-18 15:48 ` [PATCH 0/3] Convert from bio-based to blk-mq v2 Matthew Wilcox
2013-10-18 15:48 ` Matthew Wilcox
2013-10-18 19:10 ` Matias Bjørling
2013-10-18 19:10 ` Matias Bjørling
2013-10-18 19:21 ` Matias Bjorling
2013-10-18 19:21 ` Matias Bjorling
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LRH.2.03.1310180837140.4763@AMR \
--to=keith.busch@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.