public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Max Gurtovoy <mgurtovoy@nvidia.com>
To: linux-nvme@lists.infradead.org, Christoph Hellwig <hch@lst.de>,
	marcan@marcan.st, sven@svenpeter.dev,
	Keith Busch <kbusch@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	James Smart <james.smart@broadcom.com>
Cc: alyssa@rosenzweig.io, asahi@lists.linux.dev,
	Chaitanya Kulkarni <kch@nvidia.com>
Subject: Re: [PATCH] nvme: don't set a virt_boundary unless needed
Date: Fri, 22 Dec 2023 03:16:28 +0200	[thread overview]
Message-ID: <8cfe55f2-4f2e-46f9-bbc8-5ab80d06f3d5@nvidia.com> (raw)
In-Reply-To: <155ec506-ede8-42c7-95f7-e8be32800a8d@grimberg.me>



On 21/12/2023 11:30, Sagi Grimberg wrote:
> 
>> NVMe PRPs are a pain and force the expensive virt_boundary checking on
>> block layer, prevent secure passthrough and require scatter/gather I/O
>> to be split into multiple commands which is problematic for the upcoming
>> atomic write support.
> 
> But is the threshold still correct? meaning for I/Os small enough the
> device will have lower performance? I'm not advocating that we keep it,
> but we should at least mention the tradeoff in the change log.
> 
>> Fix the NVMe core to require an opt-in from the drivers for it.
>>
>> For nvme-apple it is always required as the driver only supports PRPs.
>>
>> For nvme-pci when SGLs are supported we'll always use them for data I/O
>> that would require a virt_boundary.
>>
>> For nvme-rdma the virt boundary is always required, as RMDA MRs are just
>> as dumb as NVMe PRPs.
> 
> That is actually device dependent. The driver can ask for a pool of
> mrs with type IB_MR_TYPE_SG_GAPS if the device supports IBK_SG_GAPS_REG.
> 
> See from ib_srp.c:
> -- 
>         if (device->attrs.kernel_cap_flags & IBK_SG_GAPS_REG)
>                  mr_type = IB_MR_TYPE_SG_GAPS;
>          else
>                  mr_type = IB_MR_TYPE_MEM_REG;

For now, I prefer not using the IB_MR_TYPE_SG_GAPS MR in NVMe/RDMA since 
in the case of virtual contiguous data buffers it is better to use 
IB_MR_TYPE_MEM_REG. It gives much better performance. This is the reason 
I didn't add IB_MR_TYPE_SG_GAPS MR support for NVMe/RDMA.

I actually had a plan to re-write the IB_MR_TYPE_SG_GAPS MR logic (or 
create a new MR type) that will internally open 2 MRs so if the IO is 
contiguous it will use the MTT/MEM_REG and if it isn't it will use the 
KLM/SG_GAPS.
This is how we implemented the SIG_MR but still didn't make it for the 
IB_MR_TYPE_SG_GAPS MR.

Actually, I think we should have the same logic in the NVMe PCI driver:
if the IOs can be delivered as PRPs then the driver will prepare SQE 
with PRP. Otherwise, driver will prepare SGL.
I think that doing the check in the driver for each IO is not so bad and 
devices will get benefit from it. Usually HW devices like to work with 
contiguous buffers. If the buffers can't be mapped with PRPs, then the 
HW will work a bit harder and use SGLs (it is better than doing a bounce 
buffer in the block layer).

I actually did a POC internally for NVMe/RDMA and created sg_gaps ib_mr 
and mem_reg ib_mr and checked the buffers mapping for each IO and got a 
big benefit if the buffers were discontig (used the sg_gaps mr). Also 
the contig buffers performance didn't degraded because of the check of 
the buffers mapping.

I created a fio flags that in purpose sends discontig IOs for my testing.

WDYT ?


> -- 
> 
>>
>> For nvme-tcp and nvme-fc I set the flags for now because I don't
>> understand the drivers fully, but I suspect the flags could be lifted.
> 
> tcp can absolutely omit virt boundaries.
> 
>> For nvme-loop the flag is never set as it doesn't have any requirements
>> on the I/O format.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>>   drivers/nvme/host/apple.c |  6 +++++
>>   drivers/nvme/host/core.c  | 11 ++++++++-
>>   drivers/nvme/host/fc.c    |  3 +++
>>   drivers/nvme/host/nvme.h  |  4 +++
>>   drivers/nvme/host/pci.c   | 52 ++++++++++++++++++++++-----------------
>>   drivers/nvme/host/rdma.c  |  6 +++++
>>   drivers/nvme/host/tcp.c   |  3 +++
>>   7 files changed, 61 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
>> index 596bb11eeba5a9..a1afb54e3b4da8 100644
>> --- a/drivers/nvme/host/apple.c
>> +++ b/drivers/nvme/host/apple.c
>> @@ -1116,6 +1116,12 @@ static void apple_nvme_reset_work(struct 
>> work_struct *work)
>>           goto out;
>>       }
>> +    /*
>> +     * nvme-apple always uses PRPs and thus needs to set a virt 
>> boundary.
>> +     */
>> +    set_bit(NVME_CTRL_VIRT_BOUNDARY_IO, &anv->ctrl.flags);
>> +    set_bit(NVME_CTRL_VIRT_BOUNDARY_ADMIN, &anv->ctrl.flags);
>> +
> 
> Why two flags? Why can't the core just always set the blk virt boundary
> on the admin request queue?

I also think that a single flag is good enough.
> 
>>       ret = nvme_init_ctrl_finish(&anv->ctrl, false);
>>       if (ret)
>>           goto out
> 


  parent reply	other threads:[~2023-12-22  1:16 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-21  8:48 [PATCH] nvme: don't set a virt_boundary unless needed Christoph Hellwig
2023-12-21  9:30 ` Sagi Grimberg
2023-12-21 12:17   ` Christoph Hellwig
2023-12-21 12:32     ` Sagi Grimberg
2023-12-21 12:40       ` Christoph Hellwig
2023-12-25  9:13         ` Sagi Grimberg
2023-12-21 17:03     ` Keith Busch
2023-12-25  9:20       ` Sagi Grimberg
2023-12-22  1:16   ` Max Gurtovoy [this message]
2023-12-25 10:08     ` Sagi Grimberg
2023-12-25 10:36       ` Max Gurtovoy
2023-12-25 10:44         ` Sagi Grimberg
2023-12-25 12:31           ` Max Gurtovoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8cfe55f2-4f2e-46f9-bbc8-5ab80d06f3d5@nvidia.com \
    --to=mgurtovoy@nvidia.com \
    --cc=alyssa@rosenzweig.io \
    --cc=asahi@lists.linux.dev \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=james.smart@broadcom.com \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=marcan@marcan.st \
    --cc=sven@svenpeter.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox