public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Max Gurtovoy <mgurtovoy@nvidia.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	Guixin Liu <kanie@linux.alibaba.com>,
	hch@lst.de, kbusch@kernel.org, kch@nvidia.com, axboe@kernel.dk
Cc: linux-nvme@lists.infradead.org
Subject: Re: [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize
Date: Mon, 25 Dec 2023 14:36:15 +0200	[thread overview]
Message-ID: <f4f4d62b-8f2e-460c-a393-ff2966c9ebd4@nvidia.com> (raw)
In-Reply-To: <51a09c6d-d4b5-4edf-814c-08bc95640a2b@grimberg.me>



On 25/12/2023 10:59, Sagi Grimberg wrote:
> 
>>>>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>>>>       }
>>>>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>>>>
>>>>>>> rdma_dev_max_qsize is a better name.
>>>>>>>
>>>>>>> Also, you can drop the RFC for the next submission.
>>>>>>>
>>>>>>
>>>>>> Sagi,
>>>>>> I don't feel comfortable with these patches.
>>>>>
>>>>> Well, good that you're speaking up then ;)
>>>>>
>>>>>> First I would like to understand the need for it.
>>>>>
>>>>> I assumed that he stumbled on a device that did not support the
>>>>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>>>>
>>>> The situation is that I need a queue depth greater than 128.
>>>>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>>>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>>>>> doesn't take it into account.
>>>>>
>>>>> Well, it is not taken into account now either with the existing magic
>>>>> limit in nvmet. The rdma limits reporting mechanism was and still is
>>>>> unusable.
>>>>>
>>>>> I would expect a device that has different size for different work
>>>>> items to report max_qp_wr accounting for the largest work element that
>>>>> the device supports, so it is universally correct.
>>>>>
>>>>> The fact that max_qp_wr means the maximum number of slots is a qp and
>>>>> at the same time different work requests can arbitrarily use any 
>>>>> number
>>>>> of slots without anyone ever knowing, makes it pretty much 
>>>>> impossible to
>>>>> use reliably.
>>>>>
>>>>> Maybe rdma device attributes need a new attribute called
>>>>> universal_max_qp_wr that is going to actually be reliable and not
>>>>> guess-work?
>>>>
>>>> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
>>>> another way to get a queue depth grater than 128
>>>>
>>>> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
>>>>
>>>
>>> When I added this limit to RDMA transports it was to avoid a 
>>> situation that a QP will fail to be created if one will ask a large 
>>> queue.
>>>
>>> I choose 128 since it was supported for all the RDMA adapters I've 
>>> tested in my lab (mostly Mellanox adapters).
>>> For this queue depth we found that the performance is good enough and 
>>> it will not be improved if we will increase the depth.
>>>
>>> Are you saying that you have a device that can provide better 
>>> performance with qdepth > 128 ?
>>> What is the tested qdepth and what are the numbers you see with this 
>>> qdepth ?
>>
>> Yeah, you are right, the improvement is small(about %1~2%), I do this 
>> only for better benchmark,
> 
> Well, it doesn't come for free, you are essentially doubling the queue
> depth. I'm also assuming that you tested a single initiator and a
> single queue?
> 
>> I still consist that using the capabilities of RDMA device to 
>> determine the size of queue is a better choice, but now I change the
>>
>> NVME_RDMA_MAX_QUEUE_SIZE to 256 for bidding.
> 
> Still doesn't change the fact that its a pure guess-work if it is
> supported by the device or not.
> 
> Are you even able to create that queue depth with DIF workloads?
> 
> Max, what is the maximum effective depth with DIF enabled?

I'll need to check it.

I'll prepare some patches to allow RDMA queue_size to be 256 for non-pi 
controllers anyway.
I also would like to add another configfs entry to determine the max 
queue size of a target port.
hope to merge it for upcoming merge window.


      reply	other threads:[~2023-12-25 12:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-19  7:32 [RFC PATCH V2 0/2] *** use rdma device capability to limit queue size *** Guixin Liu
2023-12-19  7:32 ` [RFC PATCH V2 1/2] nvmet: rdma: utilize ib_device capability for setting max_queue_size Guixin Liu
2023-12-19  7:32 ` [RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize Guixin Liu
2023-12-20  9:17   ` Sagi Grimberg
2023-12-20 10:52     ` Max Gurtovoy
2023-12-20 19:27       ` Sagi Grimberg
2023-12-22  6:58         ` Guixin Liu
2023-12-24  1:37           ` Max Gurtovoy
2023-12-25  8:40             ` Guixin Liu
2023-12-25  8:59               ` Sagi Grimberg
2023-12-25 12:36                 ` Max Gurtovoy [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f4f4d62b-8f2e-460c-a393-ff2966c9ebd4@nvidia.com \
    --to=mgurtovoy@nvidia.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=kanie@linux.alibaba.com \
    --cc=kbusch@kernel.org \
    --cc=kch@nvidia.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox