Re: [PATCH v12 07/26] nvme-tcp: Add DDP offload control path

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Aurelien Aptel <aaptel@nvidia.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, netdev@vger.kernel.org,
	hch@lst.de, kbusch@kernel.org, axboe@fb.com,
	chaitanyak@nvidia.com, davem@davemloft.net, kuba@kernel.org
Cc: Boris Pismenny <borisp@nvidia.com>,
	aurelien.aptel@gmail.com, smalin@nvidia.com, malin1024@gmail.com,
	ogerlitz@nvidia.com, yorayz@nvidia.com, galshalom@nvidia.com,
	mgurtovoy@nvidia.com
Subject: Re: [PATCH v12 07/26] nvme-tcp: Add DDP offload control path
Date: Wed, 16 Aug 2023 15:30:44 +0300	[thread overview]
Message-ID: <253wmxvuq2z.fsf@nvidia.com> (raw)
In-Reply-To: <bc5cd2a7-efc4-e4df-cae5-5c527dd704a6@grimberg.me>

Sagi Grimberg <sagi@grimberg.me> writes:
>>>> +     if (!netdev || !queue)
>>>> +             return false;
>>>
>>> Is it reasonable to be called here with !netdev or !queue ?
>>
>> The check is needed only for the IO queue case but we can move it
>> earlier in nvme_tcp_start_queue().
>
> I still don't understand even on io queues how this can happen.

In case where the netdev does not support DDP, when the admin queue is
started, netdev->offloading_netdev will not be set and therefore will be
NULL.

Later, when the IO queue is started:

    netdev = queue->ctrl->offloading_netdev; <== NULL
    if (is_netdev_ulp_offload_active(netdev, queue)) { <== we pass NULL

We can move the NULL check higher-up if you prefer, like so:

    if (queue->ctrl->offloading_netdev &&
        is_netdev_ulp_offload_active(queue->ctrl->offloading_netdev, queue)) {

>>>> +
>>>> +     /* If we cannot query the netdev limitations, do not offload */
>>>> +     if (!nvme_tcp_ddp_query_limits(netdev, queue))
>>>> +             return false;
>>>> +
>>>> +     /* If netdev supports nvme-tcp ddp offload, we can offload */
>>>> +     if (test_bit(ULP_DDP_C_NVME_TCP_BIT, netdev->ulp_ddp_caps.active))
>>>> +             return true;
>>>
>>> This should be coming from the API itself, have the limits query
>>> api fail if this is off.
>>
>> We can move the function to the ULP DDP layer.
>>
>>> btw, what is the active thing? is this driven from ethtool enable?
>>> what happens if the user disables it while there is a ulp using it?
>>
>> The active bits are indeed driven by ethtool according to the design
>> Jakub suggested.
>> The nvme-tcp connection will have to be reconnected to see the effect of
>> changing the bit.
>
> It should move inside the api as well. Don't want to care about it in
> nvme.

Ok, we will move it there.

>>>> +
>>>> +     return false;
>>>
>>> This can be folded to the above function.
>>
>> We won't be able to check for TLS in a common wrapper. We think this
>> should be kept.
>
> Why? any tcp ddp need to be able to support tls. Nothing specific to
> nvme here.

True, we will move it to the ULP wrapper.

>>>> +     /*
>>>> +      * The atomic operation guarantees that we don't miss any NIC driver
>>>> +      * resync requests submitted after the above checks.
>>>> +      */
>>>> +     if (atomic64_cmpxchg(&queue->resync_req, pdu_val,
>>>> +                          pdu_val & ~ULP_DDP_RESYNC_PENDING) !=
>>>> +                          atomic64_read(&queue->resync_req))
>>>> +             netdev->netdev_ops->ulp_ddp_ops->resync(netdev,
>>>> +                                                     queue->sock->sk,
>>>> +                                                     pdu_seq);
>>>
>>> Who else is doing an atomic on this value? and what happens
>>> if the cmpxchg fails?
>>
>> The driver thread can set queue->resync_req concurrently in patch
>> "net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload" in function
>> nvmeotcp_update_resync().
>>
>> If the cmpxchg fails it means a new resync request was triggered by the
>> HW, the old request will be dropped and the new one will be processed by
>> a later PDU.
>
> So resync_req is actually the current tcp sequence number or something?
> The name resync_req is very confusing.

queue->resync_req is the TCP sequence for which the HW requested a
resync operation. We can rename it with queue->resync_tcp_seq.

>>>> +                     ret = nvme_tcp_offload_socket(queue);
>>>> +                     if (ret) {
>>>> +                             dev_info(nctrl->device,
>>>> +                                      "failed to setup offload on queue %d ret=%d\n",
>>>> +                                      idx, ret);
>>>> +                     }
>>>> +             }
>>>> +     } else {
>>>>                ret = nvmf_connect_admin_queue(nctrl);
>>>> +             if (ret)
>>>> +                     goto err;
>>>>
>>>> -     if (!ret) {
>>>> -             set_bit(NVME_TCP_Q_LIVE, &queue->flags);
>>>> -     } else {
>>>> -             if (test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>>>> -                     __nvme_tcp_stop_queue(queue);
>>>> -             dev_err(nctrl->device,
>>>> -                     "failed to connect queue: %d ret=%d\n", idx, ret);
>>>> +             netdev = get_netdev_for_sock(queue->sock->sk);
>>>
>>> Is there any chance that this is a different netdev than what is
>>> already recorded? doesn't make sense to me.
>>
>> The idea is that we are first starting the admin queue, which looks up
>> the netdev associated with the socket and stored in the queue. Later,
>> when the IO queues are started, we use the recorded netdev.
>>
>> In cases of bonding or vlan, a netdev can have lower device links, which
>> get_netdev_for_sock() will look up.
>
> I think the code should in high level do:
>         if (idx) {
>                 ret = nvmf_connect_io_queue(nctrl, idx);
>                 if (ret)
>                         goto err;
>                 if (nvme_tcp_ddp_query_limits(queue))
>                         nvme_tcp_offload_socket(queue);
>
>         } else {
>                 ret = nvmf_connect_admin_queue(nctrl);
>                 if (ret)
>                         goto err;
>                 ctrl->ddp_netdev = get_netdev_for_sock(queue->sock->sk);
>                 if (nvme_tcp_ddp_query_limits(queue))
>                         nvme_tcp_offload_apply_limits(queue);
>         }

Ok, we will follow this design.

> ctrl->ddp_netdev should be cleared and put when the admin queue
> is stopped/freed, similar to how async_req is handled.

Thanks, we will clear ddp_netdev on queue stop/free.
This will also prevent reusing a potentially wrong netdev after a reconnection.

>>>> +             /*
>>>> +              * release the device as no offload context is
>>>> +              * established yet.
>>>> +              */
>>>> +             dev_put(netdev);
>>>
>>> the put is unclear, what does it pair with? the get_netdev_for_sock?
>>
>> Yes, get_netdev_for_sock() takes a reference, which we don't need at
>> that point so we put it.
>
> Well, you store a pointer to it, what happens if it goes away while
> the controller is being set up?

It's a problem. We will remove the dev_put() to keep the first
reference, and only release it when it is cleared from the admin queue.

next prev parent reply	other threads:[~2023-08-16 12:31 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-12 16:14 [PATCH v12 00/26] nvme-tcp receive offloads Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 01/26] net: Introduce direct data placement tcp offload Aurelien Aptel
2023-08-09  7:15   ` Sagi Grimberg
2023-08-10 14:46     ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 02/26] net/ethtool: add new stringset ETH_SS_ULP_DDP_{CAPS,STATS} Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 03/26] net/ethtool: add ULP_DDP_{GET,SET} operations for caps and stats Aurelien Aptel
2023-07-15 10:14   ` Simon Horman
2023-07-17  9:45     ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 04/26] Documentation: document netlink ULP_DDP_GET/SET messages Aurelien Aptel
2023-07-15 10:17   ` Simon Horman
2023-07-17  9:47     ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 05/26] iov_iter: skip copy if src == dst for direct data placement Aurelien Aptel
2023-08-16  0:24   ` Max Gurtovoy
2023-07-12 16:14 ` [PATCH v12 06/26] net/tls,core: export get_netdev_for_sock Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 07/26] nvme-tcp: Add DDP offload control path Aurelien Aptel
2023-08-01  2:25   ` Chaitanya Kulkarni
2023-08-09  7:39     ` Sagi Grimberg
2023-08-11  5:28       ` Chaitanya Kulkarni
2023-08-16  0:50         ` Max Gurtovoy
2023-08-09  7:13   ` Sagi Grimberg
2023-08-14 16:11     ` Aurelien Aptel
2023-08-14 18:54       ` Sagi Grimberg
2023-08-16 12:30         ` Aurelien Aptel [this message]
2023-07-12 16:14 ` [PATCH v12 08/26] nvme-tcp: Add DDP data-path Aurelien Aptel
2023-08-09  7:35   ` Sagi Grimberg
2023-08-14 16:12     ` Aurelien Aptel
2023-08-14 19:01       ` Sagi Grimberg
2023-08-17 13:28         ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 09/26] nvme-tcp: RX DDGST offload Aurelien Aptel
2023-08-09  7:59   ` Sagi Grimberg
2023-08-10 14:48     ` Aurelien Aptel
2023-08-13 13:49       ` Sagi Grimberg
2023-07-12 16:14 ` [PATCH v12 10/26] nvme-tcp: Deal with netdevice DOWN events Aurelien Aptel
2023-08-09  8:00   ` Sagi Grimberg
2023-08-16 13:03     ` Aurelien Aptel
2023-08-16 14:10       ` Sagi Grimberg
2023-08-17 14:09         ` Aurelien Aptel
2023-08-20 10:50           ` Sagi Grimberg
2023-08-21 12:33             ` Aurelien Aptel
2023-07-12 16:14 ` [PATCH v12 11/26] nvme-tcp: Add modparam to control the ULP offload enablement Aurelien Aptel
2023-08-09  8:03   ` Sagi Grimberg
2023-08-10 14:50     ` Aurelien Aptel
2023-08-16  1:05     ` Max Gurtovoy
2023-07-12 16:14 ` [PATCH v12 12/26] nvme-tcp: Only enable offload with TLS if the driver supports it Aurelien Aptel
2023-08-09  8:05   ` Sagi Grimberg
2023-08-10 14:52     ` Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 13/26] Documentation: add ULP DDP offload documentation Aurelien Aptel
2023-07-15 10:32   ` Simon Horman
2023-07-17  9:48     ` Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 14/26] net/mlx5e: Rename from tls to transport static params Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 15/26] net/mlx5e: Refactor ico sq polling to get budget Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 16/26] net/mlx5e: Have mdev pointer directly on the icosq structure Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 17/26] net/mlx5e: Refactor doorbell function to allow avoiding a completion Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 18/26] net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 19/26] net/mlx5e: NVMEoTCP, offload initialization Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 20/26] net/mlx5e: TCP flow steering for nvme-tcp acceleration Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 21/26] net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 22/26] net/mlx5e: NVMEoTCP, queue init/teardown Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 23/26] net/mlx5e: NVMEoTCP, ddp setup and resync Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 24/26] net/mlx5e: NVMEoTCP, async ddp invalidation Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 25/26] net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload Aurelien Aptel
2023-07-12 16:15 ` [PATCH v12 26/26] net/mlx5e: NVMEoTCP, statistics Aurelien Aptel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=253wmxvuq2z.fsf@nvidia.com \
    --to=aaptel@nvidia.com \
    --cc=aurelien.aptel@gmail.com \
    --cc=axboe@fb.com \
    --cc=borisp@nvidia.com \
    --cc=chaitanyak@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=galshalom@nvidia.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=malin1024@gmail.com \
    --cc=mgurtovoy@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=ogerlitz@nvidia.com \
    --cc=sagi@grimberg.me \
    --cc=smalin@nvidia.com \
    --cc=yorayz@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.