From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B734039D
	for <netdev@vger.kernel.org>; Mon, 14 Aug 2023 01:10:37 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 37007C433C7;
	Mon, 14 Aug 2023 01:10:36 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1691975437;
	bh=sSoPoNHGalzvIc70oDi0FxnGludM4NLeHbZZnOdBj90=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=eJZRaOCVMLsbH1+BfzU6qUmRBhziWNQDcW0phtDKDVv9Q/O3xDA63VWTr5k5qxJuH
	 Mxj5bIAqH3kFZZFDvb5CtFXywQeGtbTn7kgqpUJPWf+kAqjRUZ1ui5kslyh3IENqEI
	 DsFhdLcLUMYrAb/XosQpGbYGHN9XbRA6mT/i3cF5oc4KawmYhgMIggm/xYUU5RMO/Q
	 ABjxA4NwmU08k10E/NfmJycF8iwET1HkMcgSYmg7gVpfuNK/phWQI6y9xYIUBWztno
	 XYKAwZQJ2ioHAr3n77eOROL2fpoi7ZpOEA6tgUkkwSd5M8XfP2wC/Q9tkKzP7JAdx7
	 y0OvFvXpsvgSg==
Message-ID: <7dd4f5b0-0edf-391b-c8b4-3fa82046ab7c@kernel.org>
Date: Sun, 13 Aug 2023 19:10:35 -0600
Precedence: bulk
X-Mailing-List: netdev@vger.kernel.org
List-Id: <netdev.vger.kernel.org>
List-Subscribe: <mailto:netdev+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:netdev+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
 Gecko/20100101 Thunderbird/102.14.0
Subject: Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind
 dma-buf to netdevice
Content-Language: en-US
To: Mina Almasry <almasrymina@google.com>, netdev@vger.kernel.org,
 linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org
Cc: "David S. Miller" <davem@davemloft.net>,
 Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>,
 Paolo Abeni <pabeni@redhat.com>, Jesper Dangaard Brouer <hawk@kernel.org>,
 Ilias Apalodimas <ilias.apalodimas@linaro.org>, Arnd Bergmann
 <arnd@arndb.de>, Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
 Sumit Semwal <sumit.semwal@linaro.org>,
 =?UTF-8?Q?Christian_K=c3=b6nig?= <christian.koenig@amd.com>,
 Jason Gunthorpe <jgg@ziepe.ca>, Hari Ramakrishnan <rharix@google.com>,
 Dan Williams <dan.j.williams@intel.com>, Andy Lutomirski <luto@kernel.org>,
 stephen@networkplumber.org, sdf@google.com,
 Willem de Bruijn <willemb@google.com>, Kaiyuan Zhang <kaiyuanz@google.com>
References: <20230810015751.3297321-1-almasrymina@google.com>
 <20230810015751.3297321-3-almasrymina@google.com>
From: David Ahern <dsahern@kernel.org>
In-Reply-To: <20230810015751.3297321-3-almasrymina@google.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

On 8/9/23 7:57 PM, Mina Almasry wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8e7d0cb540cd..02a25ccf771a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -151,6 +151,8 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/prandom.h>
>  #include <linux/once_lite.h>
> +#include <linux/genalloc.h>
> +#include <linux/dma-buf.h>
> 
>  #include "dev.h"
>  #include "net-sysfs.h"
> @@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long val,
>  	return call_netdevice_notifiers_info(val, &info.info);
>  }
> 
> +/* Device memory support */
> +
> +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> +					   struct gen_pool_chunk *chunk,
> +					   void *not_used)
> +{
> +	struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> +
> +	kvfree(owner->ppiovs);
> +	kfree(owner);
> +}
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> +	size_t size, avail;
> +
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +
> +	size = gen_pool_size(binding->chunk_pool);
> +	avail = gen_pool_avail(binding->chunk_pool);
> +
> +	if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> +		  size, avail))
> +		gen_pool_destroy(binding->chunk_pool);
> +
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +	dma_buf_detach(binding->dmabuf, binding->attachment);
> +	dma_buf_put(binding->dmabuf);
> +	kfree(binding);
> +}
> +
> +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
> +{
> +	struct netdev_rx_queue *rxq;
> +	unsigned long xa_idx;
> +
> +	list_del_rcu(&binding->list);
> +
> +	xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
> +		if (rxq->binding == binding)
> +			rxq->binding = NULL;
> +
> +	netdev_devmem_binding_put(binding);

This does a put on the binding but it does not notify the driver that
that the dmabuf references need to be flushed from the rx queue.

Also, what about the device getting deleted - e.g., the driver is removed?

> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int dmabuf_fd,
> +				u32 rxq_idx, struct netdev_dmabuf_binding **out)
> +{
> +	struct netdev_dmabuf_binding *binding;
> +	struct netdev_rx_queue *rxq;
> +	struct scatterlist *sg;
> +	struct dma_buf *dmabuf;
> +	unsigned int sg_idx, i;
> +	unsigned long virtual;
> +	u32 xa_idx;
> +	int err;
> +
> +	rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> +	if (rxq->binding)
> +		return -EEXIST;

So this proposal is limiting a binding to a single dmabuf at a time? Is
this just for the RFC?

Also, this suggests that the Rx queue is unique to the flow. I do not
recall a netdev API to create H/W queues on the fly (only a passing
comment from Kuba), so how is the H/W queue (or queue set since a
completion queue is needed as well) created for the flow? And in turn if
it is unique to the flow, what deletes the queue if an app does not do a
proper cleanup? If the queue sticks around, the dmabuf references stick
around.

Also, if this is an app specific h/w queue, flow steering is not
mentioned in this RFC.

> +
> +	dmabuf = dma_buf_get(dmabuf_fd);
> +	if (IS_ERR_OR_NULL(dmabuf))
> +		return -EBADFD;
> +
> +	binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +			       dev_to_node(&dev->dev));
> +	if (!binding) {
> +		err = -ENOMEM;
> +		goto err_put_dmabuf;
> +	}
> +
> +	xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> +
> +	refcount_set(&binding->ref, 1);
> +
> +	binding->dmabuf = dmabuf;
> +
> +	binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> +	if (IS_ERR(binding->attachment)) {
> +		err = PTR_ERR(binding->attachment);
> +		goto err_free_binding;
> +	}
> +
> +	binding->sgt = dma_buf_map_attachment(binding->attachment,
> +					      DMA_BIDIRECTIONAL);
> +	if (IS_ERR(binding->sgt)) {
> +		err = PTR_ERR(binding->sgt);
> +		goto err_detach;
> +	}
> +
> +	/* For simplicity we expect to make PAGE_SIZE allocations, but the
> +	 * binding can be much more flexible than that. We may be able to
> +	 * allocate MTU sized chunks here. Leave that for future work...
> +	 */
> +	binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +					      dev_to_node(&dev->dev));
> +	if (!binding->chunk_pool) {
> +		err = -ENOMEM;
> +		goto err_unmap;
> +	}
> +
> +	virtual = 0;
> +	for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> +		dma_addr_t dma_addr = sg_dma_address(sg);
> +		struct dmabuf_genpool_chunk_owner *owner;
> +		size_t len = sg_dma_len(sg);
> +		struct page_pool_iov *ppiov;
> +
> +		owner = kzalloc_node(sizeof(*owner), GFP_KERNEL,
> +				     dev_to_node(&dev->dev));
> +		owner->base_virtual = virtual;
> +		owner->base_dma_addr = dma_addr;
> +		owner->num_ppiovs = len / PAGE_SIZE;
> +		owner->binding = binding;
> +
> +		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
> +					 dma_addr, len, dev_to_node(&dev->dev),
> +					 owner);
> +		if (err) {
> +			err = -EINVAL;
> +			goto err_free_chunks;
> +		}
> +
> +		owner->ppiovs = kvmalloc_array(owner->num_ppiovs,
> +					       sizeof(*owner->ppiovs),
> +					       GFP_KERNEL);
> +		if (!owner->ppiovs) {
> +			err = -ENOMEM;
> +			goto err_free_chunks;
> +		}
> +
> +		for (i = 0; i < owner->num_ppiovs; i++) {
> +			ppiov = &owner->ppiovs[i];
> +			ppiov->owner = owner;
> +			refcount_set(&ppiov->refcount, 1);
> +		}
> +
> +		dma_addr += len;
> +		virtual += len;
> +	}
> +
> +	/* TODO: need to be able to bind to multiple rx queues on the same
> +	 * netdevice. The code should already be able to handle that with
> +	 * minimal changes, but the netlink API currently allows for 1 rx
> +	 * queue.
> +	 */
> +	err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> +		       GFP_KERNEL);
> +	if (err)
> +		goto err_free_chunks;
> +
> +	rxq->binding = binding;
> +	*out = binding;
> +
> +	return 0;
> +
> +err_free_chunks:
> +	gen_pool_for_each_chunk(binding->chunk_pool,
> +				netdev_devmem_free_chunk_owner, NULL);
> +	gen_pool_destroy(binding->chunk_pool);
> +err_unmap:
> +	dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +				 DMA_BIDIRECTIONAL);
> +err_detach:
> +	dma_buf_detach(dmabuf, binding->attachment);
> +err_free_binding:
> +	kfree(binding);
> +err_put_dmabuf:
> +	dma_buf_put(dmabuf);
> +	return err;
> +}
> +
>  #ifdef CONFIG_NET_INGRESS
>  static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
> 
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index bf7324dd6c36..288ed0112995 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c> @@ -167,10 +231,37 @@ static int netdev_genl_netdevice_event(struct
notifier_block *nb,
>  	return NOTIFY_OK;
>  }
> 
> +static int netdev_netlink_notify(struct notifier_block *nb, unsigned long state,
> +				 void *_notify)
> +{
> +	struct netlink_notify *notify = _notify;
> +	struct netdev_dmabuf_binding *rbinding;
> +
> +	if (state != NETLINK_URELEASE || notify->protocol != NETLINK_GENERIC)
> +		return NOTIFY_DONE;
> +
> +	rcu_read_lock();
> +
> +	list_for_each_entry_rcu(rbinding, &netdev_rbinding_list, list) {
> +		if (rbinding->owner_nlportid == notify->portid) {
> +			netdev_unbind_dmabuf_to_queue(rbinding);

This ties the removal of a dmabuf to the close of the netlink socket as
suggested in the previous round of comments. What happens if the process
closes the dmabuf fd? Is the outstanding dev binding sufficient to keep
the allocation / use in place?