Netdev List

Netdev List
 help / color / mirror / Atom feed

* [RFC PATCH 20/21] core/skbuff: use skb_zdata for testing whether skb is zerocopy
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>

skb_zcopy() flag indicates whether the skb has a zerocopy ubuf.
netgpu does not use ubufs, so skb_zdata() indicates whether the
skb is carrying zero copy data, and should be left alone, while
skb_zcopy() indicates whhether there is an attached ubuf.

Also, when a write() on a zero-copy socket returns EWOULDBLOCK,
this is not synchronized with select(), which will only look at
the send buffer, and return writability if there is tcp space.

This appears to be caused by some ubuf logic, leading to iperf
spending 70% of its time in select() for ZC transmits.  With this
change, the time spent drops to 20%.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/linux/skbuff.h | 24 +++++++++++++++++++++++-
 net/core/skbuff.c      | 16 ++++++++++++----
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ba41d1a383f8..3c2efd45655b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -443,8 +443,12 @@ enum {
 
 	/* generate software time stamp when entering packet scheduling */
 	SKBTX_SCHED_TSTAMP = 1 << 6,
+
+	/* fragments are accessed only via DMA */
+	SKBTX_DEV_NETDMA = 1 << 7,
 };
 
+#define SKBTX_ZERODATA_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_DEV_NETDMA)
 #define SKBTX_ZEROCOPY_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP)
@@ -1416,6 +1420,24 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+static inline bool skb_netdma(struct sk_buff *skb)
+{
+	return skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_NETDMA;
+}
+
+static inline bool skb_zdata(struct sk_buff *skb)
+{
+	return skb && skb_shinfo(skb)->tx_flags & SKBTX_ZERODATA_FRAG;
+}
+
+static inline void skb_netdma_set(struct sk_buff *skb, bool netdma)
+{
+	if (skb && netdma) {
+		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_NETDMA;
+		skb_shinfo(skb)->destructor_arg = NULL;
+	}
+}
+
 static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
 {
 	bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
@@ -3260,7 +3282,7 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 				    const struct page *page, int off)
 {
-	if (skb_zcopy(skb))
+	if (skb_zdata(skb))
 		return false;
 	if (i) {
 		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2b4176cab578..67a421257a27 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1323,6 +1323,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 	}
 
 	skb_zcopy_set(skb, uarg, NULL);
+	skb_netdma_set(skb, sk->sk_user_data);
+
 	return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1330,8 +1332,8 @@ EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
 static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			      gfp_t gfp_mask)
 {
-	if (skb_zcopy(orig)) {
-		if (skb_zcopy(nskb)) {
+	if (skb_zdata(orig)) {
+		if (skb_zdata(nskb)) {
 			/* !gfp_mask callers are verified to !skb_zcopy(nskb) */
 			if (!gfp_mask) {
 				WARN_ON_ONCE(1);
@@ -1343,6 +1345,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 				return -EIO;
 		}
 		skb_zcopy_set(nskb, skb_uarg(orig), NULL);
+		skb_netdma_set(nskb, skb_netdma(orig));
 	}
 	return 0;
 }
@@ -1372,6 +1375,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
 		return -EINVAL;
 
+	if (!skb_has_shared_frag(skb))
+		return -EINVAL;
+
 	if (!num_frags)
 		goto release;
 
@@ -2078,6 +2084,8 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
 	 */
 	int i, k, eat = (skb->tail + delta) - skb->end;
 
+	BUG_ON(skb_netdma(skb));
+
 	if (eat > 0 || skb_cloned(skb)) {
 		if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0,
 				     GFP_ATOMIC))
@@ -3328,7 +3336,7 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 	if (skb_headlen(skb))
 		return 0;
-	if (skb_zcopy(tgt) || skb_zcopy(skb))
+	if (skb_zdata(tgt) || skb_zdata(skb))
 		return 0;
 
 	todo = shiftlen;
@@ -5171,7 +5179,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	from_shinfo = skb_shinfo(from);
 	if (to_shinfo->frag_list || from_shinfo->frag_list)
 		return false;
-	if (skb_zcopy(to) || skb_zcopy(from))
+	if (skb_zdata(to) || skb_zdata(from))
 		return false;
 
 	if (skb_headlen(from) != 0) {
-- 
2.24.1


^ permalink raw reply related

* [RFC PATCH 15/21] netgpu: add network/gpu dma module
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>

Netgpu provides a data path for zero-copy TCP sends and receives
directly to GPU memory.  TCP processing is done on the host CPU,
while data is DMA'd to and from device memory.

The use case for this module are GPUs used for machine learning,
which are located near the NICs, and have a high bandwith PCI
connection between the GPU/NIC.

This initial working code is a proof of concept, for discussion.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/misc/Kconfig         |    1 +
 drivers/misc/Makefile        |    1 +
 drivers/misc/netgpu/Kconfig  |   10 +
 drivers/misc/netgpu/Makefile |   11 +
 drivers/misc/netgpu/nvidia.c | 1516 ++++++++++++++++++++++++++++++++++
 include/uapi/misc/netgpu.h   |   43 +
 6 files changed, 1582 insertions(+)
 create mode 100644 drivers/misc/netgpu/Kconfig
 create mode 100644 drivers/misc/netgpu/Makefile
 create mode 100644 drivers/misc/netgpu/nvidia.c
 create mode 100644 include/uapi/misc/netgpu.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index e1b1ba5e2b92..13ae8e55d2a2 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -472,4 +472,5 @@ source "drivers/misc/ocxl/Kconfig"
 source "drivers/misc/cardreader/Kconfig"
 source "drivers/misc/habanalabs/Kconfig"
 source "drivers/misc/uacce/Kconfig"
+source "drivers/misc/netgpu/Kconfig"
 endmenu
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index c7bd01ac6291..e026fe95a629 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -57,3 +57,4 @@ obj-$(CONFIG_PVPANIC)   	+= pvpanic.o
 obj-$(CONFIG_HABANA_AI)		+= habanalabs/
 obj-$(CONFIG_UACCE)		+= uacce/
 obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
+obj-$(CONFIG_NETGPU)		+= netgpu/
diff --git a/drivers/misc/netgpu/Kconfig b/drivers/misc/netgpu/Kconfig
new file mode 100644
index 000000000000..f67adf825c1b
--- /dev/null
+++ b/drivers/misc/netgpu/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# NetGPU framework
+#
+
+config NETGPU
+	tristate "Network/GPU driver"
+	depends on PCI
+	---help---
+	  Experimental Network / GPU driver
diff --git a/drivers/misc/netgpu/Makefile b/drivers/misc/netgpu/Makefile
new file mode 100644
index 000000000000..fe58963efdf7
--- /dev/null
+++ b/drivers/misc/netgpu/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+pkg = /home/bsd/local/pull/nvidia/NVIDIA-Linux-x86_64-440.59/kernel
+
+obj-$(CONFIG_NETGPU) := netgpu.o
+
+netgpu-y := nvidia.o
+
+# netgpu-$(CONFIG_DEBUG_FS) += debugfs.o
+
+ccflags-y += -I$(pkg)
diff --git a/drivers/misc/netgpu/nvidia.c b/drivers/misc/netgpu/nvidia.c
new file mode 100644
index 000000000000..a0ea82effb2f
--- /dev/null
+++ b/drivers/misc/netgpu/nvidia.c
@@ -0,0 +1,1516 @@
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/seq_file.h>
+#include <linux/uio.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/memory.h>
+#include <linux/device.h>
+#include <linux/interval_tree.h>
+
+#include <net/tcp.h>
+
+#include <net/netgpu.h>
+#include <uapi/misc/netgpu.h>
+
+/* XXX enable if using nvidia - will be split out to its own file */
+//#define USE_CUDA 1
+
+#ifdef USE_CUDA
+#include "nvidia/nv-p2p.h"
+#endif
+
+/* nvidia GPU uses 64K pages */
+#define GPU_PAGE_SHIFT		16
+#define GPU_PAGE_SIZE		(1UL << GPU_PAGE_SHIFT)
+#define GPU_PAGE_MASK		(GPU_PAGE_SIZE - 1)
+
+/* self is 3 so skb_netgpu_unref does not catch the dummy page */
+#define NETGPU_REFC_MAX		0xff00
+#define NETGPU_REFC_SELF	3
+#define NETGPU_REFC_EXTRA	(NETGPU_REFC_MAX - NETGPU_REFC_SELF)
+
+static struct mutex netgpu_lock;
+static unsigned int netgpu_index;
+static DEFINE_XARRAY(xa_netgpu);
+static const struct file_operations netgpu_fops;
+
+/* XXX hack */
+static void (*sk_data_ready)(struct sock *sk);
+static struct proto netgpu_prot;
+
+#ifdef USE_CUDA
+/* page_range represents one contiguous GPU PA region */
+struct netgpu_page_range {
+	unsigned long pfn;
+	struct resource *res;
+	struct netgpu_region *region;
+	struct interval_tree_node va_node;
+};
+#endif
+
+struct netgpu_pginfo {
+	unsigned long addr;
+	dma_addr_t dma;
+};
+
+#define NETGPU_CACHE_COUNT	63
+
+/* region represents GPU VA region backed by gpu_pgtbl
+ * as the region is VA, the PA ranges may be discontiguous
+ */
+struct netgpu_region {
+	struct nvidia_p2p_page_table *gpu_pgtbl;
+	struct nvidia_p2p_dma_mapping *dmamap;
+	struct netgpu_pginfo *pginfo;
+	struct page **page;
+	struct netgpu_ctx *ctx;
+	unsigned long start;
+	unsigned long len;
+	struct rb_root_cached root;
+	unsigned host_memory : 1;
+};
+
+static inline struct device *
+netdev2device(struct net_device *dev)
+{
+	return dev->dev.parent;			/* from SET_NETDEV_DEV() */
+}
+
+static inline struct pci_dev *
+netdev2pci_dev(struct net_device *dev)
+{
+	return to_pci_dev(netdev2device(dev));
+}
+
+#ifdef USE_CUDA
+static int nvidia_pg_size[] = {
+	[NVIDIA_P2P_PAGE_SIZE_4KB]   =	 4 * 1024,
+	[NVIDIA_P2P_PAGE_SIZE_64KB]  =	64 * 1024,
+	[NVIDIA_P2P_PAGE_SIZE_128KB] = 128 * 1024,
+};
+
+static void netgpu_cuda_free_region(struct netgpu_region *r);
+#endif
+static void netgpu_free_ctx(struct netgpu_ctx *ctx);
+static int netgpu_add_region(struct netgpu_ctx *ctx, void __user *arg);
+
+#ifdef USE_CUDA
+#define node2page_range(itn) \
+	container_of(itn, struct netgpu_page_range, va_node)
+
+#define region_for_each(r, idx, itn, pr)				\
+	for (idx = r->start,						\
+		itn = interval_tree_iter_first(r->root, idx, r->last);	\
+	     pr = container_of(itn, struct netgpu_page_range, va_node),	\
+		itn;							\
+	     idx = itn->last + 1,					\
+		itn = interval_tree_iter_next(itn, idx, r->last))
+
+#define region_remove_each(r, itn) \
+	while ((itn = interval_tree_iter_first(&r->root, r->start, \
+					       r->start + r->len - 1)) && \
+	       (interval_tree_remove(itn, &r->root), 1))
+
+static inline struct netgpu_page_range *
+region_find(struct netgpu_region *r, unsigned long start, int count)
+{
+	struct interval_tree_node *itn;
+	unsigned long last;
+
+	last = start + count * PAGE_SIZE - 1;
+
+	itn = interval_tree_iter_first(&r->root, start, last);
+	return itn ? node2page_range(itn) : 0;
+}
+
+static void
+netgpu_cuda_pgtbl_cb(void *data)
+{
+	struct netgpu_region *r = data;
+
+	netgpu_cuda_free_region(r);
+}
+
+static void
+netgpu_init_pages(u64 va, unsigned long pfn_start, unsigned long pfn_end)
+{
+	unsigned long pfn;
+	struct page *page;
+
+	for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		mm_zero_struct_page(page);
+
+		set_page_count(page, 2);	/* matches host logic */
+		page->page_type = 7;		/* XXX differential flag */
+		__SetPageReserved(page);
+
+		set_page_private(page, va);
+		va += PAGE_SIZE;
+	}
+}
+
+static struct resource *
+netgpu_add_pages(int nid, u64 start, u64 end)
+{
+	struct mhp_restrictions restrict = {
+		.flags = MHP_MEMBLOCK_API,
+	};
+
+	return add_memory_pages(nid, start, end - start, &restrict);
+}
+
+static void
+netgpu_free_pages(struct resource *res)
+{
+	release_memory_pages(res);
+}
+
+static int
+netgpu_remap_pages(struct netgpu_region *r, u64 va, u64 start, u64 end)
+{
+	struct netgpu_page_range *pr;
+	struct resource *res;
+
+	pr = kmalloc(sizeof(*pr), GFP_KERNEL);
+	if (!pr)
+		return -ENOMEM;
+
+	res = netgpu_add_pages(numa_mem_id(), start, end);
+	if (IS_ERR(res)) {
+		kfree(pr);
+		return PTR_ERR(res);
+	}
+
+	pr->pfn = PHYS_PFN(start);
+	pr->region = r;
+	pr->va_node.start = va;
+	pr->va_node.last = va + (end - start) - 1;
+	pr->res = res;
+
+	netgpu_init_pages(va, PHYS_PFN(start), PHYS_PFN(end));
+
+//	spin_lock(&r->lock);
+	interval_tree_insert(&pr->va_node, &r->root);
+//	spin_unlock(&r->lock);
+
+	return 0;
+}
+
+static int
+netgpu_cuda_map_region(struct netgpu_region *r)
+{
+	struct pci_dev *pdev;
+	int ret;
+
+	pdev = netdev2pci_dev(r->ctx->dev);
+
+	/*
+	 * takes PA from pgtbl, performs mapping, saves mapping
+	 * dma_mapping holds dma mapped addresses, and pdev.
+	 * mem_info contains pgtbl and mapping list.  mapping is added to list.
+	 * rm_p2p_dma_map_pages() does the work.
+	 */
+	ret = nvidia_p2p_dma_map_pages(pdev, r->gpu_pgtbl, &r->dmamap);
+	if (ret) {
+		pr_err("dma map failed: %d\n", ret);
+		goto out;
+	}
+
+out:
+	return ret;
+}
+
+/*
+ * makes GPU pages at va available to other devices.
+ * expensive operation.
+ */
+static int
+netgpu_cuda_add_region(struct netgpu_ctx *ctx, const struct iovec *iov)
+{
+	struct nvidia_p2p_page_table *gpu_pgtbl;
+	struct netgpu_region *r;
+	u64 va, size, start, end, pa;
+	int i, count, gpu_pgsize;
+	int ret;
+
+	start = (u64)iov->iov_base;
+	va = round_down(start, GPU_PAGE_SIZE);
+	size = round_up(start - va + iov->iov_len, GPU_PAGE_SIZE);
+	count = size / PAGE_SIZE;
+
+	ret = -ENOMEM;
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		goto out;
+
+	/*
+	 * allocates page table, sets gpu_uuid to owning gpu.
+	 * allocates page array, set PA for each page.
+	 * sets page_size (64K here)
+	 * rm_p2p_get_pages() does the actual work.
+	 */
+	ret = nvidia_p2p_get_pages(0, 0, va, size, &gpu_pgtbl,
+				   netgpu_cuda_pgtbl_cb, r);
+	if (ret) {
+		kfree(r);
+		goto out;
+	}
+
+	/* gpu pgtbl owns r, will free via netgpu_cuda_pgtbl_cb */
+	r->gpu_pgtbl = gpu_pgtbl;
+
+	r->start = va;
+	r->len = size;
+	r->root = RB_ROOT_CACHED;
+//	spin_lock_init(&r->lock);
+
+	if (!NVIDIA_P2P_PAGE_TABLE_VERSION_COMPATIBLE(gpu_pgtbl)) {
+		pr_err("incompatible page table\n");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	gpu_pgsize = nvidia_pg_size[gpu_pgtbl->page_size];
+	if (count != gpu_pgtbl->entries * gpu_pgsize / PAGE_SIZE) {
+		pr_err("GPU page count %d != host page count %d\n",
+		       gpu_pgtbl->entries, count);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = xa_err(xa_store_range(&ctx->xa, va >> PAGE_SHIFT,
+				    (va + size) >> PAGE_SHIFT,
+				    r, GFP_KERNEL));
+	if (ret)
+		goto out;
+
+	r->ctx = ctx;
+	refcount_inc(&ctx->ref);
+
+	ret = netgpu_cuda_map_region(r);
+	if (ret)
+		goto out;
+
+	start = U64_MAX;
+	end = 0;
+
+	for (i = 0; i < gpu_pgtbl->entries; i++) {
+		pa = gpu_pgtbl->pages[i]->physical_address;
+		if (pa != end) {
+			if (end) {
+				ret = netgpu_remap_pages(r, va, start, end);
+				if (ret)
+					goto out;
+			}
+			start = pa;
+			va = r->start + i * gpu_pgsize;
+		}
+		end = pa + gpu_pgsize;
+	}
+	ret = netgpu_remap_pages(r, va, start, end);
+	if (ret)
+		goto out;
+
+	return 0;
+
+out:
+	return ret;
+}
+#endif
+
+static void
+netgpu_host_unaccount_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
+static int
+netgpu_host_account_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	unsigned long page_limit, cur_pages, new_pages;
+
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+				     new_pages) != cur_pages);
+
+	return 0;
+}
+
+static unsigned
+netgpu_init_region(struct netgpu_region *r, const struct iovec *iov,
+		   unsigned align)
+{
+	u64 addr = (u64)iov->iov_base;
+	u32 len = iov->iov_len;
+	unsigned nr_pages;
+
+	r->root = RB_ROOT_CACHED;
+//	spin_lock_init(&r->lock);
+
+	r->start = round_down(addr, align);
+	r->len = round_up(addr - r->start + len, align);
+	nr_pages = r->len / PAGE_SIZE;
+
+	r->page = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
+	r->pginfo = kvmalloc_array(nr_pages, sizeof(struct netgpu_pginfo),
+				   GFP_KERNEL);
+	if (!r->page || !r->pginfo)
+		return 0;
+
+	return nr_pages;
+}
+
+/* NOTE: nr_pages may be negative on error. */
+static void
+netgpu_host_put_pages(struct netgpu_region *r, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		put_page(r->page[i]);
+}
+
+static void
+netgpu_host_release_pages(struct netgpu_region *r, int nr_pages)
+{
+	struct device *device;
+	int i;
+
+	device = netdev2device(r->ctx->dev);
+
+	for (i = 0; i < nr_pages; i++) {
+		dma_unmap_page(device, r->pginfo[i].dma, PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
+		put_page(r->page[i]);
+	}
+}
+
+static bool
+netgpu_host_setup_pages(struct netgpu_region *r, unsigned nr_pages)
+{
+	struct device *device;
+	struct page *page;
+	dma_addr_t dma;
+	u64 addr;
+	int i;
+
+	device = netdev2device(r->ctx->dev);
+
+	addr = r->start;
+	for (i = 0; i < nr_pages; i++, addr += PAGE_SIZE) {
+		page = r->page[i];
+		dma = dma_map_page(device, page, 0, PAGE_SIZE,
+				   DMA_BIDIRECTIONAL);
+		if (unlikely(dma_mapping_error(device, dma)))
+			goto out;
+
+		r->pginfo[i].dma = dma;
+		r->pginfo[i].addr = addr;
+	}
+	return true;
+
+out:
+	while (i--)
+		dma_unmap_page(device, r->pginfo[i].dma, PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
+
+	return false;
+}
+
+static int
+netgpu_host_add_region(struct netgpu_ctx *ctx, const struct iovec *iov)
+{
+	struct netgpu_region *r;
+	int err, nr_pages;
+	int count = 0;
+
+	err = -ENOMEM;
+	r = kzalloc(sizeof(*r), GFP_KERNEL);
+	if (!r)
+		return err;
+
+	r->ctx = ctx;			/* no refcount for host regions */
+	r->host_memory = true;
+
+	nr_pages = netgpu_init_region(r, iov, PAGE_SIZE);
+	if (!nr_pages)
+		goto out;
+
+	if (ctx->account_mem) {
+		err = netgpu_host_account_mem(ctx->user, nr_pages);
+		if (err)
+			goto out;
+	}
+
+	/* XXX should this be pin_user_pages? */
+	mmap_read_lock(current->mm);
+	count = get_user_pages(r->start, nr_pages,
+			       FOLL_WRITE | FOLL_LONGTERM,
+			       r->page, NULL);
+	mmap_read_unlock(current->mm);
+
+	if (count != nr_pages) {
+		err = count < 0 ? count : -EFAULT;
+		goto out;
+	}
+
+	if (!netgpu_host_setup_pages(r, count))
+		goto out;
+
+	err = xa_err(xa_store_range(&ctx->xa, r->start >> PAGE_SHIFT,
+				    (r->start + r->len) >> PAGE_SHIFT,
+				    r, GFP_KERNEL));
+	if (err)
+		goto out;
+
+	return 0;
+
+out:
+	if (ctx->account_mem)
+		netgpu_host_unaccount_mem(ctx->user, nr_pages);
+	netgpu_host_put_pages(r, count);
+	kvfree(r->page);
+	kvfree(r->pginfo);
+	kfree(r);
+
+	return err;
+}
+
+static int
+netgpu_add_region(struct netgpu_ctx *ctx, void __user *arg)
+{
+	struct dma_region d;
+	int err = -EIO;
+
+	if (!ctx->dev)
+		return -ENODEV;
+
+	if (copy_from_user(&d, arg, sizeof(d)))
+		return -EFAULT;
+
+	if (d.host_memory)
+		err = netgpu_host_add_region(ctx, &d.iov);
+#ifdef USE_CUDA
+	else
+		err = netgpu_cuda_add_region(ctx, &d.iov);
+#endif
+
+	return err;
+}
+
+#ifdef USE_CUDA
+static void
+region_get_pages(struct page **pages, unsigned long pfn, int n)
+{
+	struct page *p;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		p = pfn_to_page(pfn + i);
+		get_page(p);
+		pages[i] = p;
+	}
+}
+
+static int
+netgpu_cuda_get_page(struct netgpu_region *r, unsigned long addr,
+		     struct page **page, dma_addr_t *dma)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx;
+	struct page *p;
+
+	pr = region_find(r, addr, 1);
+	if (!pr)
+		return -EFAULT;
+
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+
+	p = pfn_to_page(pr->pfn + idx);
+	get_page(p);
+	*page = p;
+	*dma = page_to_phys(p);		/* XXX can get away with this for now */
+
+	return 0;
+}
+
+static int
+netgpu_cuda_get_pages(struct netgpu_region *r, struct page **pages,
+		     unsigned long addr, int count)
+{
+	struct netgpu_page_range *pr;
+	unsigned long idx, end;
+	int n;
+
+	pr = region_find(r, addr, count);
+	if (!pr)
+		return -EFAULT;
+
+	idx = (addr - pr->va_node.start) >> PAGE_SHIFT;
+	end = (pr->va_node.last - pr->va_node.start) >> PAGE_SHIFT;
+	n = end - idx + 1;
+	n = min(count, n);
+
+	region_get_pages(pages, pr->pfn + idx, n);
+
+	return n;
+}
+#endif
+
+/* Used by the lib/iov_iter to obtain a set of pages for TX */
+static int
+netgpu_host_get_pages(struct netgpu_region *r, struct page **pages,
+		      unsigned long addr, int count)
+{
+	unsigned long idx;
+	struct page *p;
+	int i, n;
+
+	idx = (addr - r->start) >> PAGE_SHIFT;
+	n = (r->len >> PAGE_SHIFT) - idx + 1;
+	n = min(count, n);
+
+	for (i = 0; i < n; i++) {
+		p = r->page[idx + i];
+		get_page(p);
+		pages[i] = p;
+	}
+
+	return n;
+}
+
+/* Used by the driver to obtain the backing store page for a fill address */
+static int
+netgpu_host_get_page(struct netgpu_region *r, unsigned long addr,
+		     struct page **page, dma_addr_t *dma)
+{
+	unsigned long idx;
+	struct page *p;
+
+	idx = (addr - r->start) >> PAGE_SHIFT;
+
+	p = r->page[idx];
+	get_page(p);
+	set_page_private(p, addr);
+	*page = p;
+	*dma = r->pginfo[idx].dma;
+
+	return 0;
+}
+
+static void
+__netgpu_put_page_any(struct netgpu_ctx *ctx, struct page *page)
+{
+	struct netgpu_pgcache *cache = ctx->any_cache;
+	unsigned count;
+	size_t sz;
+
+	/* unsigned: count == -1 if !cache, so the check will fail. */
+	count = ctx->any_cache_count;
+	if (count < NETGPU_CACHE_COUNT) {
+		cache->page[count] = page;
+		ctx->any_cache_count = count + 1;
+		return;
+	}
+
+	sz = struct_size(cache, page, NETGPU_CACHE_COUNT);
+	cache = kmalloc(sz, GFP_ATOMIC);
+	if (!cache) {
+		/* XXX fixme */
+		pr_err("netgpu: addr 0x%lx lost to overflow\n",
+		       page_private(page));
+		return;
+	}
+	cache->next = ctx->any_cache;
+
+	cache->page[0] = page;
+	ctx->any_cache = cache;
+	ctx->any_cache_count = 1;
+}
+
+static void
+netgpu_put_page_any(struct netgpu_ctx *ctx, struct page *page)
+{
+	spin_lock(&ctx->pgcache_lock);
+
+	__netgpu_put_page_any(ctx, page);
+
+	spin_unlock(&ctx->pgcache_lock);
+}
+
+static void
+netgpu_put_page_napi(struct netgpu_ctx *ctx, struct page *page)
+{
+	struct netgpu_pgcache *spare;
+	unsigned count;
+	size_t sz;
+
+	count = ctx->napi_cache_count;
+	if (count < NETGPU_CACHE_COUNT) {
+		ctx->napi_cache->page[count] = page;
+		ctx->napi_cache_count = count + 1;
+		return;
+	}
+
+	spare = ctx->spare_cache;
+	if (spare) {
+		ctx->spare_cache = NULL;
+		goto out;
+	}
+
+	sz = struct_size(spare, page, NETGPU_CACHE_COUNT);
+	spare = kmalloc(sz, GFP_ATOMIC);
+	if (!spare) {
+		pr_err("netgpu: addr 0x%lx lost to overflow\n",
+		       page_private(page));
+		return;
+	}
+	spare->next = ctx->napi_cache;
+
+out:
+	spare->page[0] = page;
+	ctx->napi_cache = spare;
+	ctx->napi_cache_count = 1;
+}
+
+void
+netgpu_put_page(struct netgpu_ctx *ctx, struct page *page, bool napi)
+{
+	if (napi)
+		netgpu_put_page_napi(ctx, page);
+	else
+		netgpu_put_page_any(ctx, page);
+}
+EXPORT_SYMBOL(netgpu_put_page);
+
+static int
+netgpu_swap_caches(struct netgpu_ctx *ctx, struct netgpu_pgcache **cachep)
+{
+	int count;
+
+	spin_lock(&ctx->pgcache_lock);
+
+	count = ctx->any_cache_count;
+	*cachep = ctx->any_cache;
+	ctx->any_cache = ctx->napi_cache;
+	ctx->any_cache_count = 0;
+
+	spin_unlock(&ctx->pgcache_lock);
+
+	return count;
+}
+
+static struct page *
+netgpu_get_cached_page(struct netgpu_ctx *ctx)
+{
+	struct netgpu_pgcache *cache = ctx->napi_cache;
+	struct page *page;
+	int count;
+
+	count = ctx->napi_cache_count;
+
+	if (!count) {
+		if (cache->next) {
+			if (ctx->spare_cache)
+				kfree(ctx->spare_cache);
+			ctx->spare_cache = cache;
+			cache = cache->next;
+			count = NETGPU_CACHE_COUNT;
+			goto out;
+		}
+
+		/* lockless read of any count - if >0, skip */
+		count = READ_ONCE(ctx->any_cache_count);
+		if (count > 0) {
+			count = netgpu_swap_caches(ctx, &cache);
+			goto out;
+		}
+
+		return NULL;
+out:
+		ctx->napi_cache = cache;
+	}
+
+	page = cache->page[--count];
+	ctx->napi_cache_count = count;
+
+	return page;
+}
+
+/*
+ * Free cache structures.  Pages have already been released.
+ */
+static void
+netgpu_free_cache(struct netgpu_ctx *ctx)
+{
+	struct netgpu_pgcache *cache, *next;
+
+	if (ctx->spare_cache)
+		kfree(ctx->spare_cache);
+	for (cache = ctx->napi_cache; cache; cache = next) {
+		next = cache->next;
+		kfree(cache);
+	}
+	for (cache = ctx->any_cache; cache; cache = next) {
+		next = cache->next;
+		kfree(cache);
+	}
+}
+
+/*
+ * Called from iov_iter when addr is provided for TX.
+ */
+int
+netgpu_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		 int count)
+{
+	struct netgpu_region *r;
+	struct netgpu_ctx *ctx;
+	int n = 0;
+
+	ctx = xa_load(&xa_netgpu, (uintptr_t)sk->sk_user_data);
+	if (!ctx)
+		return -EEXIST;
+
+	r = xa_load(&ctx->xa, addr >> PAGE_SHIFT);
+	if (!r)
+		return -EINVAL;
+
+	if (r->host_memory)
+		n = netgpu_host_get_pages(r, pages, addr, count);
+#ifdef USE_CUDA
+	else
+		n = netgpu_cuda_get_pages(r, pages, addr, count);
+#endif
+
+	return n;
+}
+EXPORT_SYMBOL(netgpu_get_pages);
+
+static int
+netgpu_get_fill_page(struct netgpu_ctx *ctx, dma_addr_t *dma,
+		     struct page **page)
+{
+	struct netgpu_region *r;
+	u64 *addrp, addr;
+	int ret = 0;
+
+	addrp = sq_cons_peek(&ctx->fill);
+	if (!addrp)
+		return -ENOMEM;
+
+	addr = READ_ONCE(*addrp);
+
+	r = xa_load(&ctx->xa, addr >> PAGE_SHIFT);
+	if (!r)
+		return -EINVAL;
+
+	if (r->host_memory)
+		ret = netgpu_host_get_page(r, addr, page, dma);
+#ifdef USE_CUDA
+	else
+		ret = netgpu_cuda_get_page(r, addr, page, dma);
+#endif
+
+	if (!ret)
+		sq_cons_advance(&ctx->fill);
+
+	return ret;
+}
+
+static dma_addr_t
+netgpu_page_get_dma(struct netgpu_ctx *ctx, struct page *page)
+{
+	return page_to_phys(page);		/* XXX cheat for now... */
+}
+
+int
+netgpu_get_page(struct netgpu_ctx *ctx, struct page **page, dma_addr_t *dma)
+{
+	struct page *p;
+
+	p = netgpu_get_cached_page(ctx);
+	if (p) {
+		page_ref_inc(p);
+		*dma = netgpu_page_get_dma(ctx, p);
+		*page = p;
+		return 0;
+	}
+
+	return netgpu_get_fill_page(ctx, dma, page);
+}
+EXPORT_SYMBOL(netgpu_get_page);
+
+static struct page *
+netgpu_get_dummy_page(struct netgpu_ctx *ctx)
+{
+	ctx->page_extra_refc--;
+	if (unlikely(!ctx->page_extra_refc)) {
+		page_ref_add(ctx->dummy_page, NETGPU_REFC_EXTRA);
+		ctx->page_extra_refc = NETGPU_REFC_EXTRA;
+	}
+	return ctx->dummy_page;
+}
+
+/* Our version of __skb_datagram_iter */
+static int
+netgpu_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+		unsigned int offset, size_t len)
+{
+	struct netgpu_ctx *ctx = desc->arg.data;
+	struct sk_buff *frag_iter;
+	struct iovec *iov;
+	struct page *page;
+	unsigned start;
+	int i, used;
+	u64 addr;
+
+	if (skb_headlen(skb)) {
+		pr_err("zc socket receiving non-zc data");
+		return -EFAULT;
+	}
+
+	used = 0;
+	start = 0;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		skb_frag_t *frag;
+		int end, off, frag_len;
+
+		frag = &skb_shinfo(skb)->frags[i];
+		frag_len = skb_frag_size(frag);
+
+		end = start + frag_len;
+		if (offset < end) {
+			off = offset - start;
+
+			iov = sq_prod_reserve(&ctx->rx);
+			if (!iov)
+				break;
+
+			page = skb_frag_page(frag);
+			addr = (u64)page_private(page) + off;
+
+			iov->iov_base = (void *)(addr + skb_frag_off(frag));
+			iov->iov_len = frag_len - off;
+
+			used += (frag_len - off);
+			offset += (frag_len - off);
+
+			put_page(page);
+			page = netgpu_get_dummy_page(ctx);
+			__skb_frag_set_page(frag, page);
+		}
+		start = end;
+	}
+
+	if (used)
+		sq_prod_submit(&ctx->rx);
+
+	skb_walk_frags(skb, frag_iter) {
+		int end, off, ret;
+
+		end = start + frag_iter->len;
+		if (offset < end) {
+			off = offset - start;
+			len = frag_iter->len - off;
+
+			ret = netgpu_recv_skb(desc, frag_iter, off, len);
+			if (ret < 0) {
+				if (!used)
+					used = ret;
+				goto out;
+			}
+			used += ret;
+			if (ret < len)
+				goto out;
+			offset += ret;
+		}
+		start = end;
+	}
+
+out:
+	return used;
+}
+
+static void
+netgpu_read_sock(struct sock *sk, struct netgpu_ctx *ctx)
+{
+	read_descriptor_t desc;
+	int used;
+
+	desc.arg.data = ctx;
+	desc.count = 1;
+	used = tcp_read_sock(sk, &desc, netgpu_recv_skb);
+}
+
+static void
+netgpu_data_ready(struct sock *sk)
+{
+	struct netgpu_ctx *ctx;
+
+	ctx = xa_load(&xa_netgpu, (uintptr_t)sk->sk_user_data);
+	if (ctx && ctx->rx.entries)
+		netgpu_read_sock(sk, ctx);
+
+	sk_data_ready(sk);
+}
+
+static bool netgpu_stream_memory_read(const struct sock *sk)
+{
+	struct netgpu_ctx *ctx;
+	bool empty = false;
+
+	/* sk is not locked.  called from poll, so not sp. */
+	ctx = xa_load(&xa_netgpu, (uintptr_t)sk->sk_user_data);
+	if (ctx)
+		empty = sq_empty(&ctx->rx);
+
+	return !empty;
+}
+
+static struct netgpu_ctx *
+netgpu_file_to_ctx(struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+	struct netgpu_ctx *ctx = seq->private;
+
+	return ctx;
+}
+
+int
+netgpu_register_dma(struct sock *sk, void __user *optval, unsigned int optlen)
+{
+	struct fd f;
+	int netgpu_fd;
+	struct netgpu_ctx *ctx;
+
+	if (sk->sk_user_data)
+		return -EALREADY;
+	if (optlen < sizeof(netgpu_fd))
+		return -EINVAL;
+	if (copy_from_user(&netgpu_fd, optval, sizeof(netgpu_fd)))
+		return -EFAULT;
+
+	f = fdget(netgpu_fd);
+	if (!f.file)
+		return -EBADF;
+
+	if (f.file->f_op != &netgpu_fops) {
+		fdput(f);
+		return -EOPNOTSUPP;
+	}
+
+	/* XXX should really have some way to identify sk_user_data type */
+	ctx = netgpu_file_to_ctx(f.file);
+	sk->sk_user_data = (void *)(uintptr_t)ctx->index;
+
+	fdput(f);
+
+	if (!sk_data_ready)
+		sk_data_ready = sk->sk_data_ready;
+	sk->sk_data_ready = netgpu_data_ready;
+
+	/* XXX does not do any checking here */
+	if (!netgpu_prot.stream_memory_read) {
+		netgpu_prot = *sk->sk_prot;
+		netgpu_prot.stream_memory_read = netgpu_stream_memory_read;
+	}
+	sk->sk_prot = &netgpu_prot;
+
+	return 0;
+}
+EXPORT_SYMBOL(netgpu_register_dma);
+
+static int
+netgpu_validate_queue(struct netgpu_user_queue *q, unsigned elt_size,
+		      unsigned map_off)
+{
+	struct shared_queue_map *map;
+	unsigned count;
+	size_t size;
+
+	if (q->elt_sz != elt_size)
+		return -EINVAL;
+
+	count = roundup_pow_of_two(q->entries);
+	if (!count)
+		return -EINVAL;
+	q->entries = count;
+	q->mask = count - 1;
+
+	size = struct_size(map, data, count * elt_size);
+	if (size == SIZE_MAX || size > U32_MAX)
+		return -EOVERFLOW;
+	q->map_sz = size;
+
+	q->map_off = map_off;
+
+	return 0;
+}
+
+static int
+netgpu_validate_param(struct netgpu_ctx *ctx, struct netgpu_params *p)
+{
+	int rc;
+
+	if (ctx->queue_id != -1)
+		return -EALREADY;
+
+	rc = netgpu_validate_queue(&p->fill, sizeof(u64), NETGPU_OFF_FILL_ID);
+	if (rc)
+		return rc;
+
+	rc = netgpu_validate_queue(&p->rx, sizeof(struct iovec),
+				   NETGPU_OFF_RX_ID);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
+static int
+netgpu_queue_create(struct shared_queue *q, struct netgpu_user_queue *u)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
+			  __GFP_COMP | __GFP_NORETRY;
+	struct shared_queue_map *map;
+
+	map = (void *)__get_free_pages(gfp_flags, get_order(u->map_sz));
+	if (!map)
+		return -ENOMEM;
+
+	q->map_ptr = map;
+	q->prod = &map->prod;
+	q->cons = &map->cons;
+	q->data = &map->data[0];
+	q->elt_sz = u->elt_sz;
+	q->mask = u->mask;
+	q->entries = u->entries;
+
+	memset(&u->off, 0, sizeof(u->off));
+	u->off.prod = offsetof(struct shared_queue_map, prod);
+	u->off.cons = offsetof(struct shared_queue_map, cons);
+	u->off.desc = offsetof(struct shared_queue_map, data);
+
+	return 0;
+}
+
+static int
+netgpu_bind_device(struct netgpu_ctx *ctx, int ifindex)
+{
+	struct net_device *dev;
+	int rc;
+
+	dev = dev_get_by_index(&init_net, ifindex);
+	if (!dev)
+		return -ENODEV;
+
+	if (ctx->dev) {
+		rc = dev == ctx->dev ? 0 : -EALREADY;
+		dev_put(dev);
+		return rc;
+	}
+
+	ctx->dev = dev;
+
+	return 0;
+}
+
+static int
+__netgpu_queue_mgmt(struct net_device *dev, struct netgpu_ctx *ctx,
+		    u32 queue_id)
+{
+	struct netdev_bpf cmd;
+	bpf_op_t ndo_bpf;
+
+	cmd.command = XDP_SETUP_NETGPU;
+	cmd.netgpu.ctx = ctx;
+	cmd.netgpu.queue_id = queue_id;
+
+	ndo_bpf = dev->netdev_ops->ndo_bpf;
+	if (!ndo_bpf)
+		return -EINVAL;
+
+	return ndo_bpf(dev, &cmd);
+}
+
+static int
+netgpu_open_queue(struct netgpu_ctx *ctx, u32 queue_id)
+{
+	return __netgpu_queue_mgmt(ctx->dev, ctx, queue_id);
+}
+
+static int
+netgpu_close_queue(struct netgpu_ctx *ctx, u32 queue_id)
+{
+	return __netgpu_queue_mgmt(ctx->dev, NULL, queue_id);
+}
+
+static int
+netgpu_bind_queue(struct netgpu_ctx *ctx, void __user *arg)
+{
+	struct netgpu_params p;
+	int rc;
+
+	if (!ctx->dev)
+		return -ENODEV;
+
+	if (copy_from_user(&p, arg, sizeof(p)))
+		return -EFAULT;
+
+	rc = netgpu_validate_param(ctx, &p);
+	if (rc)
+		return rc;
+
+	rc = netgpu_queue_create(&ctx->fill, &p.fill);
+	if (rc)
+		return rc;
+
+	rc = netgpu_queue_create(&ctx->rx, &p.rx);
+	if (rc)
+		return rc;
+
+	rc = netgpu_open_queue(ctx, p.queue_id);
+	if (rc)
+		return rc;
+	ctx->queue_id = p.queue_id;
+
+	if (copy_to_user(arg, &p, sizeof(p)))
+		return -EFAULT;
+		/* XXX leaks ring here ... */
+
+	return rc;
+}
+
+static int
+netgpu_attach_dev(struct netgpu_ctx *ctx, void __user *arg)
+{
+	int ifindex;
+
+	if (copy_from_user(&ifindex, arg, sizeof(ifindex)))
+		return -EFAULT;
+
+	return netgpu_bind_device(ctx, ifindex);
+}
+
+static long
+netgpu_ioctl(struct file *file, unsigned cmd, unsigned long arg)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+
+	switch (cmd) {
+	case NETGPU_IOCTL_ATTACH_DEV:
+		return netgpu_attach_dev(ctx, (void __user *)arg);
+
+	case NETGPU_IOCTL_BIND_QUEUE:
+		return netgpu_bind_queue(ctx, (void __user *)arg);
+
+	case NETGPU_IOCTL_ADD_REGION:
+		return netgpu_add_region(ctx, (void __user *)arg);
+	}
+	return -ENOTTY;
+}
+
+static int
+netgpu_show(struct seq_file *seq_file, void *private)
+{
+	return 0;
+}
+
+static struct netgpu_ctx *
+netgpu_create_ctx(void)
+{
+	struct netgpu_ctx *ctx;
+	size_t sz;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->account_mem = !capable(CAP_IPC_LOCK);
+	ctx->user = get_uid(current_user());
+
+	sz = struct_size(ctx->napi_cache, page, NETGPU_CACHE_COUNT);
+	ctx->napi_cache = kmalloc(sz, GFP_KERNEL);
+	if (!ctx->napi_cache)
+		goto out;
+	ctx->napi_cache->next = NULL;
+
+	ctx->dummy_page = alloc_page(GFP_KERNEL);
+	if (!ctx->dummy_page)
+		goto out;
+
+	spin_lock_init(&ctx->pgcache_lock);
+	xa_init(&ctx->xa);
+	refcount_set(&ctx->ref, 1);
+	ctx->queue_id = -1;
+	ctx->any_cache_count = -1;
+
+	/* Set dummy page refs to MAX, with extra to hand out */
+	page_ref_add(ctx->dummy_page, NETGPU_REFC_MAX - 1);
+	ctx->page_extra_refc = NETGPU_REFC_EXTRA;
+
+	return (ctx);
+
+out:
+	free_uid(ctx->user);
+	kfree(ctx->napi_cache);
+	if (ctx->dummy_page)
+		put_page(ctx->dummy_page);
+	kfree(ctx);
+
+	return NULL;
+}
+
+static int
+netgpu_open(struct inode *inode, struct file *file)
+{
+	struct netgpu_ctx *ctx;
+	int err;
+
+	ctx = netgpu_create_ctx();
+	if (!ctx)
+		return -ENOMEM;
+
+	__module_get(THIS_MODULE);
+
+	/* miscdevice inits (but doesn't use) private_data.
+	 * single_open wants to use it, so set to NULL first.
+	 */
+	file->private_data = NULL;
+	err = single_open(file, netgpu_show, ctx);
+	if (err)
+		goto out;
+
+	mutex_lock(&netgpu_lock);
+	ctx->index = ++netgpu_index;
+	mutex_unlock(&netgpu_lock);
+
+	/* XXX retval... */
+	xa_store(&xa_netgpu, ctx->index, ctx, GFP_KERNEL);
+
+	return 0;
+
+out:
+	netgpu_free_ctx(ctx);
+
+	return err;
+}
+
+#ifdef USE_CUDA
+static void
+netgpu_cuda_free_page_range(struct netgpu_page_range *pr)
+{
+	unsigned long pfn, pfn_end;
+	struct page *page;
+
+	pfn_end = pr->pfn +
+		  ((pr->va_node.last + 1 - pr->va_node.start) >> PAGE_SHIFT);
+
+	for (pfn = pr->pfn; pfn < pfn_end; pfn++) {
+		page = pfn_to_page(pfn);
+		set_page_count(page, 0);
+	}
+	netgpu_free_pages(pr->res);
+	kfree(pr);
+}
+
+static void
+netgpu_cuda_release_resources(struct netgpu_region *r)
+{
+	struct pci_dev *pdev;
+	int ret;
+
+	if (r->dmamap) {
+		pdev = netdev2pci_dev(r->ctx->dev);
+		ret = nvidia_p2p_dma_unmap_pages(pdev, r->gpu_pgtbl, r->dmamap);
+		if (ret)
+			pr_err("nvidia_p2p_dma_unmap failed: %d\n", ret);
+	}
+}
+
+static void
+netgpu_cuda_free_region(struct netgpu_region *r)
+{
+	struct interval_tree_node *va_node;
+	int ret;
+
+	netgpu_cuda_release_resources(r);
+
+	region_remove_each(r, va_node)
+		netgpu_cuda_free_page_range(node2page_range(va_node));
+
+	/* NB: this call is a NOP in the current code */
+	ret = nvidia_p2p_free_page_table(r->gpu_pgtbl);
+	if (ret)
+		pr_err("nvidia_p2p_free_page_table error %d\n", ret);
+
+	/* erase if inital store was successful */
+	if (r->ctx) {
+		xa_store_range(&r->ctx->xa, r->start >> PAGE_SHIFT,
+			       (r->start + r->len) >> PAGE_SHIFT,
+			       NULL, GFP_KERNEL);
+		netgpu_free_ctx(r->ctx);
+	}
+
+	kfree(r);
+}
+#endif
+
+static void
+netgpu_host_free_region(struct netgpu_ctx *ctx, struct netgpu_region *r)
+{
+	unsigned nr_pages;
+
+	if (!r->host_memory)
+		return;
+
+	nr_pages = r->len / PAGE_SIZE;
+
+	xa_store_range(&ctx->xa, r->start >> PAGE_SHIFT,
+		      (r->start + r->len) >> PAGE_SHIFT,
+		      NULL, GFP_KERNEL);
+
+	if (ctx->account_mem)
+		netgpu_host_unaccount_mem(ctx->user, nr_pages);
+	netgpu_host_release_pages(r, nr_pages);
+	kvfree(r->page);
+	kvfree(r->pginfo);
+	kfree(r);
+}
+
+static void
+__netgpu_free_ctx(struct netgpu_ctx *ctx)
+{
+	struct netgpu_region *r;
+	unsigned long index;
+
+	xa_for_each(&ctx->xa, index, r)
+		netgpu_host_free_region(ctx, r);
+
+	xa_destroy(&ctx->xa);
+
+	netgpu_free_cache(ctx);
+	free_uid(ctx->user);
+	ctx->page_extra_refc += (NETGPU_REFC_SELF - 1);
+	page_ref_sub(ctx->dummy_page, ctx->page_extra_refc);
+	put_page(ctx->dummy_page);
+	if (ctx->dev)
+		dev_put(ctx->dev);
+	kfree(ctx);
+
+	module_put(THIS_MODULE);
+}
+
+static void
+netgpu_free_ctx(struct netgpu_ctx *ctx)
+{
+	if (refcount_dec_and_test(&ctx->ref))
+		__netgpu_free_ctx(ctx);
+}
+
+static int
+netgpu_release(struct inode *inode, struct file *file)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+	int ret;
+
+	if (ctx->queue_id != -1)
+		netgpu_close_queue(ctx, ctx->queue_id);
+
+	xa_erase(&xa_netgpu, ctx->index);
+
+	netgpu_free_ctx(ctx);
+
+	ret = single_release(inode, file);
+
+	return ret;
+}
+
+static void *
+netgpu_validate_mmap_request(struct file *file, loff_t pgoff, size_t sz)
+{
+	struct netgpu_ctx *ctx = netgpu_file_to_ctx(file);
+	loff_t offset = pgoff << PAGE_SHIFT;
+	struct page *page;
+	void *ptr;
+
+	/* each returned ptr is a separate allocation. */
+	switch (offset) {
+	case NETGPU_OFF_FILL_ID:
+		ptr = ctx->fill.map_ptr;
+		break;
+	case NETGPU_OFF_RX_ID:
+		ptr = ctx->rx.map_ptr;
+		break;
+	default:
+		return ERR_PTR(-EINVAL);
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > page_size(page))
+		return ERR_PTR(-EINVAL);
+
+	return ptr;
+}
+
+static int
+netgpu_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	size_t sz = vma->vm_end - vma->vm_start;
+	unsigned long pfn;
+	void *ptr;
+
+	ptr = netgpu_validate_mmap_request(file, vma->vm_pgoff, sz);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+static const struct file_operations netgpu_fops = {
+	.owner =		THIS_MODULE,
+	.open =			netgpu_open,
+	.mmap =			netgpu_mmap,
+	.unlocked_ioctl =	netgpu_ioctl,
+	.release =		netgpu_release,
+};
+
+static struct miscdevice netgpu_miscdev = {
+	.minor		= MISC_DYNAMIC_MINOR,
+	.name		= "netgpu",
+	.fops		= &netgpu_fops,
+};
+
+static int __init
+netgpu_init(void)
+{
+	mutex_init(&netgpu_lock);
+	misc_register(&netgpu_miscdev);
+
+	return 0;
+}
+
+static void __exit
+netgpu_fini(void)
+{
+	misc_deregister(&netgpu_miscdev);
+}
+
+module_init(netgpu_init);
+module_exit(netgpu_fini);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("jlemon@flugsvamp.com");
diff --git a/include/uapi/misc/netgpu.h b/include/uapi/misc/netgpu.h
new file mode 100644
index 000000000000..ca3338464218
--- /dev/null
+++ b/include/uapi/misc/netgpu.h
@@ -0,0 +1,43 @@
+#pragma once
+
+#include <linux/ioctl.h>
+
+/* VA memory provided by a specific PCI device. */
+struct dma_region {
+	struct iovec iov;
+	unsigned host_memory : 1;
+};
+
+#define NETGPU_OFF_FILL_ID	(0ULL << 12)
+#define NETGPU_OFF_RX_ID	(1ULL << 12)
+
+struct netgpu_queue_offsets {
+	unsigned prod;
+	unsigned cons;
+	unsigned desc;
+	unsigned resv;
+};
+
+struct netgpu_user_queue {
+	unsigned elt_sz;
+	unsigned entries;
+	unsigned mask;
+	unsigned map_sz;
+	unsigned map_off;
+	struct netgpu_queue_offsets off;
+};
+
+struct netgpu_params {
+	unsigned flags;
+	unsigned ifindex;
+	unsigned queue_id;
+	unsigned resv;
+	struct netgpu_user_queue fill;
+	struct netgpu_user_queue rx;
+};
+
+#define NETGPU_IOCTL_ATTACH_DEV		_IOR(0, 1, int)
+#define NETGPU_IOCTL_BIND_QUEUE		_IOWR(0, 2, struct netgpu_params)
+#define NETGPU_IOCTL_SETUP_RING		_IOWR(0, 2, struct netgpu_params)
+#define NETGPU_IOCTL_ADD_REGION		_IOW(0, 3, struct dma_region)
+
-- 
2.24.1


^ permalink raw reply related

* [RFC PATCH 09/21] include: add definitions for netgpu
From: Jonathan Lemon @ 2020-06-18 16:09 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, axboe
In-Reply-To: <20200618160941.879717-1-jonathan.lemon@gmail.com>

This adds the netgpu structure (which arguably should be private),
and some cruft to support using netgpu as a loadable module, which
should disappear.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 include/net/netgpu.h | 65 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)
 create mode 100644 include/net/netgpu.h

diff --git a/include/net/netgpu.h b/include/net/netgpu.h
new file mode 100644
index 000000000000..fee84ba3db78
--- /dev/null
+++ b/include/net/netgpu.h
@@ -0,0 +1,65 @@
+#pragma once
+
+struct net_device;
+#include <uapi/misc/shqueue.h>		/* XXX */
+
+struct netgpu_pgcache {
+	struct netgpu_pgcache *next;
+	struct page *page[];
+};
+
+struct netgpu_ctx {
+	struct xarray xa;		/* contains regions */
+	unsigned int index;
+	refcount_t ref;
+	struct shared_queue fill;
+	struct shared_queue rx;
+	struct net_device *dev;
+	struct netgpu_pgcache *napi_cache;
+	struct netgpu_pgcache *spare_cache;
+	struct netgpu_pgcache *any_cache;
+	spinlock_t pgcache_lock;
+	struct page *dummy_page;
+	unsigned page_extra_refc;
+	int queue_id;
+	int napi_cache_count;
+	int any_cache_count;
+	struct user_struct *user;
+	unsigned account_mem : 1;
+};
+
+int netgpu_get_page(struct netgpu_ctx *ctx, struct page **page,
+		    dma_addr_t *dma);
+void netgpu_put_page(struct netgpu_ctx *ctx, struct page *page, bool napi);
+int netgpu_get_pages(struct sock *sk, struct page **pages, unsigned long addr,
+		 int count);
+
+/*---------------------------------------------------------------------------*/
+/* XXX temporary development support */
+
+extern int (*fn_netgpu_get_page)(struct netgpu_ctx *ctx,
+			  struct page **page, dma_addr_t *dma);
+extern void (*fn_netgpu_put_page)(struct netgpu_ctx *, struct page *, bool);
+extern int (*fn_netgpu_get_pages)(struct sock *, struct page **,
+                           unsigned long, int);
+extern struct netgpu_ctx *g_ctx;
+
+static inline int
+__netgpu_get_page(struct netgpu_ctx *ctx,
+                  struct page **page, dma_addr_t *dma)
+{
+        return fn_netgpu_get_page(ctx, page, dma);
+}
+
+static inline void
+__netgpu_put_page(struct netgpu_ctx *ctx, struct page *page, bool napi)
+{
+        return fn_netgpu_put_page(ctx, page, napi);
+}
+
+static inline int
+__netgpu_get_pages(struct sock *sk, struct page **pages,
+                   unsigned long addr, int count)
+{
+        return fn_netgpu_get_pages(sk, pages, addr, count);
+}
-- 
2.24.1


^ permalink raw reply related

* RE: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Ioana Ciornei @ 2020-06-18 16:17 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Vladimir Oltean,
	Claudiu Manoil, Alexandru Marginean, michael@walle.cc,
	andrew@lunn.ch, f.fainelli@gmail.com
In-Reply-To: <20200618140623.GC1551@shell.armlinux.org.uk>

> Subject: Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
> 
> On Thu, Jun 18, 2020 at 03:08:36PM +0300, Ioana Ciornei wrote:
> > Add a Lynx PCS MDIO module which exposes the necessary operations to
> > drive the PCS using PHYLINK.
> >
> > The majority of the code is extracted from the Felix DSA driver, which
> > will be also modified in a later patch, and exposed as a separate
> > module for code reusability purposes.
> >
> > At the moment, USXGMII (only with in-band AN and speeds up to 2500),
> > SGMII, QSGMII and 2500Base-X (only w/o in-band AN) are supported by
> > the Lynx PCS MDIO module since these were also supported by Felix.
> >
> > The module can only be enabled by the drivers in need and not user
> > selectable.
> 
> Is this the same PCS found in the LX2160A?  It looks very similar.
> 

Yes, it is.
I already tested these protocols on LX2160A (and some other DPAA2 SoCs).
The idea is to have this patch set without any functional changes accepted and
then I will wire up dpaa2-eth as well into this.

> > +/* 2500Base-X is SerDes protocol 7 on Felix and 6 on ENETC. It is a
> > +SerDes lane
> > + * clocked at 3.125 GHz which encodes symbols with 8b/10b and does
> > +not have
> > + * auto-negotiation of any link parameters. Electrically it is
> > +compatible with
> > + * a single lane of XAUI.
> > + * The hardware reference manual wants to call this mode SGMII, but
> > +it isn't
> > + * really, since the fundamental features of SGMII:
> > + * - Downgrading the link speed by duplicating symbols
> > + * - Auto-negotiation
> > + * are not there.
> 
> I welcome that others are waking up to the industry wide obfuscation of
> terminology surrounding "SGMII" and "1000base-X", and calling it out where it is
> blatently incorrectly described in documentation.

I will not take the credit for this since this is mainly just a comment being moved.

> 
> > + * The speed is configured at 1000 in the IF_MODE because the clock
> > + frequency
> > + * is actually given by a PLL configured in the Reset Configuration Word
> (RCW).
> > + * Since there is no difference between fixed speed SGMII w/o AN and
> > + 802.3z w/o
> > + * AN, we call this PHY interface type 2500Base-X. In case a PHY
> > + negotiates a
> > + * lower link speed on line side, the system-side interface remains
> > + fixed at
> > + * 2500 Mbps and we do rate adaptation through pause frames.
> 
> We have systems that do have AN with 2500base-X however - which is what you
> want when you couple two potentially remote systems over a fibre cable.  The
> AN in 802.3z (1000base-X) is used to negotiate:
> 
> - duplex
> - pause mode
> 
> although in practice, half-duplex is not supported by lots of hardware, which
> leaves just pause mode.  It is useful to have pause mode negotiation remain
> present, whether it's 1000base-X or 2500base-X, but obviously within the
> hardware boundaries.
> 
> I suspect the hardware is capable of supporting 802.3z AN when operating at
> 2500base-X, but not the SGMII symbol duplication for slower speeds.
> 

I don't have a definitive answer to this this right now, I'll have to actually test this
if I can get my hands on some hardware for this setup.

> > +struct mdio_lynx_pcs *mdio_lynx_pcs_create(struct mdio_device
> > +*mdio_dev) {
> > +	struct mdio_lynx_pcs *pcs;
> > +
> > +	if (WARN_ON(!mdio_dev))
> > +		return NULL;
> > +
> > +	pcs = kzalloc(sizeof(*pcs), GFP_KERNEL);
> > +	if (!pcs)
> > +		return NULL;
> > +
> > +	pcs->dev = mdio_dev;
> > +	pcs->an_restart = lynx_pcs_an_restart;
> > +	pcs->get_state = lynx_pcs_get_state;
> > +	pcs->link_up = lynx_pcs_link_up;
> > +	pcs->config = lynx_pcs_config;
> 
> We really should not have these private structure interfaces.  Private structure-
> based driver specific interfaces really don't constitute a sane approach to
> interface design.
> 
> Would it work if there was a "struct mdio_device" add to the phylink_config
> structure, and then you could have the phylink_pcs_ops embedded into this
> driver?

I think that would restrict too much the usage.
I am afraid there will be instances where the PCS is not recognizable by PHY_ID,
thus no way of knowing which driver to probe which mdio_device.
Also, I would leave to the driver the choice of using (or not) the functions 
exported by Lynx.

> 
> If not, then we need some kind of mdio_pcs_device that offers this kind of
> functionality.
> 

Maybe we can meet in the middle?

What if we directly export the helper functions directly as symbols which can
be used by the driver without any mdio_lynx_pcs in the middle
(just the mdio_device passed to the function).
This would be exactly as phylink_mii_c22_pcs_[an_restart/config] are currently
used.

We can somehow standardize the functions prototypes (which will likely mean
mdio_device instead of the phylink_pcs_ops's phylink_config).

Ioana

^ permalink raw reply

* Re: [PATCH net] selftests/net: report etf errors correctly
From: Willem de Bruijn @ 2020-06-18 16:18 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Willem de Bruijn, Network Development, David Miller
In-Reply-To: <20200618085416.48b44e51@kicinski-fedora-PC1C0HJN>

On Thu, Jun 18, 2020 at 11:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 18 Jun 2020 10:55:49 -0400 Willem de Bruijn wrote:
> > +             switch (err->ee_errno) {
> > +             case ECANCELED:
> > +                     if (err->ee_code != SO_EE_CODE_TXTIME_MISSED)
> > +                             error(1, 0, "errqueue: unknown ECANCELED %u\n",
> > +                                         err->ee_code);
> > +                     reason = "missed txtime";
> > +             break;
> > +             case EINVAL:
> > +                     if (err->ee_code != SO_EE_CODE_TXTIME_INVALID_PARAM)
> > +                             error(1, 0, "errqueue: unknown EINVAL %u\n",
> > +                                         err->ee_code);
> > +                     reason = "invalid txtime";
> > +             break;
> > +             default:
> > +                     error(1, 0, "errqueue: errno %u code %u\n",
> > +                           err->ee_errno, err->ee_code);
> > +             };
> >
> >               tstamp = ((int64_t) err->ee_data) << 32 | err->ee_info;
> >               tstamp -= (int64_t) glob_tstart;
> >               tstamp /= 1000 * 1000;
> > -             fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped\n",
> > -                             data[ret - 1], tstamp);
> > +             fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
> > +                             data[ret - 1], tstamp, reason);
>
> Hi Willem! Checkpatch is grumpy about some misalignment here:
>
> CHECK: Alignment should match open parenthesis
> #67: FILE: tools/testing/selftests/net/so_txtime.c:187:
> +                               error(1, 0, "errqueue: unknown ECANCELED %u\n",
> +                                           err->ee_code);
>
> CHECK: Alignment should match open parenthesis
> #73: FILE: tools/testing/selftests/net/so_txtime.c:193:
> +                               error(1, 0, "errqueue: unknown EINVAL %u\n",
> +                                           err->ee_code);
>
> CHECK: Alignment should match open parenthesis
> #87: FILE: tools/testing/selftests/net/so_txtime.c:205:
> +               fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
> +                               data[ret - 1], tstamp, reason);

Thanks for the heads-up, Jakub.

I decided to follow the convention in the file, which is to align with
the start of the string.

Given that, do you want me to resubmit with the revised offset? I'm
fine either way, of course.

Also, which incantation of checkpatch do you use? I did run
checkpatch, without extra args, and it did not warn me about this.

^ permalink raw reply

* Re: [PATCH net] net: dsa: bcm_sf2: Fix node reference count
From: Florian Fainelli @ 2020-06-18 16:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Vivien Didelot, David S. Miller, Jakub Kicinski,
	open list
In-Reply-To: <20200618125640.GL249144@lunn.ch>



On 6/18/2020 5:56 AM, Andrew Lunn wrote:
> On Wed, Jun 17, 2020 at 08:42:44PM -0700, Florian Fainelli wrote:
>> of_find_node_by_name() will do an of_node_put() on the "from" argument.
> 
>> Fixes: afa3b592953b ("net: dsa: bcm_sf2: Ensure correct sub-node is parsed")
>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>> ---
>>  drivers/net/dsa/bcm_sf2.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
>> index c1bd21e4b15c..9f62ba3e4345 100644
>> --- a/drivers/net/dsa/bcm_sf2.c
>> +++ b/drivers/net/dsa/bcm_sf2.c
>> @@ -1154,6 +1154,8 @@ static int bcm_sf2_sw_probe(struct platform_device *pdev)
>>  	set_bit(0, priv->cfp.used);
>>  	set_bit(0, priv->cfp.unique);
>>  
>> +	/* Balance of_node_put() done by of_find_node_by_name() */
>> +	of_node_get(dn);
>>  	ports = of_find_node_by_name(dn, "ports");
> 
> That if_find_node_by_name() does a put is not very intuitive.
> Maybe document that as well in the kerneldocs?

Yes that is the plan, most callers call it with a NULL from argument but
that is a bit silly if you know what the Device Tree looks like, you can
search quicker to the target node. Thanks.

> 
> Reviewed-by: Andrew Lunn <andrew@lunn.ch>
> 
>     Andrew
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH net] selftests/net: report etf errors correctly
From: Jakub Kicinski @ 2020-06-18 16:36 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, David Miller
In-Reply-To: <CA+FuTSeLneTOB10Vd+wO2LFmU9eY_zQJJ0QvX7JbCW9C1ef=ew@mail.gmail.com>

On Thu, 18 Jun 2020 12:18:01 -0400 Willem de Bruijn wrote:
> On Thu, Jun 18, 2020 at 11:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Thu, 18 Jun 2020 10:55:49 -0400 Willem de Bruijn wrote:  
> > > +             switch (err->ee_errno) {
> > > +             case ECANCELED:
> > > +                     if (err->ee_code != SO_EE_CODE_TXTIME_MISSED)
> > > +                             error(1, 0, "errqueue: unknown ECANCELED %u\n",
> > > +                                         err->ee_code);
> > > +                     reason = "missed txtime";
> > > +             break;
> > > +             case EINVAL:
> > > +                     if (err->ee_code != SO_EE_CODE_TXTIME_INVALID_PARAM)
> > > +                             error(1, 0, "errqueue: unknown EINVAL %u\n",
> > > +                                         err->ee_code);
> > > +                     reason = "invalid txtime";
> > > +             break;
> > > +             default:
> > > +                     error(1, 0, "errqueue: errno %u code %u\n",
> > > +                           err->ee_errno, err->ee_code);
> > > +             };
> > >
> > >               tstamp = ((int64_t) err->ee_data) << 32 | err->ee_info;
> > >               tstamp -= (int64_t) glob_tstart;
> > >               tstamp /= 1000 * 1000;
> > > -             fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped\n",
> > > -                             data[ret - 1], tstamp);
> > > +             fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
> > > +                             data[ret - 1], tstamp, reason);  
> >
> > Hi Willem! Checkpatch is grumpy about some misalignment here:
> >
> > CHECK: Alignment should match open parenthesis
> > #67: FILE: tools/testing/selftests/net/so_txtime.c:187:
> > +                               error(1, 0, "errqueue: unknown ECANCELED %u\n",
> > +                                           err->ee_code);
> >
> > CHECK: Alignment should match open parenthesis
> > #73: FILE: tools/testing/selftests/net/so_txtime.c:193:
> > +                               error(1, 0, "errqueue: unknown EINVAL %u\n",
> > +                                           err->ee_code);
> >
> > CHECK: Alignment should match open parenthesis
> > #87: FILE: tools/testing/selftests/net/so_txtime.c:205:
> > +               fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
> > +                               data[ret - 1], tstamp, reason);  
> 
> Thanks for the heads-up, Jakub.
> 
> I decided to follow the convention in the file, which is to align with
> the start of the string.

Ack, I remember the selftest was added with a larger series so I didn't
want to complain about minutia :)

> Given that, do you want me to resubmit with the revised offset? I'm
> fine either way, of course.

No strong feelings, perhaps it's fine if the rest of the file is
like that already.

> Also, which incantation of checkpatch do you use? I did run
> checkpatch, without extra args, and it did not warn me about this.

I run with --strict, and pick the warnings which make sense.

^ permalink raw reply

* Re: [PATCH v5 3/3] net: phy: mscc: handle the clkout control on some phy variants
From: Russell King - ARM Linux admin @ 2020-06-18 16:40 UTC (permalink / raw)
  To: Heiko Stübner
  Cc: Andrew Lunn, davem, kuba, robh+dt, f.fainelli, hkallweit1, netdev,
	devicetree, linux-kernel, christoph.muellner
In-Reply-To: <1723854.ZAnHLLU950@diego>

On Thu, Jun 18, 2020 at 06:01:29PM +0200, Heiko Stübner wrote:
> Am Donnerstag, 18. Juni 2020, 17:47:48 CEST schrieb Russell King - ARM Linux admin:
> > On Thu, Jun 18, 2020 at 05:41:54PM +0200, Heiko Stübner wrote:
> > > Though I'm not sure how this fits in the whole bringup of ethernet phys.
> > > Like the phy is dependent on the underlying ethernet controller to
> > > actually turn it on.
> > > 
> > > I guess we should check the phy-state and if it's not accessible, just
> > > keep the values and if it's in a suitable state do the configuration.
> > > 
> > > Calling a vsc8531_config_clkout() from both the vsc8531_config_init()
> > > as well as the clk_(un-)prepare  and clk_set_rate functions and being
> > > protected by a check against phy_is_started() ?
> > 
> > It sounds like it doesn't actually fit the clk API paradym then.  I
> > see that Rob suggested it, and from the DT point of view, it makes
> > complete sense, but then if the hardware can't actually be used in
> > the way the clk API expects it to be used, then there's a semantic
> > problem.
> > 
> > What is this clock used for?
> 
> It provides a source for the mac-clk for the actual transfers, here to
> provide the 125MHz clock needed for the RGMII interface .
> 
> So right now the old rk3368-lion devicetree just declares a stub
> fixed-clock and instructs the soc's clock controller to use it [0] .
> And in the cover-letter here, I show the update variant with using
> the clock defined here.
> 
> 
> I've added the idea from my previous mail like shown below [1].
> which would take into account the phy-state.
> 
> But I guess I'll wait for more input before spamming people with v6.

Let's get a handle on exactly what this is.

The RGMII bus has two clocks: RXC and TXC, which are clocked at one
of 125MHz, 25MHz or 2.5MHz depending on the RGMII data rate.  Some
PHYs replace TXC with GTX clock, which always runs at 125MHz.  These
clocks are not what you're referring to.

You are referring to another commonly provided clock between the MAC
and the PHY, something which is not unique to your PHY.

We seem to be heading down a path where different PHYs end up doing
different things in DT for what is basically the same hardware setup,
which really isn't good. :(

We have at803x using:

qca,clk-out-frequency
qca,clk-out-strength
qca,keep-pll-enabled

which are used to control the CLK_25M output pin on the device, which
may be used to provide a reference clock for the MAC side, selecting
between 25M, 50M, 62.5M and 125MHz.  This was introduced in November
2019, so not that long ago.

Broadcom PHYs configure their 125MHz clock through the PHY device
flags passed from the MAC at attach/connect time.

There's the dp83867 and dp83869 configuration (I'm not sure I can
make sense of it from reading the code) using ti,clk-output-sel -
but it looks like it's the same kind of thing.  Introduced February
2018 into one driver, and November 2019 in the other.

It seems the Micrel PHYs produce a 125MHz clock irrespective of any
configuration (maybe configured by firmware, or hardware strapping?)

So it seems we have four ways of doing the same thing today, and now
the suggestion is to implement a fifth different way.  I think there
needs to be some consolidation here, maybe choosing one approach and
sticking with it.

Hence, I disagree with Rob - we don't need a fifth approach, we need
to choose one approach and decide that's our policy for this and
apply it evenly across the board, rather than making up something
different each time a new PHY comes along.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* [PATCH net v2] selftests/net: report etf errors correctly
From: Willem de Bruijn @ 2020-06-18 16:40 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The ETF qdisc can queue skbs that it could not pace on the errqueue.

Address a few issues in the selftest

- recv buffer size was too small, and incorrectly calculated
- compared errno to ee_code instead of ee_errno
- missed invalid request error type

v2:
  - fix a few checkpatch --strict indentation warnings

Fixes: ea6a547669b3 ("selftests/net: make so_txtime more robust to timer variance")
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/so_txtime.c | 33 +++++++++++++++++++------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/so_txtime.c b/tools/testing/selftests/net/so_txtime.c
index 383bac05ac32..ceaad78e9667 100644
--- a/tools/testing/selftests/net/so_txtime.c
+++ b/tools/testing/selftests/net/so_txtime.c
@@ -15,8 +15,9 @@
 #include <inttypes.h>
 #include <linux/net_tstamp.h>
 #include <linux/errqueue.h>
+#include <linux/if_ether.h>
 #include <linux/ipv6.h>
-#include <linux/tcp.h>
+#include <linux/udp.h>
 #include <stdbool.h>
 #include <stdlib.h>
 #include <stdio.h>
@@ -140,8 +141,8 @@ static void do_recv_errqueue_timeout(int fdt)
 {
 	char control[CMSG_SPACE(sizeof(struct sock_extended_err)) +
 		     CMSG_SPACE(sizeof(struct sockaddr_in6))] = {0};
-	char data[sizeof(struct ipv6hdr) +
-		  sizeof(struct tcphdr) + 1];
+	char data[sizeof(struct ethhdr) + sizeof(struct ipv6hdr) +
+		  sizeof(struct udphdr) + 1];
 	struct sock_extended_err *err;
 	struct msghdr msg = {0};
 	struct iovec iov = {0};
@@ -159,6 +160,8 @@ static void do_recv_errqueue_timeout(int fdt)
 	msg.msg_controllen = sizeof(control);
 
 	while (1) {
+		const char *reason;
+
 		ret = recvmsg(fdt, &msg, MSG_ERRQUEUE);
 		if (ret == -1 && errno == EAGAIN)
 			break;
@@ -176,14 +179,30 @@ static void do_recv_errqueue_timeout(int fdt)
 		err = (struct sock_extended_err *)CMSG_DATA(cm);
 		if (err->ee_origin != SO_EE_ORIGIN_TXTIME)
 			error(1, 0, "errqueue: origin 0x%x\n", err->ee_origin);
-		if (err->ee_code != ECANCELED)
-			error(1, 0, "errqueue: code 0x%x\n", err->ee_code);
+
+		switch (err->ee_errno) {
+		case ECANCELED:
+			if (err->ee_code != SO_EE_CODE_TXTIME_MISSED)
+				error(1, 0, "errqueue: unknown ECANCELED %u\n",
+				      err->ee_code);
+			reason = "missed txtime";
+		break;
+		case EINVAL:
+			if (err->ee_code != SO_EE_CODE_TXTIME_INVALID_PARAM)
+				error(1, 0, "errqueue: unknown EINVAL %u\n",
+				      err->ee_code);
+			reason = "invalid txtime";
+		break;
+		default:
+			error(1, 0, "errqueue: errno %u code %u\n",
+			      err->ee_errno, err->ee_code);
+		};
 
 		tstamp = ((int64_t) err->ee_data) << 32 | err->ee_info;
 		tstamp -= (int64_t) glob_tstart;
 		tstamp /= 1000 * 1000;
-		fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped\n",
-				data[ret - 1], tstamp);
+		fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
+			data[ret - 1], tstamp, reason);
 
 		msg.msg_flags = 0;
 		msg.msg_controllen = sizeof(control);
-- 
2.27.0.290.gba653c62da-goog


^ permalink raw reply related

* Re: [PATCH net] selftests/net: report etf errors correctly
From: Willem de Bruijn @ 2020-06-18 16:43 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Willem de Bruijn, Network Development, David Miller
In-Reply-To: <20200618093625.0bb5ac61@kicinski-fedora-PC1C0HJN>

On Thu, Jun 18, 2020 at 12:36 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 18 Jun 2020 12:18:01 -0400 Willem de Bruijn wrote:
> > On Thu, Jun 18, 2020 at 11:54 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Thu, 18 Jun 2020 10:55:49 -0400 Willem de Bruijn wrote:
> > > > +             switch (err->ee_errno) {
> > > > +             case ECANCELED:
> > > > +                     if (err->ee_code != SO_EE_CODE_TXTIME_MISSED)
> > > > +                             error(1, 0, "errqueue: unknown ECANCELED %u\n",
> > > > +                                         err->ee_code);
> > > > +                     reason = "missed txtime";
> > > > +             break;
> > > > +             case EINVAL:
> > > > +                     if (err->ee_code != SO_EE_CODE_TXTIME_INVALID_PARAM)
> > > > +                             error(1, 0, "errqueue: unknown EINVAL %u\n",
> > > > +                                         err->ee_code);
> > > > +                     reason = "invalid txtime";
> > > > +             break;
> > > > +             default:
> > > > +                     error(1, 0, "errqueue: errno %u code %u\n",
> > > > +                           err->ee_errno, err->ee_code);
> > > > +             };
> > > >
> > > >               tstamp = ((int64_t) err->ee_data) << 32 | err->ee_info;
> > > >               tstamp -= (int64_t) glob_tstart;
> > > >               tstamp /= 1000 * 1000;
> > > > -             fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped\n",
> > > > -                             data[ret - 1], tstamp);
> > > > +             fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
> > > > +                             data[ret - 1], tstamp, reason);
> > >
> > > Hi Willem! Checkpatch is grumpy about some misalignment here:
> > >
> > > CHECK: Alignment should match open parenthesis
> > > #67: FILE: tools/testing/selftests/net/so_txtime.c:187:
> > > +                               error(1, 0, "errqueue: unknown ECANCELED %u\n",
> > > +                                           err->ee_code);
> > >
> > > CHECK: Alignment should match open parenthesis
> > > #73: FILE: tools/testing/selftests/net/so_txtime.c:193:
> > > +                               error(1, 0, "errqueue: unknown EINVAL %u\n",
> > > +                                           err->ee_code);
> > >
> > > CHECK: Alignment should match open parenthesis
> > > #87: FILE: tools/testing/selftests/net/so_txtime.c:205:
> > > +               fprintf(stderr, "send: pkt %c at %" PRId64 "ms dropped: %s\n",
> > > +                               data[ret - 1], tstamp, reason);
> >
> > Thanks for the heads-up, Jakub.
> >
> > I decided to follow the convention in the file, which is to align with
> > the start of the string.
>
> Ack, I remember the selftest was added with a larger series so I didn't
> want to complain about minutia :)
>
> > Given that, do you want me to resubmit with the revised offset? I'm
> > fine either way, of course.
>
> No strong feelings, perhaps it's fine if the rest of the file is
> like that already.

We'll have to standardize at some point anyway. Sent a v2.

>
> > Also, which incantation of checkpatch do you use? I did run
> > checkpatch, without extra args, and it did not warn me about this.
>
> I run with --strict, and pick the warnings which make sense.

Ah, thanks. I've updated my bash alias to do the same from now on. The
PRId64 CamelCase warning is a false positive I'll have to leave as is.

^ permalink raw reply

* Re: [PATCH 13/29] dt: fix broken links due to txt->yaml renames
From: Rob Herring @ 2020-06-18 16:44 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Matthias Brugger, linux-mediatek, linux-mips, Sean Wang,
	Liam Girdwood, linux-bluetooth, Linux Doc Mailing List,
	devicetree, Rob Herring, linux-rockchip, Daniel Vetter,
	David S. Miller, netdev, Mark Brown, David Airlie, linux-kernel,
	dri-devel, Thomas Bogendoerfer, Jonathan Corbet, linux-arm-kernel,
	Sandy Huang, alsa-devel, Arnaud Pouliquen, Heiko Stübner,
	Jakub Kicinski
In-Reply-To: <0e4a7f0b7efcc8109c8a41a2e13c8adde4d9c6b9.1592203542.git.mchehab+huawei@kernel.org>

On Mon, 15 Jun 2020 08:46:52 +0200, Mauro Carvalho Chehab wrote:
> There are some new broken doc links due to yaml renames
> at DT. Developers should really run:
> 
> 	./scripts/documentation-file-ref-check
> 
> in order to solve those issues while submitting patches.
> This tool can even fix most of the issues with:
> 
> 	./scripts/documentation-file-ref-check --fix
> 
> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> ---
>  Documentation/devicetree/bindings/display/bridge/sii902x.txt  | 2 +-
>  .../devicetree/bindings/display/rockchip/rockchip-drm.yaml    | 2 +-
>  Documentation/devicetree/bindings/net/mediatek-bluetooth.txt  | 2 +-
>  Documentation/devicetree/bindings/sound/audio-graph-card.txt  | 2 +-
>  Documentation/devicetree/bindings/sound/st,sti-asoc-card.txt  | 2 +-
>  Documentation/mips/ingenic-tcu.rst                            | 2 +-
>  MAINTAINERS                                                   | 4 ++--
>  7 files changed, 8 insertions(+), 8 deletions(-)
> 

Applied, thanks!

^ permalink raw reply

* Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Russell King - ARM Linux admin @ 2020-06-18 16:55 UTC (permalink / raw)
  To: Ioana Ciornei
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Vladimir Oltean,
	Claudiu Manoil, Alexandru Marginean, michael@walle.cc,
	andrew@lunn.ch, f.fainelli@gmail.com
In-Reply-To: <VI1PR0402MB387191C53CE915E5AC060669E09B0@VI1PR0402MB3871.eurprd04.prod.outlook.com>

On Thu, Jun 18, 2020 at 04:17:56PM +0000, Ioana Ciornei wrote:
> > > +struct mdio_lynx_pcs *mdio_lynx_pcs_create(struct mdio_device
> > > +*mdio_dev) {
> > > +	struct mdio_lynx_pcs *pcs;
> > > +
> > > +	if (WARN_ON(!mdio_dev))
> > > +		return NULL;
> > > +
> > > +	pcs = kzalloc(sizeof(*pcs), GFP_KERNEL);
> > > +	if (!pcs)
> > > +		return NULL;
> > > +
> > > +	pcs->dev = mdio_dev;
> > > +	pcs->an_restart = lynx_pcs_an_restart;
> > > +	pcs->get_state = lynx_pcs_get_state;
> > > +	pcs->link_up = lynx_pcs_link_up;
> > > +	pcs->config = lynx_pcs_config;
> > 
> > We really should not have these private structure interfaces.  Private structure-
> > based driver specific interfaces really don't constitute a sane approach to
> > interface design.
> > 
> > Would it work if there was a "struct mdio_device" add to the phylink_config
> > structure, and then you could have the phylink_pcs_ops embedded into this
> > driver?
> 
> I think that would restrict too much the usage.
> I am afraid there will be instances where the PCS is not recognizable by PHY_ID,
> thus no way of knowing which driver to probe which mdio_device.
> Also, I would leave to the driver the choice of using (or not) the functions 
> exported by Lynx.

I think you've taken my point way too far.  What I'm complaining about
is the indirection of lynx_pcs_an_restart() et.al. through a driver-
private structure.

In order to access lynx_pcs_an_restart(), we need to dereference
struct mdio_lynx_pcs, which is a structure specific to this lynx PCS
driver.  What this leads to is users doing this:

	if (pcs_is_lynx) {
		struct mdio_lynx_pcs *pcs = foo->bar;

		pcs->an_restart(...);
	} else if (pcs_is_something_else) {
		struct mdio_somethingelse_pcs *pcs = foo->bar;

		pcs->an_restart(...);
	}

which really does not scale.  A step forward would be:

	if (pcs_is_lynx) {
		lynx_pcs_an_restart(...);
	} else if (pcs_is_something_else) {
		something_else_pcs_an_restart(...);
	}

but that also scales horribly.

Even better would be to have a generic set of operations for PCS
devices that can be declared in the lynx PCS driver and used
externally... like, maybe struct phylink_pcs_ops, which is made
globally visible for MAC drivers to use with phylink_add_pcs().

Or maybe a function in this lynx PCS driver that calls phylink_add_pcs()
itself with its own PCS operations, and maybe also sets a field in
struct phylink_config for the PCS mdio device?

Or something like that - just some a way that doesn't force us down
a path that we end up forcing people into code that has to decide
what sort of PCS we have at runtime in all these method paths.

> What if we directly export the helper functions directly as symbols which can
> be used by the driver without any mdio_lynx_pcs in the middle
> (just the mdio_device passed to the function).
> This would be exactly as phylink_mii_c22_pcs_[an_restart/config] are currently
> used.

The difference is that phylink_mii_c22_pcs_* are designed as library
functions - functions that are very likely to be re-used for multiple
different PCS (because the format, location, and access method of
these registers is defined by IEEE 802.3).  It's a bit like phylib's
configuration of autoneg - we don't have all the individual drivers
doing that, we have core code that does that for us in the form of
helpers.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [RFC PATCH 3/9] net: dsa: hellcreek: Add PTP clock support
From: Andrew Lunn @ 2020-06-18 17:23 UTC (permalink / raw)
  To: Kurt Kanzenbach
  Cc: Vivien Didelot, Florian Fainelli, David S. Miller, Jakub Kicinski,
	netdev, Rob Herring, devicetree, Sebastian Andrzej Siewior,
	Richard Cochran, Kamil Alkhouri, ilias.apalodimas
In-Reply-To: <20200618064029.32168-4-kurt@linutronix.de>

> +static u64 __hellcreek_ptp_clock_read(struct hellcreek *hellcreek)
> +{
> +	u16 nsl, nsh, secl, secm, sech;
> +
> +	/* Take a snapshot */
> +	hellcreek_ptp_write(hellcreek, PR_COMMAND_C_SS, PR_COMMAND_C);
> +
> +	/* The time of the day is saved as 96 bits. However, due to hardware
> +	 * limitations the seconds are not or only partly kept in the PTP
> +	 * core. That's why only the nanoseconds are used and the seconds are
> +	 * tracked in software. Anyway due to internal locking all five
> +	 * registers should be read.
> +	 */
> +	sech = hellcreek_ptp_read(hellcreek, PR_SS_SYNC_DATA_C);
> +	secm = hellcreek_ptp_read(hellcreek, PR_SS_SYNC_DATA_C);
> +	secl = hellcreek_ptp_read(hellcreek, PR_SS_SYNC_DATA_C);
> +	nsh  = hellcreek_ptp_read(hellcreek, PR_SS_SYNC_DATA_C);
> +	nsl  = hellcreek_ptp_read(hellcreek, PR_SS_SYNC_DATA_C);
> +
> +	return (u64)nsl | ((u64)nsh << 16);

Hi Kurt

What are the hardware limitations? There seems to be 48 bits for
seconds? That allows for 8925104 years?

> +static u64 __hellcreek_ptp_gettime(struct hellcreek *hellcreek)
> +{
> +	u64 ns;
> +
> +	ns = __hellcreek_ptp_clock_read(hellcreek);
> +	if (ns < hellcreek->last_ts)
> +		hellcreek->seconds++;
> +	hellcreek->last_ts = ns;
> +	ns += hellcreek->seconds * NSEC_PER_SEC;

So the assumption is, this gets called at least once per second. And
if that does not happen, there is no recovery. The second is lost.

I'm just wondering if there is something more robust using what the
hardware does provide, even if the hardware is not perfect.

	 Andrew

^ permalink raw reply

* [PATCH net] ionic: tame the watchdog timer on reconfig
From: Shannon Nelson @ 2020-06-18 17:29 UTC (permalink / raw)
  To: netdev, davem; +Cc: Shannon Nelson

Even with moving netif_tx_disable() to an earlier point when
taking down the queues for a reconfiguration, we still end
up with the occasional netdev watchdog Tx Timeout complaint.
The old method of using netif_trans_update() works fine for
queue 0, but has no effect on the remaining queues.  Using
netif_device_detach() allows us to signal to the watchdog to
ignore us for the moment.

Fixes: beead698b173 ("ionic: Add the basic NDO callbacks for netdev support")
Signed-off-by: Shannon Nelson <snelson@pensando.io>
---
 drivers/net/ethernet/pensando/ionic/ionic_lif.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/pensando/ionic/ionic_lif.c b/drivers/net/ethernet/pensando/ionic/ionic_lif.c
index 8f29ef133743..aaa00edd9d5b 100644
--- a/drivers/net/ethernet/pensando/ionic/ionic_lif.c
+++ b/drivers/net/ethernet/pensando/ionic/ionic_lif.c
@@ -1694,15 +1694,15 @@ static void ionic_stop_queues(struct ionic_lif *lif)
 	if (!test_and_clear_bit(IONIC_LIF_F_UP, lif->state))
 		return;
 
-	ionic_txrx_disable(lif);
 	netif_tx_disable(lif->netdev);
+	ionic_txrx_disable(lif);
 }
 
 int ionic_stop(struct net_device *netdev)
 {
 	struct ionic_lif *lif = netdev_priv(netdev);
 
-	if (!netif_device_present(netdev))
+	if (test_bit(IONIC_LIF_F_FW_RESET, lif->state))
 		return 0;
 
 	ionic_stop_queues(lif);
@@ -1985,18 +1985,19 @@ int ionic_reset_queues(struct ionic_lif *lif)
 	bool running;
 	int err = 0;
 
-	/* Put off the next watchdog timeout */
-	netif_trans_update(lif->netdev);
-
 	err = ionic_wait_for_bit(lif, IONIC_LIF_F_QUEUE_RESET);
 	if (err)
 		return err;
 
 	running = netif_running(lif->netdev);
-	if (running)
+	if (running) {
+		netif_device_detach(lif->netdev);
 		err = ionic_stop(lif->netdev);
-	if (!err && running)
+	}
+	if (!err && running) {
 		ionic_open(lif->netdev);
+		netif_device_attach(lif->netdev);
+	}
 
 	clear_bit(IONIC_LIF_F_QUEUE_RESET, lif->state);
 
-- 
2.17.1


^ permalink raw reply related

* [RFC PATCH] net/sched: add indirect call wrapper hint.
From: Paolo Abeni @ 2020-06-18 17:31 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet

The sched layer can use several indirect calls per
packet, with not work-conservative qdisc being
more affected due to the lack of the BYPASS path.

This change tries to improve the situation using
the indirect call wrappers infrastructure for the
qdisc enqueue end dequeue indirect calls.

To cope with non-trivial scenarios, a compile-time know is
introduced, so that the qdisc used by ICW can be different
from the default one.

Tested with pktgen over qdisc, with CONFIG_HINT_FQ_CODEL=y:

qdisc		threads vanilla	patched delta
		nr	Kpps	Kpps	%
pfifo_fast	1	3300	3700	12
pfifo_fast	2	3940	4070	3
fq_codel	1	3840	4110	7
fq_codel	2	1920	2260	17
fq		1	2230	2210	-1
fq		2	1530	1540	1

In the first 4 test-cases the sch hint kicks-in and we observe
measurable gain. The last 2 tests show scenarios where ICW
does not avoid the indirect calls and just add more branches:
deltas are within noise range.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
 include/net/sch_hint.h   | 51 ++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c           |  5 ++--
 net/sched/Kconfig        | 33 ++++++++++++++++++++++++++
 net/sched/sch_codel.c    | 10 +++++---
 net/sched/sch_fq.c       | 11 ++++++---
 net/sched/sch_fq_codel.c | 11 ++++++---
 net/sched/sch_generic.c  | 14 +++++++----
 net/sched/sch_sfq.c      | 10 +++++---
 8 files changed, 126 insertions(+), 19 deletions(-)
 create mode 100644 include/net/sch_hint.h

diff --git a/include/net/sch_hint.h b/include/net/sch_hint.h
new file mode 100644
index 000000000000..d7ede93d01f7
--- /dev/null
+++ b/include/net/sch_hint.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_SCHED_HINT_H
+#define __NET_SCHED_HINT_H
+
+#include <linux/indirect_call_wrapper.h>
+
+/* pfifo_fast is a bit special: we can use it for the 'nolock' path
+ * regardless of the specified hint
+ */
+INDIRECT_CALLABLE_DECLARE(int pfifo_fast_enqueue(struct sk_buff *skb,
+						struct Qdisc *qdisc,
+						struct sk_buff **to_free));
+
+#define Q_NOLOCK_ENQUEUE(skb, q, to_free) \
+	INDIRECT_CALL_1(q->enqueue, pfifo_fast_enqueue, skb, q, to_free)
+
+#if IS_ENABLED(CONFIG_HINT_FQ_CODEL)
+#define NET_SCHED_ENQUEUE fq_codel_enqueue
+#define NET_SCHED_DEQUEUE fq_codel_dequeue
+#define FQ_CODEL_SCOPE INDIRECT_CALLABLE_SCOPE
+#elif IS_ENABLED(CONFIG_HINT_FQ)
+#define NET_SCHED_ENQUEUE fq_enqueue
+#define NET_SCHED_DEQUEUE fq_dequeue
+#define FQ_SCOPE INDIRECT_CALLABLE_SCOPE
+#elif IS_ENABLED(CONFIG_HINT_CODEL)
+#define NET_SCHED_ENQUEUE codel_enqueue
+#define NET_SCHED_DEQUEUE codel_dequeue
+#define CODEL_SCOPE INDIRECT_CALLABLE_SCOPE
+#elif IS_ENABLED(CONFIG_HINT_CODEL)
+#define NET_SCHED_ENQUEUE sfq_enqueue
+#define NET_SCHED_DEQUEUE sfq_dequeue
+#define SFQ_SCOPE INDIRECT_CALLABLE_SCOPE
+#endif
+
+#if defined(NET_SCHED_DEQUEUE)
+INDIRECT_CALLABLE_DECLARE(int NET_SCHED_ENQUEUE(struct sk_buff *skb,
+						struct Qdisc *qdisc,
+						struct sk_buff **to_free));
+INDIRECT_CALLABLE_DECLARE(struct sk_buff *NET_SCHED_DEQUEUE(struct Qdisc *q));
+
+#define Q_LOCK_ENQUEUE(skb, q, to_free) \
+	INDIRECT_CALL_1(q->enqueue, NET_SCHED_ENQUEUE, skb, q, to_free)
+#define Q_DEQUEUE(q) \
+	INDIRECT_CALL_2(q->dequeue, NET_SCHED_DEQUEUE, pfifo_fast_dequeue, q)
+
+#else
+#define Q_LOCK_ENQUEUE(skb, q, to_free) q->enqueue(skb, q, to_free)
+#define Q_DEQUEUE(q) INDIRECT_CALL_1(q->dequeue, pfifo_fast_dequeue, q)
+#endif
+
+#endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 6bc2388141f6..064e4a86d502 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -143,6 +143,7 @@
 #include <linux/net_namespace.h>
 #include <linux/indirect_call_wrapper.h>
 #include <net/devlink.h>
+#include <net/sch_hint.h>
 
 #include "net-sysfs.h"
 
@@ -3743,7 +3744,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	qdisc_calculate_pkt_len(skb, q);
 
 	if (q->flags & TCQ_F_NOLOCK) {
-		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+		rc = Q_NOLOCK_ENQUEUE(skb, q, &to_free) & NET_XMIT_MASK;
 		qdisc_run(q);
 
 		if (unlikely(to_free))
@@ -3786,7 +3787,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+		rc = Q_LOCK_ENQUEUE(skb, q, &to_free) & NET_XMIT_MASK;
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 84badf00647e..a86e0ff4de26 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -484,6 +484,39 @@ config DEFAULT_NET_SCH
 	default "pfifo_fast"
 endif
 
+if RETPOLINE
+
+choice
+	prompt "Queuing discipline hint for retpoline"
+	default HINT_PFIFO_FAST if DEFAULT_PFIFO_FAST
+	default HINT_FQ if DEFAULT_FQ && NET_SCH_FQ=y
+	default HINT_FQ_CODEL if DEFAULT_FQ_CODEL && NET_SCH_FQ_CODEL=y
+	default HINT_CODEL if DEFAULT_CODEL && NET_SCH_CODEL=y
+	default HINT_SFQ if DEFAULT_FQ_CODEL && NET_SCH_SFQ=y
+	default HINT_PFIFO_FAST
+	help
+	  Specify the hint used by the queue disciplice to
+	  avoid the retpoline indirect call.
+	  Usually should match the default queue discipline.
+
+	config HINT_FQ_CODEL
+		bool "Fair Queue Controlled Delay" if NET_SCH_FQ_CODEL=y
+
+	config HINT_FQ
+		bool "Fair Queue" if NET_SCH_FQ=y
+
+	config HINT_CODEL
+		bool "Controlled Delay" if NET_SCH_CODEL=y
+
+	config HINT_SFQ
+		bool "Stochastic Fair Queue" if NET_SCH_SFQ=y
+
+	config HINT_PFIFO_FAST
+		bool "Priority FIFO Fast"
+endchoice
+
+endif
+
 comment "Classification"
 
 config NET_CLS
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index 30169b3adbbb..d5a79bcc3bd3 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -51,7 +51,11 @@
 #include <net/codel.h>
 #include <net/codel_impl.h>
 #include <net/codel_qdisc.h>
+#include <net/sch_hint.h>
 
+#ifndef CODEL_SCOPE
+#define CODEL_SCOPE static
+#endif
 
 #define DEFAULT_CODEL_LIMIT 1000
 
@@ -86,7 +90,7 @@ static void drop_func(struct sk_buff *skb, void *ctx)
 	qdisc_qstats_drop(sch);
 }
 
-static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
+CODEL_SCOPE struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
 {
 	struct codel_sched_data *q = qdisc_priv(sch);
 	struct sk_buff *skb;
@@ -108,8 +112,8 @@ static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
-static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
-			       struct sk_buff **to_free)
+CODEL_SCOPE int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				    struct sk_buff **to_free)
 {
 	struct codel_sched_data *q;
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 8f06a808c59a..2660e8163a23 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -49,6 +49,11 @@
 #include <net/sock.h>
 #include <net/tcp_states.h>
 #include <net/tcp.h>
+#include <net/sch_hint.h>
+
+#ifndef FQ_SCOPE
+#define FQ_SCOPE static
+#endif
 
 struct fq_skb_cb {
 	u64	        time_to_send;
@@ -439,8 +444,8 @@ static bool fq_packet_beyond_horizon(const struct sk_buff *skb,
 	return unlikely((s64)skb->tstamp > (s64)(q->ktime_cache + q->horizon));
 }
 
-static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
-		      struct sk_buff **to_free)
+FQ_SCOPE int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			struct sk_buff **to_free)
 {
 	struct fq_sched_data *q = qdisc_priv(sch);
 	struct fq_flow *f;
@@ -523,7 +528,7 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 	}
 }
 
-static struct sk_buff *fq_dequeue(struct Qdisc *sch)
+FQ_SCOPE struct sk_buff *fq_dequeue(struct Qdisc *sch)
 {
 	struct fq_sched_data *q = qdisc_priv(sch);
 	struct fq_flow_head *head;
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 436160be9c18..c3fb13527781 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -22,6 +22,11 @@
 #include <net/codel.h>
 #include <net/codel_impl.h>
 #include <net/codel_qdisc.h>
+#include <net/sch_hint.h>
+
+#ifndef FQ_CODEL_SCOPE
+#define FQ_CODEL_SCOPE static
+#endif
 
 /*	Fair Queue CoDel.
  *
@@ -181,8 +186,8 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
 	return idx;
 }
 
-static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch,
-			    struct sk_buff **to_free)
+FQ_CODEL_SCOPE int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+				    struct sk_buff **to_free)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	unsigned int idx, prev_backlog, prev_qlen;
@@ -278,7 +283,7 @@ static void drop_func(struct sk_buff *skb, void *ctx)
 	qdisc_qstats_drop(sch);
 }
 
-static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
+FQ_CODEL_SCOPE struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 	struct sk_buff *skb;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 265a61d011df..1e3685be4975 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -30,6 +30,9 @@
 #include <trace/events/qdisc.h>
 #include <trace/events/net.h>
 #include <net/xfrm.h>
+#include <net/sch_hint.h>
+
+static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc);
 
 /* Qdisc to use by default */
 const struct Qdisc_ops *default_qdisc_ops = &pfifo_fast_ops;
@@ -157,7 +160,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 	int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
 
 	while (bytelimit > 0) {
-		struct sk_buff *nskb = q->dequeue(q);
+		struct sk_buff *nskb = Q_DEQUEUE(q);
 
 		if (!nskb)
 			break;
@@ -182,7 +185,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
 	int cnt = 0;
 
 	do {
-		nskb = q->dequeue(q);
+		nskb = Q_DEQUEUE(q);
 		if (!nskb)
 			break;
 		if (unlikely(skb_get_queue_mapping(nskb) != mapping)) {
@@ -260,7 +263,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 			return NULL;
 		goto bulk;
 	}
-	skb = q->dequeue(q);
+	skb = Q_DEQUEUE(q);
 	if (skb) {
 bulk:
 		if (qdisc_may_bulk(q))
@@ -614,8 +617,9 @@ static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
 	return &priv->q[band];
 }
 
-static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
-			      struct sk_buff **to_free)
+INDIRECT_CALLABLE_SCOPE int pfifo_fast_enqueue(struct sk_buff *skb,
+					       struct Qdisc *qdisc,
+					       struct sk_buff **to_free)
 {
 	int band = prio2band[skb->priority & TC_PRIO_MAX];
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 5a6def5e4e6d..e120dc2aae72 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -21,6 +21,11 @@
 #include <net/pkt_sched.h>
 #include <net/pkt_cls.h>
 #include <net/red.h>
+#include <net/sch_hint.h>
+
+#ifndef SFQ_SCOPE
+#define SFQ_SCOPE static
+#endif
 
 
 /*	Stochastic Fairness Queuing algorithm.
@@ -342,7 +347,7 @@ static int sfq_headdrop(const struct sfq_sched_data *q)
 	return q->headdrop;
 }
 
-static int
+SFQ_SCOPE int
 sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
@@ -476,8 +481,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 	return NET_XMIT_SUCCESS;
 }
 
-static struct sk_buff *
-sfq_dequeue(struct Qdisc *sch)
+SFQ_SCOPE struct sk_buff *sfq_dequeue(struct Qdisc *sch)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	struct sk_buff *skb;
-- 
2.26.2


^ permalink raw reply related

* Re: net/mlx5e: bind() always returns EINVAL with XDP_ZEROCOPY
From: Kal Cutter Conley @ 2020-06-18 17:31 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Saeed Mahameed, brouer@redhat.com, Maxim Mikityanskiy,
	magnus.karlsson@intel.com, toke.hoiland-jorgensen@kau.se,
	xdp-newbies@vger.kernel.org, Tariq Toukan, gospo@broadcom.com,
	jakub.kicinski@netronome.com, netdev@vger.kernel.org,
	bjorn.topel@intel.com
In-Reply-To: <20200618150347.ihtdvsfuurgfka7i@bsd-mbp.dhcp.thefacebook.com>

On Thu, Jun 18, 2020 at 5:23 PM Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On Sun, Jun 14, 2020 at 10:55:30AM +0200, Kal Cutter Conley wrote:
> > Hi Saeed,
> > Thanks for explaining the reasoning behind the special mlx5 queue
> > numbering with XDP zerocopy.
> >
> > We have a process using AF_XDP that also shares the network interface
> > with other processes on the system. ethtool rx flow classification
> > rules are used to route the traffic to the appropriate XSK queue
> > N..(2N-1). The issue is these queues are only valid as long they are
> > active (as far as I can tell). This means if my AF_XDP process dies
> > other processes no longer receive ingress traffic routed over queues
> > N..(2N-1) even though my XDP program is still loaded and would happily
> > always return XDP_PASS. Other drivers do not have this usability issue
> > because they use queues that are always valid. Is there a simple
> > workaround for this issue? It seems to me queues N..(2N-1) should
> > simply map to 0..(N-1) when they are not active?
>
> If your XDP program returns XDP_PASS, the packet should be delivered to
> the xsk socket.  If the application isn't running, where would it go?

XDP_PASS means the packet is passed to the normal network stack for
processing. XDP_REDIRECT means the packet should be delivered to the
xsk socket.

>
> I do agree that the usability of this can be improved.  What if the flow
> rules are inserted and removed along with queue creatioin/destruction?

The problem is the mlx5 driver allows flow rules to be set on
N..(2N-1) at any time; even when no XDP program is loaded. Given this
fact, it would be totally weird if they just suddenly disappeared the
first time the queues go inactive. That's why I suggested that they
just always map to queues 0..(N-1) when they are not active. This way,
at least it's less surprising. What do people think?

> Jonathan

Kal

^ permalink raw reply

* RE: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
From: Ioana Ciornei @ 2020-06-18 17:34 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Vladimir Oltean,
	Claudiu Manoil, Alexandru Marginean, michael@walle.cc,
	andrew@lunn.ch, f.fainelli@gmail.com
In-Reply-To: <20200618165510.GG1551@shell.armlinux.org.uk>

> Subject: Re: [PATCH net-next 4/5] net: phy: add Lynx PCS MDIO module
> 
> On Thu, Jun 18, 2020 at 04:17:56PM +0000, Ioana Ciornei wrote:
> > > > +struct mdio_lynx_pcs *mdio_lynx_pcs_create(struct mdio_device
> > > > +*mdio_dev) {
> > > > +	struct mdio_lynx_pcs *pcs;
> > > > +
> > > > +	if (WARN_ON(!mdio_dev))
> > > > +		return NULL;
> > > > +
> > > > +	pcs = kzalloc(sizeof(*pcs), GFP_KERNEL);
> > > > +	if (!pcs)
> > > > +		return NULL;
> > > > +
> > > > +	pcs->dev = mdio_dev;
> > > > +	pcs->an_restart = lynx_pcs_an_restart;
> > > > +	pcs->get_state = lynx_pcs_get_state;
> > > > +	pcs->link_up = lynx_pcs_link_up;
> > > > +	pcs->config = lynx_pcs_config;
> > >
> > > We really should not have these private structure interfaces.
> > > Private structure- based driver specific interfaces really don't
> > > constitute a sane approach to interface design.
> > >
> > > Would it work if there was a "struct mdio_device" add to the
> > > phylink_config structure, and then you could have the
> > > phylink_pcs_ops embedded into this driver?
> >
> > I think that would restrict too much the usage.
> > I am afraid there will be instances where the PCS is not recognizable
> > by PHY_ID, thus no way of knowing which driver to probe which mdio_device.
> > Also, I would leave to the driver the choice of using (or not) the
> > functions exported by Lynx.
> 
> I think you've taken my point way too far.  What I'm complaining about is the
> indirection of lynx_pcs_an_restart() et.al. through a driver- private structure.
> 
> In order to access lynx_pcs_an_restart(), we need to dereference struct
> mdio_lynx_pcs, which is a structure specific to this lynx PCS driver.  What this
> leads to is users doing this:
> 
> 	if (pcs_is_lynx) {
> 		struct mdio_lynx_pcs *pcs = foo->bar;
> 
> 		pcs->an_restart(...);
> 	} else if (pcs_is_something_else) {
> 		struct mdio_somethingelse_pcs *pcs = foo->bar;
> 
> 		pcs->an_restart(...);
> 	}
> 
> which really does not scale.  A step forward would be:
> 
> 	if (pcs_is_lynx) {
> 		lynx_pcs_an_restart(...);
> 	} else if (pcs_is_something_else) {
> 		something_else_pcs_an_restart(...);
> 	}
> 
> but that also scales horribly.

This is what I was proposing. I can of course take the indirection away
and just export the functions.

Are there really instances where the ethernet driver has to manage multiple
different types of PCSs? I am not sure this type of snippet of code is really
going to occur.

> 
> Even better would be to have a generic set of operations for PCS devices that
> can be declared in the lynx PCS driver and used externally... like, maybe struct
> phylink_pcs_ops, which is made globally visible for MAC drivers to use with
> phylink_add_pcs().
> 
> Or maybe a function in this lynx PCS driver that calls phylink_add_pcs() itself with
> its own PCS operations, and maybe also sets a field in struct phylink_config for
> the PCS mdio device?
>

I am not sure how this would work with Felix and DSA drivers in general since the
DSA core is hiding the phylink_pcs_ops from the actual switch driver.

> Or something like that - just some a way that doesn't force us down a path that
> we end up forcing people into code that has to decide what sort of PCS we have
> at runtime in all these method paths.

I get what you are saying but I do not know of any drivers that actually need this
distinction at runtime.

Ioana

> 
> > What if we directly export the helper functions directly as symbols
> > which can be used by the driver without any mdio_lynx_pcs in the
> > middle (just the mdio_device passed to the function).
> > This would be exactly as phylink_mii_c22_pcs_[an_restart/config] are
> > currently used.
> 
> The difference is that phylink_mii_c22_pcs_* are designed as library functions -
> functions that are very likely to be re-used for multiple different PCS (because
> the format, location, and access method of these registers is defined by IEEE
> 802.3).  It's a bit like phylib's configuration of autoneg - we don't have all the
> individual drivers doing that, we have core code that does that for us in the form
> of helpers.
> 


^ permalink raw reply

* Re: [RFC PATCH 6/9] net: dsa: hellcreek: Add debugging mechanisms
From: Andrew Lunn @ 2020-06-18 17:34 UTC (permalink / raw)
  To: Kurt Kanzenbach
  Cc: Vivien Didelot, Florian Fainelli, David S. Miller, Jakub Kicinski,
	netdev, Rob Herring, devicetree, Sebastian Andrzej Siewior,
	Richard Cochran, Kamil Alkhouri, ilias.apalodimas
In-Reply-To: <20200618064029.32168-7-kurt@linutronix.de>

On Thu, Jun 18, 2020 at 08:40:26AM +0200, Kurt Kanzenbach wrote:
> The switch has registers which are useful for debugging issues:

debugfs is not particularly likes. Please try to find other means
where possible. Memory usage fits nicely into devlink. See mv88e6xxx
which exports the ATU fill for example. Are trace registers counters?

> +static int hellcreek_debugfs_init(struct hellcreek *hellcreek)
> +{
> +	struct dentry *file;
> +
> +	hellcreek->debug_dir = debugfs_create_dir(dev_name(hellcreek->dev),
> +						  NULL);
> +	if (!hellcreek->debug_dir)
> +		return -ENOMEM;

Just a general comment. You should not check the return value from any
debugfs call, since it is totally optional. It will also do the right
thing if the previous call has failed. There are numerous emails from
GregKH about this.

       Andrew

^ permalink raw reply

* KMSAN: uninit-value in hash_ip6_add
From: syzbot @ 2020-06-18 17:36 UTC (permalink / raw)
  To: coreteam, davem, fw, glider, gregkh, jeremy, kadlec, kstewart,
	kuba, linux-kernel, netdev, netfilter-devel, pablo,
	syzkaller-bugs, tglx

Hello,

syzbot found the following crash on:

HEAD commit:    f0d5ec90 kmsan: apply __no_sanitize_memory to dotraplinkag..
git tree:       https://github.com/google/kmsan.git master
console output: https://syzkaller.appspot.com/x/log.txt?x=126592fa100000
kernel config:  https://syzkaller.appspot.com/x/.config?x=86e4f8af239686c6
dashboard link: https://syzkaller.appspot.com/bug?extid=89bacaf2be1277d1e6de
compiler:       clang version 10.0.0 (https://github.com/llvm/llvm-project/ c2443155a0fb245c8f17f2c1c72b6ea391e86e81)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+89bacaf2be1277d1e6de@syzkaller.appspotmail.com

=====================================================
BUG: KMSAN: uninit-value in __read_once_size include/linux/compiler.h:206 [inline]
BUG: KMSAN: uninit-value in hash_ip6_add+0x14eb/0x30c0 net/netfilter/ipset/ip_set_hash_gen.h:892
CPU: 1 PID: 31730 Comm: syz-executor.3 Not tainted 5.7.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 __read_once_size include/linux/compiler.h:206 [inline]
 hash_ip6_add+0x14eb/0x30c0 net/netfilter/ipset/ip_set_hash_gen.h:892
 hash_ip6_uadt+0x8e6/0xad0 net/netfilter/ipset/ip_set_hash_ip.c:267
 call_ad+0x2dc/0xbc0 net/netfilter/ipset/ip_set_core.c:1732
 ip_set_ad+0xad2/0x1110 net/netfilter/ipset/ip_set_core.c:1820
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1845
 nfnetlink_rcv_msg+0xb86/0xcf0 net/netfilter/nfnetlink.c:229
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2469
 nfnetlink_rcv+0x3b5/0x3ab0 net/netfilter/nfnetlink.c:563
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x1246/0x14d0 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2362
 ___sys_sendmsg net/socket.c:2416 [inline]
 __sys_sendmsg+0x623/0x750 net/socket.c:2449
 __do_sys_sendmsg net/socket.c:2458 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2456
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2456
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45ca59
Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fea85396c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004fe260 RCX: 000000000045ca59
RDX: 0000000000000000 RSI: 00000000200002c0 RDI: 0000000000000003
RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000942 R14: 00000000004cc0a6 R15: 00007fea853976d4

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:310
 __msan_chain_origin+0x50/0x90 mm/kmsan/kmsan_instr.c:165
 ip6_netmask include/linux/netfilter/ipset/pfxlen.h:49 [inline]
 hash_ip6_netmask net/netfilter/ipset/ip_set_hash_ip.c:185 [inline]
 hash_ip6_uadt+0x9df/0xad0 net/netfilter/ipset/ip_set_hash_ip.c:263
 call_ad+0x2dc/0xbc0 net/netfilter/ipset/ip_set_core.c:1732
 ip_set_ad+0xad2/0x1110 net/netfilter/ipset/ip_set_core.c:1820
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1845
 nfnetlink_rcv_msg+0xb86/0xcf0 net/netfilter/nfnetlink.c:229
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2469
 nfnetlink_rcv+0x3b5/0x3ab0 net/netfilter/nfnetlink.c:563
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x1246/0x14d0 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2362
 ___sys_sendmsg net/socket.c:2416 [inline]
 __sys_sendmsg+0x623/0x750 net/socket.c:2449
 __do_sys_sendmsg net/socket.c:2458 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2456
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2456
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:310
 kmsan_memcpy_memmove_metadata+0x272/0x2e0 mm/kmsan/kmsan.c:247
 kmsan_memcpy_metadata+0xb/0x10 mm/kmsan/kmsan.c:267
 __msan_memcpy+0x43/0x50 mm/kmsan/kmsan_instr.c:116
 ip_set_get_ipaddr6+0x26a/0x300 net/netfilter/ipset/ip_set_core.c:325
 hash_ip6_uadt+0x450/0xad0 net/netfilter/ipset/ip_set_hash_ip.c:255
 call_ad+0x2dc/0xbc0 net/netfilter/ipset/ip_set_core.c:1732
 ip_set_ad+0xad2/0x1110 net/netfilter/ipset/ip_set_core.c:1820
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1845
 nfnetlink_rcv_msg+0xb86/0xcf0 net/netfilter/nfnetlink.c:229
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2469
 nfnetlink_rcv+0x3b5/0x3ab0 net/netfilter/nfnetlink.c:563
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x1246/0x14d0 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2362
 ___sys_sendmsg net/socket.c:2416 [inline]
 __sys_sendmsg+0x623/0x750 net/socket.c:2449
 __do_sys_sendmsg net/socket.c:2458 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2456
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2456
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:80
 slab_alloc_node mm/slub.c:2802 [inline]
 __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4436
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:210
 alloc_skb include/linux/skbuff.h:1083 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1175 [inline]
 netlink_sendmsg+0x7d3/0x14d0 net/netlink/af_netlink.c:1893
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2362
 ___sys_sendmsg net/socket.c:2416 [inline]
 __sys_sendmsg+0x623/0x750 net/socket.c:2449
 __do_sys_sendmsg net/socket.c:2458 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2456
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2456
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
=====================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: [RFC PATCH 9/9] dt-bindings: net: dsa: Add documentation for Hellcreek switches
From: Florian Fainelli @ 2020-06-18 17:40 UTC (permalink / raw)
  To: Kurt Kanzenbach, Andrew Lunn, Vivien Didelot
  Cc: David S. Miller, Jakub Kicinski, netdev, Rob Herring, devicetree,
	Sebastian Andrzej Siewior, Richard Cochran, Kamil Alkhouri,
	ilias.apalodimas
In-Reply-To: <20200618064029.32168-10-kurt@linutronix.de>



On 6/17/2020 11:40 PM, Kurt Kanzenbach wrote:
> Add basic documentation and example.
> 
> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
> ---
>  .../devicetree/bindings/net/dsa/hellcreek.txt | 72 +++++++++++++++++++
>  1 file changed, 72 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/dsa/hellcreek.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/dsa/hellcreek.txt b/Documentation/devicetree/bindings/net/dsa/hellcreek.txt
> new file mode 100644
> index 000000000000..9ea6494dc554
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/dsa/hellcreek.txt

This should be a YAML binding and we should also convert the DSA binding
to YAML one day.

> @@ -0,0 +1,72 @@
> +Hirschmann hellcreek switch driver
> +==================================
> +
> +Required properties:
> +
> +- compatible:
> +	Must be one of:
> +	- "hirschmann,hellcreek"
> +
> +See Documentation/devicetree/bindings/net/dsa/dsa.txt for the list of standard
> +DSA required and optional properties.
> +
> +Example
> +-------
> +
> +Ethernet switch connected memory mapped to the host, CPU port wired to gmac0:
> +
> +soc {
> +        switch0: switch@0xff240000 {

Please remove the leading 0x from the unit address.
-- 
Florian

^ permalink raw reply

* Re: [PATCH v1 2/3] net/fsl: acpize xgmac_mdio
From: Calvin Johnson @ 2020-06-18 17:42 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Jeremy Linton, Andrew Lunn, Russell King - ARM Linux admin, Jon,
	Cristi Sovaiala, Ioana Ciornei, Florian Fainelli, Madalin Bucur,
	netdev, linux.cj
In-Reply-To: <CAHp75Vdn=t2UQpCP_kpOyyX_L6kvJ-=vtWp2t87PbYBbJOczTA@mail.gmail.com>

Hi,

On Thu, Jun 18, 2020 at 07:00:20PM +0300, Andy Shevchenko wrote:
> On Thu, Jun 18, 2020 at 6:46 PM Jeremy Linton <jeremy.linton@arm.com> wrote:
> > On 6/17/20 12:34 PM, Andrew Lunn wrote:
> > > On Wed, Jun 17, 2020 at 10:45:34PM +0530, Calvin Johnson wrote:
> > >> From: Jeremy Linton <jeremy.linton@arm.com>
> > >
> > >> +static const struct acpi_device_id xgmac_acpi_match[] = {
> > >> +    { "NXP0006", (kernel_ulong_t)NULL },
> > >
> > > Hi Jeremy
> > >
> > > What exactly does NXP0006 represent? An XGMAC MDIO bus master? Some
> > > NXP MDIO bus master? An XGMAC Ethernet controller which has an NXP
> > > MDIO bus master? A cluster of Ethernet controllers?
> >
> > Strictly speaking its a NXP defined (they own the "NXP" prefix per
> > https://uefi.org/pnp_id_list) id. So, they have tied it to a specific
> > bit of hardware. In this case it appears to be a shared MDIO master
> > which isn't directly contained in an Ethernet controller. Its somewhat
> > similar to a  "nxp,xxxxx" compatible id, depending on how they are using
> > it to identify an ACPI device object (_HID()/_CID()).
> >
> > So AFAIK, this is all valid ACPI usage as long as the ID maps to a
> > unique device/object.
> >
> > >
> > > Is this documented somewhere? In the DT world we have a clear
> > > documentation for all the compatible strings. Is there anything
> > > similar in the ACPI world for these magic numbers?
> >
> > Sadly not fully. The mentioned PNP and ACPI
> > (https://uefi.org/acpi_id_list) ids lists are requested and registered
> > to a given organization. But, once the prefix is owned, it becomes the
> > responsibility of that organization to assign & manage the ID's with
> > their prefix. There are various individuals/etc which have collected
> > lists, though like PCI ids, there aren't any formal publishing requirements.
> 
> And here is the question, do we have (in form of email or other means)
> an official response from NXP about above mentioned ID?

At NXP, we've assgined NXP ids to various controllers and "NXP0006" is
assigned to the xgmac_mdio controller.

Regards
Calvin

^ permalink raw reply

* Re: [RFC PATCH 7/9] net: dsa: hellcreek: Add PTP status LEDs
From: Andrew Lunn @ 2020-06-18 17:46 UTC (permalink / raw)
  To: Kurt Kanzenbach
  Cc: Vivien Didelot, Florian Fainelli, David S. Miller, Jakub Kicinski,
	netdev, Rob Herring, devicetree, Sebastian Andrzej Siewior,
	Richard Cochran, Kamil Alkhouri, ilias.apalodimas
In-Reply-To: <20200618064029.32168-8-kurt@linutronix.de>

On Thu, Jun 18, 2020 at 08:40:27AM +0200, Kurt Kanzenbach wrote:
> The switch has two controllable I/Os which are usually connected to LEDs. This
> is useful to immediately visually see the PTP status.
> 
> These provide two signals:
> 
>  * is_gm
> 
>    This LED can be activated if the current device is the grand master in that
>    PTP domain.
> 
>  * sync_good
> 
>    This LED can be activated if the current device is in sync with the network
>    time.
> 
> Expose these via the LED framework to be controlled via user space
> e.g. linuxptp.

Hi Kurt

Is the hardware driving these signals at all? Or are these just
suggested names in the documentation? It would not be an issue to have
user space to configure them to use the heartbeat trigger, etc?

     Andrew

^ permalink raw reply

* Re: [PATCH bpf-next 1/9] libbpf: generalize libbpf externs support
From: Andrii Nakryiko @ 2020-06-18 17:51 UTC (permalink / raw)
  To: Hao Luo
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Arnaldo Carvalho de Melo, Song Liu,
	Quentin Monnet
In-Reply-To: <CA+khW7i2vjHuqExnkgAYMeHe9e556pUccjZXti3DxuTjPjiQQQ@mail.gmail.com>

On Thu, Jun 18, 2020 at 12:38 AM Hao Luo <haoluo@google.com> wrote:
>
> On Wed, Jun 17, 2020 at 9:21 AM Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > Switch existing Kconfig externs to be just one of few possible kinds of more
> > generic externs. This refactoring is in preparation for ksymbol extern
> > support, added in the follow up patch. There are no functional changes
> > intended.
> >
> > Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> > ---
>
> [...]
>
> > @@ -5572,30 +5635,33 @@ static int bpf_object__resolve_externs(struct bpf_object *obj,
> >  {
> >         bool need_config = false;
> >         struct extern_desc *ext;
> > +       void *kcfg_data;
> >         int err, i;
> > -       void *data;
> >
> >         if (obj->nr_extern == 0)
> >                 return 0;
> >
> > -       data = obj->maps[obj->kconfig_map_idx].mmaped;
> > +       if (obj->kconfig_map_idx >= 0)
> > +               kcfg_data = obj->maps[obj->kconfig_map_idx].mmaped;
> >
> >         for (i = 0; i < obj->nr_extern; i++) {
> >                 ext = &obj->externs[i];
> >
> > -               if (strcmp(ext->name, "LINUX_KERNEL_VERSION") == 0) {
> > -                       void *ext_val = data + ext->data_off;
> > +               if (ext->type == EXT_KCFG &&
> > +                   strcmp(ext->name, "LINUX_KERNEL_VERSION") == 0) {
> > +                       void *ext_val = kcfg_data + ext->kcfg.data_off;
> >                         __u32 kver = get_kernel_version();
> >
> >                         if (!kver) {
> >                                 pr_warn("failed to get kernel version\n");
> >                                 return -EINVAL;
> >                         }
> > -                       err = set_ext_value_num(ext, ext_val, kver);
> > +                       err = set_kcfg_value_num(ext, ext_val, kver);
> >                         if (err)
> >                                 return err;
> > -                       pr_debug("extern %s=0x%x\n", ext->name, kver);
> > -               } else if (strncmp(ext->name, "CONFIG_", 7) == 0) {
> > +                       pr_debug("extern (kcfg) %s=0x%x\n", ext->name, kver);
> > +               } else if (ext->type == EXT_KCFG &&
> > +                          strncmp(ext->name, "CONFIG_", 7) == 0) {
> >                         need_config = true;
> >                 } else {
> >                         pr_warn("unrecognized extern '%s'\n", ext->name);
>
> Ah, we need to initialize kcfg_data, otherwise the compiler will give
> a warning on potentially uninitialized data.

yep, good catch

^ permalink raw reply

* Re: [PATCH bpf-next 8/9] tools/bpftool: show info for processes holding BPF map/prog/link/btf FDs
From: Andrii Nakryiko @ 2020-06-18 17:53 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Andrii Nakryiko, bpf, Networking, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Hao Luo, Arnaldo Carvalho de Melo,
	Song Liu
In-Reply-To: <f8ba3a62-0bca-a2b3-9b17-1209c6cf42bb@isovalent.com>

On Thu, Jun 18, 2020 at 12:51 AM Quentin Monnet <quentin@isovalent.com> wrote:
>
> 2020-06-17 23:01 UTC-0700 ~ Andrii Nakryiko <andrii.nakryiko@gmail.com>
> > On Wed, Jun 17, 2020 at 5:24 PM Quentin Monnet <quentin@isovalent.com> wrote:
> >>
> >> 2020-06-17 09:18 UTC-0700 ~ Andrii Nakryiko <andriin@fb.com>
> >>> Add bpf_iter-based way to find all the processes that hold open FDs against
> >>> BPF object (map, prog, link, btf). bpftool always attempts to discover this,
> >>> but will silently give up if kernel doesn't yet support bpf_iter BPF programs.
> >>> Process name and PID are emitted for each process (task group).
> >>>
> >>> Sample output for each of 4 BPF objects:
> >>>
> >>> $ sudo ./bpftool prog show
> >>> 2694: cgroup_device  tag 8c42dee26e8cd4c2  gpl
> >>>         loaded_at 2020-06-16T15:34:32-0700  uid 0
> >>>         xlated 648B  jited 409B  memlock 4096B
> >>>         pids systemd(1)
> >>> 2907: cgroup_skb  name egress  tag 9ad187367cf2b9e8  gpl
> >>>         loaded_at 2020-06-16T18:06:54-0700  uid 0
> >>>         xlated 48B  jited 59B  memlock 4096B  map_ids 2436
> >>>         btf_id 1202
> >>>         pids test_progs(2238417), test_progs(2238445)
> >>>
> >>> $ sudo ./bpftool map show
> >>> 2436: array  name test_cgr.bss  flags 0x400
> >>>         key 4B  value 8B  max_entries 1  memlock 8192B
> >>>         btf_id 1202
> >>>         pids test_progs(2238417), test_progs(2238445)
> >>> 2445: array  name pid_iter.rodata  flags 0x480
> >>>         key 4B  value 4B  max_entries 1  memlock 8192B
> >>>         btf_id 1214  frozen
> >>>         pids bpftool(2239612)
> >>>
> >>> $ sudo ./bpftool link show
> >>> 61: cgroup  prog 2908
> >>>         cgroup_id 375301  attach_type egress
> >>>         pids test_progs(2238417), test_progs(2238445)
> >>> 62: cgroup  prog 2908
> >>>         cgroup_id 375344  attach_type egress
> >>>         pids test_progs(2238417), test_progs(2238445)
> >>>
> >>> $ sudo ./bpftool btf show
> >>> 1202: size 1527B  prog_ids 2908,2907  map_ids 2436
> >>>         pids test_progs(2238417), test_progs(2238445)
> >>> 1242: size 34684B
> >>>         pids bpftool(2258892)
> >>>
> >>> Signed-off-by: Andrii Nakryiko <andriin@fb.com>
> >>> ---
> >>
> >> [...]
> >>
> >>> diff --git a/tools/bpf/bpftool/pids.c b/tools/bpf/bpftool/pids.c
> >>> new file mode 100644
> >>> index 000000000000..3474a91743ff
> >>> --- /dev/null
> >>> +++ b/tools/bpf/bpftool/pids.c
> >>> @@ -0,0 +1,229 @@
> >>
> >> [...]
> >>
> >>> +int build_obj_refs_table(struct obj_refs_table *table, enum bpf_obj_type type)
> >>> +{
> >>> +     char buf[4096];
> >>> +     struct pid_iter_bpf *skel;
> >>> +     struct pid_iter_entry *e;
> >>> +     int err, ret, fd = -1, i;
> >>> +     libbpf_print_fn_t default_print;
> >>> +
> >>> +     hash_init(table->table);
> >>> +     set_max_rlimit();
> >>> +
> >>> +     skel = pid_iter_bpf__open();
> >>> +     if (!skel) {
> >>> +             p_err("failed to open PID iterator skeleton");
> >>> +             return -1;
> >>> +     }
> >>> +
> >>> +     skel->rodata->obj_type = type;
> >>> +
> >>> +     /* we don't want output polluted with libbpf errors if bpf_iter is not
> >>> +      * supported
> >>> +      */
> >>> +     default_print = libbpf_set_print(libbpf_print_none);
> >>> +     err = pid_iter_bpf__load(skel);
> >>> +     libbpf_set_print(default_print);
> >>> +     if (err) {
> >>> +             /* too bad, kernel doesn't support BPF iterators yet */
> >>> +             err = 0;
> >>> +             goto out;
> >>> +     }
> >>> +     err = pid_iter_bpf__attach(skel);
> >>> +     if (err) {
> >>> +             /* if we loaded above successfully, attach has to succeed */
> >>> +             p_err("failed to attach PID iterator: %d", err);
> >>
> >> Nit: What about using strerror(err) for the error messages, here and
> >> below? It's easier to read than an integer value.
> >
> > I'm actually against it. Just a pure string message for error is often
> > quite confusing. It's an extra step, and sometimes quite a quest in
> > itself, to find what's the integer value of errno it was, just to try
> > to understand what kind of error it actually is. So I certainly prefer
> > having integer value, optionally with a string error message.
> >
> > But that's too much hassle for this "shouldn't happen" type of errors.
> > If they happen, the user is unlikely to infer anything useful and fix
> > the problem by themselves. They will most probably have to ask on the
> > mailing list and paste error messages verbatim and let people like me
> > and you try to guess what's going on. In such cases, having an errno
> > number is much more helpful.
>
> Ok, fair enough.
>
> >>> +             goto out;
> >>> +     }
> >>> +
> >>> +     fd = bpf_iter_create(bpf_link__fd(skel->links.iter));
> >>> +     if (fd < 0) {
> >>> +             err = -errno;
> >>> +             p_err("failed to create PID iterator session: %d", err);
> >>> +             goto out;
> >>> +     }
> >>> +
> >>> +     while (true) {
> >>> +             ret = read(fd, buf, sizeof(buf));
> >>> +             if (ret < 0) {
> >>> +                     err = -errno;
> >>> +                     p_err("failed to read PID iterator output: %d", err);
> >>> +                     goto out;
> >>> +             }
> >>> +             if (ret == 0)
> >>> +                     break;
> >>> +             if (ret % sizeof(*e)) {
> >>> +                     err = -EINVAL;
> >>> +                     p_err("invalid PID iterator output format");
> >>> +                     goto out;
> >>> +             }
> >>> +             ret /= sizeof(*e);
> >>> +
> >>> +             e = (void *)buf;
> >>> +             for (i = 0; i < ret; i++, e++) {
> >>> +                     add_ref(table, e);
> >>> +             }
> >>> +     }
> >>> +     err = 0;
> >>> +out:
> >>> +     if (fd >= 0)
> >>> +             close(fd);
> >>> +     pid_iter_bpf__destroy(skel);
> >>> +     return err;
> >>> +}
> >>
> >> [...]
> >>
> >>> diff --git a/tools/bpf/bpftool/skeleton/pid_iter.bpf.c b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
> >>> new file mode 100644
> >>> index 000000000000..f560e48add07
> >>> --- /dev/null
> >>> +++ b/tools/bpf/bpftool/skeleton/pid_iter.bpf.c
> >>> @@ -0,0 +1,80 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>
> >> This would make it the only file not dual-licensed GPL/BSD in bpftool.
> >> We've had issues with that before [0], although linking to libbfd is no
> >> more a hard requirement. But I see you used a dual-license in the
> >> corresponding header file pid_iter.h, so is the single license
> >> intentional here? Or would you consider GPL/BSD?
> >>
> >
> > The other BPF program (skeleton/profiler.bpf.c) is also GPL-2.0, we
> > should probably switch both.
>
> Oh I missed that one :(. Yeah, if this is possible, that would be great!
>
> >> [0] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=896165#38
> >>
> >>> +// Copyright (c) 2020 Facebook
> >>> +#include <vmlinux.h>
> >>> +#include <bpf/bpf_helpers.h>
> >>> +#include <bpf/bpf_core_read.h>
> >>> +#include <bpf/bpf_tracing.h>
> >>> +#include "pid_iter.h"
> >>
> >> [...]
> >>
> >>> +
> >>> +char LICENSE[] SEC("license") = "GPL";
> >
> > I wonder if leaving this as GPL would be ok, if the source code itself
> > is dual GPL/BSD?
>
> If the concern is to pass the verifier, it accepts a handful of
> different strings (see include/linux/license.h), one of which is "Dual
> BSD/GPL" and should probably be used in that case. Or did you have
> something else in mind?

Oh, awesome, wasn't aware of this. I'll use "Dual BSD/GPL" then.

^ permalink raw reply

* Re: [PATCH v1 2/3] net/fsl: acpize xgmac_mdio
From: Andy Shevchenko @ 2020-06-18 17:55 UTC (permalink / raw)
  To: Calvin Johnson
  Cc: Jeremy Linton, Andrew Lunn, Russell King - ARM Linux admin, Jon,
	Cristi Sovaiala, Ioana Ciornei, Florian Fainelli, Madalin Bucur,
	netdev, linux.cj
In-Reply-To: <20200618174252.GA9430@lsv03152.swis.in-blr01.nxp.com>

On Thu, Jun 18, 2020 at 8:43 PM Calvin Johnson
<calvin.johnson@oss.nxp.com> wrote:
> On Thu, Jun 18, 2020 at 07:00:20PM +0300, Andy Shevchenko wrote:
> > On Thu, Jun 18, 2020 at 6:46 PM Jeremy Linton <jeremy.linton@arm.com> wrote:

...

> > And here is the question, do we have (in form of email or other means)
> > an official response from NXP about above mentioned ID?
>
> At NXP, we've assgined NXP ids to various controllers and "NXP0006" is
> assigned to the xgmac_mdio controller.

Thanks!

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox