Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v3 4/4] bpf_trace: pass array of u64 values in kprobe_multi.addrs
From: Jiri Olsa @ 2022-05-17  9:12 UTC (permalink / raw)
  To: Eugene Syromiatnikov
  Cc: Masami Hiramatsu, Steven Rostedt, Ingo Molnar, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, netdev, bpf,
	linux-kernel, Shuah Khan, linux-kselftest
In-Reply-To: <6ef675aeeea442fa8fc168cd1cb4e4e474f65a3f.1652772731.git.esyr@redhat.com>

On Tue, May 17, 2022 at 09:36:47AM +0200, Eugene Syromiatnikov wrote:
> With the interface as defined, it is impossible to pass 64-bit kernel
> addresses from a 32-bit userspace process in BPF_LINK_TYPE_KPROBE_MULTI,
> which severly limits the useability of the interface, change the ABI
> to accept an array of u64 values instead of (kernel? user?) longs.
> Interestingly, the rest of the libbpf infrastructure uses 64-bit values
> for kallsyms addresses already, so this patch also eliminates
> the sym_addr cast in tools/lib/bpf/libbpf.c:resolve_kprobe_multi_cb().

so the problem is when we have 32bit user sace on 64bit kernel right?

I think we should keep addrs as longs in uapi and have kernel to figure out
if it needs to read u32 or u64, like you did for symbols in previous patch

we'll need to fix also bpf_kprobe_multi_cookie_swap because it assumes
64bit user space pointers

would be gret if we could have selftest for this

thanks,
jirka

> 
> Fixes: 0dcac272540613d4 ("bpf: Add multi kprobe link")
> Fixes: 5117c26e877352bc ("libbpf: Add bpf_link_create support for multi kprobes")
> Fixes: ddc6b04989eb0993 ("libbpf: Add bpf_program__attach_kprobe_multi_opts function")
> Fixes: f7a11eeccb111854 ("selftests/bpf: Add kprobe_multi attach test")
> Fixes: 9271a0c7ae7a9147 ("selftests/bpf: Add attach test for bpf_program__attach_kprobe_multi_opts")
> Fixes: 2c6401c966ae1fbe ("selftests/bpf: Add kprobe_multi bpf_cookie test")
> Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com>
> ---
>  kernel/trace/bpf_trace.c                           | 25 ++++++++++++++++++----
>  tools/lib/bpf/bpf.h                                |  2 +-
>  tools/lib/bpf/libbpf.c                             |  8 +++----
>  tools/lib/bpf/libbpf.h                             |  2 +-
>  .../testing/selftests/bpf/prog_tests/bpf_cookie.c  |  2 +-
>  .../selftests/bpf/prog_tests/kprobe_multi_test.c   |  8 +++----
>  6 files changed, 32 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 9d3028a..30a15b3 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2454,7 +2454,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>  	void __user *ucookies;
>  	unsigned long *addrs;
>  	u32 flags, cnt, size, cookies_size;
> -	void __user *uaddrs;
> +	u64 __user *uaddrs;
>  	u64 *cookies = NULL;
>  	void __user *usyms;
>  	int err;
> @@ -2486,9 +2486,26 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr
>  		return -ENOMEM;
>  
>  	if (uaddrs) {
> -		if (copy_from_user(addrs, uaddrs, size)) {
> -			err = -EFAULT;
> -			goto error;
> +		if (sizeof(*addrs) == sizeof(*uaddrs)) {
> +			if (copy_from_user(addrs, uaddrs, size)) {
> +				err = -EFAULT;
> +				goto error;
> +			}
> +		} else {
> +			u32 i;
> +			u64 addr;
> +
> +			for (i = 0; i < cnt; i++) {
> +				if (get_user(addr, uaddrs + i)) {
> +					err = -EFAULT;
> +					goto error;
> +				}
> +				if (addr > ULONG_MAX) {
> +					err = -EINVAL;
> +					goto error;
> +				}
> +				addrs[i] = addr;
> +			}
>  		}
>  	} else {
>  		struct user_syms us;
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 2e0d373..da9c6037 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -418,7 +418,7 @@ struct bpf_link_create_opts {
>  			__u32 flags;
>  			__u32 cnt;
>  			const char **syms;
> -			const unsigned long *addrs;
> +			const __u64 *addrs;
>  			const __u64 *cookies;
>  		} kprobe_multi;
>  		struct {
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ef7f302..35fa9c5 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -10737,7 +10737,7 @@ static bool glob_match(const char *str, const char *pat)
>  
>  struct kprobe_multi_resolve {
>  	const char *pattern;
> -	unsigned long *addrs;
> +	__u64 *addrs;
>  	size_t cap;
>  	size_t cnt;
>  };
> @@ -10752,12 +10752,12 @@ resolve_kprobe_multi_cb(unsigned long long sym_addr, char sym_type,
>  	if (!glob_match(sym_name, res->pattern))
>  		return 0;
>  
> -	err = libbpf_ensure_mem((void **) &res->addrs, &res->cap, sizeof(unsigned long),
> +	err = libbpf_ensure_mem((void **) &res->addrs, &res->cap, sizeof(__u64),
>  				res->cnt + 1);
>  	if (err)
>  		return err;
>  
> -	res->addrs[res->cnt++] = (unsigned long) sym_addr;
> +	res->addrs[res->cnt++] = sym_addr;
>  	return 0;
>  }
>  
> @@ -10772,7 +10772,7 @@ bpf_program__attach_kprobe_multi_opts(const struct bpf_program *prog,
>  	};
>  	struct bpf_link *link = NULL;
>  	char errmsg[STRERR_BUFSIZE];
> -	const unsigned long *addrs;
> +	const __u64 *addrs;
>  	int err, link_fd, prog_fd;
>  	const __u64 *cookies;
>  	const char **syms;
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 9e9a3fd..76e171d 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -489,7 +489,7 @@ struct bpf_kprobe_multi_opts {
>  	/* array of function symbols to attach */
>  	const char **syms;
>  	/* array of function addresses to attach */
> -	const unsigned long *addrs;
> +	const __u64 *addrs;
>  	/* array of user-provided values fetchable through bpf_get_attach_cookie */
>  	const __u64 *cookies;
>  	/* number of elements in syms/addrs/cookies arrays */
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c b/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c
> index 83ef55e3..e843840 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_cookie.c
> @@ -140,7 +140,7 @@ static void kprobe_multi_link_api_subtest(void)
>  	cookies[6] = 7;
>  	cookies[7] = 8;
>  
> -	opts.kprobe_multi.addrs = (const unsigned long *) &addrs;
> +	opts.kprobe_multi.addrs = (const __u64 *) &addrs;
>  	opts.kprobe_multi.cnt = ARRAY_SIZE(addrs);
>  	opts.kprobe_multi.cookies = (const __u64 *) &cookies;
>  	prog_fd = bpf_program__fd(skel->progs.test_kprobe);
> diff --git a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> index 586dc52..7646112 100644
> --- a/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> +++ b/tools/testing/selftests/bpf/prog_tests/kprobe_multi_test.c
> @@ -108,7 +108,7 @@ static void test_link_api_addrs(void)
>  	GET_ADDR("bpf_fentry_test7", addrs[6]);
>  	GET_ADDR("bpf_fentry_test8", addrs[7]);
>  
> -	opts.kprobe_multi.addrs = (const unsigned long*) addrs;
> +	opts.kprobe_multi.addrs = (const __u64 *) addrs;
>  	opts.kprobe_multi.cnt = ARRAY_SIZE(addrs);
>  	test_link_api(&opts);
>  }
> @@ -186,7 +186,7 @@ static void test_attach_api_addrs(void)
>  	GET_ADDR("bpf_fentry_test7", addrs[6]);
>  	GET_ADDR("bpf_fentry_test8", addrs[7]);
>  
> -	opts.addrs = (const unsigned long *) addrs;
> +	opts.addrs = (const __u64 *) addrs;
>  	opts.cnt = ARRAY_SIZE(addrs);
>  	test_attach_api(NULL, &opts);
>  }
> @@ -244,7 +244,7 @@ static void test_attach_api_fails(void)
>  		goto cleanup;
>  
>  	/* fail_2 - both addrs and syms set */
> -	opts.addrs = (const unsigned long *) addrs;
> +	opts.addrs = (const __u64 *) addrs;
>  	opts.syms = syms;
>  	opts.cnt = ARRAY_SIZE(syms);
>  	opts.cookies = NULL;
> @@ -258,7 +258,7 @@ static void test_attach_api_fails(void)
>  		goto cleanup;
>  
>  	/* fail_3 - pattern and addrs set */
> -	opts.addrs = (const unsigned long *) addrs;
> +	opts.addrs = (const __u64 *) addrs;
>  	opts.syms = NULL;
>  	opts.cnt = ARRAY_SIZE(syms);
>  	opts.cookies = NULL;
> -- 
> 2.1.4
> 

^ permalink raw reply

* [PATCH 01/12] net: mana: Add support for auxiliary device
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

In preparation for supporting MANA RDMA driver, add support for auxiliary
device in the Ethernet driver. The RDMA device is modeled as an auxiliary
device to the Ethernet device.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma.h    |  2 +
 drivers/net/ethernet/microsoft/mana/mana.h    |  6 ++
 drivers/net/ethernet/microsoft/mana/mana_en.c | 83 ++++++++++++++++++-
 3 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index 41ecd156e95f..d815d323be87 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -204,6 +204,8 @@ struct gdma_dev {
 
 	/* GDMA driver specific pointer */
 	void *driver_data;
+
+	struct auxiliary_device *adev;
 };
 
 #define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/drivers/net/ethernet/microsoft/mana/mana.h
index d36405af9432..51bff91b63ee 100644
--- a/drivers/net/ethernet/microsoft/mana/mana.h
+++ b/drivers/net/ethernet/microsoft/mana/mana.h
@@ -6,6 +6,7 @@
 
 #include "gdma.h"
 #include "hw_channel.h"
+#include <linux/auxiliary_bus.h>
 
 /* Microsoft Azure Network Adapter (MANA)'s definitions
  *
@@ -561,4 +562,9 @@ struct mana_tx_package {
 	struct gdma_posted_wqe_info wqe_info;
 };
 
+struct mana_adev {
+	struct auxiliary_device adev;
+	struct gdma_dev *mdev;
+};
+
 #endif /* _MANA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b7d3ba1b4d17..c706bf943e49 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -13,6 +13,18 @@
 
 #include "mana.h"
 
+static DEFINE_IDA(mana_adev_ida);
+
+int mana_adev_idx_alloc(void)
+{
+	return ida_alloc(&mana_adev_ida, GFP_KERNEL);
+}
+
+void mana_adev_idx_free(int idx)
+{
+	ida_free(&mana_adev_ida, idx);
+}
+
 /* Microsoft Azure Network Adapter (MANA) functions */
 
 static int mana_open(struct net_device *ndev)
@@ -1960,6 +1972,70 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	return err;
 }
 
+static void adev_release(struct device *dev)
+{
+	struct mana_adev *madev = container_of(dev, struct mana_adev, adev.dev);
+
+	kfree(madev);
+}
+
+static void remove_adev(struct gdma_dev *gd)
+{
+	struct auxiliary_device *adev = gd->adev;
+	int id = adev->id;
+
+	auxiliary_device_delete(adev);
+	auxiliary_device_uninit(adev);
+
+	mana_adev_idx_free(id);
+	gd->adev = NULL;
+}
+
+static int add_adev(struct gdma_dev *gd)
+{
+	int ret = 0;
+	struct mana_adev *madev;
+	struct auxiliary_device *adev;
+
+	madev = kzalloc(sizeof(*madev), GFP_KERNEL);
+	if (!madev)
+		return -ENOMEM;
+
+	adev = &madev->adev;
+	adev->id = mana_adev_idx_alloc();
+	if (adev->id < 0) {
+		ret = adev->id;
+		goto idx_fail;
+	}
+
+	adev->name = "rdma";
+	adev->dev.parent = gd->gdma_context->dev;
+	adev->dev.release = adev_release;
+	madev->mdev = gd;
+
+	ret = auxiliary_device_init(adev);
+	if (ret)
+		goto init_fail;
+
+	ret = auxiliary_device_add(adev);
+	if (ret)
+		goto add_fail;
+
+	gd->adev = adev;
+	return 0;
+
+add_fail:
+	auxiliary_device_uninit(adev);
+
+init_fail:
+	mana_adev_idx_free(adev->id);
+
+idx_fail:
+	kfree(madev);
+
+	return ret;
+}
+
 int mana_probe(struct gdma_dev *gd, bool resuming)
 {
 	struct gdma_context *gc = gd->gdma_context;
@@ -2027,6 +2103,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
 				break;
 		}
 	}
+
+	err = add_adev(gd);
 out:
 	if (err)
 		mana_remove(gd, false);
@@ -2043,6 +2121,10 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 	int err;
 	int i;
 
+	/* adev currently doesn't support suspending, always remove it */
+	if (gd->adev)
+		remove_adev(gd);
+
 	for (i = 0; i < ac->num_ports; i++) {
 		ndev = ac->ports[i];
 		if (!ndev) {
@@ -2075,7 +2157,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
 	}
 
 	mana_destroy_eq(ac);
-
 out:
 	mana_gd_deregister_device(gd);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH 02/12] net: mana: Record the physical address for doorbell page region
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

For supporting RDMA device with multiple user contexts with their
individual doorbell pages, record the start address of doorbell page
region for use by the RDMA driver to allocate user context doorbell IDs.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma.h      | 2 ++
 drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index d815d323be87..c724ca410fcb 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -350,9 +350,11 @@ struct gdma_context {
 	struct completion	eq_test_event;
 	u32			test_event_eq_id;
 
+	phys_addr_t		bar0_pa;
 	void __iomem		*bar0_va;
 	void __iomem		*shm_base;
 	void __iomem		*db_page_base;
+	phys_addr_t		phys_db_page_base;
 	u32 db_page_size;
 
 	/* Shared memory chanenl (used to bootstrap HWC) */
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 49b85ca578b0..9fafaa0c8e76 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -27,6 +27,9 @@ static void mana_gd_init_registers(struct pci_dev *pdev)
 	gc->db_page_base = gc->bar0_va +
 				mana_gd_r64(gc, GDMA_REG_DB_PAGE_OFFSET);
 
+	gc->phys_db_page_base = gc->bar0_pa +
+				mana_gd_r64(gc, GDMA_REG_DB_PAGE_OFFSET);
+
 	gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
 }
 
@@ -1335,6 +1338,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	mutex_init(&gc->eq_test_event_mutex);
 	pci_set_drvdata(pdev, gc);
+	gc->bar0_pa = pci_resource_start(pdev, 0);
 
 	bar0_va = pci_iomap(pdev, bar, 0);
 	if (!bar0_va)
-- 
2.17.1


^ permalink raw reply related

* [PATCH 03/12] net: mana: Handle vport sharing between devices
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

For outgoing packets, the PF requires the VF to configure the vport with
corresponding protection domain and doorbell ID for the kernel or user
context. The vport can't be shared between different contexts.

Implement the logic to exclusively take over the vport by either the
Ethernet device or RDMA device.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana.h    |  4 ++++
 drivers/net/ethernet/microsoft/mana/mana_en.c | 19 +++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/drivers/net/ethernet/microsoft/mana/mana.h
index 51bff91b63ee..26f14fcb6a61 100644
--- a/drivers/net/ethernet/microsoft/mana/mana.h
+++ b/drivers/net/ethernet/microsoft/mana/mana.h
@@ -375,6 +375,7 @@ struct mana_port_context {
 	unsigned int num_queues;
 
 	mana_handle_t port_handle;
+	atomic_t port_use_count;
 
 	u16 port_idx;
 
@@ -567,4 +568,7 @@ struct mana_adev {
 	struct gdma_dev *mdev;
 };
 
+int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
+		   u32 doorbell_pg_id);
+void mana_uncfg_vport(struct mana_port_context *apc);
 #endif /* _MANA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c706bf943e49..4f7a50ace9f6 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -530,13 +530,25 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
 	return 0;
 }
 
-static int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
-			  u32 doorbell_pg_id)
+void mana_uncfg_vport(struct mana_port_context *apc)
+{
+	atomic_dec(&apc->port_use_count);
+}
+EXPORT_SYMBOL_GPL(mana_uncfg_vport);
+
+int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
+		   u32 doorbell_pg_id)
 {
 	struct mana_config_vport_resp resp = {};
 	struct mana_config_vport_req req = {};
 	int err;
 
+	/* Ethernet driver and IB driver can't take the port at the same time */
+	if (atomic_inc_return(&apc->port_use_count) != 1) {
+		atomic_dec(&apc->port_use_count);
+		return -ENODEV;
+	}
+
 	mana_gd_init_req_hdr(&req.hdr, MANA_CONFIG_VPORT_TX,
 			     sizeof(req), sizeof(resp));
 	req.vport = apc->port_handle;
@@ -566,6 +578,7 @@ static int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
 out:
 	return err;
 }
+EXPORT_SYMBOL_GPL(mana_cfg_vport);
 
 static int mana_cfg_vport_steering(struct mana_port_context *apc,
 				   enum TRI_STATE rx,
@@ -1678,6 +1691,8 @@ static void mana_destroy_vport(struct mana_port_context *apc)
 	}
 
 	mana_destroy_txq(apc);
+
+	mana_uncfg_vport(apc);
 }
 
 static int mana_create_vport(struct mana_port_context *apc,
-- 
2.17.1


^ permalink raw reply related

* [PATCH 04/12] net: mana: Add functions for allocating doorbell page from GDMA
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

The RDMA device needs to allocate doorbell pages for each user context.
Implement those functions and expose them for use by the RDMA driver.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma.h    | 29 ++++++++++
 .../net/ethernet/microsoft/mana/gdma_main.c   | 54 +++++++++++++++++++
 2 files changed, 83 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index c724ca410fcb..f945755760dc 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -22,11 +22,15 @@ enum gdma_request_type {
 	GDMA_GENERATE_TEST_EQE		= 10,
 	GDMA_CREATE_QUEUE		= 12,
 	GDMA_DISABLE_QUEUE		= 13,
+	GDMA_ALLOCATE_RESOURCE_RANGE	= 22,
+	GDMA_DESTROY_RESOURCE_RANGE	= 24,
 	GDMA_CREATE_DMA_REGION		= 25,
 	GDMA_DMA_REGION_ADD_PAGES	= 26,
 	GDMA_DESTROY_DMA_REGION		= 27,
 };
 
+#define GDMA_RESOURCE_DOORBELL_PAGE	27
+
 enum gdma_queue_type {
 	GDMA_INVALID_QUEUE,
 	GDMA_SQ,
@@ -568,6 +572,26 @@ struct gdma_register_device_resp {
 	u32 db_id;
 }; /* HW DATA */
 
+struct gdma_allocate_resource_range_req {
+	struct gdma_req_hdr hdr;
+	u32 resource_type;
+	u32 num_resources;
+	u32 alignment;
+	u32 allocated_resources;
+};
+
+struct gdma_allocate_resource_range_resp {
+	struct gdma_resp_hdr hdr;
+	u32 allocated_resources;
+};
+
+struct gdma_destroy_resource_range_req {
+	struct gdma_req_hdr hdr;
+	u32 resource_type;
+	u32 num_resources;
+	u32 allocated_resources;
+};
+
 /* GDMA_CREATE_QUEUE */
 struct gdma_create_queue_req {
 	struct gdma_req_hdr hdr;
@@ -676,4 +700,9 @@ void mana_gd_free_memory(struct gdma_mem_info *gmi);
 
 int mana_gd_send_request(struct gdma_context *gc, u32 req_len, const void *req,
 			 u32 resp_len, void *resp);
+
+int mana_gd_allocate_doorbell_page(struct gdma_context *gc, int *doorbell_page);
+
+int mana_gd_destroy_doorbell_page(struct gdma_context *gc, int doorbell_page);
+
 #endif /* _GDMA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 9fafaa0c8e76..86ffe0e39df0 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -153,6 +153,60 @@ void mana_gd_free_memory(struct gdma_mem_info *gmi)
 			  gmi->dma_handle);
 }
 
+int mana_gd_destroy_doorbell_page(struct gdma_context *gc, int doorbell_page)
+{
+	struct gdma_destroy_resource_range_req req = {};
+	struct gdma_resp_hdr resp = {};
+	int err;
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_RESOURCE_RANGE,
+			     sizeof(req), sizeof(resp));
+
+	req.resource_type = GDMA_RESOURCE_DOORBELL_PAGE;
+	req.num_resources = 1;
+	req.allocated_resources = doorbell_page;
+
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+	if (err || resp.status) {
+		dev_err(gc->dev,
+			"Failed to destroy doorbell page: ret %d, 0x%x\n",
+			err, resp.status);
+		return err ? err : -EPROTO;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(mana_gd_destroy_doorbell_page);
+
+int mana_gd_allocate_doorbell_page(struct gdma_context *gc,
+				   int *doorbell_page)
+{
+	struct gdma_allocate_resource_range_req req = {};
+	struct gdma_allocate_resource_range_resp resp = {};
+	int err;
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_ALLOCATE_RESOURCE_RANGE,
+			     sizeof(req), sizeof(resp));
+
+	req.resource_type = GDMA_RESOURCE_DOORBELL_PAGE;
+	req.num_resources = 1;
+	req.alignment = 0;
+	req.allocated_resources = 0; // have GDMA start searching from 0
+
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+	if (err || resp.hdr.status) { // resp.hdr.status should be >=0
+		dev_err(gc->dev,
+			"Failed to allocate doorbell page: ret %d, 0x%x\n",
+			err, resp.hdr.status);
+		return err ? err : -EPROTO;
+	}
+
+	*doorbell_page = resp.allocated_resources;
+
+	return 0;
+}
+EXPORT_SYMBOL(mana_gd_allocate_doorbell_page);
+
 static int mana_gd_create_hw_eq(struct gdma_context *gc,
 				struct gdma_queue *queue)
 {
-- 
2.17.1


^ permalink raw reply related

* [PATCH 10/12] net: mana: Define max values for SGL entries
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

The number of maximum SGl entries should be computed from the maximum
WQE size for the intended queue type, witj the corresponding OOB data
size. This guarantees the hardware queue can successfully queue requests
up to the queue depth exposed to the upper layer.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 2 +-
 include/linux/mana/gdma.h                     | 7 +++++++
 include/linux/mana/mana.h                     | 4 +---
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 928b14a7ee1f..6eb5eca5524d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -187,7 +187,7 @@ int mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	pkg.wqe_req.client_data_unit = 0;
 
 	pkg.wqe_req.num_sge = 1 + skb_shinfo(skb)->nr_frags;
-	WARN_ON_ONCE(pkg.wqe_req.num_sge > 30);
+	WARN_ON_ONCE(pkg.wqe_req.num_sge > MAX_TX_WQE_SGL_ENTRIES);
 
 	if (pkg.wqe_req.num_sge <= ARRAY_SIZE(pkg.sgl_array)) {
 		pkg.wqe_req.sgl = pkg.sgl_array;
diff --git a/include/linux/mana/gdma.h b/include/linux/mana/gdma.h
index bc8cd9528937..d6a970118f4c 100644
--- a/include/linux/mana/gdma.h
+++ b/include/linux/mana/gdma.h
@@ -436,6 +436,13 @@ struct gdma_wqe {
 #define MAX_TX_WQE_SIZE 512
 #define MAX_RX_WQE_SIZE 256
 
+#define MAX_TX_WQE_SGL_ENTRIES	((GDMA_MAX_SQE_SIZE - \
+			sizeof(struct gdma_sge) - INLINE_OOB_SMALL_SIZE) / \
+			sizeof(struct gdma_sge))
+
+#define MAX_RX_WQE_SGL_ENTRIES	((GDMA_MAX_RQE_SIZE - \
+			sizeof(struct gdma_sge)) / sizeof(struct gdma_sge))
+
 struct gdma_cqe {
 	u32 cqe_data[GDMA_COMP_DATA_SIZE / 4];
 
diff --git a/include/linux/mana/mana.h b/include/linux/mana/mana.h
index 29e14ad8b930..1cf77a03bff2 100644
--- a/include/linux/mana/mana.h
+++ b/include/linux/mana/mana.h
@@ -264,8 +264,6 @@ struct mana_cq {
 	int budget;
 };
 
-#define GDMA_MAX_RQE_SGES 15
-
 struct mana_recv_buf_oob {
 	/* A valid GDMA work request representing the data buffer. */
 	struct gdma_wqe_request wqe_req;
@@ -275,7 +273,7 @@ struct mana_recv_buf_oob {
 
 	/* SGL of the buffer going to be sent has part of the work request. */
 	u32 num_sge;
-	struct gdma_sge sgl[GDMA_MAX_RQE_SGES];
+	struct gdma_sge sgl[MAX_RX_WQE_SGL_ENTRIES];
 
 	/* Required to store the result of mana_gd_post_work_request.
 	 * gdma_posted_wqe_info.wqe_size_in_bu is required for progressing the
-- 
2.17.1


^ permalink raw reply related

* [PATCH 11/12] net: mana: Define and process GDMA response code GDMA_STATUS_MORE_ENTRIES
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

When doing memory registration, the PF may respond with
GDMA_STATUS_MORE_ENTRIES to indicate a follow request is needed. This is
not an error and should be processed as expected.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/hw_channel.c | 2 +-
 include/linux/mana/gdma.h                        | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 609cd714dcc0..a80c14676c75 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -820,7 +820,7 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
 		goto out;
 	}
 
-	if (ctx->status_code) {
+	if (ctx->status_code && ctx->status_code != GDMA_STATUS_MORE_ENTRIES) {
 		dev_err(hwc->dev, "HWC: Failed hw_channel req: 0x%x\n",
 			ctx->status_code);
 		err = -EPROTO;
diff --git a/include/linux/mana/gdma.h b/include/linux/mana/gdma.h
index d6a970118f4c..d40f1dffca5c 100644
--- a/include/linux/mana/gdma.h
+++ b/include/linux/mana/gdma.h
@@ -9,6 +9,8 @@
 
 #include "shm_channel.h"
 
+#define GDMA_STATUS_MORE_ENTRIES	((u32)0x00000105L)
+
 /* Structures labeled with "HW DATA" are exchanged with the hardware. All of
  * them are naturally aligned and hence don't need __packed.
  */
-- 
2.17.1


^ permalink raw reply related

* [PATCH 12/12] RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

Add a RDMA VF driver for Microsoft Azure Network Adapter (MANA).

Signed-off-by: Long Li <longli@microsoft.com>
---
 MAINTAINERS                             |   3 +
 drivers/infiniband/Kconfig              |   1 +
 drivers/infiniband/hw/Makefile          |   1 +
 drivers/infiniband/hw/mana/Kconfig      |   7 +
 drivers/infiniband/hw/mana/Makefile     |   4 +
 drivers/infiniband/hw/mana/cq.c         |  74 +++
 drivers/infiniband/hw/mana/main.c       | 679 ++++++++++++++++++++++++
 drivers/infiniband/hw/mana/mana_ib.h    | 145 +++++
 drivers/infiniband/hw/mana/mr.c         | 133 +++++
 drivers/infiniband/hw/mana/qp.c         | 466 ++++++++++++++++
 drivers/infiniband/hw/mana/wq.c         | 111 ++++
 include/linux/mana/mana.h               |   3 +
 include/uapi/rdma/ib_user_ioctl_verbs.h |   1 +
 include/uapi/rdma/mana-abi.h            |  68 +++
 14 files changed, 1696 insertions(+)
 create mode 100644 drivers/infiniband/hw/mana/Kconfig
 create mode 100644 drivers/infiniband/hw/mana/Makefile
 create mode 100644 drivers/infiniband/hw/mana/cq.c
 create mode 100644 drivers/infiniband/hw/mana/main.c
 create mode 100644 drivers/infiniband/hw/mana/mana_ib.h
 create mode 100644 drivers/infiniband/hw/mana/mr.c
 create mode 100644 drivers/infiniband/hw/mana/qp.c
 create mode 100644 drivers/infiniband/hw/mana/wq.c
 create mode 100644 include/uapi/rdma/mana-abi.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 268c68dc40dc..5185532c0fd2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9078,6 +9078,7 @@ M:	Haiyang Zhang <haiyangz@microsoft.com>
 M:	Stephen Hemminger <sthemmin@microsoft.com>
 M:	Wei Liu <wei.liu@kernel.org>
 M:	Dexuan Cui <decui@microsoft.com>
+M:	Long Li <longli@microsoft.com>
 L:	linux-hyperv@vger.kernel.org
 S:	Supported
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
@@ -9095,6 +9096,7 @@ F:	arch/x86/kernel/cpu/mshyperv.c
 F:	drivers/clocksource/hyperv_timer.c
 F:	drivers/hid/hid-hyperv.c
 F:	drivers/hv/
+F:	drivers/infiniband/hw/mana/
 F:	drivers/input/serio/hyperv-keyboard.c
 F:	drivers/iommu/hyperv-iommu.c
 F:	drivers/net/ethernet/microsoft/
@@ -9110,6 +9112,7 @@ F:	include/clocksource/hyperv_timer.h
 F:	include/linux/hyperv.h
 F:	include/mana/
 F:	include/uapi/linux/hyperv.h
+F:	include/uapi/rdma/mana-abi.h
 F:	net/vmw_vsock/hyperv_transport.c
 F:	tools/hv/
 
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 33d3ce9c888e..a062c662ecff 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -83,6 +83,7 @@ source "drivers/infiniband/hw/qib/Kconfig"
 source "drivers/infiniband/hw/cxgb4/Kconfig"
 source "drivers/infiniband/hw/efa/Kconfig"
 source "drivers/infiniband/hw/irdma/Kconfig"
+source "drivers/infiniband/hw/mana/Kconfig"
 source "drivers/infiniband/hw/mlx4/Kconfig"
 source "drivers/infiniband/hw/mlx5/Kconfig"
 source "drivers/infiniband/hw/ocrdma/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index fba0b3be903e..f62e9e00c780 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_INFINIBAND_QIB)		+= qib/
 obj-$(CONFIG_INFINIBAND_CXGB4)		+= cxgb4/
 obj-$(CONFIG_INFINIBAND_EFA)		+= efa/
 obj-$(CONFIG_INFINIBAND_IRDMA)		+= irdma/
+obj-$(CONFIG_MANA_INFINIBAND)		+= mana/
 obj-$(CONFIG_MLX4_INFINIBAND)		+= mlx4/
 obj-$(CONFIG_MLX5_INFINIBAND)		+= mlx5/
 obj-$(CONFIG_INFINIBAND_OCRDMA)		+= ocrdma/
diff --git a/drivers/infiniband/hw/mana/Kconfig b/drivers/infiniband/hw/mana/Kconfig
new file mode 100644
index 000000000000..b3ff03a23257
--- /dev/null
+++ b/drivers/infiniband/hw/mana/Kconfig
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config MANA_INFINIBAND
+	tristate "Microsoft Azure Network Adapter support"
+	depends on NETDEVICES && ETHERNET && PCI && MICROSOFT_MANA
+	help
+	  This driver provides low-level RDMA support for
+	  Microsoft Azure Network Adapter (MANA).
diff --git a/drivers/infiniband/hw/mana/Makefile b/drivers/infiniband/hw/mana/Makefile
new file mode 100644
index 000000000000..a799fe264c5a
--- /dev/null
+++ b/drivers/infiniband/hw/mana/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_MANA_INFINIBAND) += mana_ib.o
+
+mana_ib-y := main.o wq.o qp.o cq.o mr.o
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
new file mode 100644
index 000000000000..0eac77c97658
--- /dev/null
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
+		struct ib_udata *udata)
+{
+	struct mana_ib_create_cq ucmd = {};
+	struct ib_device *ibdev = ibcq->device;
+	struct mana_ib_dev *mdev =
+		container_of(ibdev, struct mana_ib_dev, ib_dev);
+	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+	int err;
+
+	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+	if (err) {
+		pr_err("Failed to copy from udata for create cq, %d\n", err);
+		return -EFAULT;
+	}
+
+	if (attr->cqe > MAX_SEND_BUFFERS_PER_QUEUE) {
+		pr_err("CQE %d exceeding limit\n", attr->cqe);
+		return -EINVAL;
+	}
+	cq->cqe = attr->cqe;
+
+	pr_debug("ucmd buf_addr 0x%llx\n", ucmd.buf_addr);
+
+	cq->umem = ib_umem_get(ibdev, ucmd.buf_addr,
+			       cq->cqe * COMP_ENTRY_SIZE,
+			       IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(cq->umem)) {
+		err = PTR_ERR(cq->umem);
+		pr_err("Failed to get umem for create cq, err %d\n", err);
+		return err;
+	}
+
+	err = mana_ib_gd_create_dma_region(mdev, cq->umem, &cq->gdma_region,
+					   PAGE_SIZE);
+	if (err) {
+		pr_err("Failed to create dma region for create cq, %d\n", err);
+		goto err_release_umem;
+	}
+
+	pr_debug("%s: mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+		__func__, err, cq->gdma_region);
+
+	/*
+	 * The CQ ID is not known at this time
+	 * The ID is generated at create_qp
+	 */
+
+	return 0;
+
+err_release_umem:
+	ib_umem_release(cq->umem);
+	return err;
+}
+
+int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
+{
+	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+	struct ib_device *ibdev = ibcq->device;
+	struct mana_ib_dev *mdev =
+		container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
+	ib_umem_release(cq->umem);
+
+	return 0;
+}
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
new file mode 100644
index 000000000000..e288495e3ede
--- /dev/null
+++ b/drivers/infiniband/hw/mana/main.c
@@ -0,0 +1,679 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+MODULE_DESCRIPTION("Microsoft Azure Network Adapter IB driver");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static const struct auxiliary_device_id mana_id_table[] = {
+	{ .name = "mana.rdma", },
+	{},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mana_id_table);
+
+void mana_ib_uncfg_vport(struct mana_ib_dev *dev,
+			 struct mana_ib_pd *pd, u32 port)
+{
+	struct gdma_dev *gd = dev->gdma_dev;
+	struct mana_context *mc = gd->driver_data;
+	struct net_device *ndev;
+	struct mana_port_context *mpc;
+
+	ndev = mc->ports[port];
+	mpc = netdev_priv(ndev);
+
+	if (atomic_dec_and_test(&pd->vport_use_count))
+		mana_uncfg_vport(mpc);
+}
+
+int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
+		      u32 doorbell_id)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct mana_context *mc = mdev->driver_data;
+	struct net_device *ndev = mc->ports[port];
+	struct mana_port_context *mpc = netdev_priv(ndev);
+
+	int err;
+
+	if (atomic_inc_return(&pd->vport_use_count) > 1) {
+		pr_debug("Skip as this PD is already configured vport\n");
+		return 0;
+	}
+
+	err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
+	if (err) {
+		pr_err("mana_cfg_vport err %d\n", err);
+		atomic_dec(&pd->vport_use_count);
+		return err;
+	}
+
+	pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
+	pd->tx_vp_offset = mpc->tx_vp_offset;
+
+	pr_debug("vport handle %llx pdid %x doorbell_id %x "
+		 "tx_shortform_allowed %d tx_vp_offset %u\n",
+		 mpc->port_handle, pd->pdn, doorbell_id,
+		 pd->tx_shortform_allowed, pd->tx_vp_offset);
+
+	return 0;
+}
+
+static int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+	struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+	struct ib_device *ibdev = ibpd->device;
+	struct mana_ib_dev *dev =
+		container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+	int ret;
+	enum gdma_pd_flags flags = 0;
+
+	// Set flags if this is a kernel request
+	if (ibpd->uobject == NULL)
+		flags =  GDMA_PD_FLAG_ALLOW_GPA_MR | GDMA_PD_FLAG_ALLOW_FMR_MR;
+
+	ret = mana_ib_gd_create_pd(dev, &pd->pd_handle, &pd->pdn, flags);
+	if (ret)
+		pr_err("Failed to get pd id, err %d\n", ret);
+
+	return ret;
+}
+
+static int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+	struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+	struct ib_device *ibdev = ibpd->device;
+	struct mana_ib_dev *dev = container_of(
+			ibdev, struct mana_ib_dev, ib_dev);
+
+	return mana_ib_gd_destroy_pd(dev, pd->pd_handle);
+}
+
+static int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
+				  struct ib_udata *udata)
+{
+	struct mana_ib_ucontext *ucontext =
+		container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+	struct ib_device *ibdev = ibcontext->device;
+	struct mana_ib_dev *mdev =
+			container_of(ibdev, struct mana_ib_dev, ib_dev);
+	struct gdma_dev *dev = mdev->gdma_dev;
+	struct gdma_context *gc = dev->gdma_context;
+	int doorbell_page;
+	int ret;
+
+	// Allocate a doorbell page index
+	ret = mana_gd_allocate_doorbell_page(gc, &doorbell_page);
+	if (ret) {
+		pr_err("Failed to allocate doorbell page %d\n", ret);
+		return -ENOMEM;
+	}
+
+	pr_debug("Doorbell page allocated %d\n", doorbell_page);
+
+	ucontext->doorbell = doorbell_page;
+
+	return 0;
+}
+
+static void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
+{
+	struct mana_ib_ucontext *mana_ucontext =
+		container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+	struct ib_device *ibdev = ibcontext->device;
+	struct mana_ib_dev *mdev =
+		container_of(ibdev, struct mana_ib_dev, ib_dev);
+	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
+	int ret;
+
+	ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
+	if (ret)
+		pr_err("Failed to destroy doorbell page %d\n", ret);
+}
+
+static inline enum atb_page_size mana_ib_get_atb_page_size(u64 page_sz)
+{
+	int pos = 0;
+
+	page_sz = (page_sz >> 12); //start with 4k
+
+	while (page_sz) {
+		pos++;
+		page_sz = (page_sz >> 1);
+	}
+	return (enum atb_page_size)(pos - 1);
+}
+
+static int _mana_ib_gd_create_dma_region(struct mana_ib_dev *dev,
+					 const dma_addr_t *page_addr_array,
+					 size_t num_pages_total,
+					 u64 address, u64 length,
+					 mana_handle_t *gdma_region,
+					 u64 page_sz)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct gdma_context *gc = mdev->gdma_context;
+	struct hw_channel_context *hwc = gc->hwc.driver_data;
+	size_t num_pages_cur, num_pages_to_handle;
+	unsigned int create_req_msg_size;
+	unsigned int i;
+	struct gdma_dma_region_add_pages_req *add_req = NULL;
+	int err;
+
+	struct gdma_create_dma_region_req *create_req;
+	struct gdma_create_dma_region_resp create_resp = {};
+
+	size_t max_pgs_create_cmd = (hwc->max_req_msg_size -
+				     sizeof(*create_req)) / sizeof(u64);
+
+	num_pages_to_handle = min_t(size_t, num_pages_total,
+				    max_pgs_create_cmd);
+	create_req_msg_size = struct_size(create_req, page_addr_list,
+					  num_pages_to_handle);
+
+	create_req = kzalloc(create_req_msg_size, GFP_KERNEL);
+	if (!create_req)
+		return -ENOMEM;
+
+	mana_gd_init_req_hdr(&create_req->hdr, GDMA_CREATE_DMA_REGION,
+			     create_req_msg_size, sizeof(create_resp));
+
+	create_req->length = length;
+	create_req->offset_in_page = address & (page_sz - 1);
+	create_req->gdma_page_type = mana_ib_get_atb_page_size(page_sz);
+	create_req->page_count = num_pages_total;
+	create_req->page_addr_list_len = num_pages_to_handle;
+
+	pr_debug("size_dma_region %llu num_pages_total %lu, "
+		 "page_sz 0x%llx offset_in_page %u\n",
+		length, num_pages_total, page_sz, create_req->offset_in_page);
+
+	pr_debug("num_pages_to_handle %lu, gdma_page_type %u",
+		 num_pages_to_handle, create_req->gdma_page_type);
+
+	for (i = 0; i < num_pages_to_handle; ++i) {
+		dma_addr_t cur_addr = page_addr_array[i];
+
+		create_req->page_addr_list[i] = cur_addr;
+
+		pr_debug("page num %u cur_addr 0x%llx\n", i, cur_addr);
+	}
+
+	err = mana_gd_send_request(gc, create_req_msg_size, create_req,
+				   sizeof(create_resp), &create_resp);
+	kfree(create_req);
+
+	if (err || create_resp.hdr.status) {
+		dev_err(gc->dev, "Failed to create DMA region: %d, 0x%x\n",
+			err, create_resp.hdr.status);
+		goto error;
+	}
+
+	*gdma_region = create_resp.dma_region_handle;
+	pr_debug("Created DMA region with handle 0x%llx\n", *gdma_region);
+
+	num_pages_cur = num_pages_to_handle;
+
+	if (num_pages_cur < num_pages_total) {
+
+		unsigned int add_req_msg_size;
+		size_t max_pgs_add_cmd = (hwc->max_req_msg_size -
+					  sizeof(*add_req)) / sizeof(u64);
+
+		num_pages_to_handle = min_t(size_t,
+					    num_pages_total - num_pages_cur,
+					    max_pgs_add_cmd);
+
+		// Calculate the max num of pages that will be handled
+		add_req_msg_size = struct_size(add_req, page_addr_list,
+					       num_pages_to_handle);
+
+		add_req = kmalloc(add_req_msg_size, GFP_KERNEL);
+		if (!add_req) {
+			err = -ENOMEM;
+			goto error;
+		}
+
+		while (num_pages_cur < num_pages_total) {
+			struct gdma_general_resp add_resp = {};
+			u32 expected_status;
+			int expected_ret;
+
+			if (num_pages_cur + num_pages_to_handle <
+					num_pages_total) {
+				// This value means that more pages are needed
+				expected_status = GDMA_STATUS_MORE_ENTRIES;
+				expected_ret = 0x0;
+			} else {
+				expected_status = 0x0;
+				expected_ret = 0x0;
+			}
+
+			memset(add_req, 0, add_req_msg_size);
+
+			mana_gd_init_req_hdr(&add_req->hdr,
+					     GDMA_DMA_REGION_ADD_PAGES,
+					     add_req_msg_size,
+					     sizeof(add_resp));
+			add_req->dma_region_handle = *gdma_region;
+			add_req->page_addr_list_len = num_pages_to_handle;
+
+			for (i = 0; i < num_pages_to_handle; ++i) {
+				dma_addr_t cur_addr =
+					page_addr_array[num_pages_cur + i];
+
+				add_req->page_addr_list[i] = cur_addr;
+
+				pr_debug("page_addr_list %lu addr 0x%llx\n",
+					 num_pages_cur + i, cur_addr);
+			}
+
+			err = mana_gd_send_request(gc, add_req_msg_size,
+						   add_req, sizeof(add_resp),
+						   &add_resp);
+			if (err != expected_ret ||
+			    add_resp.hdr.status != expected_status) {
+				dev_err(gc->dev,
+					"Failed to put DMA pages %u: %d,0x%x\n",
+					i, err, add_resp.hdr.status);
+				err = -EPROTO;
+				goto free_req;
+			}
+
+			num_pages_cur += num_pages_to_handle;
+			num_pages_to_handle = min_t(size_t,
+						    num_pages_total -
+							num_pages_cur,
+						    max_pgs_add_cmd);
+			add_req_msg_size = sizeof(*add_req) +
+				num_pages_to_handle * sizeof(u64);
+		}
+free_req:
+		kfree(add_req);
+	}
+
+error:
+	return err;
+}
+
+
+int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct ib_umem *umem,
+				 mana_handle_t *dma_region_handle, u64 page_sz)
+{
+	size_t num_pages = ib_umem_num_dma_blocks(umem, page_sz);
+	struct ib_block_iter biter;
+	dma_addr_t *page_addr_array;
+	unsigned int i = 0;
+	int err;
+
+	pr_debug("num pages %lu umem->address 0x%lx\n",
+		 num_pages, umem->address);
+
+	page_addr_array = kmalloc_array(num_pages,
+					sizeof(*page_addr_array), GFP_KERNEL);
+	if (!page_addr_array)
+		return -ENOMEM;
+
+	rdma_umem_for_each_dma_block(umem, &biter, page_sz)
+		page_addr_array[i++] = rdma_block_iter_dma_address(&biter);
+
+	err = _mana_ib_gd_create_dma_region(dev, page_addr_array, num_pages,
+					    umem->address, umem->length,
+					    dma_region_handle, page_sz);
+
+	kfree(page_addr_array);
+
+	return err;
+}
+
+int mana_ib_gd_destroy_dma_region(struct mana_ib_dev *dev, u64 gdma_region)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct gdma_context *gc = mdev->gdma_context;
+
+	pr_debug("%s: destroy dma region 0x%llx\n", __func__, gdma_region);
+
+	return mana_gd_destroy_dma_region(gc, gdma_region);
+}
+
+int mana_ib_gd_create_pd(struct mana_ib_dev *dev, u64 *pd_handle, u32 *pd_id,
+			 enum gdma_pd_flags flags)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct gdma_context *gc = mdev->gdma_context;
+	int err;
+
+	struct gdma_create_pd_req req = {};
+	struct gdma_create_pd_resp resp = {};
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_PD,
+			     sizeof(req), sizeof(resp));
+
+	req.flags = flags;
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+	if (!err && !resp.hdr.status) {
+		*pd_handle = resp.pd_handle;
+		*pd_id = resp.pd_id;
+		pr_debug("pd_handle 0x%llx pd_id %d\n", *pd_handle, *pd_id);
+	} else {
+		pr_err("Failed to get pd_id err %d status %u\n",
+		       err, resp.hdr.status);
+		if (!err)
+			err = -EPROTO;
+	}
+	return err;
+}
+
+int mana_ib_gd_destroy_pd(struct mana_ib_dev *dev, u64 pd_handle)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct gdma_context *gc = mdev->gdma_context;
+	int err;
+
+	struct gdma_destroy_pd_req req = {};
+	struct gdma_destory_pd_resp resp = {};
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_PD,
+			     sizeof(req), sizeof(resp));
+
+	req.pd_handle = pd_handle;
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+	if (err || resp.hdr.status) {
+		pr_err("Failed to destroy pd_handle 0x%llx err %d status %u",
+				pd_handle, err, resp.hdr.status);
+		if (!err)
+			err = -EPROTO;
+	}
+
+	return err;
+}
+
+int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr *mr,
+			 struct gdma_create_mr_params *mr_params)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct gdma_context *gc = mdev->gdma_context;
+	int err;
+
+	struct gdma_create_mr_request req = {};
+	struct gdma_create_mr_response resp = {};
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_MR,
+			     sizeof(req), sizeof(resp));
+	req.pd_handle = mr_params->pd_handle;
+
+	switch (mr_params->mr_type) {
+	case GDMA_MR_TYPE_GVA:
+		req.mr_type = GDMA_MR_TYPE_GVA;
+		req.gva.dma_region_handle = mr_params->gva.dma_region_handle;
+		req.gva.virtual_address = mr_params->gva.virtual_address;
+		req.gva.access_flags = mr_params->gva.access_flags;
+		break;
+
+	case GDMA_MR_TYPE_GPA:
+		req.mr_type = GDMA_MR_TYPE_GPA;
+		req.gpa.access_flags = mr_params->gpa.access_flags;
+		break;
+
+	case GDMA_MR_TYPE_FMR:
+		req.mr_type = GDMA_MR_TYPE_FMR;
+		req.fmr.page_size = mr_params->fmr.page_size;
+		req.fmr.reserved_pte_count = mr_params->fmr.reserved_pte_count;
+		break;
+
+	default:
+		pr_warn("invalid param (GDMA_MR_TYPE) passed, "
+			"req.mr_type %d\n", req.mr_type);
+		err = -EINVAL;
+		goto error;
+	}
+
+
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+	if (err || resp.hdr.status) {
+		pr_err("Failed to create mr %d, %u", err, resp.hdr.status);
+		goto error;
+	}
+
+	mr->ibmr.lkey = resp.lkey;
+	mr->ibmr.rkey = resp.rkey;
+	mr->mr_handle = resp.mr_handle;
+
+	return 0;
+error:
+	return err;
+}
+
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, gdma_obj_handle_t mr_handle)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct gdma_context *gc = mdev->gdma_context;
+	int err;
+
+	struct gdma_destroy_mr_response resp = {};
+	struct gdma_destroy_mr_request req = {};
+
+	mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_MR,
+			     sizeof(req), sizeof(resp));
+
+	req.mr_handle = mr_handle;
+
+	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+	if (err || resp.hdr.status) {
+		dev_err(gc->dev, "Failed to destroy MR: %d, 0x%x\n", err,
+			resp.hdr.status);
+		if (!err)
+			err = -EPROTO;
+		return err;
+	}
+
+	return 0;
+}
+
+
+static int mana_ib_mmap(struct ib_ucontext *ibcontext, struct vm_area_struct *vma)
+{
+	struct mana_ib_ucontext *mana_ucontext =
+		container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+	struct ib_device *ibdev = ibcontext->device;
+	struct mana_ib_dev *mdev =
+		container_of(ibdev, struct mana_ib_dev, ib_dev);
+	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
+	pgprot_t prot;
+	phys_addr_t pfn;
+	int ret;
+
+	// map to the page indexed by ucontext->doorbell
+	pfn = (gc->phys_db_page_base +
+	       gc->db_page_size * mana_ucontext->doorbell) >> PAGE_SHIFT;
+	prot = pgprot_writecombine(vma->vm_page_prot);
+
+	ret = rdma_user_mmap_io(ibcontext, vma, pfn, gc->db_page_size,
+			prot, NULL);
+	if (ret) {
+		pr_err("can't rdma_user_mmap_io ret %d\n", ret);
+	} else
+		pr_debug("mapped I/O pfn 0x%llx page_size %u, ret %d\n",
+			 pfn, gc->db_page_size, ret);
+
+	return ret;
+}
+
+static int mana_ib_get_port_immutable(struct ib_device *ibdev, u32 port_num,
+				      struct ib_port_immutable *immutable)
+{
+	/*
+	 * This version only support RAW_PACKET
+	 * other values need to be filled for other types
+	 */
+	immutable->core_cap_flags = RDMA_CORE_PORT_RAW_PACKET;
+
+	return 0;
+}
+
+static int mana_ib_query_device(struct ib_device *ibdev,
+				struct ib_device_attr *props,
+				struct ib_udata *uhw)
+{
+	props->max_qp = MANA_MAX_NUM_QUEUES;
+	props->max_qp_wr = MAX_SEND_BUFFERS_PER_QUEUE;
+
+	/*
+	 * max_cqe could be potentially much bigger.
+	 * As this version of driver only support RAW QP, set it to the same
+	 * value as max_qp_wr
+	 */
+	props->max_cqe = MAX_SEND_BUFFERS_PER_QUEUE;
+
+	props->max_mr_size = MANA_IB_MAX_MR_SIZE;
+	props->max_mr = INT_MAX;
+	props->max_send_sge = MAX_TX_WQE_SGL_ENTRIES;
+	props->max_recv_sge = MAX_RX_WQE_SGL_ENTRIES;
+
+	return 0;
+}
+
+int mana_ib_query_port(struct ib_device *ibdev, u32 port,
+		       struct ib_port_attr *props)
+{
+	/* This version doesn't return port properties */
+	return 0;
+}
+
+static int mana_ib_query_gid(struct ib_device *ibdev, u32 port, int index,
+			     union ib_gid *gid)
+{
+	/* This version doesn't return GID properties */
+	return 0;
+}
+
+static void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext)
+{
+}
+
+static const struct ib_device_ops mana_ib_dev_ops = {
+	.owner = THIS_MODULE,
+	.driver_id = RDMA_DRIVER_MANA,
+	.uverbs_abi_ver = MANA_IB_UVERBS_ABI_VERSION,
+
+	.alloc_pd = mana_ib_alloc_pd,
+	.dealloc_pd = mana_ib_dealloc_pd,
+
+	.alloc_ucontext = mana_ib_alloc_ucontext,
+	.dealloc_ucontext = mana_ib_dealloc_ucontext,
+
+	.create_cq = mana_ib_create_cq,
+	.destroy_cq = mana_ib_destroy_cq,
+
+	.create_qp = mana_ib_create_qp,
+	.modify_qp = mana_ib_modify_qp,
+	.destroy_qp = mana_ib_destroy_qp,
+
+	.disassociate_ucontext = mana_ib_disassociate_ucontext,
+
+	.mmap = mana_ib_mmap,
+
+	.reg_user_mr = mana_ib_reg_user_mr,
+	.dereg_mr = mana_ib_dereg_mr,
+
+	.create_wq = mana_ib_create_wq,
+	.modify_wq = mana_ib_modify_wq,
+	.destroy_wq = mana_ib_destroy_wq,
+
+	.create_rwq_ind_table = mana_ib_create_rwq_ind_table,
+	.destroy_rwq_ind_table = mana_ib_destroy_rwq_ind_table,
+
+	.get_port_immutable = mana_ib_get_port_immutable,
+	.query_device = mana_ib_query_device,
+	.query_port = mana_ib_query_port,
+	.query_gid = mana_ib_query_gid,
+
+	INIT_RDMA_OBJ_SIZE(ib_cq, mana_ib_cq, ibcq),
+	INIT_RDMA_OBJ_SIZE(ib_pd, mana_ib_pd, ibpd),
+	INIT_RDMA_OBJ_SIZE(ib_qp, mana_ib_qp, ibqp),
+	INIT_RDMA_OBJ_SIZE(ib_ucontext, mana_ib_ucontext, ibucontext),
+	INIT_RDMA_OBJ_SIZE(ib_rwq_ind_table, mana_ib_rwq_ind_table,
+			   ib_ind_table),
+};
+
+static int mana_ib_probe(struct auxiliary_device *adev,
+			 const struct auxiliary_device_id *id)
+{
+	struct mana_adev *madev = container_of(adev, struct mana_adev, adev);
+	struct gdma_dev *mdev = madev->mdev;
+	struct mana_context *mc = mdev->driver_data;
+	struct mana_ib_dev *dev;
+	int ret = 0;
+
+	dev = ib_alloc_device(mana_ib_dev, ib_dev);
+	if (!dev)
+		return -ENOMEM;
+
+
+	ib_set_device_ops(&dev->ib_dev, &mana_ib_dev_ops);
+
+	dev->ib_dev.phys_port_cnt = mc->num_ports;
+
+	pr_debug("mdev=%p id=%d num_ports=%d\n",
+			mdev, mdev->dev_id.as_uint32,
+			dev->ib_dev.phys_port_cnt);
+
+	dev->gdma_dev = mdev;
+	dev->ib_dev.node_type = RDMA_NODE_IB_CA;
+
+	/*
+	 * num_comp_vectors needs to set to the max MSIX index
+	 * when interrupts and event queues are implemented
+	 */
+	dev->ib_dev.num_comp_vectors = 1;
+	dev->ib_dev.dev.parent = mdev->gdma_context->dev;
+
+	ret = ib_register_device(&dev->ib_dev, "mana_%d",
+			mdev->gdma_context->dev);
+	if (ret) {
+		ib_dealloc_device(&dev->ib_dev);
+		return ret;
+	}
+
+	dev_set_drvdata(&adev->dev, dev);
+
+	return 0;
+}
+
+static void mana_ib_remove(struct auxiliary_device *adev)
+{
+	struct mana_ib_dev *dev = dev_get_drvdata(&adev->dev);
+
+	ib_unregister_device(&dev->ib_dev);
+	ib_dealloc_device(&dev->ib_dev);
+}
+
+static struct auxiliary_driver mana_driver = {
+	.name = "rdma",
+	.probe = mana_ib_probe,
+	.remove = mana_ib_remove,
+	.id_table = mana_id_table,
+};
+
+static int __init mana_ib_init(void)
+{
+	auxiliary_driver_register(&mana_driver);
+
+	return 0;
+}
+
+static void __exit mana_ib_cleanup(void)
+{
+	auxiliary_driver_unregister(&mana_driver);
+}
+
+module_init(mana_ib_init);
+module_exit(mana_ib_cleanup);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
new file mode 100644
index 000000000000..0f2ec882f0a2
--- /dev/null
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -0,0 +1,145 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2022 Microsoft Corporation. All rights reserved.
+ */
+
+#ifndef _MANA_IB_H_
+#define _MANA_IB_H_
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_mad.h>
+#include <rdma/ib_umem.h>
+#include <linux/auxiliary_bus.h>
+#include <rdma/mana-abi.h>
+
+#include <linux/mana/mana.h>
+
+#define PAGE_SZ_BM (SZ_4K | SZ_8K | SZ_16K | SZ_32K | SZ_64K | SZ_128K \
+		    | SZ_256K | SZ_512K | SZ_1M | SZ_2M)
+
+// Maximum size of a memory registration is 1G bytes
+#define MANA_IB_MAX_MR_SIZE	(1024 * 1024 * 1024)
+
+struct mana_ib_dev {
+	struct ib_device        ib_dev;
+	struct gdma_dev         *gdma_dev;
+};
+
+struct mana_ib_wq {
+	struct ib_wq    ibwq;
+	struct ib_umem  *umem;
+	int             wqe;
+	u32		wq_buf_size;
+	u64             gdma_region;
+	u64             id;
+	mana_handle_t	rx_object;
+};
+
+struct mana_ib_pd {
+	struct ib_pd	ibpd;
+	u32		pdn;
+	mana_handle_t	pd_handle;
+	atomic_t	vport_use_count;
+	bool		tx_shortform_allowed;
+	u32		tx_vp_offset;
+};
+
+struct mana_ib_mr {
+	struct ib_mr	ibmr;
+	struct ib_umem	*umem;
+	mana_handle_t	mr_handle;
+};
+
+struct mana_ib_cq {
+	struct ib_cq	ibcq;
+	struct ib_umem	*umem;
+	int		cqe;
+	u64		gdma_region;
+	u64		id;
+};
+
+struct mana_ib_qp {
+	struct ib_qp ibqp;
+
+	// Send queue info
+	struct ib_umem *sq_umem;
+	int	sqe;
+	u64	sq_gdma_region;
+	u64	sq_id;
+
+	// Set if this QP uses ind_table for receive queues
+
+	mana_handle_t tx_object;
+
+	// the port on the IB device, starting with 1
+	u32	port;
+};
+
+struct mana_ib_ucontext {
+	struct ib_ucontext	ibucontext;
+	u32 doorbell;
+};
+
+struct mana_ib_rwq_ind_table {
+	struct ib_rwq_ind_table ib_ind_table;
+};
+
+int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev,
+				 struct ib_umem *umem,
+				 mana_handle_t *gdma_region, u64 page_sz);
+
+int mana_ib_gd_destroy_dma_region(struct mana_ib_dev *dev,
+				  mana_handle_t gdma_region);
+
+struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
+				struct ib_wq_init_attr *init_attr,
+				struct ib_udata *udata);
+
+int mana_ib_modify_wq(struct ib_wq *wq, struct ib_wq_attr *wq_attr,
+		      u32 wq_attr_mask, struct ib_udata *udata);
+
+int mana_ib_destroy_wq(struct ib_wq *ibwq, struct ib_udata *udata);
+
+int mana_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
+				 struct ib_rwq_ind_table_init_attr *init_attr,
+				 struct ib_udata *udata);
+
+int mana_ib_destroy_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_tbl);
+
+struct ib_mr *mana_ib_get_dma_mr(struct ib_pd *ibpd, int access_flags);
+
+struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
+				  u64 iova, int access_flags,
+				  struct ib_udata *udata);
+
+int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
+
+int mana_ib_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *qp_init_attr,
+		      struct ib_udata *udata);
+
+
+int mana_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		      int attr_mask, struct ib_udata *udata);
+
+int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata);
+
+int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port_id,
+		      struct mana_ib_pd *pd, u32 doorbell_id);
+void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
+			 u32 port);
+
+int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
+		      struct ib_udata *udata);
+
+int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
+
+int mana_ib_gd_create_pd(struct mana_ib_dev *dev, u64 *pd_handle, u32 *pd_id,
+			 enum gdma_pd_flags flags);
+
+int mana_ib_gd_destroy_pd(struct mana_ib_dev *dev, u64 pd_handle);
+
+int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr *mr,
+			 struct gdma_create_mr_params *mr_params);
+
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, mana_handle_t mr_handle);
+#endif
diff --git a/drivers/infiniband/hw/mana/mr.c b/drivers/infiniband/hw/mana/mr.c
new file mode 100644
index 000000000000..691f9ec734c7
--- /dev/null
+++ b/drivers/infiniband/hw/mana/mr.c
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+#define VALID_MR_FLAGS (IB_ACCESS_LOCAL_WRITE | \
+			IB_ACCESS_REMOTE_WRITE | \
+			IB_ACCESS_REMOTE_READ)
+
+static enum gdma_mr_access_flags
+mana_ib_verbs_to_gdma_access_flags(int access_flags)
+{
+	enum gdma_mr_access_flags flags = GDMA_ACCESS_FLAG_LOCAL_READ;
+
+	if (access_flags & IB_ACCESS_LOCAL_WRITE)
+		flags |= GDMA_ACCESS_FLAG_LOCAL_WRITE;
+
+	if (access_flags & IB_ACCESS_REMOTE_WRITE)
+		flags |= GDMA_ACCESS_FLAG_REMOTE_WRITE;
+
+	if (access_flags & IB_ACCESS_REMOTE_READ)
+		flags |= GDMA_ACCESS_FLAG_REMOTE_READ;
+
+	return flags;
+}
+struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 length,
+				  u64 iova, int access_flags,
+				  struct ib_udata *udata)
+{
+	struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+	struct ib_device *ibdev = ibpd->device;
+	struct mana_ib_dev *dev = container_of(
+			ibdev, struct mana_ib_dev, ib_dev);
+	struct mana_ib_mr *mr;
+	gdma_obj_handle_t dma_region_handle;
+	struct gdma_create_mr_params mr_params = {};
+	u64 page_sz = PAGE_SIZE;
+	int err;
+
+	pr_debug("start 0x%llx, iova 0x%llx length 0x%llx access_flags 0x%x",
+		start, iova, length, access_flags);
+
+	if (access_flags & ~VALID_MR_FLAGS)
+		return ERR_PTR(-EINVAL);
+
+	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	mr->umem = ib_umem_get(ibdev, start, length, access_flags);
+	if (IS_ERR(mr->umem)) {
+		err = PTR_ERR(mr->umem);
+		pr_err("Failed to get umem for register user-mr, %d\n", err);
+		goto err_free;
+	}
+
+	page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);
+	if (unlikely(!page_sz)) {
+		pr_err("Failed to get best page size\n");
+		err = -EOPNOTSUPP;
+		goto err_umem;
+	}
+	pr_debug("Page size chosen %llu\n", page_sz);
+
+	err = mana_ib_gd_create_dma_region(dev, mr->umem, &dma_region_handle,
+					   page_sz);
+	if (err) {
+		pr_err("Failed to create dma region for register user-mr, %d\n",
+		       err);
+		goto err_umem;
+	}
+
+	pr_debug("mana_ib_gd_create_dma_region ret %d gdma_region %llx\n",
+		 err, dma_region_handle);
+
+	mr_params.pd_handle = pd->pd_handle;
+	mr_params.mr_type = GDMA_MR_TYPE_GVA;
+	mr_params.gva.dma_region_handle = dma_region_handle;
+	mr_params.gva.virtual_address = iova;
+	mr_params.gva.access_flags =
+		mana_ib_verbs_to_gdma_access_flags(access_flags);
+
+	err = mana_ib_gd_create_mr(dev, mr, &mr_params);
+	if (err)
+		goto err_dma_region;
+
+	/*
+	 * There is no need to keep track of dma_region_handle after MR is
+	 * successfully created. The dma_region_handle is tracked in the PF
+	 * as part of the lifecycle of this MR.
+	 */
+
+	mr->ibmr.length = length;
+	mr->ibmr.page_size = page_sz;
+	return &mr->ibmr;
+
+err_dma_region:
+	mana_gd_destroy_dma_region(dev->gdma_dev->gdma_context,
+				   dma_region_handle);
+
+err_umem:
+	ib_umem_release(mr->umem);
+
+err_free:
+	kfree(mr);
+	return ERR_PTR(err);
+}
+
+int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
+{
+	struct mana_ib_mr *mr =
+		container_of(ibmr, struct mana_ib_mr, ibmr);
+	struct ib_device *ibdev = ibmr->device;
+	struct mana_ib_dev *dev =
+		container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+	int err;
+
+	err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
+	if (err)
+		return err;
+
+	if (mr->umem)
+		ib_umem_release(mr->umem);
+
+	kfree(mr);
+
+	return 0;
+}
+
+
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
new file mode 100644
index 000000000000..75ab983c3f5c
--- /dev/null
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -0,0 +1,466 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev, struct net_device *ndev,
+			       mana_handle_t default_rxobj,
+			       mana_handle_t ind_table[], u32 log_ind_tbl_size,
+			       u32 rx_hash_key_len, u8 *rx_hash_key)
+{
+	struct gdma_dev *mdev = dev->gdma_dev;
+	struct gdma_context *gc = mdev->gdma_context;
+	struct mana_port_context *mpc = netdev_priv(ndev);
+
+	struct mana_cfg_rx_steer_req *req = NULL;
+	struct mana_cfg_rx_steer_resp resp = {};
+	u32 req_buf_size;
+	int err;
+	mana_handle_t *req_indir_tab;
+	int i;
+
+	req_buf_size = sizeof(*req) +
+		sizeof(mana_handle_t) * MANA_INDIRECT_TABLE_SIZE;
+	req = kzalloc(req_buf_size, GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	mana_gd_init_req_hdr(&req->hdr, MANA_CONFIG_VPORT_RX, req_buf_size,
+			     sizeof(resp));
+
+	req->vport = mpc->port_handle;
+	req->rx_enable = 1;
+	req->update_default_rxobj = 1;
+	req->default_rxobj = default_rxobj;
+	req->hdr.dev_id = mdev->dev_id;
+
+	/* If there are more than 1 entries in indirection table, enable RSS */
+	if (log_ind_tbl_size)
+		req->rss_enable = true;
+
+	req->num_indir_entries = MANA_INDIRECT_TABLE_SIZE;
+	req->indir_tab_offset = sizeof(*req);
+	req->update_indir_tab = true;
+
+	req_indir_tab = (mana_handle_t *)(req + 1);
+	/*
+	 * The ind table passed to the hardware must have
+	 * MANA_INDIRECT_TABLE_SIZE entries. Adjust the verb
+	 * ind_table to MANA_INDIRECT_TABLE_SIZE if required
+	 */
+	pr_debug("ind table size %u\n", 1 << log_ind_tbl_size);
+	for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++) {
+		req_indir_tab[i] = ind_table[i % (1 << log_ind_tbl_size)];
+		pr_debug("index %u handle 0x%llx\n", i, req_indir_tab[i]);
+	}
+
+	req->update_hashkey = true;
+	if (rx_hash_key_len)
+		memcpy(req->hashkey, rx_hash_key, rx_hash_key_len);
+	else
+		netdev_rss_key_fill(req->hashkey, MANA_HASH_KEY_SIZE);
+
+	pr_debug("vport handle %llu default_rxobj 0x%llx\n",
+		 req->vport, default_rxobj);
+
+	err = mana_gd_send_request(gc, req_buf_size, req, sizeof(resp), &resp);
+	if (err) {
+		netdev_err(ndev, "Failed to configure vPort RX: %d\n", err);
+		goto out;
+	}
+
+	if (resp.hdr.status) {
+		netdev_err(ndev, "vPort RX configuration failed: 0x%x\n",
+			   resp.hdr.status);
+		err = -EPROTO;
+	}
+
+out:
+	kfree(req);
+	return err;
+}
+
+
+static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
+			  struct ib_qp_init_attr *attr, struct ib_udata *udata)
+{
+	struct mana_ib_dev *mdev =
+		container_of(pd->device, struct mana_ib_dev, ib_dev);
+	struct gdma_dev *gd = mdev->gdma_dev;
+	struct mana_context *mc = gd->driver_data;
+	struct net_device *ndev;
+	struct mana_port_context *mpc;
+	struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
+	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+	struct ib_wq *ibwq;
+	struct mana_ib_wq *wq;
+	struct ib_cq *ibcq;
+	struct mana_ib_cq *cq;
+	int i = 0, ret;
+	u32 port;
+	mana_handle_t *mana_ind_table;
+
+	struct mana_ib_create_qp_rss ucmd = {};
+	struct mana_ib_create_qp_rss_resp resp = {};
+
+	ret = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+	if (ret) {
+		pr_err("Failed to copy from udata for create rss-qp, err %d\n",
+		       ret);
+		return -EFAULT;
+	}
+
+	if (attr->cap.max_recv_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
+		pr_err("Requested max_recv_wr %d exceeding limit.\n",
+		       attr->cap.max_recv_wr);
+		return -EINVAL;
+	}
+
+	if (attr->cap.max_recv_sge > MAX_RX_WQE_SGL_ENTRIES) {
+		pr_err("Requested max_recv_sge %d exceeding limit.\n",
+		       attr->cap.max_recv_sge);
+		return -EINVAL;
+	}
+
+	if (ucmd.rx_hash_function != MANA_IB_RX_HASH_FUNC_TOEPLITZ) {
+		pr_err("RX Hash function is not supported, %d\n",
+		       ucmd.rx_hash_function);
+		return -EINVAL;
+	}
+
+	// IB ports start with 1, MANA start with 0
+	port = ucmd.port;
+	if (port < 1 || port > mc->num_ports) {
+		pr_err("Invalid port %u in creating qp\n", port);
+		return -EINVAL;
+	}
+	ndev = mc->ports[port - 1];
+	mpc = netdev_priv(ndev);
+
+	pr_debug("rx_hash_function %d port %d\n", ucmd.rx_hash_function, port);
+
+	mana_ind_table = kzalloc(sizeof(mana_handle_t) *
+					(1 << ind_tbl->log_ind_tbl_size),
+				 GFP_KERNEL);
+	if (!mana_ind_table) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	qp->port = port;
+
+	for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
+		struct mana_obj_spec wq_spec = {};
+		struct mana_obj_spec cq_spec = {};
+
+		ibwq = ind_tbl->ind_tbl[i];
+		wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+
+		ibcq = ibwq->cq;
+		cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+
+		wq_spec.gdma_region = wq->gdma_region;
+		wq_spec.queue_size = wq->wq_buf_size;
+
+		cq_spec.gdma_region = cq->gdma_region;
+		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
+		cq_spec.modr_ctx_id = 0;
+		cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+
+		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
+				&wq_spec, &cq_spec, &wq->rx_object);
+		if (ret)
+			goto fail;
+
+		/* The GDMA regions are now owned by the WQ object */
+		wq->gdma_region = GDMA_INVALID_DMA_REGION;
+		cq->gdma_region = GDMA_INVALID_DMA_REGION;
+
+		wq->id = wq_spec.queue_index;
+		cq->id = cq_spec.queue_index;
+
+		pr_debug("ret %d rx_object 0x%llx wq id %llu cq id %llu\n",
+				ret, wq->rx_object, wq->id, cq->id);
+
+		resp.entries[i].cqid = cq->id;
+		resp.entries[i].wqid = wq->id;
+
+		mana_ind_table[i] = wq->rx_object;
+	}
+	resp.num_entries = i;
+
+	ret = mana_ib_cfg_vport_steering(mdev, ndev, wq->rx_object,
+					 mana_ind_table,
+					 ind_tbl->log_ind_tbl_size,
+					 ucmd.rx_hash_key_len,
+					 ucmd.rx_hash_key);
+	if (ret)
+		goto fail;
+
+	kfree(mana_ind_table);
+
+	if (udata) {
+		ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
+		if (ret) {
+			pr_err("Failed to copy to udata create rss-qp, %d\n",
+			       ret);
+			goto fail;
+		}
+	}
+
+	return 0;
+
+fail:
+	while (i-- > 0) {
+		ibwq = ind_tbl->ind_tbl[i];
+		wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+		mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
+	}
+
+	kfree(mana_ind_table);
+
+	return ret;
+}
+
+int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
+			  struct ib_qp_init_attr *attr,
+			  struct ib_udata *udata)
+{
+	struct ib_ucontext *ib_ucontext = ibpd->uobject->context;
+	struct mana_ib_ucontext *mana_ucontext =
+		container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
+	struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+	struct mana_ib_create_qp ucmd = {};
+	struct mana_ib_create_qp_resp resp = {};
+	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+	struct mana_ib_cq *send_cq =
+		container_of(attr->send_cq, struct mana_ib_cq, ibcq);
+	struct mana_ib_dev *mdev =
+		container_of(ibpd->device, struct mana_ib_dev, ib_dev);
+	struct gdma_dev *gd = mdev->gdma_dev;
+	struct mana_context *mc = gd->driver_data;
+	struct net_device *ndev;
+	struct mana_port_context *mpc;
+	struct mana_obj_spec wq_spec = {};
+	struct mana_obj_spec cq_spec = {};
+	int err;
+	u32 port;
+
+	struct ib_umem *umem;
+
+	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+	if (err) {
+		pr_err("Failed to copy from udata create qp-raw, %d\n", err);
+		return -EFAULT;
+	}
+
+	// IB ports start with 1, MANA Ethernet ports start with 0
+	port = ucmd.port;
+	if (ucmd.port > mc->num_ports)
+		return -EINVAL;
+
+	if (attr->cap.max_send_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
+		pr_err("Requested max_send_wr %d exceeding limit\n",
+		       attr->cap.max_send_wr);
+		return -EINVAL;
+	}
+
+	if (attr->cap.max_send_sge > MAX_TX_WQE_SGL_ENTRIES) {
+		pr_err("Requested max_send_sge %d exceeding limit\n",
+		       attr->cap.max_send_sge);
+		return -EINVAL;
+	}
+
+	ndev = mc->ports[port - 1];
+	mpc = netdev_priv(ndev);
+	pr_debug("port %u ndev %p mpc %p\n", port, ndev, mpc);
+
+	err =  mana_ib_cfg_vport(mdev, port - 1, pd, mana_ucontext->doorbell);
+	if (err) {
+		pr_err("cfg vport failed err %d\n", err);
+		return -ENODEV;
+	}
+
+	qp->port = port;
+
+	pr_debug("ucmd sq_buf_addr 0x%llx port %u\n",
+		 ucmd.sq_buf_addr, ucmd.port);
+
+	umem = ib_umem_get(ibpd->device, ucmd.sq_buf_addr, ucmd.sq_buf_size,
+			   IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(umem)) {
+		err = PTR_ERR(umem);
+		pr_err("Failed to get umem for create qp-raw, err %d\n", err);
+		goto err_free_vport;
+	}
+	qp->sq_umem = umem;
+
+	err = mana_ib_gd_create_dma_region(mdev, qp->sq_umem,
+			&qp->sq_gdma_region, PAGE_SIZE);
+	if (err) {
+		pr_err("Failed to create dma region for create qp-raw, %d\n",
+		       err);
+		goto err_release_umem;
+	}
+
+	pr_debug("%s: mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+		 __func__, err, qp->sq_gdma_region);
+
+	// Create a WQ on the same port handle used by the Ethernet
+	wq_spec.gdma_region = qp->sq_gdma_region;
+	wq_spec.queue_size = ucmd.sq_buf_size;
+
+	cq_spec.gdma_region = send_cq->gdma_region;
+	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
+	cq_spec.modr_ctx_id = 0;
+	cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+
+	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ,
+			&wq_spec, &cq_spec, &qp->tx_object);
+	if (err) {
+		pr_err("Failed to create wq for create raw-qp, err %d\n", err);
+		goto err_destroy_dma_region;
+	}
+
+	/* The GDMA regions are now owned by the WQ object */
+	qp->sq_gdma_region = GDMA_INVALID_DMA_REGION;
+	send_cq->gdma_region = GDMA_INVALID_DMA_REGION;
+
+	qp->sq_id = wq_spec.queue_index;
+	send_cq->id = cq_spec.queue_index;
+
+	pr_debug("ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n",
+			err, qp->tx_object, qp->sq_id, send_cq->id);
+
+	resp.sqid = qp->sq_id;
+	resp.cqid = send_cq->id;
+	resp.tx_vp_offset = pd->tx_vp_offset;
+
+	if (udata) {
+		err = ib_copy_to_udata(udata, &resp, sizeof(resp));
+		if (err) {
+			pr_err("Failed to copy udata for create qp-raw, %d\n",
+			       err);
+			goto err_destroy_wq_obj;
+		}
+	}
+
+	return 0;
+
+err_destroy_wq_obj:
+	mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
+
+err_destroy_dma_region:
+	mana_ib_gd_destroy_dma_region(mdev, qp->sq_gdma_region);
+
+err_release_umem:
+	ib_umem_release(umem);
+
+err_free_vport:
+	mana_ib_uncfg_vport(mdev, pd, port - 1);
+
+	return err;
+}
+
+int mana_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
+		      struct ib_udata *udata)
+{
+	switch (attr->qp_type) {
+
+	case IB_QPT_RAW_PACKET:
+		// When rwq_ind_tbl is used, it's for creating WQs for RSS
+		if (attr->rwq_ind_tbl)
+			return mana_ib_create_qp_rss(ibqp, ibqp->pd, attr, udata);
+
+		return mana_ib_create_qp_raw(ibqp, ibqp->pd, attr, udata);
+	default:
+		// Creating QP other than IB_QPT_RAW_PACKET is not supported
+		pr_err("Creating QP type %u not supported\n", attr->qp_type);
+	}
+
+	return -EINVAL;
+}
+
+int mana_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		int attr_mask, struct ib_udata *udata)
+{
+	// modify_qp is not supported by this version of the driver
+	return -ENOTSUPP;
+}
+
+static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
+				  struct ib_rwq_ind_table *ind_tbl,
+				  struct ib_udata *udata)
+{
+	struct mana_ib_dev *mdev =
+		container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+	struct gdma_dev *gd = mdev->gdma_dev;
+	struct mana_context *mc = gd->driver_data;
+	struct net_device *ndev;
+	struct mana_port_context *mpc;
+	struct ib_wq *ibwq;
+	struct mana_ib_wq *wq;
+	int i;
+
+	ndev = mc->ports[qp->port - 1];
+	mpc = netdev_priv(ndev);
+	pr_debug("ndev %p mpc %p\n", ndev, mpc);
+
+	for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
+		ibwq = ind_tbl->ind_tbl[i];
+		wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+		pr_debug("wq->rx_object %llu\n", wq->rx_object);
+		mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
+	}
+
+	return 0;
+}
+
+int mana_ib_destroy_qp_raw(struct mana_ib_qp *qp, struct ib_udata *udata)
+{
+	struct mana_ib_dev *mdev =
+		container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+	struct gdma_dev *gd = mdev->gdma_dev;
+	struct mana_context *mc = gd->driver_data;
+	struct net_device *ndev;
+	struct mana_port_context *mpc;
+	struct ib_pd *ibpd = qp->ibqp.pd;
+	struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+
+	ndev = mc->ports[qp->port - 1];
+	mpc = netdev_priv(ndev);
+	pr_debug("ndev %p mpc %p qp->tx_object %llu\n",
+			ndev, mpc, qp->tx_object);
+
+	mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
+
+	if (qp->sq_umem) {
+		mana_ib_gd_destroy_dma_region(mdev, qp->sq_gdma_region);
+		ib_umem_release(qp->sq_umem);
+	}
+
+	mana_ib_uncfg_vport(mdev, pd, qp->port - 1);
+
+	return 0;
+}
+
+int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
+{
+	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+
+	switch (ibqp->qp_type) {
+	case IB_QPT_RAW_PACKET:
+		if (ibqp->rwq_ind_tbl)
+			return mana_ib_destroy_qp_rss(qp, ibqp->rwq_ind_tbl,
+						      udata);
+
+		return mana_ib_destroy_qp_raw(qp, udata);
+
+	default:
+		pr_debug("Unexpected QP type %u\n", ibqp->qp_type);
+	}
+
+	return -ENOENT;
+}
diff --git a/drivers/infiniband/hw/mana/wq.c b/drivers/infiniband/hw/mana/wq.c
new file mode 100644
index 000000000000..945aa163c452
--- /dev/null
+++ b/drivers/infiniband/hw/mana/wq.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
+				struct ib_wq_init_attr *init_attr,
+				struct ib_udata *udata)
+{
+	struct ib_umem *umem;
+	struct mana_ib_dev *mdev = container_of(pd->device,
+						struct mana_ib_dev, ib_dev);
+	struct mana_ib_create_wq ucmd = { };
+	struct mana_ib_wq *wq;
+	int err;
+
+	pr_debug("udata->inlen %lu\n", udata->inlen);
+	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+	if (err) {
+		pr_err("Failed to copy from udata for create wq, %d\n", err);
+		return ERR_PTR(-EFAULT);
+	}
+
+	wq = kzalloc(sizeof(*wq), GFP_KERNEL);
+	if (!wq)
+		return ERR_PTR(-ENOMEM);
+
+	pr_debug("ucmd wq_buf_addr 0x%llx\n", ucmd.wq_buf_addr);
+
+	umem = ib_umem_get(pd->device, ucmd.wq_buf_addr, ucmd.wq_buf_size,
+			   IB_ACCESS_LOCAL_WRITE);
+	if (IS_ERR(umem)) {
+		err = PTR_ERR(umem);
+		pr_err("Failed to get umem for create wq, err %d\n", err);
+		goto err_free_wq;
+	}
+
+	wq->umem = umem;
+	wq->wqe = init_attr->max_wr;
+	wq->wq_buf_size = ucmd.wq_buf_size;
+	wq->rx_object = INVALID_MANA_HANDLE;
+
+	err = mana_ib_gd_create_dma_region(mdev, wq->umem, &wq->gdma_region,
+					   PAGE_SIZE);
+	if (err) {
+		pr_err("Failed to create dma region for create wq, %d\n", err);
+		goto err_release_umem;
+	}
+
+	pr_debug("%s: mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+			__func__, err, wq->gdma_region);
+
+	// WQ ID is returned at wq_create time, doesn't know the value yet
+
+	return &wq->ibwq;
+
+err_release_umem:
+	ib_umem_release(umem);
+
+err_free_wq:
+	kfree(wq);
+
+	return ERR_PTR(err);
+}
+
+
+int mana_ib_modify_wq(struct ib_wq *wq, struct ib_wq_attr *wq_attr,
+		      u32 wq_attr_mask, struct ib_udata *udata)
+{
+	// modify_wq is not supported by this version of the driver
+	return -ENOTSUPP;
+}
+
+int mana_ib_destroy_wq(struct ib_wq *ibwq, struct ib_udata *udata)
+{
+	struct mana_ib_wq *wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+	struct ib_device *ib_dev = ibwq->device;
+	struct mana_ib_dev *mdev = container_of(ib_dev, struct mana_ib_dev,
+						ib_dev);
+
+	mana_ib_gd_destroy_dma_region(mdev, wq->gdma_region);
+	ib_umem_release(wq->umem);
+
+	kfree(wq);
+
+	return 0;
+}
+
+int mana_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
+				 struct ib_rwq_ind_table_init_attr *init_attr,
+				 struct ib_udata *udata)
+{
+	pr_debug("udata->inlen %lu\n", udata->inlen);
+
+	/*
+	 * There is no additional data in ind_table to be maintained by this
+	 * driver, do nothing
+	 */
+	return 0;
+}
+
+int mana_ib_destroy_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_tbl)
+{
+	/*
+	 * There is no additional data in ind_table to be maintained by this
+	 * driver, do nothing
+	 */
+	return 0;
+}
diff --git a/include/linux/mana/mana.h b/include/linux/mana/mana.h
index 1cf77a03bff2..114698f682cf 100644
--- a/include/linux/mana/mana.h
+++ b/include/linux/mana/mana.h
@@ -403,6 +403,9 @@ int mana_bpf(struct net_device *ndev, struct netdev_bpf *bpf);
 
 extern const struct ethtool_ops mana_ethtool_ops;
 
+/* A CQ can be created not associated with any EQ */
+#define GDMA_CQ_NO_EQ  0xffff
+
 struct mana_obj_spec {
 	u32 queue_index;
 	u64 gdma_region;
diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 3072e5d6b692..081aabf536dc 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -250,6 +250,7 @@ enum rdma_driver_id {
 	RDMA_DRIVER_QIB,
 	RDMA_DRIVER_EFA,
 	RDMA_DRIVER_SIW,
+	RDMA_DRIVER_MANA,
 };
 
 enum ib_uverbs_gid_type {
diff --git a/include/uapi/rdma/mana-abi.h b/include/uapi/rdma/mana-abi.h
new file mode 100644
index 000000000000..4e40f70a0601
--- /dev/null
+++ b/include/uapi/rdma/mana-abi.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#ifndef MANA_ABI_USER_H
+#define MANA_ABI_USER_H
+
+#include <linux/types.h>
+#include <rdma/ib_user_ioctl_verbs.h>
+
+#include <linux/mana/mana.h>
+
+/*
+ * Increment this value if any changes that break userspace ABI
+ * compatibility are made.
+ */
+
+#define MANA_IB_UVERBS_ABI_VERSION              1
+
+struct mana_ib_create_cq {
+	__aligned_u64 buf_addr;
+};
+
+struct mana_ib_create_qp {
+	__aligned_u64 sq_buf_addr;
+	__u32	sq_buf_size;
+	__u32	port;
+};
+
+struct mana_ib_create_qp_resp {
+	__u32 sqid;
+	__u32 cqid;
+	__u32 tx_vp_offset;
+	__u32 reserved;
+};
+
+struct mana_ib_create_wq {
+	__aligned_u64 wq_buf_addr;
+	__u32	wq_buf_size;
+	__u32   reserved;
+};
+
+/* RX Hash function flags */
+enum mana_ib_rx_hash_function_flags {
+	MANA_IB_RX_HASH_FUNC_TOEPLITZ	= 1 << 0,
+};
+
+struct mana_ib_create_qp_rss {
+	__aligned_u64 rx_hash_fields_mask;
+	__u8    rx_hash_function;
+	__u8    reserved[7];
+	__u32	rx_hash_key_len;
+	__u8    rx_hash_key[40];
+	__u32	port;
+};
+
+struct rss_resp_entry {
+	__u32   cqid;
+	__u32   wqid;
+};
+
+struct mana_ib_create_qp_rss_resp {
+	__aligned_u64 num_entries;
+	struct rss_resp_entry entries[MANA_MAX_NUM_QUEUES];
+};
+
+#endif
-- 
2.17.1


^ permalink raw reply related

* [PATCH 09/12] net: mana: Move header files to a common location
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

In preparation to add MANA RDMA driver, move all the required header files
to a common location for use by both Ethernet and RDMA drivers.

Signed-off-by: Long Li <longli@microsoft.com>
---
 MAINTAINERS                                                   | 1 +
 drivers/net/ethernet/microsoft/mana/gdma_main.c               | 2 +-
 drivers/net/ethernet/microsoft/mana/hw_channel.c              | 4 ++--
 drivers/net/ethernet/microsoft/mana/mana_bpf.c                | 2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c                 | 2 +-
 drivers/net/ethernet/microsoft/mana/mana_ethtool.c            | 2 +-
 drivers/net/ethernet/microsoft/mana/shm_channel.c             | 2 +-
 {drivers/net/ethernet/microsoft => include/linux}/mana/gdma.h | 0
 .../ethernet/microsoft => include/linux}/mana/hw_channel.h    | 0
 {drivers/net/ethernet/microsoft => include/linux}/mana/mana.h | 0
 .../ethernet/microsoft => include/linux}/mana/shm_channel.h   | 0
 11 files changed, 8 insertions(+), 7 deletions(-)
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/gdma.h (100%)
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/hw_channel.h (100%)
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/mana.h (100%)
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/shm_channel.h (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 40fa1955ca3f..268c68dc40dc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9108,6 +9108,7 @@ F:	include/asm-generic/hyperv-tlfs.h
 F:	include/asm-generic/mshyperv.h
 F:	include/clocksource/hyperv_timer.h
 F:	include/linux/hyperv.h
+F:	include/mana/
 F:	include/uapi/linux/hyperv.h
 F:	net/vmw_vsock/hyperv_transport.c
 F:	tools/hv/
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 9c93d7a403ea..96edf8491ebd 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -6,7 +6,7 @@
 #include <linux/utsname.h>
 #include <linux/version.h>
 
-#include "mana.h"
+#include <linux/mana/mana.h>
 
 static u32 mana_gd_r32(struct gdma_context *g, u64 offset)
 {
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 078d6a5a0768..609cd714dcc0 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -1,8 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /* Copyright (c) 2021, Microsoft Corporation. */
 
-#include "gdma.h"
-#include "hw_channel.h"
+#include <linux/mana/gdma.h>
+#include <linux/mana/hw_channel.h>
 
 static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id)
 {
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 1d2f948b5c00..7476f21e5f37 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -8,7 +8,7 @@
 #include <linux/bpf_trace.h>
 #include <net/xdp.h>
 
-#include "mana.h"
+#include <linux/mana/mana.h>
 
 void mana_xdp_tx(struct sk_buff *skb, struct net_device *ndev)
 {
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 6bb38c90b008..928b14a7ee1f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -11,7 +11,7 @@
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
 
-#include "mana.h"
+#include <linux/mana/mana.h>
 
 static DEFINE_IDA(mana_adev_ida);
 
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index e13f2453eabb..c2ecb5154139 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -5,7 +5,7 @@
 #include <linux/etherdevice.h>
 #include <linux/ethtool.h>
 
-#include "mana.h"
+#include <linux/mana/mana.h>
 
 static const struct {
 	char name[ETH_GSTRING_LEN];
diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c b/drivers/net/ethernet/microsoft/mana/shm_channel.c
index da255da62176..161a4e6ba32a 100644
--- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
@@ -6,7 +6,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 
-#include "shm_channel.h"
+#include <linux/mana/shm_channel.h>
 
 #define PAGE_FRAME_L48_WIDTH_BYTES 6
 #define PAGE_FRAME_L48_WIDTH_BITS (PAGE_FRAME_L48_WIDTH_BYTES * 8)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/include/linux/mana/gdma.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/gdma.h
rename to include/linux/mana/gdma.h
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.h b/include/linux/mana/hw_channel.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/hw_channel.h
rename to include/linux/mana/hw_channel.h
diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/include/linux/mana/mana.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/mana.h
rename to include/linux/mana/mana.h
diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.h b/include/linux/mana/shm_channel.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/shm_channel.h
rename to include/linux/mana/shm_channel.h
-- 
2.17.1


^ permalink raw reply related

* [PATCH 06/12] net: mana: Define data structures for protection domain and memory registration
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

The MANA hardware support protection domain and memory registration for use
in RDMA environment. Add those definitions and expose them for use by the
RDMA driver.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma.h    | 149 +++++++++++++++++-
 .../net/ethernet/microsoft/mana/gdma_main.c   |  26 +--
 drivers/net/ethernet/microsoft/mana/mana_en.c |  16 +-
 3 files changed, 168 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index f945755760dc..bc8cd9528937 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -27,6 +27,10 @@ enum gdma_request_type {
 	GDMA_CREATE_DMA_REGION		= 25,
 	GDMA_DMA_REGION_ADD_PAGES	= 26,
 	GDMA_DESTROY_DMA_REGION		= 27,
+	GDMA_CREATE_PD			= 29,
+	GDMA_DESTROY_PD			= 30,
+	GDMA_CREATE_MR			= 31,
+	GDMA_DESTROY_MR			= 32,
 };
 
 #define GDMA_RESOURCE_DOORBELL_PAGE	27
@@ -59,6 +63,8 @@ enum {
 	GDMA_DEVICE_MANA	= 2,
 };
 
+typedef u64 gdma_obj_handle_t;
+
 struct gdma_resource {
 	/* Protect the bitmap */
 	spinlock_t lock;
@@ -192,7 +198,7 @@ struct gdma_mem_info {
 	u64 length;
 
 	/* Allocated by the PF driver */
-	u64 gdma_region;
+	gdma_obj_handle_t dma_region_handle;
 };
 
 #define REGISTER_ATB_MST_MKEY_LOWER_SIZE 8
@@ -599,7 +605,7 @@ struct gdma_create_queue_req {
 	u32 reserved1;
 	u32 pdid;
 	u32 doolbell_id;
-	u64 gdma_region;
+	gdma_obj_handle_t gdma_region;
 	u32 reserved2;
 	u32 queue_size;
 	u32 log2_throttle_limit;
@@ -626,6 +632,28 @@ struct gdma_disable_queue_req {
 	u32 alloc_res_id_on_creation;
 }; /* HW DATA */
 
+enum atb_page_size {
+	ATB_PAGE_SIZE_4K,
+	ATB_PAGE_SIZE_8K,
+	ATB_PAGE_SIZE_16K,
+	ATB_PAGE_SIZE_32K,
+	ATB_PAGE_SIZE_64K,
+	ATB_PAGE_SIZE_128K,
+	ATB_PAGE_SIZE_256K,
+	ATB_PAGE_SIZE_512K,
+	ATB_PAGE_SIZE_1M,
+	ATB_PAGE_SIZE_2M,
+	ATB_PAGE_SIZE_MAX,
+};
+
+enum gdma_mr_access_flags {
+	GDMA_ACCESS_FLAG_LOCAL_READ = (1 << 0),
+	GDMA_ACCESS_FLAG_LOCAL_WRITE = (1 << 1),
+	GDMA_ACCESS_FLAG_REMOTE_READ = (1 << 2),
+	GDMA_ACCESS_FLAG_REMOTE_WRITE = (1 << 3),
+	GDMA_ACCESS_FLAG_REMOTE_ATOMIC = (1 << 4),
+};
+
 /* GDMA_CREATE_DMA_REGION */
 struct gdma_create_dma_region_req {
 	struct gdma_req_hdr hdr;
@@ -652,14 +680,14 @@ struct gdma_create_dma_region_req {
 
 struct gdma_create_dma_region_resp {
 	struct gdma_resp_hdr hdr;
-	u64 gdma_region;
+	gdma_obj_handle_t dma_region_handle;
 }; /* HW DATA */
 
 /* GDMA_DMA_REGION_ADD_PAGES */
 struct gdma_dma_region_add_pages_req {
 	struct gdma_req_hdr hdr;
 
-	u64 gdma_region;
+	gdma_obj_handle_t dma_region_handle;
 
 	u32 page_addr_list_len;
 	u32 reserved3;
@@ -671,9 +699,117 @@ struct gdma_dma_region_add_pages_req {
 struct gdma_destroy_dma_region_req {
 	struct gdma_req_hdr hdr;
 
-	u64 gdma_region;
+	gdma_obj_handle_t dma_region_handle;
 }; /* HW DATA */
 
+enum gdma_pd_flags {
+	GDMA_PD_FLAG_ALLOW_GPA_MR = (1 << 0),
+	GDMA_PD_FLAG_ALLOW_FMR_MR = (1 << 1),
+};
+
+struct gdma_create_pd_req {
+	struct gdma_req_hdr hdr;
+	enum gdma_pd_flags flags;
+	u32 reserved;
+};
+
+struct gdma_create_pd_resp {
+	struct gdma_resp_hdr hdr;
+	gdma_obj_handle_t pd_handle;
+	u32 pd_id;
+	u32 reserved;
+};
+
+struct gdma_destroy_pd_req {
+	struct gdma_req_hdr hdr;
+	gdma_obj_handle_t pd_handle;
+};
+
+struct gdma_destory_pd_resp {
+	struct gdma_resp_hdr hdr;
+};
+
+enum gdma_mr_type {
+	//
+	// Guest Physical Address - MRs of this type allow access
+	// to any DMA-mapped memory using bus-logical address
+	//
+	GDMA_MR_TYPE_GPA = 1,
+
+	//
+	// Guest Virtual Address - MRs of this type allow access
+	// to memory mapped by PTEs associated with this MR using a virtual
+	// address that is set up in the MST
+	//
+	GDMA_MR_TYPE_GVA,
+
+	//
+	// Fast Memory Register - Like GVA but the MR is initially put in the
+	// FREE state (as opposed to Valid), and the specified number of
+	// PTEs are reserved for future fast memory reservations.
+	//
+	GDMA_MR_TYPE_FMR,
+};
+
+struct gdma_create_mr_params {
+	gdma_obj_handle_t pd_handle;
+	enum gdma_mr_type mr_type;
+	union {
+		struct {
+			gdma_obj_handle_t dma_region_handle;
+			u64 virtual_address;
+			enum gdma_mr_access_flags access_flags;
+		} gva;
+		struct {
+			enum gdma_mr_access_flags access_flags;
+		} gpa;
+		struct {
+			enum atb_page_size page_size;
+			u32  reserved_pte_count;
+		} fmr;
+	};
+};
+
+struct gdma_create_mr_request {
+	struct gdma_req_hdr hdr;
+	gdma_obj_handle_t pd_handle;
+	enum gdma_mr_type mr_type;
+	u32 reserved;
+
+	union {
+		struct {
+			enum gdma_mr_access_flags access_flags;
+		} gpa;
+
+		struct {
+			gdma_obj_handle_t dma_region_handle;
+			u64 virtual_address;
+			enum gdma_mr_access_flags access_flags;
+		} gva;
+
+		struct {
+			enum atb_page_size page_size;
+			u32 reserved_pte_count;
+		} fmr;
+	};
+};
+
+struct gdma_create_mr_response {
+	struct gdma_resp_hdr hdr;
+	gdma_obj_handle_t mr_handle;
+	u32 lkey;
+	u32 rkey;
+};
+
+struct gdma_destroy_mr_request {
+	struct gdma_req_hdr hdr;
+	gdma_obj_handle_t mr_handle;
+};
+
+struct gdma_destroy_mr_response {
+	struct gdma_resp_hdr hdr;
+};
+
 int mana_gd_verify_vf_version(struct pci_dev *pdev);
 
 int mana_gd_register_device(struct gdma_dev *gd);
@@ -705,4 +841,7 @@ int mana_gd_allocate_doorbell_page(struct gdma_context *gc, int *doorbell_page);
 
 int mana_gd_destroy_doorbell_page(struct gdma_context *gc, int doorbell_page);
 
+int mana_gd_destroy_dma_region(struct gdma_context *gc,
+			       gdma_obj_handle_t dma_region_handle);
+
 #endif /* _GDMA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 426087688480..55c4059ac870 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -224,7 +224,7 @@ static int mana_gd_create_hw_eq(struct gdma_context *gc,
 	req.type = queue->type;
 	req.pdid = queue->gdma_dev->pdid;
 	req.doolbell_id = queue->gdma_dev->doorbell;
-	req.gdma_region = queue->mem_info.gdma_region;
+	req.gdma_region = queue->mem_info.dma_region_handle;
 	req.queue_size = queue->queue_size;
 	req.log2_throttle_limit = queue->eq.log2_throttle_limit;
 	req.eq_pci_msix_index = queue->eq.msix_index;
@@ -238,7 +238,7 @@ static int mana_gd_create_hw_eq(struct gdma_context *gc,
 
 	queue->id = resp.queue_index;
 	queue->eq.disable_needed = true;
-	queue->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
+	queue->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
 	return 0;
 }
 
@@ -692,24 +692,30 @@ int mana_gd_create_hwc_queue(struct gdma_dev *gd,
 	return err;
 }
 
-static void mana_gd_destroy_dma_region(struct gdma_context *gc, u64 gdma_region)
+int mana_gd_destroy_dma_region(struct gdma_context *gc,
+			       gdma_obj_handle_t dma_region_handle)
 {
 	struct gdma_destroy_dma_region_req req = {};
 	struct gdma_general_resp resp = {};
 	int err;
 
-	if (gdma_region == GDMA_INVALID_DMA_REGION)
-		return;
+	if (dma_region_handle == GDMA_INVALID_DMA_REGION)
+		return 0;
 
 	mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_DMA_REGION, sizeof(req),
 			     sizeof(resp));
-	req.gdma_region = gdma_region;
+	req.dma_region_handle = dma_region_handle;
 
 	err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
-	if (err || resp.hdr.status)
+	if (err || resp.hdr.status) {
 		dev_err(gc->dev, "Failed to destroy DMA region: %d, 0x%x\n",
 			err, resp.hdr.status);
+		return -EPROTO;
+	}
+
+	return 0;
 }
+EXPORT_SYMBOL(mana_gd_destroy_dma_region);
 
 static int mana_gd_create_dma_region(struct gdma_dev *gd,
 				     struct gdma_mem_info *gmi)
@@ -754,14 +760,14 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
 	if (err)
 		goto out;
 
-	if (resp.hdr.status || resp.gdma_region == GDMA_INVALID_DMA_REGION) {
+	if (resp.hdr.status || resp.dma_region_handle == GDMA_INVALID_DMA_REGION) {
 		dev_err(gc->dev, "Failed to create DMA region: 0x%x\n",
 			resp.hdr.status);
 		err = -EPROTO;
 		goto out;
 	}
 
-	gmi->gdma_region = resp.gdma_region;
+	gmi->dma_region_handle = resp.dma_region_handle;
 out:
 	kfree(req);
 	return err;
@@ -884,7 +890,7 @@ void mana_gd_destroy_queue(struct gdma_context *gc, struct gdma_queue *queue)
 		return;
 	}
 
-	mana_gd_destroy_dma_region(gc, gmi->gdma_region);
+	mana_gd_destroy_dma_region(gc, gmi->dma_region_handle);
 	mana_gd_free_memory(gmi);
 	kfree(queue);
 }
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 4f7a50ace9f6..dc9fcb99e937 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1364,10 +1364,10 @@ static int mana_create_txq(struct mana_port_context *apc,
 		memset(&wq_spec, 0, sizeof(wq_spec));
 		memset(&cq_spec, 0, sizeof(cq_spec));
 
-		wq_spec.gdma_region = txq->gdma_sq->mem_info.gdma_region;
+		wq_spec.gdma_region = txq->gdma_sq->mem_info.dma_region_handle;
 		wq_spec.queue_size = txq->gdma_sq->queue_size;
 
-		cq_spec.gdma_region = cq->gdma_cq->mem_info.gdma_region;
+		cq_spec.gdma_region = cq->gdma_cq->mem_info.dma_region_handle;
 		cq_spec.queue_size = cq->gdma_cq->queue_size;
 		cq_spec.modr_ctx_id = 0;
 		cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
@@ -1382,8 +1382,8 @@ static int mana_create_txq(struct mana_port_context *apc,
 		txq->gdma_sq->id = wq_spec.queue_index;
 		cq->gdma_cq->id = cq_spec.queue_index;
 
-		txq->gdma_sq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
-		cq->gdma_cq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
+		txq->gdma_sq->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
+		cq->gdma_cq->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
 
 		txq->gdma_txq_id = txq->gdma_sq->id;
 
@@ -1594,10 +1594,10 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	memset(&wq_spec, 0, sizeof(wq_spec));
 	memset(&cq_spec, 0, sizeof(cq_spec));
-	wq_spec.gdma_region = rxq->gdma_rq->mem_info.gdma_region;
+	wq_spec.gdma_region = rxq->gdma_rq->mem_info.dma_region_handle;
 	wq_spec.queue_size = rxq->gdma_rq->queue_size;
 
-	cq_spec.gdma_region = cq->gdma_cq->mem_info.gdma_region;
+	cq_spec.gdma_region = cq->gdma_cq->mem_info.dma_region_handle;
 	cq_spec.queue_size = cq->gdma_cq->queue_size;
 	cq_spec.modr_ctx_id = 0;
 	cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
@@ -1610,8 +1610,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	rxq->gdma_rq->id = wq_spec.queue_index;
 	cq->gdma_cq->id = cq_spec.queue_index;
 
-	rxq->gdma_rq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
-	cq->gdma_cq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
+	rxq->gdma_rq->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
+	cq->gdma_cq->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
 
 	rxq->gdma_id = rxq->gdma_rq->id;
 	cq->gdma_id = cq->gdma_cq->id;
-- 
2.17.1


^ permalink raw reply related

* [PATCH 07/12] net: mana: Export Work Queue functions for use by RDMA driver
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

RDMA device may need to create Ethernet device queues for use by Queue
Pair type RAW. This allows a user-mode context accesses Ethernet hardware
queues. Export the supporting functions for use by the RDMA driver.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c |  1 +
 drivers/net/ethernet/microsoft/mana/mana.h      |  9 +++++++++
 drivers/net/ethernet/microsoft/mana/mana_en.c   | 16 +++++++++-------
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 55c4059ac870..9c93d7a403ea 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -125,6 +125,7 @@ int mana_gd_send_request(struct gdma_context *gc, u32 req_len, const void *req,
 
 	return mana_hwc_send_request(hwc, req_len, req, resp_len, resp);
 }
+EXPORT_SYMBOL(mana_gd_send_request);
 
 int mana_gd_alloc_memory(struct gdma_context *gc, unsigned int length,
 			 struct gdma_mem_info *gmi)
diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/drivers/net/ethernet/microsoft/mana/mana.h
index 26f14fcb6a61..29e14ad8b930 100644
--- a/drivers/net/ethernet/microsoft/mana/mana.h
+++ b/drivers/net/ethernet/microsoft/mana/mana.h
@@ -568,6 +568,15 @@ struct mana_adev {
 	struct gdma_dev *mdev;
 };
 
+int mana_create_wq_obj(struct mana_port_context *apc,
+		       mana_handle_t vport,
+		       u32 wq_type, struct mana_obj_spec *wq_spec,
+		       struct mana_obj_spec *cq_spec,
+		       mana_handle_t *wq_obj);
+
+void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
+			 mana_handle_t wq_obj);
+
 int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
 		   u32 doorbell_pg_id);
 void mana_uncfg_vport(struct mana_port_context *apc);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index dc9fcb99e937..b4af85e81834 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -644,11 +644,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
 	return err;
 }
 
-static int mana_create_wq_obj(struct mana_port_context *apc,
-			      mana_handle_t vport,
-			      u32 wq_type, struct mana_obj_spec *wq_spec,
-			      struct mana_obj_spec *cq_spec,
-			      mana_handle_t *wq_obj)
+int mana_create_wq_obj(struct mana_port_context *apc,
+		       mana_handle_t vport,
+		       u32 wq_type, struct mana_obj_spec *wq_spec,
+		       struct mana_obj_spec *cq_spec,
+		       mana_handle_t *wq_obj)
 {
 	struct mana_create_wqobj_resp resp = {};
 	struct mana_create_wqobj_req req = {};
@@ -697,9 +697,10 @@ static int mana_create_wq_obj(struct mana_port_context *apc,
 out:
 	return err;
 }
+EXPORT_SYMBOL_GPL(mana_create_wq_obj);
 
-static void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
-				mana_handle_t wq_obj)
+void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
+			 mana_handle_t wq_obj)
 {
 	struct mana_destroy_wqobj_resp resp = {};
 	struct mana_destroy_wqobj_req req = {};
@@ -724,6 +725,7 @@ static void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
 		netdev_err(ndev, "Failed to destroy WQ object: %d, 0x%x\n", err,
 			   resp.hdr.status);
 }
+EXPORT_SYMBOL_GPL(mana_destroy_wq_obj);
 
 static void mana_destroy_eq(struct mana_context *ac)
 {
-- 
2.17.1


^ permalink raw reply related

* [PATCH 08/12] net: mana: Record port number in netdev
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

The port number is useful for user-mode application to identify this
net device based on port index. Set to the correct value in ndev.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b4af85e81834..6bb38c90b008 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1952,6 +1952,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
 	ndev->max_mtu = ndev->mtu;
 	ndev->min_mtu = ndev->mtu;
 	ndev->needed_headroom = MANA_HEADROOM;
+	ndev->dev_port = port_idx;
 	SET_NETDEV_DEV(ndev, gc->dev);
 
 	netif_carrier_off(ndev);
-- 
2.17.1


^ permalink raw reply related

* [PATCH 05/12] net: mana: Set the DMA device max page size
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li
In-Reply-To: <1652778276-2986-1-git-send-email-longli@linuxonhyperv.com>

From: Long Li <longli@microsoft.com>

The system chooses default 64K page size if the device does not specify
the max page size the device can handle for DMA. This do not work well
when device is registering large chunk of memory in that a large page size
is more efficient.

Set it to the maximum hardware supported page size.

Signed-off-by: Long Li <longli@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/gdma_main.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 86ffe0e39df0..426087688480 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1385,6 +1385,13 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto release_region;
 
+	// The max GDMA HW supported page size is 2M
+	err = dma_set_max_seg_size(&pdev->dev, SZ_2M);
+	if (err) {
+		dev_err(&pdev->dev, "Failed to set dma device segment size\n");
+		goto release_region;
+	}
+
 	err = -ENOMEM;
 	gc = vzalloc(sizeof(*gc));
 	if (!gc)
-- 
2.17.1


^ permalink raw reply related

* [PATCH 00/12] Introduce Microsoft Azure Network Adapter (MANA) RDMA driver
From: longli @ 2022-05-17  9:04 UTC (permalink / raw)
  To: K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Wei Liu,
	Dexuan Cui, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jason Gunthorpe, Leon Romanovsky
  Cc: linux-hyperv, netdev, linux-kernel, linux-rdma, Long Li

From: Long Li <longli@microsoft.com>

This patchset implements a RDMA driver for Microsoft Azure Network
Adapter (MANA). In MANA, the RDMA device is modeled as an auxiliary device
to the Ethernet device.

The first 11 patches modify the MANA Ethernet driver to support RDMA driver.
The last patch implementes the RDMA driver.

Long Li (12):
  net: mana: Add support for auxiliary device
  net: mana: Record the physical address for doorbell page region
  net: mana: Handle vport sharing between devices
  net: mana: Add functions for allocating doorbell page from GDMA
  net: mana: Set the DMA device max page size
  net: mana: Define data structures for protection domain and memory
    registration
  net: mana: Export Work Queue functions for use by RDMA driver
  net: mana: Record port number in netdev
  net: mana: Move header files to a common location
  net: mana: Define max values for SGL entries
  net: mana: Define and process GDMA response code
    GDMA_STATUS_MORE_ENTRIES
  RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter

 MAINTAINERS                                   |   4 +
 drivers/infiniband/Kconfig                    |   1 +
 drivers/infiniband/hw/Makefile                |   1 +
 drivers/infiniband/hw/mana/Kconfig            |   7 +
 drivers/infiniband/hw/mana/Makefile           |   4 +
 drivers/infiniband/hw/mana/cq.c               |  74 ++
 drivers/infiniband/hw/mana/main.c             | 679 ++++++++++++++++++
 drivers/infiniband/hw/mana/mana_ib.h          | 145 ++++
 drivers/infiniband/hw/mana/mr.c               | 133 ++++
 drivers/infiniband/hw/mana/qp.c               | 466 ++++++++++++
 drivers/infiniband/hw/mana/wq.c               | 111 +++
 .../net/ethernet/microsoft/mana/gdma_main.c   |  94 ++-
 .../net/ethernet/microsoft/mana/hw_channel.c  |   6 +-
 .../net/ethernet/microsoft/mana/mana_bpf.c    |   2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 139 +++-
 .../ethernet/microsoft/mana/mana_ethtool.c    |   2 +-
 .../net/ethernet/microsoft/mana/shm_channel.c |   2 +-
 .../microsoft => include/linux}/mana/gdma.h   | 191 ++++-
 .../linux}/mana/hw_channel.h                  |   0
 .../microsoft => include/linux}/mana/mana.h   |  26 +-
 .../linux}/mana/shm_channel.h                 |   0
 include/uapi/rdma/ib_user_ioctl_verbs.h       |   1 +
 include/uapi/rdma/mana-abi.h                  |  68 ++
 23 files changed, 2111 insertions(+), 45 deletions(-)
 create mode 100644 drivers/infiniband/hw/mana/Kconfig
 create mode 100644 drivers/infiniband/hw/mana/Makefile
 create mode 100644 drivers/infiniband/hw/mana/cq.c
 create mode 100644 drivers/infiniband/hw/mana/main.c
 create mode 100644 drivers/infiniband/hw/mana/mana_ib.h
 create mode 100644 drivers/infiniband/hw/mana/mr.c
 create mode 100644 drivers/infiniband/hw/mana/qp.c
 create mode 100644 drivers/infiniband/hw/mana/wq.c
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/gdma.h (77%)
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/hw_channel.h (100%)
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/mana.h (94%)
 rename {drivers/net/ethernet/microsoft => include/linux}/mana/shm_channel.h (100%)
 create mode 100644 include/uapi/rdma/mana-abi.h

-- 
2.17.1


^ permalink raw reply

* Re: linux-next: build warning after merge of the net-next tree
From: Stephen Rothwell @ 2022-05-17  9:03 UTC (permalink / raw)
  To: David Miller, Networking
  Cc: Florian Westphal, Pablo Neira Ayuso, Linux Kernel Mailing List,
	Linux Next Mailing List
In-Reply-To: <20220517110303.723a7148@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1708 bytes --]

Hi all,

On Tue, 17 May 2022 11:03:03 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> After merging the net-next tree, today's linux-next build (powerpc
> ppc64_defconfig) produced this warning:
> 
> net/netfilter/nf_conntrack_netlink.c:1717:12: warning: 'ctnetlink_dump_one_entry' defined but not used [-Wunused-function]
>  1717 | static int ctnetlink_dump_one_entry(struct sk_buff *skb,
>       |            ^~~~~~~~~~~~~~~~~~~~~~~~
> 
> Introduced by commit
> 
>   8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list")

So for my i386 defconfig build this became on error, so I have applied
the following patch for today.

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Tue, 17 May 2022 18:58:43 +1000
Subject: [PATCH] fix up for "netfilter: conntrack: remove unconfirmed list"

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 net/netfilter/nf_conntrack_netlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index e768f59741a6..722af5e309ba 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1714,6 +1714,7 @@ static int ctnetlink_done_list(struct netlink_callback *cb)
 	return 0;
 }
 
+#ifdef CONFIG_NF_CONNTRACK_EVENTS
 static int ctnetlink_dump_one_entry(struct sk_buff *skb,
 				    struct netlink_callback *cb,
 				    struct nf_conn *ct,
@@ -1754,6 +1755,7 @@ static int ctnetlink_dump_one_entry(struct sk_buff *skb,
 
 	return res;
 }
+#endif
 
 static int
 ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb)
-- 
2.35.1
-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related

* Re: [PATCH] net: systemport: Fix an error handling path in bcm_sysport_probe()
From: patchwork-bot+netdevbpf @ 2022-05-17  9:00 UTC (permalink / raw)
  To: Christophe JAILLET
  Cc: f.fainelli, bcm-kernel-feedback-list, davem, edumazet, kuba,
	pabeni, linux-kernel, kernel-janitors, netdev
In-Reply-To: <99d70634a81c229885ae9e4ee69b2035749f7edc.1652634040.git.christophe.jaillet@wanadoo.fr>

Hello:

This patch was applied to netdev/net.git (master)
by Paolo Abeni <pabeni@redhat.com>:

On Sun, 15 May 2022 19:01:56 +0200 you wrote:
> if devm_clk_get_optional() fails, we still need to go through the error
> handling path.
> 
> Add the missing goto.
> 
> Fixes: 6328a126896ea ("net: systemport: Manage Wake-on-LAN clock")
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
> 
> [...]

Here is the summary with links:
  - net: systemport: Fix an error handling path in bcm_sysport_probe()
    https://git.kernel.org/netdev/net/c/ef6b1cd11962

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v5 11/15] seltests/landlock: connect() with AF_UNSPEC tests
From: Mickaël Salaün @ 2022-05-17  8:55 UTC (permalink / raw)
  To: Konstantin Meskhidze
  Cc: willemdebruijn.kernel, linux-security-module, netdev,
	netfilter-devel, yusongping, anton.sirazetdinov
In-Reply-To: <20220516152038.39594-12-konstantin.meskhidze@huawei.com>

I guess these tests would also work with IPv6. You can then use the 
"alternative" tests I explained.

On 16/05/2022 17:20, Konstantin Meskhidze wrote:
> Adds two selftests for connect() action with
> AF_UNSPEC family flag.
> The one is with no landlock restrictions
> allows to disconnect already conneted socket
> with connect(..., AF_UNSPEC, ...):
>      - connect_afunspec_no_restictions;
> The second one refuses landlocked process
> to disconnect already connected socket:
>      - connect_afunspec_with_restictions;
> 
> Signed-off-by: Konstantin Meskhidze <konstantin.meskhidze@huawei.com>
> ---
> 
> Changes since v3:
> * Add connect_afunspec_no_restictions test.
> * Add connect_afunspec_with_restictions test.
> 
> Changes since v4:
> * Refactoring code with self->port, self->addr4 variables.
> * Adds bind() hook check for with AF_UNSPEC family.
> 
> ---
>   tools/testing/selftests/landlock/net_test.c | 121 ++++++++++++++++++++
>   1 file changed, 121 insertions(+)
> 
> diff --git a/tools/testing/selftests/landlock/net_test.c b/tools/testing/selftests/landlock/net_test.c
> index cf914d311eb3..bf8e49466d1d 100644
> --- a/tools/testing/selftests/landlock/net_test.c
> +++ b/tools/testing/selftests/landlock/net_test.c
> @@ -449,6 +449,7 @@ TEST_F_FORK(socket_test, connect_with_restrictions_ip6) {
>   	int new_fd;
>   	int sockfd_1, sockfd_2;
>   	pid_t child_1, child_2;
> +
>   	int status;
> 
>   	struct landlock_ruleset_attr ruleset_attr = {
> @@ -467,10 +468,12 @@ TEST_F_FORK(socket_test, connect_with_restrictions_ip6) {
> 
>   	const int ruleset_fd = landlock_create_ruleset(&ruleset_attr,
>   			sizeof(ruleset_attr), 0);
> +

Please no…


>   	ASSERT_LE(0, ruleset_fd);
> 
>   	/* Allows connect and bind operations to the port[0] socket */
>   	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_SERVICE,
> +

ditto

>   				&net_service_1, 0));
>   	/* Allows connect and deny bind operations to the port[1] socket */
>   	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_SERVICE,
> @@ -480,6 +483,7 @@ TEST_F_FORK(socket_test, connect_with_restrictions_ip6) {
>   	enforce_ruleset(_metadata, ruleset_fd);
> 
>   	/* Creates a server socket 1 */
> +
>   	sockfd_1 = create_socket(_metadata, true, false);
>   	ASSERT_LE(0, sockfd_1);
> 
> @@ -556,4 +560,121 @@ TEST_F_FORK(socket_test, connect_with_restrictions_ip6) {
>   	ASSERT_EQ(1, WIFEXITED(status));
>   	ASSERT_EQ(EXIT_SUCCESS, WEXITSTATUS(status));
>   }
> +
> +TEST_F_FORK(socket_test, connect_afunspec_no_restictions) {
> +
> +	int sockfd;
> +	pid_t child;
> +	int status;
> +
> +	/* Creates a server socket 1 */
> +	sockfd = create_socket(_metadata, false, false);
> +	ASSERT_LE(0, sockfd);
> +
> +	/* Binds the socket 1 to address with port[0] with AF_UNSPEC family */
> +	self->addr4[0].sin_family = AF_UNSPEC;
> +	ASSERT_EQ(0, bind(sockfd, (struct sockaddr *)&self->addr4[0], sizeof(self->addr4[0])));
> +
> +	/* Makes connection to socket with port[0] */
> +	ASSERT_EQ(0, connect(sockfd, (struct sockaddr *)&self->addr4[0],
> +						   sizeof(self->addr4[0])));
> +
> +	child = fork();
> +	ASSERT_LE(0, child);
> +	if (child == 0) {
> +		struct sockaddr addr_unspec = {.sa_family = AF_UNSPEC};
> +
> +		/* Child tries to disconnect already connected socket */
> +		ASSERT_EQ(0, connect(sockfd, (struct sockaddr *)&addr_unspec,
> +						sizeof(addr_unspec)));
> +		_exit(_metadata->passed ? EXIT_SUCCESS : EXIT_FAILURE);
> +		return;
> +	}
> +	/* Closes listening socket 1 for the parent*/
> +	ASSERT_EQ(0, close(sockfd));
> +
> +	ASSERT_EQ(child, waitpid(child, &status, 0));
> +	ASSERT_EQ(1, WIFEXITED(status));
> +	ASSERT_EQ(EXIT_SUCCESS, WEXITSTATUS(status));
> +}
> +
> +TEST_F_FORK(socket_test, connect_afunspec_with_restictions) {
> +
> +	int sockfd;
> +	pid_t child;
> +	int status;
> +
> +	struct landlock_ruleset_attr ruleset_attr_1 = {
> +		.handled_access_net = LANDLOCK_ACCESS_NET_BIND_TCP,
> +	};
> +	struct landlock_net_service_attr net_service_1 = {
> +		.allowed_access = LANDLOCK_ACCESS_NET_BIND_TCP,
> +
> +		.port = self->port[0],
> +	};
> +
> +	struct landlock_ruleset_attr ruleset_attr_2 = {
> +		.handled_access_net = LANDLOCK_ACCESS_NET_BIND_TCP |
> +				      LANDLOCK_ACCESS_NET_CONNECT_TCP,
> +	};
> +	struct landlock_net_service_attr net_service_2 = {
> +		.allowed_access = LANDLOCK_ACCESS_NET_BIND_TCP |
> +				  LANDLOCK_ACCESS_NET_CONNECT_TCP,
> +
> +		.port = self->port[0],
> +	};
> +
> +	const int ruleset_fd_1 = landlock_create_ruleset(&ruleset_attr_1,
> +					sizeof(ruleset_attr_1), 0);
> +	ASSERT_LE(0, ruleset_fd_1);
> +
> +	/* Allows bind operations to the port[0] socket */
> +	ASSERT_EQ(0, landlock_add_rule(ruleset_fd_1, LANDLOCK_RULE_NET_SERVICE,
> +				       &net_service_1, 0));
> +
> +	/* Enforces the ruleset. */
> +	enforce_ruleset(_metadata, ruleset_fd_1);
> +
> +	/* Creates a server socket 1 */
> +	sockfd = create_socket(_metadata, false, false);
> +	ASSERT_LE(0, sockfd);
> +
> +	/* Binds the socket 1 to address with port[0] with AF_UNSPEC family */
> +	self->addr4[0].sin_family = AF_UNSPEC;
> +	ASSERT_EQ(0, bind(sockfd, (struct sockaddr *)&self->addr4[0], sizeof(self->addr4[0])));
> +
> +	/* Makes connection to socket with port[0] */
> +	ASSERT_EQ(0, connect(sockfd, (struct sockaddr *)&self->addr4[0],
> +						   sizeof(self->addr4[0])));
> +
> +	const int ruleset_fd_2 = landlock_create_ruleset(&ruleset_attr_2,
> +					sizeof(ruleset_attr_2), 0);
> +	ASSERT_LE(0, ruleset_fd_2);
> +
> +	/* Allows connect and bind operations to the port[0] socket */
> +	ASSERT_EQ(0, landlock_add_rule(ruleset_fd_2, LANDLOCK_RULE_NET_SERVICE,
> +				       &net_service_2, 0));
> +
> +	/* Enforces the ruleset. */
> +	enforce_ruleset(_metadata, ruleset_fd_2);
> +
> +	child = fork();
> +	ASSERT_LE(0, child);
> +	if (child == 0) {
> +		struct sockaddr addr_unspec = {.sa_family = AF_UNSPEC};
> +
> +		/* Child tries to disconnect already connected socket */
> +		ASSERT_EQ(-1, connect(sockfd, (struct sockaddr *)&addr_unspec,
> +						sizeof(addr_unspec)));
> +		ASSERT_EQ(EACCES, errno);
> +		_exit(_metadata->passed ? EXIT_SUCCESS : EXIT_FAILURE);
> +		return;
> +	}
> +	/* Closes listening socket 1 for the parent*/
> +	ASSERT_EQ(0, close(sockfd));
> +
> +	ASSERT_EQ(child, waitpid(child, &status, 0));
> +	ASSERT_EQ(1, WIFEXITED(status));
> +	ASSERT_EQ(EXIT_SUCCESS, WEXITSTATUS(status));
> +}
>   TEST_HARNESS_MAIN
> --
> 2.25.1
> 

^ permalink raw reply

* [PATCH v5 3/3] ARM: dts: imx6qdl-sr-som: update phy configuration for som revision 1.9
From: Josua Mayer @ 2022-05-17  8:54 UTC (permalink / raw)
  To: netdev
  Cc: alvaro.karsz, Josua Mayer, Russell King, Rob Herring,
	Krzysztof Kozlowski, Shawn Guo, Sascha Hauer,
	Pengutronix Kernel Team, Fabio Estevam, NXP Linux Team
In-Reply-To: <20220517085431.3895-1-josua@solid-run.com>

Since SoM revision 1.9 the PHY has been replaced with an ADIN1300,
add an entry for it next to the original.

As Russell King pointed out, additional phy nodes cause warnings like:
mdio_bus 2188000.ethernet-1: MDIO device at address 1 is missing
To avoid this the new node has its status set to disabled. U-Boot will
be modified to enable the appropriate phy node after probing.

The existing ar8035 nodes have to stay enabled by default to avoid
breaking existing systems when they update Linux only.

Co-developed-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Signed-off-by: Josua Mayer <josua@solid-run.com>
---
V2 -> V3: new phy node status set disabled
V1 -> V2: changed dts property name

 arch/arm/boot/dts/imx6qdl-sr-som.dtsi | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/arm/boot/dts/imx6qdl-sr-som.dtsi b/arch/arm/boot/dts/imx6qdl-sr-som.dtsi
index f86efd0ccc40..ce543e325cd3 100644
--- a/arch/arm/boot/dts/imx6qdl-sr-som.dtsi
+++ b/arch/arm/boot/dts/imx6qdl-sr-som.dtsi
@@ -83,6 +83,16 @@ ethernet-phy@4 {
 			qca,clk-out-frequency = <125000000>;
 			qca,smarteee-tw-us-1g = <24>;
 		};
+
+		/*
+		 * ADIN1300 (som rev 1.9 or later) is always at address 1. It
+		 * will be enabled automatically by U-Boot if detected.
+		 */
+		ethernet-phy@1 {
+			reg = <1>;
+			adi,phy-output-clock = "125mhz-free-running";
+			status = "disabled";
+		};
 	};
 };
 
-- 
2.35.3


^ permalink raw reply related

* [PATCH v5 2/3] net: phy: adin: add support for clock output
From: Josua Mayer @ 2022-05-17  8:54 UTC (permalink / raw)
  To: netdev
  Cc: alvaro.karsz, Josua Mayer, Michael Hennerich, Andrew Lunn,
	Heiner Kallweit, Russell King, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
In-Reply-To: <20220517085431.3895-1-josua@solid-run.com>

The ADIN1300 supports generating certain clocks on its GP_CLK pin, as
well as providing the reference clock on CLK25_REF.

Add support for selecting the clock via device-tree properties.

Technically the phy also supports a recovered 125MHz clock for
synchronous ethernet. SyncE should be configured dynamically at
runtime, however Linux does not currently have a toggle for this,
so support is explicitly omitted.

Co-developed-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Signed-off-by: Alvaro Karsz <alvaro.karsz@solid-run.com>
Signed-off-by: Josua Mayer<josua@solid-run.com>
---
V4 -> V5: removed recovered clock options
V3 -> V4: fix coding style violations reported by Andrew and checkpatch
V2 -> V3: fix integer-as-null-pointer compiler warning
V1 -> V2: revised dts property name for clock(s)
V1 -> V2: implemented all 6 bits in the clock configuration register

 drivers/net/phy/adin.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/drivers/net/phy/adin.c b/drivers/net/phy/adin.c
index 5ce6da62cc8e..ee374a85544a 100644
--- a/drivers/net/phy/adin.c
+++ b/drivers/net/phy/adin.c
@@ -99,6 +99,15 @@
 #define ADIN1300_GE_SOFT_RESET_REG		0xff0c
 #define   ADIN1300_GE_SOFT_RESET		BIT(0)
 
+#define ADIN1300_GE_CLK_CFG_REG			0xff1f
+#define   ADIN1300_GE_CLK_CFG_MASK		GENMASK(5, 0)
+#define   ADIN1300_GE_CLK_CFG_RCVR_125		BIT(5)
+#define   ADIN1300_GE_CLK_CFG_FREE_125		BIT(4)
+#define   ADIN1300_GE_CLK_CFG_REF_EN		BIT(3)
+#define   ADIN1300_GE_CLK_CFG_HRT_RCVR		BIT(2)
+#define   ADIN1300_GE_CLK_CFG_HRT_FREE		BIT(1)
+#define   ADIN1300_GE_CLK_CFG_25		BIT(0)
+
 #define ADIN1300_GE_RGMII_CFG_REG		0xff23
 #define   ADIN1300_GE_RGMII_RX_MSK		GENMASK(8, 6)
 #define   ADIN1300_GE_RGMII_RX_SEL(x)		\
@@ -433,6 +442,33 @@ static int adin_set_tunable(struct phy_device *phydev,
 	}
 }
 
+static int adin_config_clk_out(struct phy_device *phydev)
+{
+	struct device *dev = &phydev->mdio.dev;
+	const char *val = NULL;
+	u8 sel = 0;
+
+	device_property_read_string(dev, "adi,phy-output-clock", &val);
+	if (!val) {
+		/* property not present, do not enable GP_CLK pin */
+	} else if (strcmp(val, "25mhz-reference") == 0) {
+		sel |= ADIN1300_GE_CLK_CFG_25;
+	} else if (strcmp(val, "125mhz-free-running") == 0) {
+		sel |= ADIN1300_GE_CLK_CFG_FREE_125;
+	} else if (strcmp(val, "adaptive-free-running") == 0) {
+		sel |= ADIN1300_GE_CLK_CFG_HRT_FREE;
+	} else {
+		phydev_err(phydev, "invalid adi,phy-output-clock\n");
+		return -EINVAL;
+	}
+
+	if (device_property_read_bool(dev, "adi,phy-output-reference-clock"))
+		sel |= ADIN1300_GE_CLK_CFG_REF_EN;
+
+	return phy_modify_mmd(phydev, MDIO_MMD_VEND1, ADIN1300_GE_CLK_CFG_REG,
+			      ADIN1300_GE_CLK_CFG_MASK, sel);
+}
+
 static int adin_config_init(struct phy_device *phydev)
 {
 	int rc;
@@ -455,6 +491,10 @@ static int adin_config_init(struct phy_device *phydev)
 	if (rc < 0)
 		return rc;
 
+	rc = adin_config_clk_out(phydev);
+	if (rc < 0)
+		return rc;
+
 	phydev_dbg(phydev, "PHY is using mode '%s'\n",
 		   phy_modes(phydev->interface));
 
-- 
2.35.3


^ permalink raw reply related

* [PATCH v5 1/3] dt-bindings: net: adin: document phy clock output properties
From: Josua Mayer @ 2022-05-17  8:54 UTC (permalink / raw)
  To: netdev
  Cc: alvaro.karsz, Josua Mayer, Michael Hennerich, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Alexandru Ardelean
In-Reply-To: <20220517085143.3749-1-josua@solid-run.com>

The ADIN1300 supports generating certain clocks on its GP_CLK pin, as
well as providing the reference clock on CLK25_REF.

Add DT properties to configure both pins.

Technically the phy also supports a recovered 125MHz clock for
synchronous ethernet. However SyncE should be configured dynamically at
runtime, so it is explicitly omitted in this binding.

Signed-off-by: Josua Mayer <josua@solid-run.com>
---
V4 -> V5: removed recovered clock options
V3 -> V4: changed type of adi,phy-output-reference-clock to boolean
V1 -> V2: changed clkout property to enum
V1 -> V2: added property for CLK25_REF pin

 .../devicetree/bindings/net/adi,adin.yaml         | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/adi,adin.yaml b/Documentation/devicetree/bindings/net/adi,adin.yaml
index 1129f2b58e98..77750df0c2c4 100644
--- a/Documentation/devicetree/bindings/net/adi,adin.yaml
+++ b/Documentation/devicetree/bindings/net/adi,adin.yaml
@@ -36,6 +36,21 @@ properties:
     enum: [ 4, 8, 12, 16, 20, 24 ]
     default: 8
 
+  adi,phy-output-clock:
+    description: Select clock output on GP_CLK pin. Two clocks are available:
+      A 25MHz reference and a free-running 125MHz.
+      The phy can alternatively automatically switch between the reference and
+      the 125MHz clocks based on its internal state.
+    $ref: /schemas/types.yaml#/definitions/string
+    enum:
+      - 25mhz-reference
+      - 125mhz-free-running
+      - adaptive-free-running
+
+  adi,phy-output-reference-clock:
+    description: Enable 25MHz reference clock output on CLK25_REF pin.
+    type: boolean
+
 unevaluatedProperties: false
 
 examples:
-- 
2.35.3


^ permalink raw reply related

* [PATCH v5 0/3] adin: add support for clock output
From: Josua Mayer @ 2022-05-17  8:51 UTC (permalink / raw)
  To: netdev; +Cc: alvaro.karsz, Josua Mayer

This patch series adds support for configuring the two clock outputs of adin
1200 and 1300 PHYs. Certain network controllers require an external reference
clock which can be provided by the PHY.

One of the replies to v1 was asking why the common clock framework isn't used.
Currently no PHY driver has implemented providing a clock to the network
controller. Instead they rely on vendor extensions to make the appropriate
configuration. For example ar8035 uses qca,clk-out-frequency - this patchset
aimed to replicate the same functionality.

Finally the 125MHz free-running clock is enabled in the device-tree for
SolidRun i.MX6 SoMs, to support revisions 1.9 and later, where the original phy
has been replaced with an adin 1300.
To avoid introducing new warning messages during boot for SoMs before rev 1.9,
the status field of the new phy node is disabled by default, and will be
enabled by U-Boot on demand.

Changes since v4:
- removed recovered clock options

Changes since v3:
- fix coding style violations reported by Andrew and checkpatch
- changed type of adi,phy-output-reference-clock from flag to boolean

Changes since v2:
- set new phy node status to disabled
- fix integer-as-null-pointer compiler warning
  Reported-by: kernel test robot <lkp@intel.com>

Changes since v1:
- renamed device-tree property and changed to enum
- added device-tree property for second clock output
- implemented all bits from the clock configuration register

Josua Mayer (3):
  dt-bindings: net: adin: document phy clock output properties
  net: phy: adin: add support for clock output
  ARM: dts: imx6qdl-sr-som: update phy configuration for som revision
    1.9

 .../devicetree/bindings/net/adi,adin.yaml     | 15 +++++++
 arch/arm/boot/dts/imx6qdl-sr-som.dtsi         | 10 +++++
 drivers/net/phy/adin.c                        | 40 +++++++++++++++++++
 3 files changed, 65 insertions(+)

-- 
2.35.3


^ permalink raw reply

* Re: [PATCH v5 08/15] landlock: TCP network hooks implementation
From: Mickaël Salaün @ 2022-05-17  8:51 UTC (permalink / raw)
  To: Konstantin Meskhidze
  Cc: willemdebruijn.kernel, linux-security-module, netdev,
	netfilter-devel, yusongping, anton.sirazetdinov
In-Reply-To: <20220516152038.39594-9-konstantin.meskhidze@huawei.com>


On 16/05/2022 17:20, Konstantin Meskhidze wrote:
> Support of socket_bind() and socket_connect() hooks.
> Its possible to restrict binding and connecting of TCP
> types of sockets to particular ports. Its just basic idea
> how Landlock could support network confinement.
> 
> Signed-off-by: Konstantin Meskhidze <konstantin.meskhidze@huawei.com>
> ---
> 
> Changes since v3:
> * Split commit.
> * Add SECURITY_NETWORK in config.
> * Add IS_ENABLED(CONFIG_INET) if a kernel has no INET configuration.
> * Add hook_socket_bind and hook_socket_connect hooks.
> 
> Changes since v4:
> * Factors out CONFIG_INET into make file.
> * Refactoring check_socket_access().
> * Adds helper get_port().
> * Adds CONFIG_IPV6 in  get_port(), hook_socket_bind/connect
> functions to support AF_INET6 family.
> * Adds AF_UNSPEC family support in hook_socket_bind/connect
> functions.
> * Refactoring add_rule_net_service() and landlock_add_rule
> syscall to support network rule inserting.
> * Refactoring init_layer_masks() to support network rules.
> 
> ---
>   security/landlock/Kconfig    |   1 +
>   security/landlock/Makefile   |   2 +
>   security/landlock/net.c      | 159 +++++++++++++++++++++++++++++++++++
>   security/landlock/net.h      |  25 ++++++
>   security/landlock/ruleset.c  |  15 +++-
>   security/landlock/setup.c    |   2 +
>   security/landlock/syscalls.c |  63 ++++++++++++--
>   7 files changed, 261 insertions(+), 6 deletions(-)
>   create mode 100644 security/landlock/net.c
>   create mode 100644 security/landlock/net.h
> 
> diff --git a/security/landlock/Kconfig b/security/landlock/Kconfig
> index 8e33c4e8ffb8..10c099097533 100644
> --- a/security/landlock/Kconfig
> +++ b/security/landlock/Kconfig
> @@ -3,6 +3,7 @@
>   config SECURITY_LANDLOCK
>   	bool "Landlock support"
>   	depends on SECURITY && !ARCH_EPHEMERAL_INODES
> +	select SECURITY_NETWORK
>   	select SECURITY_PATH
>   	help
>   	  Landlock is a sandboxing mechanism that enables processes to restrict
> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
> index 7bbd2f413b3e..53d3c92ae22e 100644
> --- a/security/landlock/Makefile
> +++ b/security/landlock/Makefile
> @@ -2,3 +2,5 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
> 
>   landlock-y := setup.o syscalls.o object.o ruleset.o \
>   	cred.o ptrace.o fs.o
> +
> +landlock-$(CONFIG_INET) += net.o
> \ No newline at end of file
> diff --git a/security/landlock/net.c b/security/landlock/net.c
> new file mode 100644
> index 000000000000..9302e5891991
> --- /dev/null
> +++ b/security/landlock/net.c
> @@ -0,0 +1,159 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Landlock LSM - Network management and hooks
> + *
> + * Copyright (C) 2022 Huawei Tech. Co., Ltd.
> + */
> +
> +#include <linux/in.h>
> +#include <linux/net.h>
> +#include <linux/socket.h>
> +#include <net/ipv6.h>
> +
> +#include "cred.h"
> +#include "limits.h"
> +#include "net.h"
> +
> +int landlock_append_net_rule(struct landlock_ruleset *const ruleset,
> +			     u16 port, u32 access_rights)
> +{
> +	int err;
> +
> +	/* Transforms relative access rights to absolute ones. */
> +	access_rights |= LANDLOCK_MASK_ACCESS_NET &
> +			 ~landlock_get_net_access_mask(ruleset, 0);
> +
> +	BUILD_BUG_ON(sizeof(port) > sizeof(uintptr_t));
> +	mutex_lock(&ruleset->lock);
> +	err = landlock_insert_rule(ruleset, NULL, port,
> +				access_rights, LANDLOCK_RULE_NET_SERVICE);
> +	mutex_unlock(&ruleset->lock);
> +
> +	return err;
> +}
> +
> +static int check_socket_access(const struct landlock_ruleset *const domain,
> +			       u16 port, access_mask_t access_request)
> +{
> +	bool allowed = false;
> +	layer_mask_t layer_masks[LANDLOCK_NUM_ACCESS_NET] = {};
> +	const struct landlock_rule *rule;
> +	access_mask_t handled_access;
> +
> +	if (WARN_ON_ONCE(!domain))
> +		return 0;
> +	if (WARN_ON_ONCE(domain->num_layers < 1))
> +		return -EACCES;
> +
> +	rule = landlock_find_rule(domain, port,
> +					LANDLOCK_RULE_NET_SERVICE);
> +
> +	handled_access = init_layer_masks(domain, access_request,
> +			&layer_masks, sizeof(layer_masks),
> +			LANDLOCK_RULE_NET_SERVICE);
> +	allowed = unmask_layers(rule, handled_access,
> +			&layer_masks, ARRAY_SIZE(layer_masks));
> +
> +	return allowed ? 0 : -EACCES;
> +}
> +
> +static u16 get_port(const struct sockaddr *const address)
> +{
> +	/* Gets port value in host byte order. */
> +	switch (address->sa_family) {
> +	case AF_UNSPEC:

Are you sure about that?

Please write a test for this case.


> +	case AF_INET:
> +	{

You don't need these braces (except if it is required by checkpatch.pl).


> +		const struct sockaddr_in *const sockaddr =
> +					(struct sockaddr_in *)address;
> +		return ntohs(sockaddr->sin_port);
> +	}
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case AF_INET6:
> +	{
> +		const struct sockaddr_in6 *const sockaddr_ip6 =
> +					(struct sockaddr_in6 *)address;
> +		return ntohs(sockaddr_ip6->sin6_port);
> +	}
> +#endif
> +	}

You missed some part of my patch… We should not get the port for a 
protocol we don't know, hence the WARN_ON_ONCE.


> +	return 0;
> +}
> +
> +static int hook_socket_bind(struct socket *sock, struct sockaddr *address,
> +			    int addrlen)
> +{
> +	const struct landlock_ruleset *const dom =
> +						landlock_get_current_domain();
> +
> +	if (!dom)
> +		return 0;
> +
> +	/* Check if it's a TCP socket */
> +	if (sock->type != SOCK_STREAM)
> +		return 0;
> +
> +	/* Get port value in host byte order */

I moved/removed this in my patch against v4 for a reason. Please, ask if 
you don't understand or if you don't agree with something I said.


> +	switch (address->sa_family) {
> +	case AF_UNSPEC:

Is this correct?

Please write a test for this case.

> +	case AF_INET:
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case AF_INET6:
> +#endif
> +		return check_socket_access(dom, get_port(address),
> +					LANDLOCK_ACCESS_NET_BIND_TCP);
> +	default:
> +		return 0;
> +	}
> +}
> +
> +static int hook_socket_connect(struct socket *sock, struct sockaddr *address,
> +				int addrlen)
> +{
> +	const struct landlock_ruleset *const dom =
> +						landlock_get_current_domain();
> +
> +	if (!dom)
> +		return 0;
> +
> +	/* Check if it's a TCP socket */
> +	if (sock->type != SOCK_STREAM)
> +		return 0;
> +
> +	/* Get port value in host byte order */
> +	switch (address->sa_family) {
> +	case AF_INET:
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case AF_INET6:
> +#endif
> +		return check_socket_access(dom, get_port(address),
> +					   LANDLOCK_ACCESS_NET_CONNECT_TCP);
> +	case AF_UNSPEC:
> +	{
> +		u16 i;
> +		/*
> +		 * If just in a layer a mask supports connect access,
> +		 * the socket_connect() hook with AF_UNSPEC family flag
> +		 * must be banned. This prevents from disconnecting already
> +		 * connected sockets.
> +		 */
> +		for (i = 0; i < dom->num_layers; i++) {
> +			if (landlock_get_net_access_mask(dom, i) &
> +				LANDLOCK_ACCESS_NET_CONNECT_TCP)
> +				return -EACCES;
> +		}
> +	}
> +	}
> +	return 0;
> +}
> +
> +static struct security_hook_list landlock_hooks[] __lsm_ro_after_init = {
> +	LSM_HOOK_INIT(socket_bind, hook_socket_bind),
> +	LSM_HOOK_INIT(socket_connect, hook_socket_connect),
> +};
> +
> +__init void landlock_add_net_hooks(void)
> +{
> +	security_add_hooks(landlock_hooks, ARRAY_SIZE(landlock_hooks),
> +			LANDLOCK_NAME);
> +}
> diff --git a/security/landlock/net.h b/security/landlock/net.h
> new file mode 100644
> index 000000000000..da5ce8fa04cc
> --- /dev/null
> +++ b/security/landlock/net.h
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Landlock LSM - Network management and hooks
> + *
> + * Copyright (C) 2022 Huawei Tech. Co., Ltd.
> + */
> +
> +#ifndef _SECURITY_LANDLOCK_NET_H
> +#define _SECURITY_LANDLOCK_NET_H
> +
> +#include "common.h"
> +#include "ruleset.h"
> +#include "setup.h"
> +
> +#if IS_ENABLED(CONFIG_INET)
> +__init void landlock_add_net_hooks(void);
> +
> +int landlock_append_net_rule(struct landlock_ruleset *const ruleset,
> +				u16 port, u32 access_hierarchy);
> +#else /* IS_ENABLED(CONFIG_INET) */
> +static inline void landlock_add_net_hooks(void)
> +{}
> +#endif /* IS_ENABLED(CONFIG_INET) */
> +
> +#endif /* _SECURITY_LANDLOCK_NET_H */
> diff --git a/security/landlock/ruleset.c b/security/landlock/ruleset.c
> index ea9ecb3f471a..317cf98890f6 100644
> --- a/security/landlock/ruleset.c
> +++ b/security/landlock/ruleset.c
> @@ -671,7 +671,7 @@ access_mask_t get_handled_accesses(
>   		}
>   		break;
>   	default:
> -		break;
> +		return 0;

Why?


>   	}
>   	return access_dom;
>   }
> @@ -763,6 +763,19 @@ access_mask_t init_layer_masks(const struct landlock_ruleset *const domain,
>   				}
>   			}
>   			break;
> +		case LANDLOCK_RULE_NET_SERVICE:
> +			for_each_set_bit(access_bit, &access_req,
> +					LANDLOCK_NUM_ACCESS_NET) {
> +				if (landlock_get_net_access_mask(domain,
> +								 layer_level) &
> +						BIT_ULL(access_bit)) {
> +					(*layer_masks)[access_bit] |=
> +						BIT_ULL(layer_level);
> +					handled_accesses |=
> +							   BIT_ULL(access_bit);
> +				}
> +			}
> +			break;
>   		default:
>   			return 0;
>   		}
> diff --git a/security/landlock/setup.c b/security/landlock/setup.c
> index f8e8e980454c..8059dc0b47d3 100644
> --- a/security/landlock/setup.c
> +++ b/security/landlock/setup.c
> @@ -14,6 +14,7 @@
>   #include "fs.h"
>   #include "ptrace.h"
>   #include "setup.h"
> +#include "net.h"
> 
>   bool landlock_initialized __lsm_ro_after_init = false;
> 
> @@ -28,6 +29,7 @@ static int __init landlock_init(void)
>   	landlock_add_cred_hooks();
>   	landlock_add_ptrace_hooks();
>   	landlock_add_fs_hooks();
> +	landlock_add_net_hooks();
>   	landlock_initialized = true;
>   	pr_info("Up and running.\n");
>   	return 0;
> diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> index 812541f4e155..9454c6361011 100644
> --- a/security/landlock/syscalls.c
> +++ b/security/landlock/syscalls.c
> @@ -29,6 +29,7 @@
>   #include "cred.h"
>   #include "fs.h"
>   #include "limits.h"
> +#include "net.h"
>   #include "ruleset.h"
>   #include "setup.h"
> 
> @@ -74,7 +75,8 @@ static void build_check_abi(void)
>   {
>   	struct landlock_ruleset_attr ruleset_attr;
>   	struct landlock_path_beneath_attr path_beneath_attr;
> -	size_t ruleset_size, path_beneath_size;
> +	struct landlock_net_service_attr net_service_attr;
> +	size_t ruleset_size, path_beneath_size, net_service_size;
> 
>   	/*
>   	 * For each user space ABI structures, first checks that there is no
> @@ -90,6 +92,11 @@ static void build_check_abi(void)
>   	path_beneath_size += sizeof(path_beneath_attr.parent_fd);
>   	BUILD_BUG_ON(sizeof(path_beneath_attr) != path_beneath_size);
>   	BUILD_BUG_ON(sizeof(path_beneath_attr) != 12);
> +
> +	net_service_size = sizeof(net_service_attr.allowed_access);
> +	net_service_size += sizeof(net_service_attr.port);
> +	BUILD_BUG_ON(sizeof(net_service_attr) != net_service_size);
> +	BUILD_BUG_ON(sizeof(net_service_attr) != 10);
>   }
> 
>   /* Ruleset handling */
> @@ -299,9 +306,9 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset,
>   	 * Informs about useless rule: empty allowed_access (i.e. deny rules)
>   	 * are ignored in path walks.
>   	 */
> -	if (!path_beneath_attr.allowed_access) {
> +	if (!path_beneath_attr.allowed_access)

Why?


>   		return -ENOMSG;
> -	}
> +
>   	/*
>   	 * Checks that allowed_access matches the @ruleset constraints
>   	 * (ruleset->access_masks[0] is automatically upgraded to 64-bits).
> @@ -323,13 +330,54 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset,
>   	return err;
>   }
> 
> +static int add_rule_net_service(struct landlock_ruleset *ruleset,
> +				const void *const rule_attr)
> +{
> +#if IS_ENABLED(CONFIG_INET)
> +	struct landlock_net_service_attr net_service_attr;
> +	int res;
> +	u32 mask;
> +
> +	/* Copies raw user space buffer, only one type for now. */
> +	res = copy_from_user(&net_service_attr, rule_attr,
> +			sizeof(net_service_attr));
> +	if (res)
> +		return -EFAULT;
> +
> +	/*
> +	 * Informs about useless rule: empty allowed_access (i.e. deny rules)
> +	 * are ignored by network actions
> +	 */
> +	if (!net_service_attr.allowed_access)
> +		return -ENOMSG;
> +
> +	/*
> +	 * Checks that allowed_access matches the @ruleset constraints
> +	 * (ruleset->access_masks[0] is automatically upgraded to 64-bits).
> +	 */
> +	mask = landlock_get_net_access_mask(ruleset, 0);
> +	if ((net_service_attr.allowed_access | mask) != mask)
> +		return -EINVAL;
> +
> +	/* Denies inserting a rule with port 0 */
> +	if (net_service_attr.port == 0)
> +		return -EINVAL;
> +
> +	/* Imports the new rule. */
> +	return landlock_append_net_rule(ruleset, net_service_attr.port,
> +				       net_service_attr.allowed_access);
> +#else /* IS_ENABLED(CONFIG_INET) */
> +	return -EAFNOSUPPORT;
> +#endif /* IS_ENABLED(CONFIG_INET) */
> +}
> +
>   /**
>    * sys_landlock_add_rule - Add a new rule to a ruleset
>    *
>    * @ruleset_fd: File descriptor tied to the ruleset that should be extended
>    *		with the new rule.
> - * @rule_type: Identify the structure type pointed to by @rule_attr (only
> - *             LANDLOCK_RULE_PATH_BENEATH for now).
> + * @rule_type: Identify the structure type pointed to by @rule_attr:
> + *             LANDLOCK_RULE_PATH_BENEATH or LANDLOCK_RULE_NET_SERVICE.
>    * @rule_attr: Pointer to a rule (only of type &struct
>    *             landlock_path_beneath_attr for now).
>    * @flags: Must be 0.
> @@ -340,6 +388,8 @@ static int add_rule_path_beneath(struct landlock_ruleset *const ruleset,
>    * Possible returned errors are:
>    *
>    * - EOPNOTSUPP: Landlock is supported by the kernel but disabled at boot time;
> + * - EAFNOSUPPORT: @rule_type is LANDLOCK_RULE_NET_SERVICE but TCP/IP is not
> + *   supported by the running kernel;
>    * - EINVAL: @flags is not 0, or inconsistent access in the rule (i.e.
>    *   &landlock_path_beneath_attr.allowed_access is not a subset of the rule's
>    *   accesses);
> @@ -375,6 +425,9 @@ SYSCALL_DEFINE4(landlock_add_rule,
>   	case LANDLOCK_RULE_PATH_BENEATH:
>   		err = add_rule_path_beneath(ruleset, rule_attr);
>   		break;
> +	case LANDLOCK_RULE_NET_SERVICE:
> +		err = add_rule_net_service(ruleset, rule_attr);
> +		break;
>   	default:
>   		err = -EINVAL;
>   		break;
> --
> 2.25.1
> 

^ permalink raw reply

* Re: [PATCH net-next v2] net: wwan: t7xx: fix GFP_KERNEL usage in spin_lock context
From: Loic Poulain @ 2022-05-17  8:50 UTC (permalink / raw)
  To: Ziyang Xuan
  Cc: chandrashekar.devegowda, linuxwwan, chiranjeevi.rapolu,
	haijun.liu, m.chetan.kumar, ricardo.martinez, ryazanov.s.a,
	johannes, davem, edumazet, kuba, pabeni, netdev
In-Reply-To: <20220517064821.3966990-1-william.xuanziyang@huawei.com>

Hi Ziyang,

On Tue, 17 May 2022 at 08:30, Ziyang Xuan <william.xuanziyang@huawei.com> wrote:
>
> t7xx_cldma_clear_rxq() call t7xx_cldma_alloc_and_map_skb() in spin_lock
> context, But __dev_alloc_skb() in t7xx_cldma_alloc_and_map_skb() uses
> GFP_KERNEL, that will introduce scheduling factor in spin_lock context.
>
> Because t7xx_cldma_clear_rxq() is called after stopping CLDMA, so we can
> remove the spin_lock from t7xx_cldma_clear_rxq().
>
> Fixes: 39d439047f1d ("net: wwan: t7xx: Add control DMA interface")
> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
> ---

You should normally indicate what changed in this v2.

>  drivers/net/wwan/t7xx/t7xx_hif_cldma.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/wwan/t7xx/t7xx_hif_cldma.c b/drivers/net/wwan/t7xx/t7xx_hif_cldma.c
> index 46066dcd2607..7493285a9606 100644
> --- a/drivers/net/wwan/t7xx/t7xx_hif_cldma.c
> +++ b/drivers/net/wwan/t7xx/t7xx_hif_cldma.c
> @@ -782,10 +782,12 @@ static int t7xx_cldma_clear_rxq(struct cldma_ctrl *md_ctrl, int qnum)
>         struct cldma_queue *rxq = &md_ctrl->rxq[qnum];
>         struct cldma_request *req;
>         struct cldma_gpd *gpd;
> -       unsigned long flags;
>         int ret = 0;
>
> -       spin_lock_irqsave(&rxq->ring_lock, flags);
> +       /* CLDMA has been stopped. There is not any CLDMA IRQ, holding
> +        * ring_lock is not needed.

If it makes sense to explain why we don't need locking, the next
sentence is not needed:


>  Thus we can use functions that may
> +        * introduce scheduling.
> +        */
>         t7xx_cldma_q_reset(rxq);
>         list_for_each_entry(req, &rxq->tr_ring->gpd_ring, entry) {
>                 gpd = req->gpd;
> @@ -808,7 +810,6 @@ static int t7xx_cldma_clear_rxq(struct cldma_ctrl *md_ctrl, int qnum)
>
>                 t7xx_cldma_gpd_set_data_ptr(req->gpd, req->mapped_buff);
>         }
> -       spin_unlock_irqrestore(&rxq->ring_lock, flags);
>
>         return ret;
>  }
> --
> 2.25.1
>

^ permalink raw reply

* Re: [PATCH v4 1/3] dt-bindings: net: adin: document phy clock
From: Josua Mayer @ 2022-05-17  8:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Michael Walle, alexandru.ardelean, alvaro.karsz, davem, edumazet,
	krzysztof.kozlowski+dt, michael.hennerich, netdev, pabeni,
	robh+dt
In-Reply-To: <20220516154044.29361acc@kernel.org>

Am 17.05.22 um 01:40 schrieb Jakub Kicinski:
> On Mon, 16 May 2022 22:48:20 +0300 Josua Mayer wrote:
>> So I can imagine to change the bindings as follows:
>> 1. remove the -recovered variants
>> 2. add an explicit note in the commit message that the recovered clock
>> is not implemented because we do not have infrastructure for SyncE
>> 3. keep the -free-running suffix, we should imo only hide it on the day
>> SyncE can be toggled by another means.
> 
> SGTM, thanks!

Thank you for your comments, I am sending v5 shortly!

^ permalink raw reply

* Re: [PATCH net v2] netfilter: nf_flow_table: fix teardown flow timeout
From: Sven Auhagen @ 2022-05-17  8:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Oz Shlomo, Felix Fietkau, netdev, netfilter-devel,
	Florian Westphal, Paul Blakey
In-Reply-To: <YoNdg/5IBucYJ+hi@salvia>

On Tue, May 17, 2022 at 10:32:03AM +0200, Pablo Neira Ayuso wrote:
> On Mon, May 16, 2022 at 08:23:10PM +0200, Sven Auhagen wrote:
> > On Mon, May 16, 2022 at 07:50:09PM +0200, Pablo Neira Ayuso wrote:
> > > On Mon, May 16, 2022 at 03:02:13PM +0200, Sven Auhagen wrote:
> > > > On Mon, May 16, 2022 at 02:43:06PM +0200, Pablo Neira Ayuso wrote:
> > > > > On Mon, May 16, 2022 at 02:23:00PM +0200, Sven Auhagen wrote:
> > > > > > On Mon, May 16, 2022 at 02:13:03PM +0200, Pablo Neira Ayuso wrote:
> > > > > > > On Mon, May 16, 2022 at 12:56:41PM +0200, Pablo Neira Ayuso wrote:
> > > > > > > > On Thu, May 12, 2022 at 09:28:03PM +0300, Oz Shlomo wrote:
> > > [...]
> > > > > > > > [...]
> > > > > > > > > diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> > > > > > > > > index 0164e5f522e8..324fdb62c08b 100644
> > > > > > > > > --- a/net/netfilter/nf_conntrack_core.c
> > > > > > > > > +++ b/net/netfilter/nf_conntrack_core.c
> > > > > > > > > @@ -1477,7 +1477,8 @@ static void gc_worker(struct work_struct *work)
> > > > > > > > >  			tmp = nf_ct_tuplehash_to_ctrack(h);
> > > > > > > > >
> > > > > > > > >  			if (test_bit(IPS_OFFLOAD_BIT, &tmp->status)) {
> > > > > > > > > -				nf_ct_offload_timeout(tmp);
> > > > > > > >
> > > > > > > > Hm, it is the trick to avoid checking for IPS_OFFLOAD from the packet
> > > > > > > > path that triggers the race, ie. nf_ct_is_expired()
> > > > > > > >
> > > > > > > > The flowtable ct fixup races with conntrack gc collector.
> > > > > > > >
> > > > > > > > Clearing IPS_OFFLOAD might result in offloading the entry again for
> > > > > > > > the closing packets.
> > > > > > > >
> > > > > > > > Probably clear IPS_OFFLOAD from teardown, and skip offload if flow is
> > > > > > > > in a TCP state that represent closure?
> > > > > > > >
> > > > > > > >   		if (unlikely(!tcph || tcph->fin || tcph->rst))
> > > > > > > >   			goto out;
> > > > > > > >
> > > > > > > > this is already the intention in the existing code.
> > > > > > >
> > > > > > > I'm attaching an incomplete sketch patch. My goal is to avoid the
> > > > > > > extra IPS_ bit.
> > > > > >
> > > > > > You might create a race with ct gc that will remove the ct
> > > > > > if it is in close or end of close and before flow offload teardown is running
> > > > > > so flow offload teardown might access memory that was freed.
> > > > >
> > > > > flow object holds a reference to the ct object until it is released,
> > > > > no use-after-free can happen.
> > > > >
> > > >
> > > > Also if nf_ct_delete is called before flowtable delete?
> > > > Can you let me know why?
> > >
> > > nf_ct_delete() removes the conntrack object from lists and it
> > > decrements the reference counter by one.
> > >
> > > flow_offload_free() also calls nf_ct_put(). flow_offload_alloc() bumps
> > > the reference count on the conntrack object before creating the flow.
> > >
> > > > > > It is not a very likely scenario but never the less it might happen now
> > > > > > since the IPS_OFFLOAD_BIT is not set and the state might just time out.
> > > > > >
> > > > > > If someone sets a very small TCP CLOSE timeout it gets more likely.
> > > > > >
> > > > > > So Oz and myself were debatting about three possible cases/problems:
> > > > > >
> > > > > > 1. ct gc sets timeout even though the state is in CLOSE/FIN because the
> > > > > > IPS_OFFLOAD is still set but the flow is in teardown
> > > > > > 2. ct gc removes the ct because the IPS_OFFLOAD is not set and
> > > > > > the CLOSE timeout is reached before the flow offload del
> > > > >
> > > > > OK.
> > > > >
> > > > > > 3. tcp ct is always set to ESTABLISHED with a very long timeout
> > > > > > in flow offload teardown/delete even though the state is already
> > > > > > CLOSED.
> > > > > >
> > > > > > Also as a remark we can not assume that the FIN or RST packet is hitting
> > > > > > flow table teardown as the packet might get bumped to the slow path in
> > > > > > nftables.
> > > > >
> > > > > I assume this remark is related to 3.?
> > > >
> > > > Yes, exactly.
> > > >
> > > > > if IPS_OFFLOAD is unset, then conntrack would update the state
> > > > > according to this FIN or RST.
> > > >
> > > > It will move to a different TCP state anyways only the ct state
> > > > will be at IPS_OFFLOAD_BIT and prevent it from beeing garbage collected.
> > > > The timeout will be bumped back up as long as IPS_OFFLOAD_BIT is set
> > > > even though TCP might already be CLOSED.
> >
> > I see what you are trying to do here, I have some remarks:
> >
> > >
> > > If teardown fixes the ct state and timeout to established, and IPS_OFFLOAD is
> > > unset, then the packet is passed up in a consistent state.
> > >
> > > I made a patch, it is based on yours, it's attached:
> > >
> > > - If flow timeout expires or rst/fin is seen, ct state and timeout is
> > >   fixed up (to established state) and IPS_OFFLOAD is unset.
> > >
> > > - If rst/fin packet is seen, ct state and timeout is fixed up (to
> > >   established state) and IPS_OFFLOAD is unset. The packet continues
> > >   its travel up to the classic path, so conntrack triggers the
> > >   transition from established to one of the close states.
> > >
> > > For the case 1., IPS_OFFLOAD is not set anymore, so conntrack gc
> > > cannot race to reset the ct timeout anymore.
> > >
> > > For the case 2., if gc conntrack ever removes the ct entry, then the
> > > IPS_DYING bit is set, which implicitly triggers the teardown state
> > > from the flowtable gc. The flowtable still holds a reference to the
> > > ct object, so no UAF can happen.
> > >
> > > For the case 3. the conntrack is set to ESTABLISHED with a long
> > > timeout, yes. This is to deal with the two possible cases:
> > >
> > > a) flowtable timeout expired, so conntrack recovers control on the
> > >    flow.
> > > b) tcp rst/fin will take back the packet to slow path. The ct has been
> > >    fixed up to established state so it will trasition to one of the
> > >    close states.
> > >
> > > Am I missing anything?
> >
> > You should not fixup the tcp state back to established.
> > If flow_offload_teardown is not called because a packet got bumped up to the slow path
> > and you call flow_offload_teardown from nf_flow_offload_gc_step, the tcp state might already
> > be in CLOSE state and you just moved it back to established.
> 
> OK.
> 
> > The entire function flow_offload_fixup_tcp can go away if we only allow established tcp states
> > in the flowtable.
> 
> I'm keeping it, but I remove the reset of the tcp state.
> 
> > Same goes for the timeout. The timeout should really be set to the current tcp state
> > ct->proto.tcp->state which might not be established anymore.
> 
> OK.
> 
> > For me the question remains, why can the ct gc not remove the ct when nf_ct_delete
> > is called before flow_offload_del is called?
> 
> nf_ct_delete() removes indeed the entry from the conntrack table, then
> it calls nf_ct_put() which decrements the refcnt. Given that the
> flowtable holds a reference to the conntrack object...
> 
>  struct flow_offload *flow_offload_alloc(struct nf_conn *ct)
>  {
>         struct flow_offload *flow;
> 
>         if (unlikely(nf_ct_is_dying(ct) ||
>             !refcount_inc_not_zero(&ct->ct_general.use)))
>                 return NULL;
> 
> ... use-after-free cannot happen. Note that flow_offload_free() calls
> nf_ct_put(flow->ct), so at this point the ct object is released.
> 
> Is this your concern?

Ah yes, thank you.
I did not catch the refcount_inc_not_zero call.

> 
> > Also you probably want to move the IPS_OFFLOAD_BIT to the beginning of
> > flow_offload_teardown just to make sure that the ct gc is not bumping up the ct timeout
> > while it is changed in flow_offload_fixup_ct.
> 
> Done.
> 
> See patch attached.
> >
> >

The patch looks good to me, one remark.

This has to be

-		if (unlikely(!tcph || tcph->fin || tcph->rst))
+		if (unlikely(!tcph || tcph->fin || tcph->rst ||
+			     !nf_conntrack_tcp_established(&ct->proto.tcp)))
 			goto out;

You are currently go to out if the tcp state is established but you
want the opposite, not established.

I think this will cover all cases.

Best
Sven


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox