[RFC PATCH 00/13] Ultra Ethernet driver introduction

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/13] Ultra Ethernet driver introduction
@ 2025-03-06 23:01 Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 01/13] drivers: ultraeth: add initial skeleton and kconfig option Nikolay Aleksandrov
                   ` (14 more replies)
  0 siblings, 15 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Hi all,
This patch-set introduces minimal Ultra Ethernet driver infrastructure and
the lowest Ultra Ethernet sublayer - the Packet Delivery Sublayer (PDS),
which underpins the entire communication model of the Ultra Ethernet
Transport[1] (UET). Ultra Ethernet is a new RDMA transport designed for
efficient AI and HPC communication. The specifications are still being
ironed out and first public versions should be available soon. As there
isn't any UET hardware available yet, we introduce a software device model
which implements the lowest sublayer of the spec - PDS. The code is still
in early stages and experimental, aiming to start a discussion on the
kernel implementation and to show how we plan to organize it.

The PDS is responsible for establishing dynamic connections between Fabric
Endpoints (FEPs) called Packet Delivery Contexts (PDCs), packet
reliability, ordering, duplicate elimination and congestion management.
The PDS packet ordering is defined by a mode which can be one of:
 - Reliable, Ordered Delivery (ROD)
 - Reliable, Unordered Delivery (RUD)
 - Reliable, Unordered Delivery for Idempotent Operations (RUDI)
 - Unreliable, Unordered Delivery (UUD)

This set implements RUD mode of communication with Packet Sequence
Number (PSN) tracking, retransmits, idle timeouts, coalescing and selective
ACKs. It adds support for generating and processing Request, ACK, NACK and
Control packet types. Communication is done over UDP, so all Ultra Ethernet
headers are on top of UDP packets. Packets are tracked by Packet Sequence
Numbers (PSNs) uniquely assigned within a PDC, the PSN window sizes are
currently static.

In this RFC all of the code is under a single kernel module in
drivers/ultraeth/ and guarded by a new kconfig option CONFIG_ULTRAETH. The
plan is to have that split into core Ultra Ethernet module (ultraeth.ko)
which is responsible for managing the UET contexts, jobs and all other
common/generic UET configuration, and the software UET device model
(uecon.ko) which implements the UET protocols for communication in software
(e.g. the PDS will be a part of uecon) and is represented by a UDP tunnel
network device. Note that there are critical missing pieces that will be
present when we send the first version such as:
 - Ultra Ethernet specs will be publicly available
 - missing UET sublayers critical for communication
 - more complete user API
 - kernel UET device API
 - memory management
 - IPv6

The last patch is a hack which adds a custom character device used to test
communication and basic PDS functionality, for the first version of this set
we would rather extend and re-use some of the Infiniband infrastructure.

This set will also be used to better illustrate the UET code and concepts
for the "Networking For AI BoF"[2] at the upcoming Netdev 0x19 conference
in Zagreb, Croatia.

Thank you,
 Nik

[1] https://ultraethernet.org/
[2] https://netdevconf.info/0x19/sessions/bof/networking-for-ai-bof.html


Alex Badea (1):
  HACK: drivers: ultraeth: add char device

Nikolay Aleksandrov (12):
  drivers: ultraeth: add initial skeleton and kconfig option
  drivers: ultraeth: add context support
  drivers: ultraeth: add new genl family
  drivers: ultraeth: add job support
  drivers: ultraeth: add tunnel udp device support
  drivers: ultraeth: add initial PDS infrastructure
  drivers: ultraeth: add request and ack receive support
  drivers: ultraeth: add request transmit support
  drivers: ultraeth: add support for coalescing ack
  drivers: ultraeth: add sack support
  drivers: ultraeth: add nack support
  drivers: ultraeth: add initiator and target idle timeout support

 Documentation/netlink/specs/rt_link.yaml  |   14 +
 Documentation/netlink/specs/ultraeth.yaml |  218 ++++
 drivers/Kconfig                           |    2 +
 drivers/Makefile                          |    1 +
 drivers/ultraeth/Kconfig                  |   11 +
 drivers/ultraeth/Makefile                 |    4 +
 drivers/ultraeth/uecon.c                  |  324 ++++++
 drivers/ultraeth/uet_chardev.c            |  264 +++++
 drivers/ultraeth/uet_context.c            |  274 +++++
 drivers/ultraeth/uet_job.c                |  456 +++++++++
 drivers/ultraeth/uet_main.c               |   41 +
 drivers/ultraeth/uet_netlink.c            |  113 +++
 drivers/ultraeth/uet_netlink.h            |   29 +
 drivers/ultraeth/uet_pdc.c                | 1122 +++++++++++++++++++++
 drivers/ultraeth/uet_pds.c                |  481 +++++++++
 include/net/ultraeth/uecon.h              |   28 +
 include/net/ultraeth/uet_chardev.h        |   11 +
 include/net/ultraeth/uet_context.h        |   47 +
 include/net/ultraeth/uet_job.h            |   80 ++
 include/net/ultraeth/uet_pdc.h            |  170 ++++
 include/net/ultraeth/uet_pds.h            |  110 ++
 include/uapi/linux/if_link.h              |    8 +
 include/uapi/linux/ultraeth.h             |  536 ++++++++++
 include/uapi/linux/ultraeth_nl.h          |  116 +++
 24 files changed, 4460 insertions(+)
 create mode 100644 Documentation/netlink/specs/ultraeth.yaml
 create mode 100644 drivers/ultraeth/Kconfig
 create mode 100644 drivers/ultraeth/Makefile
 create mode 100644 drivers/ultraeth/uecon.c
 create mode 100644 drivers/ultraeth/uet_chardev.c
 create mode 100644 drivers/ultraeth/uet_context.c
 create mode 100644 drivers/ultraeth/uet_job.c
 create mode 100644 drivers/ultraeth/uet_main.c
 create mode 100644 drivers/ultraeth/uet_netlink.c
 create mode 100644 drivers/ultraeth/uet_netlink.h
 create mode 100644 drivers/ultraeth/uet_pdc.c
 create mode 100644 drivers/ultraeth/uet_pds.c
 create mode 100644 include/net/ultraeth/uecon.h
 create mode 100644 include/net/ultraeth/uet_chardev.h
 create mode 100644 include/net/ultraeth/uet_context.h
 create mode 100644 include/net/ultraeth/uet_job.h
 create mode 100644 include/net/ultraeth/uet_pdc.h
 create mode 100644 include/net/ultraeth/uet_pds.h
 create mode 100644 include/uapi/linux/ultraeth.h
 create mode 100644 include/uapi/linux/ultraeth_nl.h

-- 
2.48.1


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [RFC PATCH 01/13] drivers: ultraeth: add initial skeleton and kconfig option
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 02/13] drivers: ultraeth: add context support Nikolay Aleksandrov
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Create drivers/ultraeth/ for the upcoming new Ultra Ethernet driver and add
a new Kconfig option for it.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/Kconfig             |  2 ++
 drivers/Makefile            |  1 +
 drivers/ultraeth/Kconfig    | 11 +++++++++++
 drivers/ultraeth/Makefile   |  3 +++
 drivers/ultraeth/uet_main.c | 19 +++++++++++++++++++
 5 files changed, 36 insertions(+)
 create mode 100644 drivers/ultraeth/Kconfig
 create mode 100644 drivers/ultraeth/Makefile
 create mode 100644 drivers/ultraeth/uet_main.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 7bdad836fc62..df3369781d37 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -245,4 +245,6 @@ source "drivers/cdx/Kconfig"
 
 source "drivers/dpll/Kconfig"
 
+source "drivers/ultraeth/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 45d1c3e630f7..47848677605a 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -195,3 +195,4 @@ obj-$(CONFIG_CDX_BUS)		+= cdx/
 obj-$(CONFIG_DPLL)		+= dpll/
 
 obj-$(CONFIG_S390)		+= s390/
+obj-$(CONFIG_ULTRAETH)		+= ultraeth/
diff --git a/drivers/ultraeth/Kconfig b/drivers/ultraeth/Kconfig
new file mode 100644
index 000000000000..a769c6118f2f
--- /dev/null
+++ b/drivers/ultraeth/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+config ULTRAETH
+	tristate "Ultra Ethernet core"
+	depends on INET
+	depends on IPV6 || !IPV6
+	select NET_UDP_TUNNEL
+	select GRO_CELLS
+	help
+	  To compile this driver as a module, choose M here: the module
+	  will be called ultraeth.
diff --git a/drivers/ultraeth/Makefile b/drivers/ultraeth/Makefile
new file mode 100644
index 000000000000..e30373d4b5dc
--- /dev/null
+++ b/drivers/ultraeth/Makefile
@@ -0,0 +1,3 @@
+obj-$(CONFIG_ULTRAETH) += ultraeth.o
+
+ultraeth-objs := uet_main.o
diff --git a/drivers/ultraeth/uet_main.c b/drivers/ultraeth/uet_main.c
new file mode 100644
index 000000000000..0d74175fc047
--- /dev/null
+++ b/drivers/ultraeth/uet_main.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+
+static int __init uet_init(void)
+{
+	return 0;
+}
+
+static void __exit uet_exit(void)
+{
+}
+
+module_init(uet_init);
+module_exit(uet_exit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Ultra Ethernet core");
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 02/13] drivers: ultraeth: add context support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 01/13] drivers: ultraeth: add initial skeleton and kconfig option Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 03/13] drivers: ultraeth: add new genl family Nikolay Aleksandrov
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

The ultra ethernet context is at the root and must be created first. UET
contexts are identified by host unique assigned ids on creation and are
protected by a ref counter.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/Makefile          |   2 +-
 drivers/ultraeth/uet_context.c     | 149 +++++++++++++++++++++++++++++
 drivers/ultraeth/uet_main.c        |   2 +
 include/net/ultraeth/uet_context.h |  27 ++++++
 4 files changed, 179 insertions(+), 1 deletion(-)
 create mode 100644 drivers/ultraeth/uet_context.c
 create mode 100644 include/net/ultraeth/uet_context.h

diff --git a/drivers/ultraeth/Makefile b/drivers/ultraeth/Makefile
index e30373d4b5dc..dc0c07eeef65 100644
--- a/drivers/ultraeth/Makefile
+++ b/drivers/ultraeth/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_ULTRAETH) += ultraeth.o
 
-ultraeth-objs := uet_main.o
+ultraeth-objs := uet_main.o uet_context.o
diff --git a/drivers/ultraeth/uet_context.c b/drivers/ultraeth/uet_context.c
new file mode 100644
index 000000000000..1c74cd8bbd56
--- /dev/null
+++ b/drivers/ultraeth/uet_context.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+#include <net/ultraeth/uet_context.h>
+
+#define MAX_CONTEXT_ID 256
+static DECLARE_BITMAP(uet_context_ids, MAX_CONTEXT_ID);
+static LIST_HEAD(uet_context_list);
+static DEFINE_MUTEX(uet_context_lock);
+
+static int uet_context_get_new_id(int id)
+{
+	if (WARN_ON(id < -1 || id >= MAX_CONTEXT_ID))
+		return -EINVAL;
+
+	mutex_lock(&uet_context_lock);
+	if (id == -1)
+		id = find_first_zero_bit(uet_context_ids, MAX_CONTEXT_ID);
+	if (id < MAX_CONTEXT_ID) {
+		if (test_and_set_bit(id, uet_context_ids))
+			id = -EBUSY;
+	} else {
+		id = -ENOSPC;
+	}
+	mutex_unlock(&uet_context_lock);
+
+	return id;
+}
+
+static void uet_context_put_id(struct uet_context *ctx)
+{
+	clear_bit(ctx->id, uet_context_ids);
+}
+
+static void uet_context_link(struct uet_context *ctx)
+{
+	WARN_ON(!list_empty(&ctx->list));
+	list_add(&ctx->list, &uet_context_list);
+}
+
+static void uet_context_unlink(struct uet_context *ctx)
+{
+	list_del_init(&ctx->list);
+	if (refcount_dec_and_test(&ctx->refcnt))
+		return;
+
+	mutex_unlock(&uet_context_lock);
+	wait_event(ctx->refcnt_wait, refcount_read(&ctx->refcnt) == 0);
+	mutex_lock(&uet_context_lock);
+	WARN_ON(refcount_read(&ctx->refcnt) > 0);
+}
+
+static struct uet_context *uet_context_find(int id)
+{
+	struct uet_context *ctx;
+
+	if (!test_bit(id, uet_context_ids))
+		return NULL;
+
+	list_for_each_entry(ctx, &uet_context_list, list)
+		if (ctx->id == id)
+			return ctx;
+
+	return NULL;
+}
+
+struct uet_context *uet_context_get_by_id(int id)
+{
+	struct uet_context *ctx;
+
+	mutex_lock(&uet_context_lock);
+	ctx = uet_context_find(id);
+	if (ctx)
+		refcount_inc(&ctx->refcnt);
+	mutex_unlock(&uet_context_lock);
+
+	return ctx;
+}
+
+void uet_context_put(struct uet_context *ctx)
+{
+	if (refcount_dec_and_test(&ctx->refcnt))
+		wake_up(&ctx->refcnt_wait);
+}
+
+int uet_context_create(int id)
+{
+	struct uet_context *ctx;
+	int err = -ENOMEM;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return err;
+
+	INIT_LIST_HEAD(&ctx->list);
+	init_waitqueue_head(&ctx->refcnt_wait);
+	refcount_set(&ctx->refcnt, 1);
+
+	ctx->id = uet_context_get_new_id(id);
+	if (ctx->id < 0) {
+		err = ctx->id;
+		goto ctx_id_err;
+	}
+
+	uet_context_link(ctx);
+
+	return 0;
+
+ctx_id_err:
+	kfree(ctx);
+
+	return err;
+}
+
+static void __uet_context_destroy(struct uet_context *ctx)
+{
+	uet_context_unlink(ctx);
+	uet_context_put_id(ctx);
+	kfree(ctx);
+}
+
+bool uet_context_destroy(int id)
+{
+	struct uet_context *ctx;
+	bool found = false;
+
+	mutex_lock(&uet_context_lock);
+	ctx = uet_context_find(id);
+	if (ctx) {
+		__uet_context_destroy(ctx);
+		found = true;
+	}
+	mutex_unlock(&uet_context_lock);
+
+	return found;
+}
+
+void uet_context_destroy_all(void)
+{
+	struct uet_context *ctx;
+
+	mutex_lock(&uet_context_lock);
+	while ((ctx = list_first_entry_or_null(&uet_context_list,
+						 struct uet_context,
+						 list)))
+		__uet_context_destroy(ctx);
+
+	WARN_ON(!list_empty(&uet_context_list));
+	mutex_unlock(&uet_context_lock);
+}
diff --git a/drivers/ultraeth/uet_main.c b/drivers/ultraeth/uet_main.c
index 0d74175fc047..0f8383c6aba0 100644
--- a/drivers/ultraeth/uet_main.c
+++ b/drivers/ultraeth/uet_main.c
@@ -3,6 +3,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/types.h>
+#include <net/ultraeth/uet_context.h>
 
 static int __init uet_init(void)
 {
@@ -11,6 +12,7 @@ static int __init uet_init(void)
 
 static void __exit uet_exit(void)
 {
+	uet_context_destroy_all();
 }
 
 module_init(uet_init);
diff --git a/include/net/ultraeth/uet_context.h b/include/net/ultraeth/uet_context.h
new file mode 100644
index 000000000000..150ad2c9b456
--- /dev/null
+++ b/include/net/ultraeth/uet_context.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+
+#ifndef _UET_CONTEXT_H
+#define _UET_CONTEXT_H
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/refcount.h>
+#include <linux/wait.h>
+
+struct uet_context {
+	int id;
+	refcount_t refcnt;
+	wait_queue_head_t refcnt_wait;
+	struct list_head list;
+};
+
+struct uet_context *uet_context_get_by_id(int id);
+void uet_context_put(struct uet_context *ses_pl);
+
+int uet_context_create(int id);
+bool uet_context_destroy(int id);
+void uet_context_destroy_all(void);
+
+#endif /* _UET_CONTEXT_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 03/13] drivers: ultraeth: add new genl family
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 01/13] drivers: ultraeth: add initial skeleton and kconfig option Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 02/13] drivers: ultraeth: add context support Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 04/13] drivers: ultraeth: add job support Nikolay Aleksandrov
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

The UE genl family is described by ynl spec in
Documentation/netlink/specs/ultraeth.yaml. It supports context list, create
and delete.

The corresponding files are auto-generated by ynl:
     drivers/ultraeth/uet_netlink.c
     drivers/ultraeth/uet_netlink.h
     include/uapi/linux/ultraeth_nl.h

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 Documentation/netlink/specs/ultraeth.yaml | 56 ++++++++++++++++++
 drivers/ultraeth/Makefile                 |  2 +-
 drivers/ultraeth/uet_context.c            | 72 +++++++++++++++++++++++
 drivers/ultraeth/uet_main.c               |  5 +-
 drivers/ultraeth/uet_netlink.c            | 54 +++++++++++++++++
 drivers/ultraeth/uet_netlink.h            | 21 +++++++
 include/uapi/linux/ultraeth_nl.h          | 35 +++++++++++
 7 files changed, 243 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/netlink/specs/ultraeth.yaml
 create mode 100644 drivers/ultraeth/uet_netlink.c
 create mode 100644 drivers/ultraeth/uet_netlink.h
 create mode 100644 include/uapi/linux/ultraeth_nl.h

diff --git a/Documentation/netlink/specs/ultraeth.yaml b/Documentation/netlink/specs/ultraeth.yaml
new file mode 100644
index 000000000000..55ab4d9b82a9
--- /dev/null
+++ b/Documentation/netlink/specs/ultraeth.yaml
@@ -0,0 +1,56 @@
+# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+name: ultraeth
+protocol: genetlink
+uapi-header: linux/ultraeth_nl.h
+
+doc: Ultra Ethernet driver genetlink operations
+
+attribute-sets:
+  -
+    name: context
+    attributes:
+      -
+        name: id
+        type: s32
+        checks:
+          min: 0
+          max: 255
+  -
+    name: contexts
+    attributes:
+      -
+        name: context
+        type: nest
+        nested-attributes: context
+        multi-attr: true
+
+operations:
+  name-prefix: ultraeth-cmd-
+  list:
+    -
+      name: context-get
+      doc: dump ultraeth context information
+      attribute-set: context
+      dump:
+        reply:
+          attributes: &all-context-attrs
+            - id
+    -
+      name: context-new
+      doc: add new ultraeth context
+      attribute-set: context
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - id
+    -
+      name: context-del
+      doc: delete ultraeth context
+      attribute-set: context
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - id
diff --git a/drivers/ultraeth/Makefile b/drivers/ultraeth/Makefile
index dc0c07eeef65..599d91d205c1 100644
--- a/drivers/ultraeth/Makefile
+++ b/drivers/ultraeth/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_ULTRAETH) += ultraeth.o
 
-ultraeth-objs := uet_main.o uet_context.o
+ultraeth-objs := uet_main.o uet_context.o uet_netlink.o
diff --git a/drivers/ultraeth/uet_context.c b/drivers/ultraeth/uet_context.c
index 1c74cd8bbd56..2444fa3f35cd 100644
--- a/drivers/ultraeth/uet_context.c
+++ b/drivers/ultraeth/uet_context.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
 
 #include <net/ultraeth/uet_context.h>
+#include "uet_netlink.h"
 
 #define MAX_CONTEXT_ID 256
 static DECLARE_BITMAP(uet_context_ids, MAX_CONTEXT_ID);
@@ -147,3 +148,74 @@ void uet_context_destroy_all(void)
 	WARN_ON(!list_empty(&uet_context_list));
 	mutex_unlock(&uet_context_lock);
 }
+
+static int __nl_ctx_fill_one(struct sk_buff *skb,
+				const struct uet_context *ctx,
+				int cmd, u32 flags, u32 seq, u32 portid)
+{
+	void *hdr;
+
+	hdr = genlmsg_put(skb, portid, seq, &ultraeth_nl_family, flags, cmd);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	if (nla_put_s32(skb, ULTRAETH_A_CONTEXT_ID, ctx->id))
+		goto out_err;
+
+	genlmsg_end(skb, hdr);
+	return 0;
+
+out_err:
+	genlmsg_cancel(skb, hdr);
+	return -EMSGSIZE;
+}
+
+int ultraeth_nl_context_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	int idx = 0, s_idx = cb->args[0], err;
+	struct uet_context *ctx;
+
+	mutex_lock(&uet_context_lock);
+	list_for_each_entry(ctx, &uet_context_list, list) {
+		if (idx < s_idx) {
+			idx++;
+			continue;
+		}
+		err = __nl_ctx_fill_one(skb, ctx, ULTRAETH_CMD_CONTEXT_GET,
+					  NLM_F_MULTI, cb->nlh->nlmsg_seq,
+					  NETLINK_CB(cb->skb).portid);
+		if (err)
+			break;
+		idx++;
+	}
+	cb->args[0] = idx;
+	mutex_unlock(&uet_context_lock);
+
+	return err ? err : skb->len;
+}
+
+int ultraeth_nl_context_new_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	int id = -1;
+
+	if (info->attrs[ULTRAETH_A_CONTEXT_ID])
+		id = nla_get_s32(info->attrs[ULTRAETH_A_CONTEXT_ID]);
+
+	return uet_context_create(id);
+}
+
+int ultraeth_nl_context_del_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	bool destroyed = false;
+	int id;
+
+	if (!info->attrs[ULTRAETH_A_CONTEXT_ID]) {
+		NL_SET_ERR_MSG(info->extack, "UET context id must be specified");
+		return -EINVAL;
+	}
+
+	id = nla_get_s32(info->attrs[ULTRAETH_A_CONTEXT_ID]);
+	destroyed = uet_context_destroy(id);
+
+	return destroyed ? 0 : -ENOENT;
+}
diff --git a/drivers/ultraeth/uet_main.c b/drivers/ultraeth/uet_main.c
index 0f8383c6aba0..0ec1dc74abbb 100644
--- a/drivers/ultraeth/uet_main.c
+++ b/drivers/ultraeth/uet_main.c
@@ -5,13 +5,16 @@
 #include <linux/types.h>
 #include <net/ultraeth/uet_context.h>
 
+#include "uet_netlink.h"
+
 static int __init uet_init(void)
 {
-	return 0;
+	return genl_register_family(&ultraeth_nl_family);
 }
 
 static void __exit uet_exit(void)
 {
+	genl_unregister_family(&ultraeth_nl_family);
 	uet_context_destroy_all();
 }
 
diff --git a/drivers/ultraeth/uet_netlink.c b/drivers/ultraeth/uet_netlink.c
new file mode 100644
index 000000000000..39e4aa6092a9
--- /dev/null
+++ b/drivers/ultraeth/uet_netlink.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/ultraeth.yaml */
+/* YNL-GEN kernel source */
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include "uet_netlink.h"
+
+#include <uapi/linux/ultraeth_nl.h>
+
+/* ULTRAETH_CMD_CONTEXT_NEW - do */
+static const struct nla_policy ultraeth_context_new_nl_policy[ULTRAETH_A_CONTEXT_ID + 1] = {
+	[ULTRAETH_A_CONTEXT_ID] = NLA_POLICY_RANGE(NLA_S32, 0, 255),
+};
+
+/* ULTRAETH_CMD_CONTEXT_DEL - do */
+static const struct nla_policy ultraeth_context_del_nl_policy[ULTRAETH_A_CONTEXT_ID + 1] = {
+	[ULTRAETH_A_CONTEXT_ID] = NLA_POLICY_RANGE(NLA_S32, 0, 255),
+};
+
+/* Ops table for ultraeth */
+static const struct genl_split_ops ultraeth_nl_ops[] = {
+	{
+		.cmd	= ULTRAETH_CMD_CONTEXT_GET,
+		.dumpit	= ultraeth_nl_context_get_dumpit,
+		.flags	= GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= ULTRAETH_CMD_CONTEXT_NEW,
+		.doit		= ultraeth_nl_context_new_doit,
+		.policy		= ultraeth_context_new_nl_policy,
+		.maxattr	= ULTRAETH_A_CONTEXT_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+	{
+		.cmd		= ULTRAETH_CMD_CONTEXT_DEL,
+		.doit		= ultraeth_nl_context_del_doit,
+		.policy		= ultraeth_context_del_nl_policy,
+		.maxattr	= ULTRAETH_A_CONTEXT_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+};
+
+struct genl_family ultraeth_nl_family __ro_after_init = {
+	.name		= ULTRAETH_FAMILY_NAME,
+	.version	= ULTRAETH_FAMILY_VERSION,
+	.netnsok	= true,
+	.parallel_ops	= true,
+	.module		= THIS_MODULE,
+	.split_ops	= ultraeth_nl_ops,
+	.n_split_ops	= ARRAY_SIZE(ultraeth_nl_ops),
+};
diff --git a/drivers/ultraeth/uet_netlink.h b/drivers/ultraeth/uet_netlink.h
new file mode 100644
index 000000000000..9dd9df24513a
--- /dev/null
+++ b/drivers/ultraeth/uet_netlink.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/ultraeth.yaml */
+/* YNL-GEN kernel header */
+
+#ifndef _LINUX_ULTRAETH_GEN_H
+#define _LINUX_ULTRAETH_GEN_H
+
+#include <net/netlink.h>
+#include <net/genetlink.h>
+
+#include <uapi/linux/ultraeth_nl.h>
+
+int ultraeth_nl_context_get_dumpit(struct sk_buff *skb,
+				   struct netlink_callback *cb);
+int ultraeth_nl_context_new_doit(struct sk_buff *skb, struct genl_info *info);
+int ultraeth_nl_context_del_doit(struct sk_buff *skb, struct genl_info *info);
+
+extern struct genl_family ultraeth_nl_family;
+
+#endif /* _LINUX_ULTRAETH_GEN_H */
diff --git a/include/uapi/linux/ultraeth_nl.h b/include/uapi/linux/ultraeth_nl.h
new file mode 100644
index 000000000000..f3bdf8111623
--- /dev/null
+++ b/include/uapi/linux/ultraeth_nl.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+/* Do not edit directly, auto-generated from: */
+/*	Documentation/netlink/specs/ultraeth.yaml */
+/* YNL-GEN uapi header */
+
+#ifndef _UAPI_LINUX_ULTRAETH_NL_H
+#define _UAPI_LINUX_ULTRAETH_NL_H
+
+#define ULTRAETH_FAMILY_NAME	"ultraeth"
+#define ULTRAETH_FAMILY_VERSION	1
+
+enum {
+	ULTRAETH_A_CONTEXT_ID = 1,
+
+	__ULTRAETH_A_CONTEXT_MAX,
+	ULTRAETH_A_CONTEXT_MAX = (__ULTRAETH_A_CONTEXT_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_CONTEXTS_CONTEXT = 1,
+
+	__ULTRAETH_A_CONTEXTS_MAX,
+	ULTRAETH_A_CONTEXTS_MAX = (__ULTRAETH_A_CONTEXTS_MAX - 1)
+};
+
+enum {
+	ULTRAETH_CMD_CONTEXT_GET = 1,
+	ULTRAETH_CMD_CONTEXT_NEW,
+	ULTRAETH_CMD_CONTEXT_DEL,
+
+	__ULTRAETH_CMD_MAX,
+	ULTRAETH_CMD_MAX = (__ULTRAETH_CMD_MAX - 1)
+};
+
+#endif /* _UAPI_LINUX_ULTRAETH_NL_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 04/13] drivers: ultraeth: add job support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (2 preceding siblings ...)
  2025-03-06 23:01 ` [RFC PATCH 03/13] drivers: ultraeth: add new genl family Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 05/13] drivers: ultraeth: add tunnel udp device support Nikolay Aleksandrov
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

A UE job identifies an application that a communicating process belongs
to within a distributed parallel application. Jobs are assigned to the
initiating process and are a part of addressing, they are present in all
packets. Jobs are supposed to be assigned by a provisioning system. Job
ids must be globally unique within a UE context. Every UE context
contains a job registry with all current jobs, regardless if they're
associated with a fabric endpoint (FEP) or not. The Ultra Ethernet
netlink spec is updated with job support to create, delete and list jobs.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 Documentation/netlink/specs/ultraeth.yaml | 147 +++++++
 drivers/ultraeth/Makefile                 |   2 +-
 drivers/ultraeth/uet_context.c            |   7 +
 drivers/ultraeth/uet_job.c                | 455 ++++++++++++++++++++++
 drivers/ultraeth/uet_netlink.c            |  59 +++
 drivers/ultraeth/uet_netlink.h            |   8 +
 include/net/ultraeth/uet_context.h        |   3 +
 include/net/ultraeth/uet_job.h            |  78 ++++
 include/uapi/linux/ultraeth.h             |  44 +++
 include/uapi/linux/ultraeth_nl.h          |  76 ++++
 10 files changed, 878 insertions(+), 1 deletion(-)
 create mode 100644 drivers/ultraeth/uet_job.c
 create mode 100644 include/net/ultraeth/uet_job.h
 create mode 100644 include/uapi/linux/ultraeth.h

diff --git a/Documentation/netlink/specs/ultraeth.yaml b/Documentation/netlink/specs/ultraeth.yaml
index 55ab4d9b82a9..e95c73a36892 100644
--- a/Documentation/netlink/specs/ultraeth.yaml
+++ b/Documentation/netlink/specs/ultraeth.yaml
@@ -24,6 +24,119 @@ attribute-sets:
         type: nest
         nested-attributes: context
         multi-attr: true
+  -
+    name: fep-in-addr
+    attributes:
+      -
+        name: ip
+        type: binary
+        display-hint: ipv4
+      -
+        name: ip6
+        type: binary
+        byte-order: big-endian
+        display-hint: ipv6
+      -
+        name: family
+        type: u16
+  -
+    name: fep-address
+    attributes:
+      -
+        name: in-address
+        type: nest
+        nested-attributes: fep-in-addr
+      -
+        name: flags
+        type: u16
+      -
+        name: caps
+        type: u16
+      -
+        name: start-resource-index
+        type: u16
+      -
+        name: num-resource-indices
+        type: u16
+      -
+        name: initiator-id
+        type: u32
+      -
+        name: pid-on-fep
+        type: u16
+      -
+        name: padding
+        type: u16
+      -
+        name: version
+        type: u8
+  -
+    name: fep-entry
+    attributes:
+      -
+        name: address
+        type: nest
+        nested-attributes: fep-address
+  -
+    name: flist
+    attributes:
+      -
+        name: fep
+        type: nest
+        multi-attr: true
+        nested-attributes: fep-entry
+  -
+    name: job-req
+    attributes:
+      -
+        name: context-id
+        type: s32
+      -
+        name: id
+        type : u32
+      -
+        name: address
+        type: nest
+        nested-attributes: fep-address
+      -
+        name: service-name
+        type: string
+  -
+    name: job
+    attributes:
+      -
+        name: id
+        type : u32
+      -
+        name: address
+        type: nest
+        nested-attributes: fep-address
+      -
+        name: service-name
+        type: string
+      -
+        name: flist
+        type: nest
+        nested-attributes: flist
+        multi-attr: true
+  -
+    name: jlist
+    attributes:
+      -
+        name: job
+        type: nest
+        nested-attributes: job
+        multi-attr: true
+  -
+    name: jobs
+    attributes:
+      -
+        name: context-id
+        type: s32
+      -
+        name: jlist
+        type: nest
+        nested-attributes: jlist
 
 operations:
   name-prefix: ultraeth-cmd-
@@ -54,3 +167,37 @@ operations:
         request:
           attributes:
             - id
+    -
+      name: job-get
+      doc: dump uecon context jobs
+      attribute-set: jobs
+      dump:
+        request:
+          attributes:
+            - context-id
+        reply:
+          attributes:
+            - context-id
+            - jlist
+    -
+      name: job-new
+      doc: add a new job to uecon context
+      attribute-set: job-req
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - context-id
+            - id
+            - address
+            - service-name
+    -
+      name: job-del
+      doc: delete a job in uecon context
+      attribute-set: job-req
+      flags: [ admin-perm ]
+      do:
+        request:
+          attributes:
+            - context-id
+            - id
diff --git a/drivers/ultraeth/Makefile b/drivers/ultraeth/Makefile
index 599d91d205c1..bf41a62273f9 100644
--- a/drivers/ultraeth/Makefile
+++ b/drivers/ultraeth/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_ULTRAETH) += ultraeth.o
 
-ultraeth-objs := uet_main.o uet_context.o uet_netlink.o
+ultraeth-objs := uet_main.o uet_context.o uet_netlink.o uet_job.o
diff --git a/drivers/ultraeth/uet_context.c b/drivers/ultraeth/uet_context.c
index 2444fa3f35cd..3d738c02e992 100644
--- a/drivers/ultraeth/uet_context.c
+++ b/drivers/ultraeth/uet_context.c
@@ -102,10 +102,16 @@ int uet_context_create(int id)
 		goto ctx_id_err;
 	}
 
+	err = uet_jobs_init(&ctx->job_reg);
+	if (err)
+		goto ctx_jobs_err;
+
 	uet_context_link(ctx);
 
 	return 0;
 
+ctx_jobs_err:
+	uet_context_put_id(ctx);
 ctx_id_err:
 	kfree(ctx);
 
@@ -115,6 +121,7 @@ int uet_context_create(int id)
 static void __uet_context_destroy(struct uet_context *ctx)
 {
 	uet_context_unlink(ctx);
+	uet_jobs_uninit(&ctx->job_reg);
 	uet_context_put_id(ctx);
 	kfree(ctx);
 }
diff --git a/drivers/ultraeth/uet_job.c b/drivers/ultraeth/uet_job.c
new file mode 100644
index 000000000000..3a55a0f70749
--- /dev/null
+++ b/drivers/ultraeth/uet_job.c
@@ -0,0 +1,455 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/bug.h>
+#include <net/ipv6.h>
+#include <net/ultraeth/uet_context.h>
+
+#include "uet_netlink.h"
+
+static const struct rhashtable_params uet_job_registry_rht_params = {
+	.head_offset = offsetof(struct uet_job, rht_node),
+	.key_offset = offsetof(struct uet_job, id),
+	.key_len = sizeof(u32),
+	.nelem_hint = 128,
+	.automatic_shrinking = true,
+};
+
+int uet_jobs_init(struct uet_job_registry *jreg)
+{
+	int ret;
+
+	mutex_init(&jreg->jobs_lock);
+
+	ret = rhashtable_init(&jreg->jobs_hash, &uet_job_registry_rht_params);
+	if (ret)
+		mutex_destroy(&jreg->jobs_lock);
+
+	return ret;
+}
+
+static int __job_associate(struct uet_job *job, struct uet_fep *fep)
+{
+	lockdep_assert_held_once(&job->jreg->jobs_lock);
+
+	if (rcu_access_pointer(job->fep))
+		return -EBUSY;
+
+	WRITE_ONCE(fep->job_id, job->id);
+	rcu_assign_pointer(job->fep, fep);
+
+	return 0;
+}
+
+/* disassociate and close all PDCs related to the job */
+static void __job_disassociate(struct uet_job *job)
+{
+	struct uet_fep *fep;
+
+	fep = rcu_dereference_check(job->fep,
+				    lockdep_is_held(&job->jreg->jobs_lock));
+	if (!fep)
+		return;
+
+	WRITE_ONCE(fep->job_id, 0);
+	RCU_INIT_POINTER(job->fep, NULL);
+	synchronize_rcu();
+}
+
+struct uet_job *uet_job_find(struct uet_job_registry *jreg, u32 id)
+{
+	return rhashtable_lookup_fast(&jreg->jobs_hash, &id,
+				      uet_job_registry_rht_params);
+}
+
+static struct uet_job *uet_job_find_svc_name(struct uet_job_registry *jreg,
+					     char *service_name)
+{
+	struct uet_job *job;
+
+	lockdep_assert_held_once(&jreg->jobs_lock);
+
+	hlist_for_each_entry(job, &jreg->jobs_list, hnode) {
+		if (!strcmp(job->service_name, service_name))
+			return job;
+	}
+
+	return NULL;
+}
+
+static void __uet_job_remove(struct uet_job *job)
+{
+	struct uet_job_registry *jreg = job->jreg;
+
+	__job_disassociate(job);
+	hlist_del_init_rcu(&job->hnode);
+	rhashtable_remove_fast(&jreg->jobs_hash, &job->rht_node,
+			       uet_job_registry_rht_params);
+	kfree_rcu(job, rcu);
+}
+
+bool uet_job_remove(struct uet_job_registry *jreg, u32 job_id)
+{
+	bool removed = false;
+	struct uet_job *job;
+
+	mutex_lock(&jreg->jobs_lock);
+	job = uet_job_find(jreg, job_id);
+	if (job) {
+		__uet_job_remove(job);
+		removed = true;
+	}
+	mutex_unlock(&jreg->jobs_lock);
+
+	return removed;
+}
+
+void uet_jobs_uninit(struct uet_job_registry *jreg)
+{
+	struct hlist_node *tmp;
+	struct uet_job *job;
+
+	mutex_lock(&jreg->jobs_lock);
+	hlist_for_each_entry_safe(job, tmp, &jreg->jobs_list, hnode)
+		__uet_job_remove(job);
+	mutex_unlock(&jreg->jobs_lock);
+
+	rhashtable_destroy(&jreg->jobs_hash);
+	rcu_barrier();
+	mutex_destroy(&jreg->jobs_lock);
+}
+
+struct uet_job *uet_job_create(struct uet_job_registry *jreg,
+			       struct uet_job_ctrl_addr_req *job_req)
+{
+	struct uet_job *job;
+	int ret;
+
+	if (job_req->job_id == 0)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&jreg->jobs_lock);
+	if (uet_job_find_svc_name(jreg, job_req->service_name)) {
+		mutex_unlock(&jreg->jobs_lock);
+		return ERR_PTR(-EEXIST);
+	}
+
+	job = kzalloc(sizeof(*job), GFP_KERNEL);
+	if (!job)
+		return ERR_PTR(-ENOMEM);
+
+	job->jreg = jreg;
+	job->id = job_req->job_id;
+	strscpy(job->service_name, job_req->service_name, sizeof(job->service_name));
+
+	ret = rhashtable_lookup_insert_fast(&jreg->jobs_hash, &job->rht_node,
+					    uet_job_registry_rht_params);
+	if (ret) {
+		kfree_rcu(job, rcu);
+		mutex_unlock(&jreg->jobs_lock);
+		return ERR_PTR(ret);
+	}
+	hlist_add_head_rcu(&job->hnode, &jreg->jobs_list);
+	mutex_unlock(&jreg->jobs_lock);
+
+	return job;
+}
+
+int uet_job_reg_associate(struct uet_job_registry *jreg, struct uet_fep *fep,
+			  char *service_name)
+{
+	struct uet_job *job;
+	int ret = -ENOENT;
+
+	mutex_lock(&jreg->jobs_lock);
+	job = uet_job_find_svc_name(jreg, service_name);
+	if (job)
+		ret = __job_associate(job, fep);
+	mutex_unlock(&jreg->jobs_lock);
+
+	return ret;
+}
+
+void uet_job_reg_disassociate(struct uet_job_registry *jreg, u32 job_id)
+{
+	struct uet_job *job;
+
+	mutex_lock(&jreg->jobs_lock);
+	job = uet_job_find(jreg, job_id);
+	if (job)
+		__job_disassociate(job);
+	mutex_unlock(&jreg->jobs_lock);
+}
+
+/* returns <0 (error) or 1 (queued the skb) */
+int uet_job_fep_queue_skb(struct uet_context *ctx,
+			  u32 job_id, struct sk_buff *skb,
+			  __be32 remote_fep_addr)
+{
+	struct uet_job *job = uet_job_find(&ctx->job_reg, job_id);
+	struct uet_fep *fep;
+
+	if (!job)
+		return -ENOENT;
+
+	fep = rcu_dereference(job->fep);
+	if (!fep)
+		return -ENODEV;
+
+	skb_dst_drop(skb);
+	skb_queue_tail(&fep->rxq, skb);
+
+	return 1;
+}
+
+static int __nl_fep_addr_fill_one(struct sk_buff *skb,
+				  const struct fep_in_address *fep_addr,
+				  int fep_attr)
+{
+	struct nlattr *nest;
+	int attr, len;
+
+	if (!fep_addr->family)
+		return 0;
+
+	nest = nla_nest_start(skb, fep_attr);
+	if (!nest)
+		return -EMSGSIZE;
+
+	switch (fep_addr->family) {
+	case AF_INET:
+		attr = ULTRAETH_A_FEP_IN_ADDR_IP;
+		len = sizeof(fep_addr->ip);
+		break;
+	case AF_INET6:
+		attr = ULTRAETH_A_FEP_IN_ADDR_IP6;
+		len = sizeof(fep_addr->ip6);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		nla_nest_cancel(skb, nest);
+		return 0;
+	}
+
+	if (nla_put(skb, attr, len, &fep_addr->ip) ||
+	    nla_put_u16(skb, ULTRAETH_A_FEP_IN_ADDR_FAMILY, fep_addr->family)) {
+		nla_nest_cancel(skb, nest);
+		return -EMSGSIZE;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
+static int __nl_uet_addr_fill_one(struct sk_buff *skb,
+				    const struct fep_address *addr, int attr)
+{
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, attr);
+	if (!nest)
+		return -EMSGSIZE;
+	if (__nl_fep_addr_fill_one(skb, &addr->in_address,
+				   ULTRAETH_A_FEP_ADDRESS_IN_ADDRESS) ||
+	    nla_put_u16(skb, ULTRAETH_A_FEP_ADDRESS_FLAGS, addr->flags) ||
+	    nla_put_u16(skb, ULTRAETH_A_FEP_ADDRESS_CAPS, addr->fep_caps) ||
+	    nla_put_u16(skb, ULTRAETH_A_FEP_ADDRESS_START_RESOURCE_INDEX,
+			addr->start_resource_index) ||
+	    nla_put_u16(skb, ULTRAETH_A_FEP_ADDRESS_NUM_RESOURCE_INDICES,
+			addr->num_resource_indices) ||
+	    nla_put_u32(skb, ULTRAETH_A_FEP_ADDRESS_INITIATOR_ID,
+			addr->initiator_id) ||
+	    nla_put_u16(skb, ULTRAETH_A_FEP_ADDRESS_PID_ON_FEP,
+			addr->pid_on_fep) ||
+	    nla_put_u8(skb, ULTRAETH_A_FEP_ADDRESS_VERSION, addr->version)) {
+		nla_nest_cancel(skb, nest);
+		return -EMSGSIZE;
+	}
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
+static int __nl_fep_fill_one(struct sk_buff *skb,
+			     const struct uet_fep *fep, int attr)
+{
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, attr);
+	if (!nest)
+		return -EMSGSIZE;
+	if (__nl_uet_addr_fill_one(skb, &fep->addr, ULTRAETH_A_FEP_ENTRY_ADDRESS)) {
+		nla_nest_cancel(skb, nest);
+		return -EMSGSIZE;
+	}
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
+static int __nl_job_feps_fill(struct sk_buff *skb, const struct uet_fep *fep)
+{
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, ULTRAETH_A_JOB_FLIST);
+	if (!nest)
+		return -EMSGSIZE;
+	if (fep && __nl_fep_fill_one(skb, fep, ULTRAETH_A_FLIST_FEP)) {
+		nla_nest_cancel(skb, nest);
+		return -EMSGSIZE;
+	}
+	nla_nest_end(skb, nest);
+
+	return 0;
+}
+
+static int __nl_job_fill_one(struct sk_buff *skb, const struct uet_job *job)
+{
+	struct nlattr *nest;
+
+	nest = nla_nest_start(skb, ULTRAETH_A_JLIST_JOB);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (__nl_uet_addr_fill_one(skb, &job->addr, ULTRAETH_A_JOB_ADDRESS) ||
+	    nla_put_u32(skb, ULTRAETH_A_JOB_ID, job->id) ||
+	    nla_put_string(skb, ULTRAETH_A_JOB_SERVICE_NAME, job->service_name) ||
+	    __nl_job_feps_fill(skb, rcu_dereference(job->fep))) {
+		nla_nest_cancel(skb, nest);
+		return -EMSGSIZE;
+	}
+
+	nla_nest_end(skb, nest);
+	return 0;
+}
+
+int ultraeth_nl_job_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	const struct genl_info *info = genl_info_dump(cb);
+	int idx = 0, s_idx = cb->args[0], err;
+	struct uet_context *ctx;
+	struct uet_job *job;
+	struct nlattr *nest;
+	int context_id;
+	void *hdr;
+
+	if (!info->attrs[ULTRAETH_A_JOBS_CONTEXT_ID]) {
+		NL_SET_ERR_MSG(info->extack, "context id must be specified");
+		return -EINVAL;
+	}
+
+	context_id = nla_get_s32(info->attrs[ULTRAETH_A_JOBS_CONTEXT_ID]);
+	ctx = uet_context_get_by_id(context_id);
+	if (!ctx) {
+		NL_SET_ERR_MSG(info->extack, "context doesn't exist");
+		return -ENOENT;
+	}
+
+	/* filled all, return 0 */
+	if (s_idx == atomic_read(&ctx->job_reg.jobs_hash.nelems))
+		goto out_put;
+
+	err = -EMSGSIZE;
+	hdr = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
+			  &ultraeth_nl_family, NLM_F_MULTI, ULTRAETH_CMD_JOB_GET);
+	if (!hdr)
+		goto out_put;
+	if (nla_put_s32(skb, ULTRAETH_A_JOBS_CONTEXT_ID, ctx->id))
+		goto out_end;
+	nest = nla_nest_start(skb, ULTRAETH_A_JOBS_JLIST);
+	if (!nest)
+		goto out_end;
+	err = 0;
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(job, &ctx->job_reg.jobs_list, hnode) {
+		if (idx < s_idx) {
+			idx++;
+			continue;
+		}
+		err = __nl_job_fill_one(skb, job);
+		if (err)
+			break;
+		idx++;
+	}
+	cb->args[0] = idx;
+	rcu_read_unlock();
+	nla_nest_end(skb, nest);
+out_end:
+	genlmsg_end(skb, hdr);
+out_put:
+	uet_context_put(ctx);
+
+	return err ? err : skb->len;
+}
+
+int ultraeth_nl_job_new_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct uet_job_ctrl_addr_req jreq;
+	struct uet_context *ctx;
+	int context_id, job_id;
+	struct uet_job *job;
+	char *service_name;
+	int ret = 0;
+
+	if (!info->attrs[ULTRAETH_A_JOB_REQ_CONTEXT_ID]) {
+		NL_SET_ERR_MSG(info->extack, "context id must be specified");
+		return -EINVAL;
+	}
+	if (!info->attrs[ULTRAETH_A_JOB_REQ_ID]) {
+		NL_SET_ERR_MSG(info->extack, "Job id must be specified");
+		return -EINVAL;
+	}
+	if (!info->attrs[ULTRAETH_A_JOB_REQ_SERVICE_NAME]) {
+		NL_SET_ERR_MSG(info->extack, "Job service name must be specified");
+		return -EINVAL;
+	}
+	service_name = nla_data(info->attrs[ULTRAETH_A_JOB_REQ_SERVICE_NAME]);
+	job_id = nla_get_u32(info->attrs[ULTRAETH_A_JOB_REQ_ID]);
+	context_id = nla_get_s32(info->attrs[ULTRAETH_A_JOB_REQ_CONTEXT_ID]);
+	ctx = uet_context_get_by_id(context_id);
+	if (!ctx) {
+		NL_SET_ERR_MSG(info->extack, "context doesn't exist");
+		return -ENOENT;
+	}
+
+	memset(&jreq, 0, sizeof(jreq));
+	jreq.job_id = job_id;
+	strscpy(jreq.service_name, service_name, sizeof(jreq.service_name));
+	job = uet_job_create(&ctx->job_reg, &jreq);
+	if (IS_ERR(job))
+		ret = PTR_ERR(job);
+
+	uet_context_put(ctx);
+
+	return ret;
+}
+
+int ultraeth_nl_job_del_doit(struct sk_buff *skb, struct genl_info *info)
+{
+	struct uet_context *ctx;
+	bool destroyed = false;
+	int context_id, job_id;
+
+	if (!info->attrs[ULTRAETH_A_JOB_REQ_CONTEXT_ID]) {
+		NL_SET_ERR_MSG(info->extack, "context id must be specified");
+		return -EINVAL;
+	}
+	if (!info->attrs[ULTRAETH_A_JOB_REQ_ID]) {
+		NL_SET_ERR_MSG(info->extack, "Job id must be specified");
+		return -EINVAL;
+	}
+	job_id = nla_get_u32(info->attrs[ULTRAETH_A_JOB_REQ_ID]);
+	context_id = nla_get_s32(info->attrs[ULTRAETH_A_JOB_REQ_CONTEXT_ID]);
+	ctx = uet_context_get_by_id(context_id);
+	if (!ctx) {
+		NL_SET_ERR_MSG(info->extack, "context doesn't exist");
+		return -ENOENT;
+	}
+
+	destroyed = uet_job_remove(&ctx->job_reg, job_id);
+	uet_context_put(ctx);
+
+	return destroyed ? 0 : -ENOENT;
+}
diff --git a/drivers/ultraeth/uet_netlink.c b/drivers/ultraeth/uet_netlink.c
index 39e4aa6092a9..7fdaf15e43e3 100644
--- a/drivers/ultraeth/uet_netlink.c
+++ b/drivers/ultraeth/uet_netlink.c
@@ -10,6 +10,25 @@
 
 #include <uapi/linux/ultraeth_nl.h>
 
+/* Common nested types */
+const struct nla_policy ultraeth_fep_address_nl_policy[ULTRAETH_A_FEP_ADDRESS_VERSION + 1] = {
+	[ULTRAETH_A_FEP_ADDRESS_IN_ADDRESS] = NLA_POLICY_NESTED(ultraeth_fep_in_addr_nl_policy),
+	[ULTRAETH_A_FEP_ADDRESS_FLAGS] = { .type = NLA_U16, },
+	[ULTRAETH_A_FEP_ADDRESS_CAPS] = { .type = NLA_U16, },
+	[ULTRAETH_A_FEP_ADDRESS_START_RESOURCE_INDEX] = { .type = NLA_U16, },
+	[ULTRAETH_A_FEP_ADDRESS_NUM_RESOURCE_INDICES] = { .type = NLA_U16, },
+	[ULTRAETH_A_FEP_ADDRESS_INITIATOR_ID] = { .type = NLA_U32, },
+	[ULTRAETH_A_FEP_ADDRESS_PID_ON_FEP] = { .type = NLA_U16, },
+	[ULTRAETH_A_FEP_ADDRESS_PADDING] = { .type = NLA_U16, },
+	[ULTRAETH_A_FEP_ADDRESS_VERSION] = { .type = NLA_U8, },
+};
+
+const struct nla_policy ultraeth_fep_in_addr_nl_policy[ULTRAETH_A_FEP_IN_ADDR_FAMILY + 1] = {
+	[ULTRAETH_A_FEP_IN_ADDR_IP] = { .type = NLA_BINARY, },
+	[ULTRAETH_A_FEP_IN_ADDR_IP6] = { .type = NLA_BINARY, },
+	[ULTRAETH_A_FEP_IN_ADDR_FAMILY] = { .type = NLA_U16, },
+};
+
 /* ULTRAETH_CMD_CONTEXT_NEW - do */
 static const struct nla_policy ultraeth_context_new_nl_policy[ULTRAETH_A_CONTEXT_ID + 1] = {
 	[ULTRAETH_A_CONTEXT_ID] = NLA_POLICY_RANGE(NLA_S32, 0, 255),
@@ -20,6 +39,25 @@ static const struct nla_policy ultraeth_context_del_nl_policy[ULTRAETH_A_CONTEXT
 	[ULTRAETH_A_CONTEXT_ID] = NLA_POLICY_RANGE(NLA_S32, 0, 255),
 };
 
+/* ULTRAETH_CMD_JOB_GET - dump */
+static const struct nla_policy ultraeth_job_get_nl_policy[ULTRAETH_A_JOBS_CONTEXT_ID + 1] = {
+	[ULTRAETH_A_JOBS_CONTEXT_ID] = { .type = NLA_S32, },
+};
+
+/* ULTRAETH_CMD_JOB_NEW - do */
+static const struct nla_policy ultraeth_job_new_nl_policy[ULTRAETH_A_JOB_REQ_SERVICE_NAME + 1] = {
+	[ULTRAETH_A_JOB_REQ_CONTEXT_ID] = { .type = NLA_S32, },
+	[ULTRAETH_A_JOB_REQ_ID] = { .type = NLA_U32, },
+	[ULTRAETH_A_JOB_REQ_ADDRESS] = NLA_POLICY_NESTED(ultraeth_fep_address_nl_policy),
+	[ULTRAETH_A_JOB_REQ_SERVICE_NAME] = { .type = NLA_NUL_STRING, },
+};
+
+/* ULTRAETH_CMD_JOB_DEL - do */
+static const struct nla_policy ultraeth_job_del_nl_policy[ULTRAETH_A_JOB_REQ_ID + 1] = {
+	[ULTRAETH_A_JOB_REQ_CONTEXT_ID] = { .type = NLA_S32, },
+	[ULTRAETH_A_JOB_REQ_ID] = { .type = NLA_U32, },
+};
+
 /* Ops table for ultraeth */
 static const struct genl_split_ops ultraeth_nl_ops[] = {
 	{
@@ -41,6 +79,27 @@ static const struct genl_split_ops ultraeth_nl_ops[] = {
 		.maxattr	= ULTRAETH_A_CONTEXT_ID,
 		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
 	},
+	{
+		.cmd		= ULTRAETH_CMD_JOB_GET,
+		.dumpit		= ultraeth_nl_job_get_dumpit,
+		.policy		= ultraeth_job_get_nl_policy,
+		.maxattr	= ULTRAETH_A_JOBS_CONTEXT_ID,
+		.flags		= GENL_CMD_CAP_DUMP,
+	},
+	{
+		.cmd		= ULTRAETH_CMD_JOB_NEW,
+		.doit		= ultraeth_nl_job_new_doit,
+		.policy		= ultraeth_job_new_nl_policy,
+		.maxattr	= ULTRAETH_A_JOB_REQ_SERVICE_NAME,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
+	{
+		.cmd		= ULTRAETH_CMD_JOB_DEL,
+		.doit		= ultraeth_nl_job_del_doit,
+		.policy		= ultraeth_job_del_nl_policy,
+		.maxattr	= ULTRAETH_A_JOB_REQ_ID,
+		.flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
+	},
 };
 
 struct genl_family ultraeth_nl_family __ro_after_init = {
diff --git a/drivers/ultraeth/uet_netlink.h b/drivers/ultraeth/uet_netlink.h
index 9dd9df24513a..6e7226f39ddf 100644
--- a/drivers/ultraeth/uet_netlink.h
+++ b/drivers/ultraeth/uet_netlink.h
@@ -11,10 +11,18 @@
 
 #include <uapi/linux/ultraeth_nl.h>
 
+/* Common nested types */
+extern const struct nla_policy ultraeth_fep_address_nl_policy[ULTRAETH_A_FEP_ADDRESS_VERSION + 1];
+extern const struct nla_policy ultraeth_fep_in_addr_nl_policy[ULTRAETH_A_FEP_IN_ADDR_FAMILY + 1];
+
 int ultraeth_nl_context_get_dumpit(struct sk_buff *skb,
 				   struct netlink_callback *cb);
 int ultraeth_nl_context_new_doit(struct sk_buff *skb, struct genl_info *info);
 int ultraeth_nl_context_del_doit(struct sk_buff *skb, struct genl_info *info);
+int ultraeth_nl_job_get_dumpit(struct sk_buff *skb,
+			       struct netlink_callback *cb);
+int ultraeth_nl_job_new_doit(struct sk_buff *skb, struct genl_info *info);
+int ultraeth_nl_job_del_doit(struct sk_buff *skb, struct genl_info *info);
 
 extern struct genl_family ultraeth_nl_family;
 
diff --git a/include/net/ultraeth/uet_context.h b/include/net/ultraeth/uet_context.h
index 150ad2c9b456..7638c768597e 100644
--- a/include/net/ultraeth/uet_context.h
+++ b/include/net/ultraeth/uet_context.h
@@ -9,12 +9,15 @@
 #include <linux/mutex.h>
 #include <linux/refcount.h>
 #include <linux/wait.h>
+#include <net/ultraeth/uet_job.h>
 
 struct uet_context {
 	int id;
 	refcount_t refcnt;
 	wait_queue_head_t refcnt_wait;
 	struct list_head list;
+
+	struct uet_job_registry job_reg;
 };
 
 struct uet_context *uet_context_get_by_id(int id);
diff --git a/include/net/ultraeth/uet_job.h b/include/net/ultraeth/uet_job.h
new file mode 100644
index 000000000000..fac1f0752a78
--- /dev/null
+++ b/include/net/ultraeth/uet_job.h
@@ -0,0 +1,78 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+
+#ifndef _UET_JOB_H
+#define _UET_JOB_H
+
+#include <linux/types.h>
+#include <linux/rhashtable.h>
+#include <linux/skbuff.h>
+#include <linux/mutex.h>
+#include <uapi/linux/ultraeth.h>
+
+struct uet_context;
+
+struct uet_job_registry {
+	struct mutex jobs_lock;
+	struct hlist_head jobs_list;
+	struct rhashtable jobs_hash;
+};
+
+struct uet_fep {
+	struct uet_context *context;
+	struct sk_buff_head rxq;
+	struct fep_address addr;
+	u32 job_id;
+};
+
+/**
+ * struct uet_job - single job
+ *
+ * @rht_node: link into the job registry's job hash table
+ * @hnode: link into the job registry's list
+ * @jreg: pointer to job registry (owner)
+ * @service_name: service name used for lookups on address req
+ * @addr: job specific address (XXX)
+ * @job_id: unique job id
+ * @rcu: used for freeing
+ *
+ * if @fep is set then the job is considered associated, i.e. there is
+ * an fd for the context's character device which is bound to this
+ * job (FEP)
+ */
+struct uet_job {
+	struct rhash_head rht_node;
+	struct hlist_node hnode;
+
+	struct uet_job_registry *jreg;
+
+	char service_name[UET_SVC_MAX_LEN];
+
+	struct fep_address addr;
+	struct uet_fep __rcu *fep;
+
+	u32 id;
+
+	struct rcu_head rcu;
+};
+
+struct uet_job_ctrl_addr_req {
+	char service_name[UET_SVC_MAX_LEN];
+	struct fep_in_address address;
+	__u32 job_id;
+	__u32 os_pid;
+	__u8 flags;
+};
+
+int uet_jobs_init(struct uet_job_registry *jreg);
+void uet_jobs_uninit(struct uet_job_registry *jreg);
+
+struct uet_job *uet_job_create(struct uet_job_registry *jreg,
+			       struct uet_job_ctrl_addr_req *job_req);
+bool uet_job_remove(struct uet_job_registry *jreg, u32 job_id);
+struct uet_job *uet_job_find(struct uet_job_registry *jreg, u32 id);
+void uet_job_reg_disassociate(struct uet_job_registry *jreg, u32 job_id);
+int uet_job_reg_associate(struct uet_job_registry *jreg, struct uet_fep *fep,
+			  char *service_name);
+int uet_job_fep_queue_skb(struct uet_context *ctx, u32 job_id,
+			  struct sk_buff *skb, __be32 remote_fep_addr);
+#endif /* _UET_JOB_H */
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
new file mode 100644
index 000000000000..a6f244de6d75
--- /dev/null
+++ b/include/uapi/linux/ultraeth.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+
+#ifndef _UAPI_LINUX_ULTRAETH_H
+#define _UAPI_LINUX_ULTRAETH_H
+
+#include <asm/byteorder.h>
+#include <linux/types.h>
+
+#define UET_SVC_MAX_LEN 64
+
+enum {
+	UET_ADDR_F_VALID_FEP_CAP	= (1 << 0),
+	UET_ADDR_F_VALID_ADDR		= (1 << 1),
+	UET_ADDR_F_VALID_PID_ON_FEP	= (1 << 2),
+	UET_ADDR_F_VALID_RI		= (1 << 3),
+	UET_ADDR_F_VALID_INIT_ID	= (1 << 4),
+	UET_ADDR_F_ADDRESS_MODE		= (1 << 5),
+	UET_ADDR_F_ADDRESS_TYPE		= (1 << 6),
+	UET_ADDR_F_MTU_LIMITED		= (1 << 7),
+};
+
+#define UET_ADDR_FLAG_IP_VER (1 << 6)
+
+struct fep_in_address {
+	union {
+		__be32 ip;
+		__u8 ip6[16];
+	};
+	__u16 family;
+};
+
+struct fep_address {
+	struct fep_in_address in_address;
+
+	__u16 flags;
+	__u16 fep_caps;
+	__u16 start_resource_index;
+	__u16 num_resource_indices;
+	__u32 initiator_id;
+	__u16 pid_on_fep;
+	__u16 padding;
+	__u8 version;
+};
+#endif /* _UAPI_LINUX_ULTRAETH_H */
diff --git a/include/uapi/linux/ultraeth_nl.h b/include/uapi/linux/ultraeth_nl.h
index f3bdf8111623..d65521de196a 100644
--- a/include/uapi/linux/ultraeth_nl.h
+++ b/include/uapi/linux/ultraeth_nl.h
@@ -23,10 +23,86 @@ enum {
 	ULTRAETH_A_CONTEXTS_MAX = (__ULTRAETH_A_CONTEXTS_MAX - 1)
 };
 
+enum {
+	ULTRAETH_A_FEP_IN_ADDR_IP = 1,
+	ULTRAETH_A_FEP_IN_ADDR_IP6,
+	ULTRAETH_A_FEP_IN_ADDR_FAMILY,
+
+	__ULTRAETH_A_FEP_IN_ADDR_MAX,
+	ULTRAETH_A_FEP_IN_ADDR_MAX = (__ULTRAETH_A_FEP_IN_ADDR_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_FEP_ADDRESS_IN_ADDRESS = 1,
+	ULTRAETH_A_FEP_ADDRESS_FLAGS,
+	ULTRAETH_A_FEP_ADDRESS_CAPS,
+	ULTRAETH_A_FEP_ADDRESS_START_RESOURCE_INDEX,
+	ULTRAETH_A_FEP_ADDRESS_NUM_RESOURCE_INDICES,
+	ULTRAETH_A_FEP_ADDRESS_INITIATOR_ID,
+	ULTRAETH_A_FEP_ADDRESS_PID_ON_FEP,
+	ULTRAETH_A_FEP_ADDRESS_PADDING,
+	ULTRAETH_A_FEP_ADDRESS_VERSION,
+
+	__ULTRAETH_A_FEP_ADDRESS_MAX,
+	ULTRAETH_A_FEP_ADDRESS_MAX = (__ULTRAETH_A_FEP_ADDRESS_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_FEP_ENTRY_ADDRESS = 1,
+
+	__ULTRAETH_A_FEP_ENTRY_MAX,
+	ULTRAETH_A_FEP_ENTRY_MAX = (__ULTRAETH_A_FEP_ENTRY_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_FLIST_FEP = 1,
+
+	__ULTRAETH_A_FLIST_MAX,
+	ULTRAETH_A_FLIST_MAX = (__ULTRAETH_A_FLIST_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_JOB_REQ_CONTEXT_ID = 1,
+	ULTRAETH_A_JOB_REQ_ID,
+	ULTRAETH_A_JOB_REQ_ADDRESS,
+	ULTRAETH_A_JOB_REQ_SERVICE_NAME,
+
+	__ULTRAETH_A_JOB_REQ_MAX,
+	ULTRAETH_A_JOB_REQ_MAX = (__ULTRAETH_A_JOB_REQ_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_JOB_ID = 1,
+	ULTRAETH_A_JOB_ADDRESS,
+	ULTRAETH_A_JOB_SERVICE_NAME,
+	ULTRAETH_A_JOB_FLIST,
+
+	__ULTRAETH_A_JOB_MAX,
+	ULTRAETH_A_JOB_MAX = (__ULTRAETH_A_JOB_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_JLIST_JOB = 1,
+
+	__ULTRAETH_A_JLIST_MAX,
+	ULTRAETH_A_JLIST_MAX = (__ULTRAETH_A_JLIST_MAX - 1)
+};
+
+enum {
+	ULTRAETH_A_JOBS_CONTEXT_ID = 1,
+	ULTRAETH_A_JOBS_JLIST,
+
+	__ULTRAETH_A_JOBS_MAX,
+	ULTRAETH_A_JOBS_MAX = (__ULTRAETH_A_JOBS_MAX - 1)
+};
+
 enum {
 	ULTRAETH_CMD_CONTEXT_GET = 1,
 	ULTRAETH_CMD_CONTEXT_NEW,
 	ULTRAETH_CMD_CONTEXT_DEL,
+	ULTRAETH_CMD_JOB_GET,
+	ULTRAETH_CMD_JOB_NEW,
+	ULTRAETH_CMD_JOB_DEL,
 
 	__ULTRAETH_CMD_MAX,
 	ULTRAETH_CMD_MAX = (__ULTRAETH_CMD_MAX - 1)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 05/13] drivers: ultraeth: add tunnel udp device support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (3 preceding siblings ...)
  2025-03-06 23:01 ` [RFC PATCH 04/13] drivers: ultraeth: add job support Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 06/13] drivers: ultraeth: add initial PDS infrastructure Nikolay Aleksandrov
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Add UE UDP tunnel device (uecon) which is created for each context. It will
be used to transmit and receive UE packets. Currently all packets are
dropped. A default port of 5432 will be used at context creation. It can be
changed at runtime when the net device is down.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 Documentation/netlink/specs/rt_link.yaml  |  14 +
 Documentation/netlink/specs/ultraeth.yaml |   6 +
 drivers/ultraeth/Makefile                 |   3 +-
 drivers/ultraeth/uecon.c                  | 311 ++++++++++++++++++++++
 drivers/ultraeth/uet_context.c            |  12 +-
 drivers/ultraeth/uet_main.c               |  19 +-
 include/net/ultraeth/uecon.h              |  28 ++
 include/net/ultraeth/uet_context.h        |   2 +-
 include/uapi/linux/if_link.h              |   8 +
 include/uapi/linux/ultraeth.h             |   1 +
 include/uapi/linux/ultraeth_nl.h          |   2 +
 11 files changed, 402 insertions(+), 4 deletions(-)
 create mode 100644 drivers/ultraeth/uecon.c
 create mode 100644 include/net/ultraeth/uecon.h

diff --git a/Documentation/netlink/specs/rt_link.yaml b/Documentation/netlink/specs/rt_link.yaml
index 31238455f8e9..747231b1fd6d 100644
--- a/Documentation/netlink/specs/rt_link.yaml
+++ b/Documentation/netlink/specs/rt_link.yaml
@@ -2272,6 +2272,17 @@ attribute-sets:
       -
         name: tailroom
         type: u16
+  -
+    name: linkinfo-uecon-attrs
+    name-prefix: ifla-uecon-
+    attributes:
+      -
+        name: context-id
+        type: u32
+      -
+        name: port
+        type: u16
+        byte-order: big-endian
 
 sub-messages:
   -
@@ -2322,6 +2333,9 @@ sub-messages:
       -
         value: netkit
         attribute-set: linkinfo-netkit-attrs
+      -
+        value: uecon
+        attribute-set: linkinfo-uecon-attrs
   -
     name: linkinfo-member-data-msg
     formats:
diff --git a/Documentation/netlink/specs/ultraeth.yaml b/Documentation/netlink/specs/ultraeth.yaml
index e95c73a36892..847f748efa52 100644
--- a/Documentation/netlink/specs/ultraeth.yaml
+++ b/Documentation/netlink/specs/ultraeth.yaml
@@ -16,6 +16,12 @@ attribute-sets:
         checks:
           min: 0
           max: 255
+      -
+        name: netdev-ifindex
+        type: s32
+      -
+        name: netdev-name
+        type: string
   -
     name: contexts
     attributes:
diff --git a/drivers/ultraeth/Makefile b/drivers/ultraeth/Makefile
index bf41a62273f9..0035023876ab 100644
--- a/drivers/ultraeth/Makefile
+++ b/drivers/ultraeth/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_ULTRAETH) += ultraeth.o
 
-ultraeth-objs := uet_main.o uet_context.o uet_netlink.o uet_job.o
+ultraeth-objs := uet_main.o uet_context.o uet_netlink.o uet_job.o \
+			uecon.o
diff --git a/drivers/ultraeth/uecon.c b/drivers/ultraeth/uecon.c
new file mode 100644
index 000000000000..4b74680700af
--- /dev/null
+++ b/drivers/ultraeth/uecon.c
@@ -0,0 +1,311 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+#include <linux/etherdevice.h>
+#include <linux/if_link.h>
+#include <net/ipv6_stubs.h>
+#include <net/dst_metadata.h>
+#include <net/rtnetlink.h>
+#include <net/gro.h>
+#include <net/udp_tunnel.h>
+#include <net/ultraeth/uecon.h>
+#include <net/ultraeth/uet_context.h>
+
+static const struct nla_policy uecon_ndev_policy[IFLA_UECON_MAX + 1] = {
+	[IFLA_UECON_CONTEXT_ID]	= { .type = NLA_REJECT,
+				    .reject_message = "Domain id attribute is read-only" },
+	[IFLA_UECON_PORT]	= { .type = NLA_BE16 },
+};
+
+static netdev_tx_t uecon_ndev_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct uecon_ndev_priv *uecpriv = netdev_priv(dev);
+	struct ip_tunnel_info *info;
+	int err, min_headroom;
+	struct socket *sock;
+	struct rtable *rt;
+	bool use_cache;
+	__be32 saddr;
+	__be16 sport;
+
+	rcu_read_lock();
+	sock = rcu_dereference(uecpriv->sock);
+	if (!sock)
+		goto out_err;
+	info = skb_tunnel_info(skb);
+	if (!info)
+		goto out_err;
+	use_cache = ip_tunnel_dst_cache_usable(skb, info);
+	sport = uecpriv->udp_port;
+	rt = udp_tunnel_dst_lookup(skb, dev, dev_net(dev), 0, &saddr,
+				   &info->key, sport,
+				   info->key.tp_dst, info->key.tos,
+				   use_cache ? &info->dst_cache : NULL);
+	if (IS_ERR(rt)) {
+		if (PTR_ERR(rt) == -ELOOP)
+			dev->stats.collisions++;
+		else if (PTR_ERR(rt) == -ENETUNREACH)
+			dev->stats.tx_carrier_errors++;
+
+		goto out_err;
+	}
+
+	skb_tunnel_check_pmtu(skb, &rt->dst,
+			      sizeof(struct iphdr) + sizeof(struct udphdr),
+			      false);
+	skb_scrub_packet(skb, false);
+
+	min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len +
+		       sizeof(struct iphdr) + sizeof(struct udphdr);
+	err = skb_cow_head(skb, min_headroom);
+	if (unlikely(err)) {
+		dst_release(&rt->dst);
+		goto out_err;
+	}
+
+	err = udp_tunnel_handle_offloads(skb, false);
+	if (err) {
+		dst_release(&rt->dst);
+		goto out_err;
+	}
+
+	skb_reset_mac_header(skb);
+	skb_set_inner_protocol(skb, skb->protocol);
+
+	udp_tunnel_xmit_skb(rt, sock->sk, skb, saddr,
+			    info->key.u.ipv4.dst, info->key.tos,
+			    ip4_dst_hoplimit(&rt->dst), 0,
+			    sport, info->key.tp_dst,
+			    false, false);
+	rcu_read_unlock();
+
+	return NETDEV_TX_OK;
+
+out_err:
+	rcu_read_unlock();
+	dev_kfree_skb(skb);
+	dev->stats.tx_errors++;
+
+	return NETDEV_TX_OK;
+}
+
+static int uecon_ndev_encap_recv(struct sock *sk, struct sk_buff *skb)
+{
+	struct uecon_ndev_priv *uecpriv;
+	int len;
+
+	uecpriv = rcu_dereference_sk_user_data(sk);
+	if (!uecpriv)
+		goto drop;
+
+	if (skb->protocol != htons(ETH_P_IP))
+		goto drop;
+
+	/* we assume [ tnl ip hdr ] [ tnl udp hdr ] [ pdc hdr ] [ ses hdr ] */
+	if (iptunnel_pull_header(skb, sizeof(struct udphdr), htons(ETH_P_802_3), false))
+		goto drop_count;
+
+	skb_reset_mac_header(skb);
+	skb_reset_network_header(skb);
+	skb->pkt_type = PACKET_HOST;
+	skb->dev = uecpriv->dev;
+	len = skb->len;
+	consume_skb(skb);
+	dev_sw_netstats_rx_add(uecpriv->dev, len);
+
+	return 0;
+
+drop_count:
+	dev_core_stats_rx_dropped_inc(uecpriv->dev);
+drop:
+	kfree_skb(skb);
+	return 0;
+}
+
+static int uecon_ndev_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+	return 0;
+}
+
+static struct socket *uecon_ndev_create_sock(struct net_device *dev)
+{
+	struct uecon_ndev_priv *uecpriv = netdev_priv(dev);
+	struct udp_port_cfg udp_conf;
+	struct socket *sock;
+	int err;
+
+	memset(&udp_conf, 0, sizeof(udp_conf));
+	udp_conf.family = AF_INET;
+	udp_conf.local_udp_port = uecpriv->udp_port;
+	err = udp_sock_create(dev_net(dev), &udp_conf, &sock);
+	if (err < 0)
+		return ERR_PTR(err);
+
+	udp_allow_gso(sock->sk);
+
+	return sock;
+}
+
+static int uecon_ndev_open(struct net_device *dev)
+{
+	struct uecon_ndev_priv *uecpriv = netdev_priv(dev);
+	struct udp_tunnel_sock_cfg tunnel_cfg;
+	struct socket *sock;
+
+	sock = uecon_ndev_create_sock(dev);
+	if (IS_ERR(sock))
+		return PTR_ERR(sock);
+	memset(&tunnel_cfg, 0, sizeof(tunnel_cfg));
+	tunnel_cfg.sk_user_data = uecpriv;
+	tunnel_cfg.encap_type = 1;
+	tunnel_cfg.encap_rcv = uecon_ndev_encap_recv;
+	tunnel_cfg.encap_err_lookup = uecon_ndev_err_lookup;
+	setup_udp_tunnel_sock(dev_net(dev), sock, &tunnel_cfg);
+
+	rcu_assign_pointer(uecpriv->sock, sock);
+
+	return 0;
+}
+
+static int uecon_ndev_stop(struct net_device *dev)
+{
+	struct uecon_ndev_priv *uecpriv = netdev_priv(dev);
+	struct socket *sock = rtnl_dereference(uecpriv->sock);
+
+	rcu_assign_pointer(uecpriv->sock, NULL);
+	synchronize_rcu();
+	udp_tunnel_sock_release(sock);
+
+	return 0;
+}
+
+const struct net_device_ops uecon_netdev_ops = {
+	.ndo_open		= uecon_ndev_open,
+	.ndo_stop		= uecon_ndev_stop,
+	.ndo_start_xmit		= uecon_ndev_xmit,
+	.ndo_get_stats64	= dev_get_tstats64,
+};
+
+static const struct device_type uecon_ndev_type = {
+	.name = "uecon",
+};
+
+static void uecon_ndev_setup(struct net_device *dev)
+{
+	struct uecon_ndev_priv *uecpriv = netdev_priv(dev);
+
+	dev->netdev_ops = &uecon_netdev_ops;
+	SET_NETDEV_DEVTYPE(dev, &uecon_ndev_type);
+
+	dev->features |= NETIF_F_VLAN_CHALLENGED | NETIF_F_SG | NETIF_F_HW_CSUM;
+	dev->features |= NETIF_F_FRAGLIST | NETIF_F_RXCSUM | NETIF_F_GSO_SOFTWARE;
+
+	dev->hw_features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_FRAGLIST;
+	dev->hw_features |= NETIF_F_RXCSUM | NETIF_F_GSO_SOFTWARE;
+
+	dev->priv_flags |= IFF_NO_QUEUE;
+
+	dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST;
+	dev->type = ARPHRD_NONE;
+
+	dev->min_mtu = IPV4_MIN_MTU;
+	/* No header for the time being, account for it later */
+	dev->max_mtu = IP_MAX_MTU;
+	dev->mtu = ETH_DATA_LEN;
+	dev->pcpu_stat_type = NETDEV_PCPU_STAT_TSTATS;
+
+	netif_keep_dst(dev);
+	uecpriv->dev = dev;
+}
+
+static int uecon_ndev_changelink(struct net_device *dev, struct nlattr *tb[],
+				 struct nlattr *data[],
+				 struct netlink_ext_ack *extack)
+{
+	if (dev->flags & IFF_UP) {
+		NL_SET_ERR_MSG_MOD(extack, "Cannot change uecon settings while the device is up");
+		return -EBUSY;
+	}
+
+	if (tb[IFLA_UECON_PORT]) {
+		struct uecon_ndev_priv *uecpriv = netdev_priv(dev);
+
+		uecpriv->udp_port = nla_get_be16(tb[IFLA_UECON_PORT]);
+	}
+
+	return 0;
+}
+
+static size_t uecon_ndev_get_size(const struct net_device *dev)
+{
+	return nla_total_size(sizeof(__u32))  + /* IFLA_UECON_CONTEXT_ID */
+	       nla_total_size(sizeof(__be16)) + /* IFLA_UECON_PORT */
+	       0;
+}
+
+static int uecon_ndev_fill_info(struct sk_buff *skb, const struct net_device *dev)
+{
+	struct uecon_ndev_priv *uecpriv = netdev_priv(dev);
+
+	if (nla_put_u32(skb, IFLA_UECON_CONTEXT_ID, uecpriv->context->id) ||
+	    nla_put_be16(skb, IFLA_UECON_PORT, uecpriv->udp_port))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static struct rtnl_link_ops uecon_netdev_link_ops __read_mostly = {
+	.kind		= "uecon",
+	.priv_size	= sizeof(struct uecon_ndev_priv),
+	.setup		= uecon_ndev_setup,
+	.get_size	= uecon_ndev_get_size,
+	.fill_info	= uecon_ndev_fill_info,
+	.changelink	= uecon_ndev_changelink,
+	.policy		= uecon_ndev_policy,
+	.maxtype	= IFLA_UECON_MAX
+};
+
+int uecon_netdev_init(struct uet_context *ctx)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct uecon_ndev_priv *priv;
+	char ifname[IFNAMSIZ];
+	int ret;
+
+	snprintf(ifname, IFNAMSIZ, "uecon%d", ctx->id);
+	ctx->netdev = alloc_netdev(sizeof(struct uecon_ndev_priv), ifname,
+				   NET_NAME_PREDICTABLE, uecon_ndev_setup);
+	if (!ctx->netdev)
+		return -ENOMEM;
+	priv = netdev_priv(ctx->netdev);
+
+	priv->context = ctx;
+	priv->dev = ctx->netdev;
+	priv->udp_port = htons(UECON_DEFAULT_PORT);
+	ctx->netdev->rtnl_link_ops = &uecon_netdev_link_ops;
+	dev_net_set(ctx->netdev, net);
+
+	ret = register_netdev(ctx->netdev);
+	if (ret) {
+		free_netdev(ctx->netdev);
+		ctx->netdev = NULL;
+	}
+
+	return ret;
+}
+
+void uecon_netdev_uninit(struct uet_context *ctx)
+{
+	unregister_netdev(ctx->netdev);
+	free_netdev(ctx->netdev);
+	ctx->netdev = NULL;
+}
+
+int uecon_rtnl_link_register(void)
+{
+	return rtnl_link_register(&uecon_netdev_link_ops);
+}
+
+void uecon_rtnl_link_unregister(void)
+{
+	rtnl_link_unregister(&uecon_netdev_link_ops);
+}
diff --git a/drivers/ultraeth/uet_context.c b/drivers/ultraeth/uet_context.c
index 3d738c02e992..e0d276cb1942 100644
--- a/drivers/ultraeth/uet_context.c
+++ b/drivers/ultraeth/uet_context.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
 
 #include <net/ultraeth/uet_context.h>
+#include <net/ultraeth/uecon.h>
 #include "uet_netlink.h"
 
 #define MAX_CONTEXT_ID 256
@@ -106,10 +107,16 @@ int uet_context_create(int id)
 	if (err)
 		goto ctx_jobs_err;
 
+	err = uecon_netdev_init(ctx);
+	if (err)
+		goto ctx_netdev_err;
+
 	uet_context_link(ctx);
 
 	return 0;
 
+ctx_netdev_err:
+	uet_jobs_uninit(&ctx->job_reg);
 ctx_jobs_err:
 	uet_context_put_id(ctx);
 ctx_id_err:
@@ -121,6 +128,7 @@ int uet_context_create(int id)
 static void __uet_context_destroy(struct uet_context *ctx)
 {
 	uet_context_unlink(ctx);
+	uecon_netdev_uninit(ctx);
 	uet_jobs_uninit(&ctx->job_reg);
 	uet_context_put_id(ctx);
 	kfree(ctx);
@@ -166,7 +174,9 @@ static int __nl_ctx_fill_one(struct sk_buff *skb,
 	if (!hdr)
 		return -EMSGSIZE;
 
-	if (nla_put_s32(skb, ULTRAETH_A_CONTEXT_ID, ctx->id))
+	if (nla_put_s32(skb, ULTRAETH_A_CONTEXT_ID, ctx->id) ||
+	    nla_put_s32(skb, ULTRAETH_A_CONTEXT_NETDEV_IFINDEX, ctx->netdev->ifindex) ||
+	    nla_put_string(skb, ULTRAETH_A_CONTEXT_NETDEV_NAME, ctx->netdev->name))
 		goto out_err;
 
 	genlmsg_end(skb, hdr);
diff --git a/drivers/ultraeth/uet_main.c b/drivers/ultraeth/uet_main.c
index 0ec1dc74abbb..c37f65978ecf 100644
--- a/drivers/ultraeth/uet_main.c
+++ b/drivers/ultraeth/uet_main.c
@@ -4,18 +4,35 @@
 #include <linux/module.h>
 #include <linux/types.h>
 #include <net/ultraeth/uet_context.h>
+#include <net/ultraeth/uecon.h>
 
 #include "uet_netlink.h"
 
 static int __init uet_init(void)
 {
-	return genl_register_family(&ultraeth_nl_family);
+	int err;
+
+	err = genl_register_family(&ultraeth_nl_family);
+	if (err)
+		goto out_err;
+
+	err = uecon_rtnl_link_register();
+	if (err)
+		goto rtnl_link_err;
+
+	return 0;
+
+rtnl_link_err:
+	genl_unregister_family(&ultraeth_nl_family);
+out_err:
+	return err;
 }
 
 static void __exit uet_exit(void)
 {
 	genl_unregister_family(&ultraeth_nl_family);
 	uet_context_destroy_all();
+	uecon_rtnl_link_unregister();
 }
 
 module_init(uet_init);
diff --git a/include/net/ultraeth/uecon.h b/include/net/ultraeth/uecon.h
new file mode 100644
index 000000000000..6316a0557b1f
--- /dev/null
+++ b/include/net/ultraeth/uecon.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+
+#ifndef _UECON_H
+#define _UECON_H
+#include <net/ip_tunnels.h>
+
+#define UECON_DEFAULT_PORT 5432
+
+struct socket;
+struct net_device;
+
+struct uecon_ndev_priv {
+	struct uet_context *context;
+	struct socket __rcu *sock;
+	struct net_device *dev;
+	__be16 udp_port;
+};
+
+extern const struct net_device_ops uecon_netdev_ops;
+int uecon_netdev_init(struct uet_context *ctx);
+void uecon_netdev_uninit(struct uet_context *ctx);
+
+int uecon_rtnl_link_register(void);
+void uecon_rtnl_link_unregister(void);
+
+int uecon_netdev_register(void);
+void uecon_netdev_unregister(void);
+#endif /* _UECON_H */
diff --git a/include/net/ultraeth/uet_context.h b/include/net/ultraeth/uet_context.h
index 7638c768597e..8210f69a1571 100644
--- a/include/net/ultraeth/uet_context.h
+++ b/include/net/ultraeth/uet_context.h
@@ -17,6 +17,7 @@ struct uet_context {
 	wait_queue_head_t refcnt_wait;
 	struct list_head list;
 
+	struct net_device *netdev;
 	struct uet_job_registry job_reg;
 };
 
@@ -26,5 +27,4 @@ void uet_context_put(struct uet_context *ses_pl);
 int uet_context_create(int id);
 bool uet_context_destroy(int id);
 void uet_context_destroy_all(void);
-
 #endif /* _UET_CONTEXT_H */
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 318386cc5b0d..1ba189ecf9da 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1986,4 +1986,12 @@ enum {
 
 #define IFLA_DSA_MAX	(__IFLA_DSA_MAX - 1)
 
+enum {
+	IFLA_UECON_UNSPEC,
+	IFLA_UECON_CONTEXT_ID,
+	IFLA_UECON_PORT,
+	__IFLA_UECON_MAX
+};
+
+#define IFLA_UECON_MAX (__IFLA_UECON_MAX - 1)
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
index a6f244de6d75..a4ac25455aa0 100644
--- a/include/uapi/linux/ultraeth.h
+++ b/include/uapi/linux/ultraeth.h
@@ -6,6 +6,7 @@
 #include <asm/byteorder.h>
 #include <linux/types.h>
 
+#define UET_DEFAULT_PORT 5432
 #define UET_SVC_MAX_LEN 64
 
 enum {
diff --git a/include/uapi/linux/ultraeth_nl.h b/include/uapi/linux/ultraeth_nl.h
index d65521de196a..515044022906 100644
--- a/include/uapi/linux/ultraeth_nl.h
+++ b/include/uapi/linux/ultraeth_nl.h
@@ -11,6 +11,8 @@
 
 enum {
 	ULTRAETH_A_CONTEXT_ID = 1,
+	ULTRAETH_A_CONTEXT_NETDEV_IFINDEX,
+	ULTRAETH_A_CONTEXT_NETDEV_NAME,
 
 	__ULTRAETH_A_CONTEXT_MAX,
 	ULTRAETH_A_CONTEXT_MAX = (__ULTRAETH_A_CONTEXT_MAX - 1)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 06/13] drivers: ultraeth: add initial PDS infrastructure
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (4 preceding siblings ...)
  2025-03-06 23:01 ` [RFC PATCH 05/13] drivers: ultraeth: add tunnel udp device support Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 07/13] drivers: ultraeth: add request and ack receive support Nikolay Aleksandrov
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Add PDS structures as described in the specifications and helpers to
access their fields which are also exposed to user-space. Add initial
kernel PDS structures and routines to manage PDCs. PDC ids are random
and allocated when a PDC gets created. Each PDC instance has a spinlock to
protect it, lookups are done with RCU using two rhashtables
 - one for our local PDC ids
 - one for endpoints

Receiving a packet currently is a noop.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/Makefile          |   2 +-
 drivers/ultraeth/uecon.c           |  23 +-
 drivers/ultraeth/uet_context.c     |   7 +
 drivers/ultraeth/uet_job.c         |   1 +
 drivers/ultraeth/uet_pdc.c         | 124 ++++++++++
 drivers/ultraeth/uet_pds.c         | 159 +++++++++++++
 include/net/ultraeth/uet_context.h |  14 ++
 include/net/ultraeth/uet_pdc.h     |  79 +++++++
 include/net/ultraeth/uet_pds.h     |  93 ++++++++
 include/uapi/linux/ultraeth.h      | 354 +++++++++++++++++++++++++++++
 10 files changed, 850 insertions(+), 6 deletions(-)
 create mode 100644 drivers/ultraeth/uet_pdc.c
 create mode 100644 drivers/ultraeth/uet_pds.c
 create mode 100644 include/net/ultraeth/uet_pdc.h
 create mode 100644 include/net/ultraeth/uet_pds.h

diff --git a/drivers/ultraeth/Makefile b/drivers/ultraeth/Makefile
index 0035023876ab..f2d6a8569dbf 100644
--- a/drivers/ultraeth/Makefile
+++ b/drivers/ultraeth/Makefile
@@ -1,4 +1,4 @@
 obj-$(CONFIG_ULTRAETH) += ultraeth.o
 
 ultraeth-objs := uet_main.o uet_context.o uet_netlink.o uet_job.o \
-			uecon.o
+			uecon.o uet_pdc.o uet_pds.o
diff --git a/drivers/ultraeth/uecon.c b/drivers/ultraeth/uecon.c
index 4b74680700af..38f930bf93ec 100644
--- a/drivers/ultraeth/uecon.c
+++ b/drivers/ultraeth/uecon.c
@@ -42,9 +42,9 @@ static netdev_tx_t uecon_ndev_xmit(struct sk_buff *skb, struct net_device *dev)
 				   use_cache ? &info->dst_cache : NULL);
 	if (IS_ERR(rt)) {
 		if (PTR_ERR(rt) == -ELOOP)
-			dev->stats.collisions++;
+			DEV_STATS_INC(dev, collisions);
 		else if (PTR_ERR(rt) == -ENETUNREACH)
-			dev->stats.tx_carrier_errors++;
+			DEV_STATS_INC(dev, tx_carrier_errors);
 
 		goto out_err;
 	}
@@ -83,7 +83,7 @@ static netdev_tx_t uecon_ndev_xmit(struct sk_buff *skb, struct net_device *dev)
 out_err:
 	rcu_read_unlock();
 	dev_kfree_skb(skb);
-	dev->stats.tx_errors++;
+	DEV_STATS_INC(dev, tx_errors);
 
 	return NETDEV_TX_OK;
 }
@@ -91,7 +91,11 @@ static netdev_tx_t uecon_ndev_xmit(struct sk_buff *skb, struct net_device *dev)
 static int uecon_ndev_encap_recv(struct sock *sk, struct sk_buff *skb)
 {
 	struct uecon_ndev_priv *uecpriv;
-	int len;
+	__be32 saddr, daddr;
+	unsigned int len;
+	__be16 dport;
+	__u8 tos;
+	int ret;
 
 	uecpriv = rcu_dereference_sk_user_data(sk);
 	if (!uecpriv)
@@ -100,6 +104,11 @@ static int uecon_ndev_encap_recv(struct sock *sk, struct sk_buff *skb)
 	if (skb->protocol != htons(ETH_P_IP))
 		goto drop;
 
+	saddr = ip_hdr(skb)->saddr;
+	daddr = ip_hdr(skb)->daddr;
+	dport = udp_hdr(skb)->source;
+	tos = ip_hdr(skb)->tos;
+
 	/* we assume [ tnl ip hdr ] [ tnl udp hdr ] [ pdc hdr ] [ ses hdr ] */
 	if (iptunnel_pull_header(skb, sizeof(struct udphdr), htons(ETH_P_802_3), false))
 		goto drop_count;
@@ -109,7 +118,11 @@ static int uecon_ndev_encap_recv(struct sock *sk, struct sk_buff *skb)
 	skb->pkt_type = PACKET_HOST;
 	skb->dev = uecpriv->dev;
 	len = skb->len;
-	consume_skb(skb);
+	ret = uet_pds_rx(&uecpriv->context->pds, skb, daddr, saddr, dport, tos);
+	if (ret < 0)
+		goto drop_count;
+	else if (ret == 0)
+		consume_skb(skb);
 	dev_sw_netstats_rx_add(uecpriv->dev, len);
 
 	return 0;
diff --git a/drivers/ultraeth/uet_context.c b/drivers/ultraeth/uet_context.c
index e0d276cb1942..6bdd72344e01 100644
--- a/drivers/ultraeth/uet_context.c
+++ b/drivers/ultraeth/uet_context.c
@@ -107,6 +107,10 @@ int uet_context_create(int id)
 	if (err)
 		goto ctx_jobs_err;
 
+	err = uet_pds_init(&ctx->pds);
+	if (err)
+		goto ctx_pds_err;
+
 	err = uecon_netdev_init(ctx);
 	if (err)
 		goto ctx_netdev_err;
@@ -116,6 +120,8 @@ int uet_context_create(int id)
 	return 0;
 
 ctx_netdev_err:
+	uet_pds_uninit(&ctx->pds);
+ctx_pds_err:
 	uet_jobs_uninit(&ctx->job_reg);
 ctx_jobs_err:
 	uet_context_put_id(ctx);
@@ -129,6 +135,7 @@ static void __uet_context_destroy(struct uet_context *ctx)
 {
 	uet_context_unlink(ctx);
 	uecon_netdev_uninit(ctx);
+	uet_pds_uninit(&ctx->pds);
 	uet_jobs_uninit(&ctx->job_reg);
 	uet_context_put_id(ctx);
 	kfree(ctx);
diff --git a/drivers/ultraeth/uet_job.c b/drivers/ultraeth/uet_job.c
index 3a55a0f70749..4a421dd8e86c 100644
--- a/drivers/ultraeth/uet_job.c
+++ b/drivers/ultraeth/uet_job.c
@@ -55,6 +55,7 @@ static void __job_disassociate(struct uet_job *job)
 	WRITE_ONCE(fep->job_id, 0);
 	RCU_INIT_POINTER(job->fep, NULL);
 	synchronize_rcu();
+	uet_pds_clean_job(&fep->context->pds, job->id);
 }
 
 struct uet_job *uet_job_find(struct uet_job_registry *jreg, u32 id)
diff --git a/drivers/ultraeth/uet_pdc.c b/drivers/ultraeth/uet_pdc.c
new file mode 100644
index 000000000000..47cf4c3dee04
--- /dev/null
+++ b/drivers/ultraeth/uet_pdc.c
@@ -0,0 +1,124 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <net/ultraeth/uet_context.h>
+#include <net/ultraeth/uet_pdc.h>
+
+/* use the approach as nf nat, try a few rounds starting at random offset */
+static bool uet_pdc_id_get(struct uet_pdc *pdc)
+{
+	int attempts = UET_PDC_ID_MAX_ATTEMPTS, i;
+
+	pdc->spdcid = get_random_u16();
+try_again:
+	for (i = 0; i < attempts; i++, pdc->spdcid++) {
+		if (uet_pds_pdcid_insert(pdc) == 0)
+			return true;
+	}
+
+	if (attempts > 16) {
+		attempts /= 2;
+		pdc->spdcid = get_random_u16();
+		goto try_again;
+	}
+
+	return false;
+}
+
+struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
+			       u16 dpdcid, u16 pid_on_fep, u8 mode,
+			       u8 tos, __be16 dport,
+			       const struct uet_pdc_key *key, bool is_inbound)
+{
+	struct uet_pdc *pdc, *pdc_ins = ERR_PTR(-ENOMEM);
+	IP_TUNNEL_DECLARE_FLAGS(md_flags) = { };
+	int ret __maybe_unused;
+
+	switch (mode) {
+	case UET_PDC_MODE_RUD:
+		break;
+	case UET_PDC_MODE_ROD:
+		fallthrough;
+	case UET_PDC_MODE_RUDI:
+		fallthrough;
+	case UET_PDC_MODE_UUD:
+		fallthrough;
+	default:
+		return ERR_PTR(-EOPNOTSUPP);
+	}
+
+	pdc = kzalloc(sizeof(*pdc), GFP_ATOMIC);
+	if (!pdc)
+		goto err_alloc;
+	memcpy(&pdc->key, key, sizeof(*key));
+	pdc->pds = pds;
+	pdc->mode = mode;
+	pdc->is_initiator = !is_inbound;
+
+	if (!uet_pdc_id_get(pdc))
+		goto err_id_get;
+
+	spin_lock_init(&pdc->lock);
+
+	pdc->rx_base_psn = rx_base_psn;
+	pdc->tx_base_psn = rx_base_psn;
+	pdc->state = state;
+	pdc->dpdcid = dpdcid;
+	pdc->pid_on_fep = pid_on_fep;
+	pdc->metadata = __ip_tun_set_dst(key->src_ip, key->dst_ip, tos, 0, dport,
+					 md_flags, 0, 0);
+	if (!pdc->metadata)
+		goto err_tun_dst;
+
+#ifdef CONFIG_DST_CACHE
+	ret = dst_cache_init(&pdc->metadata->u.tun_info.dst_cache, GFP_ATOMIC);
+	if (ret) {
+		pdc_ins = ERR_PTR(ret);
+		goto err_ep_insert;
+	}
+#endif
+	pdc->metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
+
+	if (is_inbound) {
+		/* this PDC is a result of packet Rx */
+		pdc_ins = pdc;
+		goto out;
+	}
+
+	pdc_ins = uet_pds_pdcep_insert(pdc);
+	if (!pdc_ins) {
+		pdc_ins = pdc;
+	} else {
+		/* someone beat us to it or there was an error, either way
+		 * we free the newly created pdc and drop the ref
+		 */
+		goto err_ep_insert;
+	}
+
+out:
+	return pdc_ins;
+
+err_ep_insert:
+	dst_release(&pdc->metadata->dst);
+err_tun_dst:
+	uet_pds_pdcid_remove(pdc);
+err_id_get:
+	kfree(pdc);
+err_alloc:
+	goto out;
+}
+
+void uet_pdc_free(struct uet_pdc *pdc)
+{
+	dst_release(&pdc->metadata->dst);
+	kfree(pdc);
+}
+
+void uet_pdc_destroy(struct uet_pdc *pdc)
+{
+	uet_pds_pdcep_remove(pdc);
+	uet_pds_pdcid_remove(pdc);
+	uet_pds_pdc_gc_queue(pdc);
+}
diff --git a/drivers/ultraeth/uet_pds.c b/drivers/ultraeth/uet_pds.c
new file mode 100644
index 000000000000..4aec61eeb230
--- /dev/null
+++ b/drivers/ultraeth/uet_pds.c
@@ -0,0 +1,159 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/bug.h>
+
+#include <net/ultraeth/uet_context.h>
+#include <net/ultraeth/uet_pdc.h>
+
+static const struct rhashtable_params uet_pds_pdcid_rht_params = {
+	.head_offset = offsetof(struct uet_pdc, pdcid_node),
+	.key_offset = offsetof(struct uet_pdc, spdcid),
+	.key_len = sizeof(u16),
+	.nelem_hint = 2048,
+	.max_size = UET_PDC_MAX_ID,
+	.automatic_shrinking = true,
+};
+
+static const struct rhashtable_params uet_pds_pdcep_rht_params = {
+	.head_offset = offsetof(struct uet_pdc, pdcep_node),
+	.key_offset = offsetof(struct uet_pdc, key),
+	.key_len = sizeof(struct uet_pdc_key),
+	.nelem_hint = 2048,
+	.automatic_shrinking = true,
+};
+
+static void uet_pds_pdc_gc_flush(struct uet_pds *pds)
+{
+	HLIST_HEAD(deleted_head);
+	struct hlist_node *tmp;
+	struct uet_pdc *pdc;
+
+	spin_lock_bh(&pds->gc_lock);
+	hlist_move_list(&pds->pdc_gc_list, &deleted_head);
+	spin_unlock_bh(&pds->gc_lock);
+
+	synchronize_rcu();
+
+	hlist_for_each_entry_safe(pdc, tmp, &deleted_head, gc_node)
+		uet_pdc_free(pdc);
+}
+
+static void uet_pds_pdc_gc_work(struct work_struct *work)
+{
+	struct uet_pds *pds = container_of(work, struct uet_pds, pdc_gc_work);
+
+	uet_pds_pdc_gc_flush(pds);
+}
+
+void uet_pds_pdc_gc_queue(struct uet_pdc *pdc)
+{
+	struct uet_pds *pds = pdc->pds;
+
+	spin_lock_bh(&pds->gc_lock);
+	if (hlist_unhashed(&pdc->gc_node))
+		hlist_add_head(&pdc->gc_node, &pds->pdc_gc_list);
+	spin_unlock_bh(&pds->gc_lock);
+
+	queue_work(system_long_wq, &pds->pdc_gc_work);
+}
+
+int uet_pds_init(struct uet_pds *pds)
+{
+	int ret;
+
+	spin_lock_init(&pds->gc_lock);
+	INIT_HLIST_HEAD(&pds->pdc_gc_list);
+	INIT_WORK(&pds->pdc_gc_work, uet_pds_pdc_gc_work);
+
+	ret = rhashtable_init(&pds->pdcid_hash, &uet_pds_pdcid_rht_params);
+	if (ret)
+		goto err_pdcid_hash;
+
+	ret = rhashtable_init(&pds->pdcep_hash, &uet_pds_pdcep_rht_params);
+	if (ret)
+		goto err_pdcep_hash;
+
+	return 0;
+
+err_pdcep_hash:
+	rhashtable_destroy(&pds->pdcid_hash);
+err_pdcid_hash:
+	return ret;
+}
+
+struct uet_pdc *uet_pds_pdcep_insert(struct uet_pdc *pdc)
+{
+	struct uet_pds *pds = pdc->pds;
+
+	return rhashtable_lookup_get_insert_fast(&pds->pdcep_hash,
+						 &pdc->pdcep_node,
+						 uet_pds_pdcep_rht_params);
+}
+
+void uet_pds_pdcep_remove(struct uet_pdc *pdc)
+{
+	struct uet_pds *pds = pdc->pds;
+
+	rhashtable_remove_fast(&pds->pdcep_hash, &pdc->pdcep_node,
+			       uet_pds_pdcep_rht_params);
+}
+
+int uet_pds_pdcid_insert(struct uet_pdc *pdc)
+{
+	struct uet_pds *pds = pdc->pds;
+
+	return rhashtable_insert_fast(&pds->pdcid_hash, &pdc->pdcid_node,
+				      uet_pds_pdcid_rht_params);
+}
+
+void uet_pds_pdcid_remove(struct uet_pdc *pdc)
+{
+	struct uet_pds *pds = pdc->pds;
+
+	rhashtable_remove_fast(&pds->pdcid_hash, &pdc->pdcid_node,
+			       uet_pds_pdcid_rht_params);
+}
+
+static void uet_pds_pdcep_hash_free(void *ptr, void *arg)
+{
+	struct uet_pdc *pdc = ptr;
+
+	uet_pdc_destroy(pdc);
+}
+
+void uet_pds_uninit(struct uet_pds *pds)
+{
+	rhashtable_free_and_destroy(&pds->pdcep_hash, uet_pds_pdcep_hash_free, NULL);
+	/* the above call should also release all PDC ids */
+	WARN_ON(atomic_read(&pds->pdcid_hash.nelems));
+	rhashtable_destroy(&pds->pdcid_hash);
+	uet_pds_pdc_gc_flush(pds);
+	cancel_work_sync(&pds->pdc_gc_work);
+	rcu_barrier();
+}
+
+void uet_pds_clean_job(struct uet_pds *pds, u32 job_id)
+{
+	struct rhashtable_iter iter;
+	struct uet_pdc *pdc;
+
+	rhashtable_walk_enter(&pds->pdcid_hash, &iter);
+	rhashtable_walk_start(&iter);
+	while ((pdc = rhashtable_walk_next(&iter))) {
+		if (pdc->key.job_id == job_id)
+			uet_pdc_destroy(pdc);
+	}
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
+}
+
+int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
+	       __be32 remote_fep_addr, __be16 dport, __u8 tos)
+{
+	if (!pskb_may_pull(skb, sizeof(struct uet_prologue_hdr)))
+		return -EINVAL;
+
+	return 0;
+}
diff --git a/include/net/ultraeth/uet_context.h b/include/net/ultraeth/uet_context.h
index 8210f69a1571..76077df3bce6 100644
--- a/include/net/ultraeth/uet_context.h
+++ b/include/net/ultraeth/uet_context.h
@@ -10,6 +10,7 @@
 #include <linux/refcount.h>
 #include <linux/wait.h>
 #include <net/ultraeth/uet_job.h>
+#include <net/ultraeth/uet_pds.h>
 
 struct uet_context {
 	int id;
@@ -19,6 +20,7 @@ struct uet_context {
 
 	struct net_device *netdev;
 	struct uet_job_registry job_reg;
+	struct uet_pds pds;
 };
 
 struct uet_context *uet_context_get_by_id(int id);
@@ -27,4 +29,16 @@ void uet_context_put(struct uet_context *ses_pl);
 int uet_context_create(int id);
 bool uet_context_destroy(int id);
 void uet_context_destroy_all(void);
+
+static inline struct uet_context *pds_context(const struct uet_pds *pds)
+{
+	return container_of(pds, struct uet_context, pds);
+}
+
+static inline struct net_device *pds_netdev(const struct uet_pds *pds)
+{
+	struct uet_context *ctx = pds_context(pds);
+
+	return ctx->netdev;
+}
 #endif /* _UET_CONTEXT_H */
diff --git a/include/net/ultraeth/uet_pdc.h b/include/net/ultraeth/uet_pdc.h
new file mode 100644
index 000000000000..70f3c6aa03df
--- /dev/null
+++ b/include/net/ultraeth/uet_pdc.h
@@ -0,0 +1,79 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+
+#ifndef _UECON_PDC_H
+#define _UECON_PDC_H
+
+#include <linux/rhashtable.h>
+#include <linux/rcupdate.h>
+#include <linux/spinlock.h>
+#include <linux/limits.h>
+#include <linux/refcount.h>
+#include <net/dst.h>
+#include <net/dst_metadata.h>
+
+#define UET_PDC_ID_MAX_ATTEMPTS 128
+#define UET_PDC_MAX_ID U16_MAX
+#define UET_PDC_MPR 128
+
+#define UET_SKB_CB(skb)       ((struct uet_skb_cb *)&((skb)->cb[0]))
+
+struct uet_skb_cb {
+	u32 psn;
+	__be32 remote_fep_addr;
+};
+
+enum {
+	UET_PDC_EP_STATE_CLOSED,
+	UET_PDC_EP_STATE_SYN_SENT,
+	UET_PDC_EP_STATE_NEW_ESTABLISHED,
+	UET_PDC_EP_STATE_ESTABLISHED,
+	UET_PDC_EP_STATE_QUIESCE,
+	UET_PDC_EP_STATE_ACK_WAIT,
+	UET_PDC_EP_STATE_CLOSE_ACK_WAIT
+};
+
+struct uet_pdc_key {
+	__be32 src_ip;
+	__be32 dst_ip;
+	u32 job_id;
+};
+
+enum {
+	UET_PDC_MODE_ROD,
+	UET_PDC_MODE_RUD,
+	UET_PDC_MODE_RUDI,
+	UET_PDC_MODE_UUD
+};
+
+struct uet_pdc {
+	struct rhash_head pdcid_node;
+	struct rhash_head pdcep_node;
+	struct uet_pdc_key key;
+	struct uet_pds *pds;
+
+	struct metadata_dst *metadata;
+
+	spinlock_t lock;
+	u32 psn_start;
+	u16 state;
+	u16 spdcid;
+	u16 dpdcid;
+	u16 pid_on_fep;
+	u8 tx_busy;
+	u8 mode;
+	bool is_initiator;
+
+	u32 rx_base_psn;
+	u32 tx_base_psn;
+
+	struct hlist_node gc_node;
+	struct rcu_head rcu;
+};
+
+struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
+			       u16 dpdcid, u16 pid_on_fep, u8 mode,
+			       u8 tos, __be16 dport,
+			       const struct uet_pdc_key *key, bool is_inbound);
+void uet_pdc_destroy(struct uet_pdc *pdc);
+void uet_pdc_free(struct uet_pdc *pdc);
+#endif /* _UECON_PDC_H */
diff --git a/include/net/ultraeth/uet_pds.h b/include/net/ultraeth/uet_pds.h
new file mode 100644
index 000000000000..43f5748a318a
--- /dev/null
+++ b/include/net/ultraeth/uet_pds.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+
+#ifndef _UECON_PDS_H
+#define _UECON_PDS_H
+
+#include <linux/types.h>
+#include <linux/rhashtable.h>
+#include <uapi/linux/ultraeth.h>
+#include <linux/skbuff.h>
+
+/**
+ * struct uet_pds - Packet Delivery Sublayer state structure
+ *
+ * @pdcep_hash: a hash table mapping <dst ip, job id, pid on fep> to struct PDC
+ * @pdcid_hash: a hash table mapping PDC id to struct PDC
+ *
+ * @pdcep_hash is used in fast path to find the assigned PDC, @pdcid_hash
+ * is used when allocating a new PDC
+ */
+struct uet_pds {
+	struct rhashtable pdcep_hash;
+	struct rhashtable pdcid_hash;
+
+	spinlock_t gc_lock;
+	struct hlist_head pdc_gc_list;
+	struct work_struct pdc_gc_work;
+};
+
+struct uet_pdc *uet_pds_pdcep_insert(struct uet_pdc *pdc);
+void uet_pds_pdcep_remove(struct uet_pdc *pdc);
+
+int uet_pds_pdcid_insert(struct uet_pdc *pdc);
+void uet_pds_pdcid_remove(struct uet_pdc *pdc);
+
+int uet_pds_init(struct uet_pds *pds);
+void uet_pds_uninit(struct uet_pds *pds);
+
+void uet_pds_pdc_gc_queue(struct uet_pdc *pdc);
+void uet_pds_clean_job(struct uet_pds *pds, u32 job_id);
+
+int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
+	       __be32 remote_fep_addr, __be16 dport, __u8 tos);
+
+static inline struct uet_prologue_hdr *pds_prologue_hdr(const struct sk_buff *skb)
+{
+	return (struct uet_prologue_hdr *)skb_network_header(skb);
+}
+
+static inline struct uet_pds_req_hdr *pds_req_hdr(const struct sk_buff *skb)
+{
+	return (struct uet_pds_req_hdr *)skb_network_header(skb);
+}
+
+static inline struct uet_pds_ack_hdr *pds_ack_hdr(const struct sk_buff *skb)
+{
+	return (struct uet_pds_ack_hdr *)skb_network_header(skb);
+}
+
+static inline struct uet_pds_nack_hdr *pds_nack_hdr(const struct sk_buff *skb)
+{
+	return (struct uet_pds_nack_hdr *)skb_network_header(skb);
+}
+
+static inline struct uet_pds_ack_ext_hdr *pds_ack_ext_hdr(const struct sk_buff *skb)
+{
+	return (struct uet_pds_ack_ext_hdr *)(pds_ack_hdr(skb) + 1);
+}
+
+static inline struct uet_ses_rsp_hdr *pds_ack_ses_rsp_hdr(const struct sk_buff *skb)
+{
+	/* TODO: ack_ext_hdr, CC_STATE, etc. */
+	return (struct uet_ses_rsp_hdr *)(pds_ack_hdr(skb) + 1);
+}
+
+static inline struct uet_ses_req_hdr *pds_req_ses_req_hdr(const struct sk_buff *skb)
+{
+	/* TODO: ack_ext_hdr, CC_STATE, etc. */
+	return (struct uet_ses_req_hdr *)(pds_req_hdr(skb) + 1);
+}
+
+static inline __be16 pds_ses_rsp_hdr_pack(__u8 opcode, __u8 version, __u8 list,
+					  __u8 ses_rc)
+{
+	return cpu_to_be16((opcode & UET_SES_RSP_OPCODE_MASK) <<
+			   UET_SES_RSP_OPCODE_SHIFT |
+			   (version & UET_SES_RSP_VERSION_MASK) <<
+			   UET_SES_RSP_VERSION_SHIFT |
+			   (list & UET_SES_RSP_LIST_MASK) <<
+			   UET_SES_RSP_LIST_SHIFT |
+			   (ses_rc & UET_SES_RSP_RC_MASK) <<
+			   UET_SES_RSP_RC_SHIFT);
+}
+#endif /* _UECON_PDS_H */
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
index a4ac25455aa0..6f3ee5ac8cf4 100644
--- a/include/uapi/linux/ultraeth.h
+++ b/include/uapi/linux/ultraeth.h
@@ -9,6 +9,360 @@
 #define UET_DEFAULT_PORT 5432
 #define UET_SVC_MAX_LEN 64
 
+/* types used for prologue's type field */
+enum {
+	UET_PDS_TYPE_RSVD0,
+	UET_PDS_TYPE_ENC_HDR,
+	UET_PDS_TYPE_RUD_REQ,
+	UET_PDS_TYPE_ROD_REQ,
+	UET_PDS_TYPE_RUDI_REQ,
+	UET_PDS_TYPE_RUDI_RESPONSE,
+	UET_PDS_TYPE_UUD_REQ,
+	UET_PDS_TYPE_ACK,
+	UET_PDS_TYPE_ACK_CC,
+	UET_PDS_TYPE_ACK_CCX,
+	UET_PDS_TYPE_NACK,
+	UET_PDS_TYPE_CTRL_MSG
+};
+
+/* ctl_type when type is UET_PDS_CTRL_MSG (control message) */
+enum {
+	UET_CTL_TYPE_NOOP,
+	UET_CTL_TYPE_REQ_ACK,
+	UET_CTL_TYPE_CLEAR,
+	UET_CTL_TYPE_REQ_CLEAR,
+	UET_CTL_TYPE_CLOSE,
+	UET_CTL_TYPE_REQ_CLOSE,
+	UET_CTL_TYPE_PROBE,
+	UET_CTL_TYPE_CREDIT,
+	UET_CTL_TYPE_REQ_CREDIT
+};
+
+/* next header, 0x06-0x0E reserved */
+enum {
+	UET_PDS_NEXT_HDR_NONE		= 0x00,
+	UET_PDS_NEXT_HDR_REQ_SMALL	= 0x01,
+	UET_PDS_NEXT_HDR_REQ_MEDIUM	= 0x02,
+	UET_PDS_NEXT_HDR_REQ_STD	= 0x03,
+	UET_PDS_NEXT_HDR_RSP		= 0x04,
+	UET_PDS_NEXT_HDR_RSP_DATA	= 0x05,
+	UET_PDS_NEXT_HDR_RSP_DATA_SMALL	= 0x06,
+	UET_PDS_NEXT_HDR_PDS		= 0x0F,
+};
+
+/* fields(union): type_next_flags, type_ctl_flags  */
+#define UET_PROLOGUE_FLAGS_BITS 7
+#define UET_PROLOGUE_FLAGS_MASK 0x7f
+#define UET_PROLOGUE_NEXT_BITS 4
+#define UET_PROLOGUE_NEXT_MASK 0x0f
+#define UET_PROLOGUE_NEXT_SHIFT UET_PROLOGUE_FLAGS_BITS
+#define UET_PROLOGUE_CTL_BITS UET_PROLOGUE_NEXT_BITS
+#define UET_PROLOGUE_CTL_SHIFT UET_PROLOGUE_NEXT_SHIFT
+#define UET_PROLOGUE_CTL_MASK UET_PROLOGUE_NEXT_MASK
+#define UET_PROLOGUE_TYPE_BITS 5
+#define UET_PROLOGUE_TYPE_MASK 0x1f
+#define UET_PROLOGUE_TYPE_SHIFT (UET_PROLOGUE_NEXT_SHIFT + UET_PROLOGUE_NEXT_BITS)
+struct uet_prologue_hdr {
+	union {
+		__be16 type_next_flags;
+		__be16 type_ctl_flags;
+	};
+} __attribute__ ((__packed__));
+
+static inline __u8 uet_prologue_flags(const struct uet_prologue_hdr *hdr)
+{
+	return __be16_to_cpu(hdr->type_next_flags) & UET_PROLOGUE_FLAGS_MASK;
+}
+
+static inline __u8 uet_prologue_next_hdr(const struct uet_prologue_hdr *hdr)
+{
+	return (__be16_to_cpu(hdr->type_next_flags) >> UET_PROLOGUE_NEXT_SHIFT) &
+	       UET_PROLOGUE_NEXT_MASK;
+}
+
+static inline __u8 uet_prologue_ctl_type(const struct uet_prologue_hdr *hdr)
+{
+	return (__be16_to_cpu(hdr->type_ctl_flags) >> UET_PROLOGUE_CTL_SHIFT) &
+	       UET_PROLOGUE_CTL_MASK;
+}
+
+static inline __u8 uet_prologue_type(const struct uet_prologue_hdr *hdr)
+{
+	return (__be16_to_cpu(hdr->type_next_flags) >> UET_PROLOGUE_TYPE_SHIFT) &
+	       UET_PROLOGUE_TYPE_MASK;
+}
+
+/* rud/rod request flags */
+enum {
+	UET_PDS_REQ_FLAG_RSV2	= (1 << 0),
+	UET_PDS_REQ_FLAG_CC	= (1 << 1),
+	UET_PDS_REQ_FLAG_SYN	= (1 << 2),
+	UET_PDS_REQ_FLAG_AR	= (1 << 3),
+	UET_PDS_REQ_FLAG_RETX	= (1 << 4),
+	UET_PDS_REQ_FLAG_RSV	= (1 << 5),
+	UET_PDS_REQ_FLAG_CRC	= (1 << 6),
+};
+
+/* field: pdc_mode_psn_offset */
+#define UET_PDS_REQ_PSN_OFF_BITS 12
+#define UET_PDS_REQ_PSN_OFF_MASK 0xff1
+#define UET_PDS_REQ_MODE_BITS 4
+#define UET_PDS_REQ_MODE_MASK 0xf
+#define UET_PDS_REQ_MODE_SHIFT UET_PDS_REQ_PSN_OFF_BITS
+struct uet_pds_req_hdr {
+	struct uet_prologue_hdr prologue;
+	__be16 clear_psn_offset;
+	__be32 psn;
+	__be16 spdcid;
+	union {
+		__be16 pdc_mode_psn_offset;
+		__be16 dpdcid;
+	};
+} __attribute__ ((__packed__));
+
+static inline __u16 uet_pds_request_psn_offset(const struct uet_pds_req_hdr *req)
+{
+	return __be16_to_cpu(req->pdc_mode_psn_offset) & UET_PDS_REQ_PSN_OFF_MASK;
+}
+
+static inline __u8 uet_pds_request_pdc_mode(const struct uet_pds_req_hdr *req)
+{
+	return (__be16_to_cpu(req->pdc_mode_psn_offset) >> UET_PDS_REQ_MODE_SHIFT) &
+	       UET_PDS_REQ_MODE_MASK;
+}
+
+/* rud/rod ack flags */
+enum {
+	UET_PDS_ACK_FLAG_RSVD	= (1 << 0),
+	UET_PDS_ACK_FLAG_REQ1	= (1 << 1),
+	UET_PDS_ACK_FLAG_REQ2	= (1 << 2),
+	UET_PDS_ACK_FLAG_P	= (1 << 3),
+	UET_PDS_ACK_FLAG_RETX	= (1 << 4),
+	UET_PDS_ACK_FLAG_M	= (1 << 5),
+	UET_PDS_ACK_FLAG_CRC	= (1 << 6)
+};
+
+struct uet_pds_ack_hdr {
+	struct uet_prologue_hdr prologue;
+	__be16 ack_psn_offset;
+	__be32 cack_psn;
+	__be16 spdcid;
+	__be16 dpdcid;
+} __attribute__ ((__packed__));
+
+/* ses request op codes */
+enum {
+	UET_SES_REQ_OP_NOOP			= 0x00,
+	UET_SES_REQ_OP_WRITE			= 0x01,
+	UET_SES_REQ_OP_READ			= 0x02,
+	UET_SES_REQ_OP_ATOMIC			= 0x03,
+	UET_SES_REQ_OP_FETCHING_ATOMIC		= 0x04,
+	UET_SES_REQ_OP_SEND			= 0x05,
+	UET_SES_REQ_OP_RENDEZVOUS_SEND		= 0x06,
+	UET_SES_REQ_OP_DGRAM_SEND		= 0x07,
+	UET_SES_REQ_OP_DEFERRABLE_SEND		= 0x08,
+	UET_SES_REQ_OP_TAGGED_SEND		= 0x09,
+	UET_SES_REQ_OP_RENDEZVOUS_TSEND		= 0x0A,
+	UET_SES_REQ_OP_DEFERRABLE_TSEND		= 0x0B,
+	UET_SES_REQ_OP_DEFERRABLE_RTR		= 0x0C,
+	UET_SES_REQ_OP_TSEND_ATOMIC		= 0x0D,
+	UET_SES_REQ_OP_TSEND_FETCH_ATOMIC	= 0x0E,
+	UET_SES_REQ_OP_MSG_ERROR		= 0x0F,
+	UET_SES_REQ_OP_INC_PUSH			= 0x10,
+};
+
+enum {
+	UET_SES_REQ_FLAG_SOM		= (1 << 0),
+	UET_SES_REQ_FLAG_EOM		= (1 << 1),
+	UET_SES_REQ_FLAG_HD		= (1 << 2),
+	UET_SES_REQ_FLAG_RELATIVE	= (1 << 3),
+	UET_SES_REQ_FLAG_IE		= (1 << 4),
+	UET_SES_REQ_FLAG_DC		= (1 << 5)
+};
+
+/* field: resv_opcode */
+#define UET_SES_REQ_OPCODE_MASK 0x3f
+/* field: flags */
+#define UET_SES_REQ_FLAGS_MASK 0x3f
+#define UET_SES_REQ_FLAGS_VERSION_MASK 0x3
+#define UET_SES_REQ_FLAGS_VERSION_SHIFT 6
+/* field: resv_idx */
+#define UET_SES_REQ_INDEX_MASK 0xfff
+/* field: idx_gen_job_id */
+#define UET_SES_REQ_JOB_ID_BITS 24
+#define UET_SES_REQ_JOB_ID_MASK 0xffffff
+#define UET_SES_REQ_INDEX_GEN_MASK 0xff
+#define UET_SES_REQ_INDEX_GEN_SHIFT UET_SES_REQ_JOB_ID_BITS
+/* field: resv_pid_on_fep */
+#define UET_SES_REQ_PID_ON_FEP_MASK 0xfff
+struct uet_ses_req_hdr {
+	__u8 resv_opcode;
+	__u8 flags;
+	__be16 msg_id;
+	__be32 idx_gen_job_id;
+	__be16 resv_pid_on_fep;
+	__be16 resv_idx;
+	__be64 buffer_offset;
+	__be32 initiator;
+	__be64 match_bits;
+	__be64 header_data;
+	__be32 request_len;
+} __attribute__ ((__packed__));
+
+static inline __u8 uet_ses_req_opcode(const struct uet_ses_req_hdr *sreq)
+{
+	return sreq->resv_opcode & UET_SES_REQ_OPCODE_MASK;
+}
+
+static inline __u8 uet_ses_req_flags(const struct uet_ses_req_hdr *sreq)
+{
+	return sreq->flags & UET_SES_REQ_FLAGS_MASK;
+}
+
+static inline __u8 uet_ses_req_version(const struct uet_ses_req_hdr *sreq)
+{
+	return (sreq->flags >> UET_SES_REQ_FLAGS_VERSION_SHIFT) &
+	       UET_SES_REQ_FLAGS_VERSION_MASK;
+}
+
+static inline __u16 uet_ses_req_index(const struct uet_ses_req_hdr *sreq)
+{
+	return __be16_to_cpu(sreq->resv_idx) & UET_SES_REQ_INDEX_MASK;
+}
+
+static inline __u32 uet_ses_req_job_id(const struct uet_ses_req_hdr *sreq)
+{
+	return __be32_to_cpu(sreq->idx_gen_job_id) & UET_SES_REQ_JOB_ID_MASK;
+}
+
+static inline __u8 uet_ses_req_index_gen(const struct uet_ses_req_hdr *sreq)
+{
+	return (__be32_to_cpu(sreq->idx_gen_job_id) >> UET_SES_REQ_INDEX_GEN_SHIFT) &
+	       UET_SES_REQ_INDEX_GEN_MASK;
+}
+
+static inline __u16 uet_ses_req_pid_on_fep(const struct uet_ses_req_hdr *sreq)
+{
+	return __be16_to_cpu(sreq->resv_pid_on_fep) & UET_SES_REQ_PID_ON_FEP_MASK;
+}
+
+/* return codes */
+enum {
+	UET_SES_RSP_RC_NULL		= 0x00,
+	UET_SES_RSP_RC_OK		= 0x01,
+	UET_SES_RSP_RC_BAD_GEN		= 0x02,
+	UET_SES_RSP_RC_DISABLED		= 0x03,
+	UET_SES_RSP_RC_DISABLED_GEN	= 0x04,
+	UET_SES_RSP_RC_NO_MATCH		= 0x05,
+	UET_SES_RSP_RC_UNSUPP_OP	= 0x06,
+	UET_SES_RSP_RC_UNSUPP_SIZE	= 0x07,
+	UET_SES_RSP_RC_AT_INVALID	= 0x08,
+	UET_SES_RSP_RC_AT_PERM		= 0x09,
+	UET_SES_RSP_RC_AT_ATS_ERROR	= 0x0A,
+	UET_SES_RSP_RC_AT_NO_TRANS	= 0x0B,
+	UET_SES_RSP_RC_AT_OUT_OF_RANGE	= 0x0C,
+	UET_SES_RSP_RC_HOST_POISONED	= 0x0D,
+	UET_SES_RSP_RC_HOST_UNSUCC_CMPL	= 0x0E,
+	UET_SES_RSP_RC_AMO_UNSUPP_OP	= 0x0F,
+	UET_SES_RSP_RC_AMO_UNSUPP_DT	= 0x10,
+	UET_SES_RSP_RC_AMO_UNSUPP_SIZE	= 0x11,
+	UET_SES_RSP_RC_AMO_UNALIGNED	= 0x12,
+	UET_SES_RSP_RC_AMO_FP_NAN	= 0x13,
+	UET_SES_RSP_RC_AMO_FP_UNDERFLOW	= 0x14,
+	UET_SES_RSP_RC_AMO_FP_OVERFLOW	= 0x15,
+	UET_SES_RSP_RC_AMO_FP_INEXACT	= 0x16,
+	UET_SES_RSP_RC_PERM_VIOLATION	= 0x17,
+	UET_SES_RSP_RC_OP_VIOLATION	= 0x18,
+	UET_SES_RSP_RC_BAD_INDEX	= 0x19,
+	UET_SES_RSP_RC_BAD_PID		= 0x1A,
+	UET_SES_RSP_RC_BAD_JOB_ID	= 0x1B,
+	UET_SES_RSP_RC_BAD_MKEY		= 0x1C,
+	UET_SES_RSP_RC_BAD_ADDR		= 0x1D,
+	UET_SES_RSP_RC_CANCELLED	= 0x1E,
+	UET_SES_RSP_RC_UNDELIVERABLE	= 0x1F,
+	UET_SES_RSP_RC_UNCOR		= 0x20,
+	UET_SES_RSP_RC_UNCOR_TRNSNT	= 0x21,
+	UET_SES_RSP_RC_TOO_LONG		= 0x22,
+	UET_SES_RSP_RC_INITIATOR_ERR	= 0x23,
+	UET_SES_RSP_RC_DROPPED		= 0x24,
+};
+
+/* ses response list values */
+enum {
+	UET_SES_RSP_LIST_EXPECTED	= 0x00,
+	UET_SES_RSP_LIST_OVERFLOW	= 0x01
+};
+
+/* ses response op codes */
+enum {
+	UET_SES_RSP_OP_DEF_RESP		= 0x00,
+	UET_SES_RSP_OP_RESPONSE		= 0x01,
+	UET_SES_RSP_OP_RESP_W_DATA	= 0x02
+};
+
+/* field: lst_opcode_ver_rc */
+#define UET_SES_RSP_RC_BITS 6
+#define UET_SES_RSP_RC_MASK 0x3f
+#define UET_SES_RSP_RC_SHIFT 0
+#define UET_SES_RSP_VERSION_BITS 2
+#define UET_SES_RSP_VERSION_MASK 0x3
+#define UET_SES_RSP_VERSION_SHIFT (UET_SES_RSP_RC_SHIFT + \
+				   UET_SES_RSP_RC_BITS)
+#define UET_SES_RSP_OPCODE_BITS 6
+#define UET_SES_RSP_OPCODE_MASK 0x3f
+#define UET_SES_RSP_OPCODE_SHIFT (UET_SES_RSP_VERSION_SHIFT + \
+				  UET_SES_RSP_VERSION_BITS)
+#define UET_SES_RSP_LIST_BITS 2
+#define UET_SES_RSP_LIST_MASK 0x3
+#define UET_SES_RSP_LIST_SHIFT (UET_SES_RSP_OPCODE_SHIFT + \
+				UET_SES_RSP_OPCODE_BITS)
+/* field: idx_gen_job_id */
+#define UET_SES_RSP_JOB_ID_BITS 24
+#define UET_SES_RSP_JOB_ID_MASK 0xffffff
+#define UET_SES_RSP_INDEX_GEN_MASK 0xff
+#define UET_SES_RSP_INDEX_GEN_SHIFT UET_SES_RSP_JOB_ID_BITS
+struct uet_ses_rsp_hdr {
+	__be16 lst_opcode_ver_rc;
+	__be16 msg_id;
+	__be32 idx_gen_job_id;
+	__be32 mod_len;
+} __attribute__ ((__packed__));
+
+static inline __u8 uet_ses_rsp_rc(const struct uet_ses_rsp_hdr *rsp)
+{
+	return (__be32_to_cpu(rsp->lst_opcode_ver_rc) >>
+		UET_SES_RSP_RC_SHIFT) & UET_SES_RSP_RC_MASK;
+}
+
+static inline __u8 uet_ses_rsp_list(const struct uet_ses_rsp_hdr *rsp)
+{
+	return (__be32_to_cpu(rsp->lst_opcode_ver_rc) >>
+		UET_SES_RSP_LIST_SHIFT) & UET_SES_RSP_LIST_MASK;
+}
+
+static inline __u8 uet_ses_rsp_version(const struct uet_ses_rsp_hdr *rsp)
+{
+	return (__be32_to_cpu(rsp->lst_opcode_ver_rc) >>
+		UET_SES_RSP_VERSION_SHIFT) & UET_SES_RSP_VERSION_MASK;
+}
+
+static inline __u8 uet_ses_rsp_opcode(const struct uet_ses_rsp_hdr *rsp)
+{
+	return (__be32_to_cpu(rsp->lst_opcode_ver_rc) >>
+		UET_SES_RSP_OPCODE_SHIFT) & UET_SES_RSP_OPCODE_MASK;
+}
+
+static inline __u32 uet_ses_rsp_job_id(const struct uet_ses_rsp_hdr *rsp)
+{
+	return __be32_to_cpu(rsp->idx_gen_job_id) & UET_SES_RSP_JOB_ID_MASK;
+}
+
+static inline __u8 uet_ses_rsp_index_gen(const struct uet_ses_req_hdr *rsp)
+{
+	return (__be32_to_cpu(rsp->idx_gen_job_id) >> UET_SES_RSP_INDEX_GEN_SHIFT) &
+	       UET_SES_RSP_INDEX_GEN_MASK;
+}
+
 enum {
 	UET_ADDR_F_VALID_FEP_CAP	= (1 << 0),
 	UET_ADDR_F_VALID_ADDR		= (1 << 1),
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 07/13] drivers: ultraeth: add request and ack receive support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (5 preceding siblings ...)
  2025-03-06 23:01 ` [RFC PATCH 06/13] drivers: ultraeth: add initial PDS infrastructure Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 08/13] drivers: ultraeth: add request transmit support Nikolay Aleksandrov
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Add support for receiving request and ack packets. A PDC is automatically
created if a request with SYN flag is received. If all is well with the
request and it passes all validations, we automatically return an ack.
Currently RUD (unordered) type of request is expected. The receive
and ack packet sequence numbers are tracked via bitmaps.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/uet_pdc.c     | 285 +++++++++++++++++++++++++++++++++
 drivers/ultraeth/uet_pds.c     | 137 +++++++++++++++-
 include/net/ultraeth/uet_pdc.h |  38 +++++
 3 files changed, 458 insertions(+), 2 deletions(-)

diff --git a/drivers/ultraeth/uet_pdc.c b/drivers/ultraeth/uet_pdc.c
index 47cf4c3dee04..a0352a925329 100644
--- a/drivers/ultraeth/uet_pdc.c
+++ b/drivers/ultraeth/uet_pdc.c
@@ -6,6 +6,19 @@
 #include <net/ultraeth/uet_context.h>
 #include <net/ultraeth/uet_pdc.h>
 
+static void uet_pdc_xmit(struct uet_pdc *pdc, struct sk_buff *skb)
+{
+	skb->dev = pds_netdev(pdc->pds);
+
+	if (!dst_hold_safe(&pdc->metadata->dst)) {
+		kfree_skb(skb);
+		return;
+	}
+
+	skb_dst_set(skb, &pdc->metadata->dst);
+	dev_queue_xmit(skb);
+}
+
 /* use the approach as nf nat, try a few rounds starting at random offset */
 static bool uet_pdc_id_get(struct uet_pdc *pdc)
 {
@@ -67,6 +80,12 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 	pdc->state = state;
 	pdc->dpdcid = dpdcid;
 	pdc->pid_on_fep = pid_on_fep;
+	pdc->rx_bitmap = bitmap_zalloc(UET_PDC_MPR, GFP_ATOMIC);
+	if (!pdc->rx_bitmap)
+		goto err_rx_bitmap;
+	pdc->ack_bitmap = bitmap_zalloc(UET_PDC_MPR, GFP_ATOMIC);
+	if (!pdc->ack_bitmap)
+		goto err_ack_bitmap;
 	pdc->metadata = __ip_tun_set_dst(key->src_ip, key->dst_ip, tos, 0, dport,
 					 md_flags, 0, 0);
 	if (!pdc->metadata)
@@ -103,6 +122,10 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 err_ep_insert:
 	dst_release(&pdc->metadata->dst);
 err_tun_dst:
+	bitmap_free(pdc->ack_bitmap);
+err_ack_bitmap:
+	bitmap_free(pdc->rx_bitmap);
+err_rx_bitmap:
 	uet_pds_pdcid_remove(pdc);
 err_id_get:
 	kfree(pdc);
@@ -113,6 +136,8 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 void uet_pdc_free(struct uet_pdc *pdc)
 {
 	dst_release(&pdc->metadata->dst);
+	bitmap_free(pdc->ack_bitmap);
+	bitmap_free(pdc->rx_bitmap);
 	kfree(pdc);
 }
 
@@ -122,3 +147,263 @@ void uet_pdc_destroy(struct uet_pdc *pdc)
 	uet_pds_pdcid_remove(pdc);
 	uet_pds_pdc_gc_queue(pdc);
 }
+
+static void pdc_build_ack(struct uet_pdc *pdc, struct sk_buff *skb, u32 psn,
+			  u8 ack_flags, bool exact_psn)
+{
+	struct uet_pds_ack_hdr *ack = skb_put(skb, sizeof(*ack));
+
+	uet_pdc_build_prologue(&ack->prologue, UET_PDS_TYPE_ACK,
+			       UET_PDS_NEXT_HDR_RSP, ack_flags);
+	if (exact_psn) {
+		ack->ack_psn_offset = 0;
+		ack->cack_psn = cpu_to_be32(psn);
+	} else {
+		ack->ack_psn_offset = cpu_to_be16(psn - pdc->rx_base_psn);
+		ack->cack_psn = cpu_to_be32(pdc->rx_base_psn);
+	}
+	ack->spdcid = cpu_to_be16(pdc->spdcid);
+	ack->dpdcid = cpu_to_be16(pdc->dpdcid);
+}
+
+static void uet_pdc_build_ses_ack(struct uet_pdc *pdc, struct sk_buff *skb,
+				  __u8 ses_rc, __be16 msg_id, u32 psn,
+				  u8 ack_flags, bool exact_psn)
+{
+	struct uet_ses_rsp_hdr *ses_rsp;
+	__be16 packed;
+
+	pdc_build_ack(pdc, skb, psn, ack_flags, exact_psn);
+	ses_rsp = skb_put(skb, sizeof(*ses_rsp));
+	memset(ses_rsp, 0, sizeof(*ses_rsp));
+	packed = pds_ses_rsp_hdr_pack(UET_SES_RSP_OP_RESPONSE, 0,
+				      UET_SES_RSP_LIST_EXPECTED, ses_rc);
+	ses_rsp->lst_opcode_ver_rc = packed;
+	ses_rsp->idx_gen_job_id = cpu_to_be32(pdc->key.job_id);
+	ses_rsp->msg_id = msg_id;
+}
+
+static int uet_pdc_send_ses_ack(struct uet_pdc *pdc, __u8 ses_rc, __be16 msg_id,
+				u32 psn, u8 ack_flags, bool exact_psn)
+{
+	struct sk_buff *skb;
+
+	skb = alloc_skb(sizeof(struct uet_ses_rsp_hdr) +
+			sizeof(struct uet_pds_ack_hdr), GFP_ATOMIC);
+	if (!skb)
+		return -ENOBUFS;
+
+	uet_pdc_build_ses_ack(pdc, skb, ses_rc, msg_id, psn, ack_flags,
+			      exact_psn);
+	uet_pdc_xmit(pdc, skb);
+
+	return 0;
+}
+
+static void uet_pdc_mpr_advance_tx(struct uet_pdc *pdc, u32 bits)
+{
+	if (!test_bit(0, pdc->ack_bitmap))
+		return;
+
+	bitmap_shift_right(pdc->ack_bitmap, pdc->ack_bitmap, bits, UET_PDC_MPR);
+	pdc->tx_base_psn += bits;
+	netdev_dbg(pds_netdev(pdc->pds), "%s: advancing tx to %u\n", __func__,
+		   pdc->tx_base_psn);
+}
+
+int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
+		   __be32 remote_fep_addr)
+{
+	struct uet_ses_rsp_hdr *ses_rsp = pds_ack_ses_rsp_hdr(skb);
+	struct uet_pds_ack_hdr *ack = pds_ack_hdr(skb);
+	s16 ack_psn_offset = be16_to_cpu(ack->ack_psn_offset);
+	const char *drop_reason = "ack_psn not in MPR window";
+	u32 cack_psn = be32_to_cpu(ack->cack_psn);
+	u32 ack_psn = cack_psn + ack_psn_offset;
+	int ret = -EINVAL;
+	u32 psn_bit;
+
+	spin_lock(&pdc->lock);
+	netdev_dbg(pds_netdev(pdc->pds), "%s: tx_busy: %u pdc: [ tx_base_psn: %u"
+				  " state: %u dpdcid: %u spdcid: %u ]\n"
+				  "ses: [ msg id: %u cack_psn: %u spdcid: %u"
+				  " dpdcid: %u ack_psn: %u ]\n",
+		   __func__, pdc->tx_busy, pdc->tx_base_psn,
+		   pdc->state, pdc->dpdcid, pdc->spdcid,
+		   be16_to_cpu(ses_rsp->msg_id), be32_to_cpu(ack->cack_psn),
+		   be16_to_cpu(ack->spdcid), be16_to_cpu(ack->dpdcid), ack_psn);
+
+	if (psn_mpr_pos(pdc->tx_base_psn, ack_psn) != UET_PDC_MPR_CUR)
+		goto err_dbg;
+
+	psn_bit = ack_psn - pdc->tx_base_psn;
+	if (!psn_bit_valid(psn_bit)) {
+		drop_reason = "ack_psn bit is invalid";
+		goto err_dbg;
+	}
+	if (test_and_set_bit(psn_bit, pdc->ack_bitmap)) {
+		drop_reason = "ack_psn bit already set in ack_bitmap";
+		goto err_dbg;
+	}
+
+	/* either using ROD mode or in SYN_SENT state */
+	if (pdc->tx_busy)
+		pdc->tx_busy = false;
+	/* we can advance only if the oldest pkt got acked */
+	if (!psn_bit)
+		uet_pdc_mpr_advance_tx(pdc, 1);
+
+	ret = 0;
+	switch (pdc->state) {
+	case UET_PDC_EP_STATE_SYN_SENT:
+	case UET_PDC_EP_STATE_NEW_ESTABLISHED:
+		pdc->dpdcid = be16_to_cpu(ack->spdcid);
+		pdc->state = UET_PDC_EP_STATE_ESTABLISHED;
+		fallthrough;
+	case UET_PDC_EP_STATE_ESTABLISHED:
+		ret = uet_job_fep_queue_skb(pds_context(pdc->pds),
+					    uet_ses_rsp_job_id(ses_rsp), skb,
+					    remote_fep_addr);
+		break;
+	case UET_PDC_EP_STATE_ACK_WAIT:
+		break;
+	case UET_PDC_EP_STATE_CLOSE_ACK_WAIT:
+		break;
+	}
+
+out:
+	spin_unlock(&pdc->lock);
+
+	return ret;
+err_dbg:
+	netdev_dbg(pds_netdev(pdc->pds), "%s: drop reason: [ %s ]\n"
+				  "pdc: [ tx_base_psn: %u state: %u"
+				  " dpdcid: %u spdcid: %u ]\n"
+				  "ses: [ msg id: %u cack_psn: %u spdcid: %u"
+				  " dpdcid: %u ack_psn: %u ]\n",
+		  __func__, drop_reason, pdc->tx_base_psn,
+		  pdc->state, pdc->dpdcid, pdc->spdcid,
+		  be16_to_cpu(ses_rsp->msg_id), be32_to_cpu(ack->cack_psn),
+		  be16_to_cpu(ack->spdcid), be16_to_cpu(ack->dpdcid), ack_psn);
+	goto out;
+}
+
+static void uet_pdc_mpr_advance_rx(struct uet_pdc *pdc)
+{
+	if (!test_bit(0, pdc->rx_bitmap))
+		return;
+
+	bitmap_shift_right(pdc->rx_bitmap, pdc->rx_bitmap, 1, UET_PDC_MPR);
+	pdc->rx_base_psn++;
+	netdev_dbg(pds_netdev(pdc->pds), "%s: advancing rx to %u\n",
+		   __func__, pdc->rx_base_psn);
+}
+
+int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
+		   __be32 remote_fep_addr, __u8 tos)
+{
+	struct uet_ses_req_hdr *ses_req = pds_req_ses_req_hdr(skb);
+	struct uet_pds_req_hdr *req = pds_req_hdr(skb);
+	u8 req_flags = uet_prologue_flags(&req->prologue), ack_flags = 0;
+	u32 req_psn = be32_to_cpu(req->psn);
+	const char *drop_reason = "tx_busy";
+	unsigned long psn_bit;
+	enum mpr_pos psn_pos;
+	int ret = -EINVAL;
+
+	spin_lock(&pdc->lock);
+	netdev_dbg(pds_netdev(pdc->pds), "%s: tx_busy: %u pdc: [ tx_base_psn: %u"
+				  " state: %u dpdcid: %u spdcid: %u ]\n"
+				  "req: [ psn: %u spdcid: %u dpdcid: %u prologue flags: 0x%x ]\n"
+				  "ses_req: [ opcode: %u msg id: %u job id: %u "
+				  "pid_on_fep: %u flags: 0x%x ]\n",
+		   __func__, pdc->tx_busy, pdc->tx_base_psn,
+		   pdc->state, pdc->dpdcid, pdc->spdcid,
+		   req_psn, be16_to_cpu(req->spdcid), be16_to_cpu(req->dpdcid),
+		   uet_prologue_flags(&req->prologue),
+		   uet_ses_req_opcode(ses_req), be16_to_cpu(ses_req->msg_id),
+		   uet_ses_req_job_id(ses_req), uet_ses_req_pid_on_fep(ses_req),
+		   uet_ses_req_flags(ses_req));
+
+	if (unlikely(pdc->tx_busy))
+		goto err_dbg;
+
+	if (req_flags & UET_PDS_REQ_FLAG_RETX)
+		ack_flags |= UET_PDS_ACK_FLAG_RETX;
+	if (INET_ECN_is_ce(tos))
+		ack_flags |= UET_PDS_ACK_FLAG_M;
+	psn_pos = psn_mpr_pos(pdc->rx_base_psn, req_psn);
+	switch (psn_pos) {
+	case UET_PDC_MPR_FUTURE:
+		drop_reason = "req psn is in a future MPR window";
+		goto err_dbg;
+	case UET_PDC_MPR_PREV:
+		if ((int)(req_psn - pdc->rx_base_psn) < S16_MIN) {
+			drop_reason = "req psn is too far in the past";
+			goto err_dbg;
+		}
+		uet_pdc_send_ses_ack(pdc, UET_SES_RSP_RC_NULL, ses_req->msg_id,
+				     req_psn, ack_flags, true);
+		netdev_dbg(pds_netdev(pdc->pds), "%s: received a request in previous MPR window (psn %u)\n"
+					  "pdc: [ rx_base_psn: %u state: %u"
+					  " dpdcid: %u spdcid: %u ]\n",
+			   __func__, req_psn, pdc->rx_base_psn,
+			   pdc->state, pdc->dpdcid, pdc->spdcid);
+		goto out;
+	case UET_PDC_MPR_CUR:
+		break;
+	}
+
+	psn_bit = req_psn - pdc->rx_base_psn;
+	if (!psn_bit_valid(psn_bit)) {
+		drop_reason = "req psn bit is invalid";
+		goto err_dbg;
+	}
+	if (test_and_set_bit(psn_bit, pdc->rx_bitmap)) {
+		drop_reason = "req psn bit is already set in rx_bitmap";
+		goto err_dbg;
+	}
+
+	ret = 0;
+	switch (pdc->state) {
+	case UET_PDC_EP_STATE_SYN_SENT:
+		/* error */
+		break;
+	case UET_PDC_EP_STATE_ESTABLISHED:
+		/* Rx request and do an upcall, potentially return an ack */
+		ret = uet_job_fep_queue_skb(pds_context(pdc->pds),
+					    uet_ses_req_job_id(ses_req), skb,
+					    remote_fep_addr);
+		/* TODO: handle errors in sending the error */
+		/* TODO: more specific RC codes */
+		break;
+	case UET_PDC_EP_STATE_ACK_WAIT:
+		break;
+	case UET_PDC_EP_STATE_CLOSE_ACK_WAIT:
+		break;
+	}
+
+	if (ret >= 0)
+		uet_pdc_send_ses_ack(pdc, UET_SES_RSP_RC_NULL, ses_req->msg_id,
+				     req_psn, ack_flags, false);
+	/* TODO: NAK */
+
+	if (!psn_bit)
+		uet_pdc_mpr_advance_rx(pdc);
+
+out:
+	spin_unlock(&pdc->lock);
+
+	return ret;
+err_dbg:
+	netdev_dbg(pds_netdev(pdc->pds), "%s: drop reason: [ %s ]\n"
+				  "pdc: [ rx_base_psn: %u state: %u"
+				  " dpdcid: %u spdcid: %u ]\n"
+				  "ses_req: [ msg id: %u ack_psn: %u spdcid: %u"
+				  " dpdcid: %u ]\n",
+		  __func__, drop_reason, pdc->rx_base_psn,
+		  pdc->state, pdc->dpdcid, pdc->spdcid,
+		  be16_to_cpu(ses_req->msg_id), be32_to_cpu(req->psn),
+		  be16_to_cpu(req->spdcid), be16_to_cpu(req->dpdcid));
+	goto out;
+}
diff --git a/drivers/ultraeth/uet_pds.c b/drivers/ultraeth/uet_pds.c
index 4aec61eeb230..abc576e5b6e7 100644
--- a/drivers/ultraeth/uet_pds.c
+++ b/drivers/ultraeth/uet_pds.c
@@ -149,11 +149,144 @@ void uet_pds_clean_job(struct uet_pds *pds, u32 job_id)
 	rhashtable_walk_exit(&iter);
 }
 
+static int uet_pds_rx_ack(struct uet_pds *pds, struct sk_buff *skb,
+			  __be32 local_fep_addr, __be32 remote_fep_addr)
+{
+	struct uet_pds_req_hdr *pds_req = pds_req_hdr(skb);
+	u16 pdcid = be16_to_cpu(pds_req->dpdcid);
+	struct uet_pdc *pdc;
+
+	pdc = rhashtable_lookup_fast(&pds->pdcid_hash, &pdcid,
+				     uet_pds_pdcid_rht_params);
+	if (!pdc)
+		return -ENOENT;
+
+	return uet_pdc_rx_ack(pdc, skb, remote_fep_addr);
+}
+
+static struct uet_pdc *uet_pds_new_pdc_rx(struct uet_pds *pds,
+					  struct sk_buff *skb,
+					  __be16 dport,
+					  struct uet_pdc_key *key,
+					  u8 mode, u8 state)
+{
+	struct uet_ses_req_hdr *ses_req = pds_req_ses_req_hdr(skb);
+	struct uet_pds_req_hdr *req = pds_req_hdr(skb);
+
+	return uet_pdc_create(pds, be32_to_cpu(req->psn), state,
+			      be16_to_cpu(req->spdcid),
+			      uet_ses_req_pid_on_fep(ses_req),
+			      mode, 0, dport, key, true);
+}
+
+static int uet_pds_rx_req(struct uet_pds *pds, struct sk_buff *skb,
+			  __be32 local_fep_addr, __be32 remote_fep_addr,
+			  __be16 dport, __u8 tos)
+{
+	struct uet_ses_req_hdr *ses_req = pds_req_ses_req_hdr(skb);
+	struct uet_pds_req_hdr *pds_req = pds_req_hdr(skb);
+	u16 pdcid = be16_to_cpu(pds_req->dpdcid);
+	struct uet_pdc_key key = {};
+	struct uet_fep *fep;
+	struct uet_pdc *pdc;
+
+	key.src_ip = local_fep_addr;
+	key.dst_ip = remote_fep_addr;
+	key.job_id = uet_ses_req_job_id(ses_req);
+
+	pdc = rhashtable_lookup_fast(&pds->pdcid_hash, &pdcid,
+				     uet_pds_pdcid_rht_params);
+	/* new flow */
+	if (unlikely(!pdc)) {
+		struct uet_prologue_hdr *prologue = pds_prologue_hdr(skb);
+		struct uet_context *ctx;
+		struct uet_job *job;
+
+		if (!(uet_prologue_flags(prologue) & UET_PDS_REQ_FLAG_SYN))
+			return -EINVAL;
+
+		ctx = container_of(pds, struct uet_context, pds);
+		job = uet_job_find(&ctx->job_reg, key.job_id);
+		if (!job)
+			return -ENOENT;
+		fep = rcu_dereference(job->fep);
+		if (!fep)
+			return -ECONNREFUSED;
+		if (fep->addr.in_address.ip != local_fep_addr)
+			return -ENOENT;
+
+		pdc = uet_pds_new_pdc_rx(pds, skb, dport, &key,
+					 UET_PDC_MODE_RUD,
+					 UET_PDC_EP_STATE_NEW_ESTABLISHED);
+		if (IS_ERR(pdc))
+			return PTR_ERR(pdc);
+	}
+
+	return uet_pdc_rx_req(pdc, skb, remote_fep_addr, tos);
+}
+
+static bool uet_pds_rx_valid_req_next_hdr(const struct uet_prologue_hdr *prologue)
+{
+	switch (uet_prologue_next_hdr(prologue)) {
+	case UET_PDS_NEXT_HDR_REQ_STD:
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+static bool uet_pds_rx_valid_ack_next_hdr(const struct uet_prologue_hdr *prologue)
+{
+	switch (uet_prologue_next_hdr(prologue)) {
+	case UET_PDS_NEXT_HDR_RSP:
+	case UET_PDS_NEXT_HDR_RSP_DATA:
+	case UET_PDS_NEXT_HDR_RSP_DATA_SMALL:
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
 int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 	       __be32 remote_fep_addr, __be16 dport, __u8 tos)
 {
+	struct uet_prologue_hdr *prologue;
+	unsigned int offset = 0;
+	int ret = -EINVAL;
+
 	if (!pskb_may_pull(skb, sizeof(struct uet_prologue_hdr)))
-		return -EINVAL;
+		return ret;
+
+	prologue = pds_prologue_hdr(skb);
+	switch (uet_prologue_type(prologue)) {
+	case UET_PDS_TYPE_ACK:
+		if (!uet_pds_rx_valid_ack_next_hdr(prologue))
+			break;
+		offset += sizeof(struct uet_pds_ack_hdr) +
+			  sizeof(struct uet_ses_rsp_hdr);
+		if (!pskb_may_pull(skb, offset))
+			break;
+
+		__net_timestamp(skb);
+		ret = uet_pds_rx_ack(pds, skb, local_fep_addr, remote_fep_addr);
+		break;
+	case UET_PDS_TYPE_RUD_REQ:
+		if (!uet_pds_rx_valid_req_next_hdr(prologue))
+			break;
+		offset = sizeof(struct uet_pds_ack_hdr) +
+			 sizeof(struct uet_ses_req_hdr);
+		if (!pskb_may_pull(skb, offset))
+			break;
+		ret = uet_pds_rx_req(pds, skb, local_fep_addr, remote_fep_addr,
+				     dport, tos);
+		break;
+	default:
+		break;
+	}
 
-	return 0;
+	return ret;
 }
diff --git a/include/net/ultraeth/uet_pdc.h b/include/net/ultraeth/uet_pdc.h
index 70f3c6aa03df..1a42647489fe 100644
--- a/include/net/ultraeth/uet_pdc.h
+++ b/include/net/ultraeth/uet_pdc.h
@@ -45,6 +45,12 @@ enum {
 	UET_PDC_MODE_UUD
 };
 
+enum mpr_pos {
+	UET_PDC_MPR_PREV,
+	UET_PDC_MPR_CUR,
+	UET_PDC_MPR_FUTURE
+};
+
 struct uet_pdc {
 	struct rhash_head pdcid_node;
 	struct rhash_head pdcep_node;
@@ -63,6 +69,9 @@ struct uet_pdc {
 	u8 mode;
 	bool is_initiator;
 
+	unsigned long *rx_bitmap;
+	unsigned long *ack_bitmap;
+
 	u32 rx_base_psn;
 	u32 tx_base_psn;
 
@@ -76,4 +85,33 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 			       const struct uet_pdc_key *key, bool is_inbound);
 void uet_pdc_destroy(struct uet_pdc *pdc);
 void uet_pdc_free(struct uet_pdc *pdc);
+int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
+		   __be32 remote_fep_addr, __u8 tos);
+int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
+		   __be32 remote_fep_addr);
+
+static inline void uet_pdc_build_prologue(struct uet_prologue_hdr *prologue,
+					  u8 type, u8 next, u8 flags)
+{
+	prologue->type_next_flags = cpu_to_be16((type & UET_PROLOGUE_TYPE_MASK) <<
+						UET_PROLOGUE_TYPE_SHIFT |
+						(next & UET_PROLOGUE_NEXT_MASK) <<
+						UET_PROLOGUE_NEXT_SHIFT |
+						(flags & UET_PROLOGUE_FLAGS_MASK));
+}
+
+static inline enum mpr_pos psn_mpr_pos(u32 base_psn, u32 psn)
+{
+	if (base_psn > psn)
+		return UET_PDC_MPR_PREV;
+	else if (psn - base_psn < UET_PDC_MPR)
+		return UET_PDC_MPR_CUR;
+	else
+		return UET_PDC_MPR_FUTURE;
+}
+
+static inline bool psn_bit_valid(u32 bit)
+{
+	return bit < UET_PDC_MPR;
+}
 #endif /* _UECON_PDC_H */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 08/13] drivers: ultraeth: add request transmit support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (6 preceding siblings ...)
  2025-03-06 23:01 ` [RFC PATCH 07/13] drivers: ultraeth: add request and ack receive support Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:01 ` [RFC PATCH 09/13] drivers: ultraeth: add support for coalescing ack Nikolay Aleksandrov
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Add support for sending request packets, if a PDC doesn't exist yet one
gets created and the request carries the SYN flag until the first ack is
received. Currently it operates in RUD (unordered) mode. The transmit
packet sequence numbers are tracked via a bitmap. Track unacked packets
and retransmit them upon a timeout.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/uet_pdc.c     | 312 ++++++++++++++++++++++++++++++++-
 drivers/ultraeth/uet_pds.c     |  57 ++++++
 include/net/ultraeth/uet_pdc.h |  27 +++
 include/net/ultraeth/uet_pds.h |   2 +
 4 files changed, 389 insertions(+), 9 deletions(-)

diff --git a/drivers/ultraeth/uet_pdc.c b/drivers/ultraeth/uet_pdc.c
index a0352a925329..dc79305cc3b5 100644
--- a/drivers/ultraeth/uet_pdc.c
+++ b/drivers/ultraeth/uet_pdc.c
@@ -19,6 +19,191 @@ static void uet_pdc_xmit(struct uet_pdc *pdc, struct sk_buff *skb)
 	dev_queue_xmit(skb);
 }
 
+static void uet_pdc_mpr_advance_tx(struct uet_pdc *pdc, u32 bits)
+{
+	if (!test_bit(0, pdc->tx_bitmap) || !test_bit(0, pdc->ack_bitmap))
+		return;
+
+	bitmap_shift_right(pdc->tx_bitmap, pdc->tx_bitmap, bits, UET_PDC_MPR);
+	bitmap_shift_right(pdc->ack_bitmap, pdc->ack_bitmap, bits, UET_PDC_MPR);
+	pdc->tx_base_psn += bits;
+	netdev_dbg(pds_netdev(pdc->pds), "%s: advancing tx to %u\n", __func__,
+		   pdc->tx_base_psn);
+}
+
+static void uet_pdc_rtx_skb(struct uet_pdc *pdc, struct sk_buff *skb, ktime_t ts)
+{
+	struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+	struct uet_prologue_hdr *prologue;
+
+	if (!nskb)
+		return;
+
+	prologue = (struct uet_prologue_hdr *)nskb->data;
+	if (!(uet_prologue_flags(prologue) & UET_PDS_REQ_FLAG_RETX))
+		uet_pdc_build_prologue(prologue,
+				       uet_prologue_ctl_type(prologue),
+				       uet_prologue_next_hdr(prologue),
+				       uet_prologue_flags(prologue) |
+				       UET_PDS_REQ_FLAG_RETX);
+
+	uet_pdc_xmit(pdc, nskb);
+	skb->tstamp = ts;
+	UET_SKB_CB(skb)->rtx_attempts++;
+}
+
+static void uet_pdc_rtx_timer_expired(struct timer_list *t)
+{
+	u64 smallest_diff = UET_PDC_RTX_DEFAULT_TIMEOUT_NSEC;
+	struct uet_pdc *pdc = from_timer(pdc, t, rtx_timer);
+	ktime_t now = ktime_get_real_ns();
+	struct sk_buff *skb, *skb_tmp;
+
+	spin_lock(&pdc->lock);
+	skb = skb_rb_first(&pdc->rtx_queue);
+	skb_rbtree_walk_from_safe(skb, skb_tmp) {
+		ktime_t expire = ktime_add(skb->tstamp,
+					   UET_PDC_RTX_DEFAULT_TIMEOUT_NSEC);
+
+		if (ktime_before(now, expire)) {
+			u64 diff = ktime_to_ns(ktime_sub(expire, now));
+
+			if (diff < smallest_diff)
+				smallest_diff = diff;
+			continue;
+		}
+		if (UET_SKB_CB(skb)->rtx_attempts == UET_PDC_RTX_DEFAULT_MAX) {
+			/* XXX: close connection, count drops etc */
+			netdev_dbg(pds_netdev(pdc->pds), "%s: psn: %u too many rtx attempts: %u\n",
+				   __func__, UET_SKB_CB(skb)->psn,
+				   UET_SKB_CB(skb)->rtx_attempts);
+			/* if dropping the oldest packet move window */
+			if (UET_SKB_CB(skb)->psn == pdc->tx_base_psn)
+				uet_pdc_mpr_advance_tx(pdc, 1);
+			rb_erase(&skb->rbnode, &pdc->rtx_queue);
+			consume_skb(skb);
+			continue;
+		}
+
+		uet_pdc_rtx_skb(pdc, skb, now);
+	}
+
+	mod_timer(&pdc->rtx_timer, jiffies +
+				   nsecs_to_jiffies(smallest_diff));
+	spin_unlock(&pdc->lock);
+}
+
+static void uet_pdc_rbtree_insert(struct rb_root *root, struct sk_buff *skb)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct sk_buff *skb1;
+
+	while (*p) {
+		parent = *p;
+		skb1 = rb_to_skb(parent);
+		if (before(UET_SKB_CB(skb)->psn, UET_SKB_CB(skb1)->psn))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, root);
+}
+
+static struct sk_buff *uet_pdc_rtx_find(struct uet_pdc *pdc, u32 psn)
+{
+	struct rb_node *parent, **p = &pdc->rtx_queue.rb_node;
+
+	while (*p) {
+		struct sk_buff *skb;
+
+		parent = *p;
+		skb = rb_to_skb(parent);
+		if (psn == UET_SKB_CB(skb)->psn)
+			return skb;
+
+		if (before(psn, UET_SKB_CB(skb)->psn))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+
+	return NULL;
+}
+
+static void uet_pdc_rtx_remove_skb(struct uet_pdc *pdc, struct sk_buff *skb)
+{
+	rb_erase(&skb->rbnode, &pdc->rtx_queue);
+	consume_skb(skb);
+}
+
+static void uet_pdc_ack_psn(struct uet_pdc *pdc, struct sk_buff *ack_skb,
+			    u32 psn, bool ecn_marked)
+{
+	struct sk_buff *skb = skb_rb_first(&pdc->rtx_queue);
+	u32 first_psn = skb ? UET_SKB_CB(skb)->psn : 0;
+
+	/* if the oldest PSN got ACKed and it hasn't been retransmitted
+	 * we can move the timer to the next one
+	 */
+	if (skb && psn == first_psn) {
+		struct sk_buff *next = skb_rb_next(skb);
+
+		/* move timer only if first PSN wasn't retransmitted */
+		if (next && !UET_SKB_CB(skb)->rtx_attempts) {
+			ktime_t expire = ktime_add(next->tstamp,
+						   UET_PDC_RTX_DEFAULT_TIMEOUT_NSEC);
+			ktime_t now = ktime_get_ns();
+
+			if (ktime_before(expire, now)) {
+				u64 diff = ktime_to_ns(ktime_sub(expire, now));
+				unsigned long diffj = nsecs_to_jiffies(diff);
+
+				mod_timer(&pdc->rtx_timer, jiffies + diffj);
+			}
+		}
+	} else {
+		skb = uet_pdc_rtx_find(pdc, psn);
+	}
+
+	if (!skb)
+		return;
+
+	uet_pdc_rtx_remove_skb(pdc, skb);
+}
+
+static void uet_pdc_rtx_purge(struct uet_pdc *pdc)
+{
+	struct rb_node *p = rb_first(&pdc->rtx_queue);
+
+	while (p) {
+		struct sk_buff *skb = rb_to_skb(p);
+
+		p = rb_next(p);
+		uet_pdc_rtx_remove_skb(pdc, skb);
+	}
+}
+
+static int uet_pdc_rtx_queue(struct uet_pdc *pdc, struct sk_buff *skb, u32 psn)
+{
+	struct sk_buff *rtx_skb = skb_clone(skb, GFP_ATOMIC);
+
+	if (unlikely(!rtx_skb))
+		return -ENOMEM;
+
+	UET_SKB_CB(rtx_skb)->psn = psn;
+	UET_SKB_CB(rtx_skb)->rtx_attempts = 0;
+	uet_pdc_rbtree_insert(&pdc->rtx_queue, rtx_skb);
+
+	if (!timer_pending(&pdc->rtx_timer))
+		mod_timer(&pdc->rtx_timer, jiffies +
+					   UET_PDC_RTX_DEFAULT_TIMEOUT_JIFFIES);
+
+	return 0;
+}
+
 /* use the approach as nf nat, try a few rounds starting at random offset */
 static bool uet_pdc_id_get(struct uet_pdc *pdc)
 {
@@ -69,7 +254,7 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 	pdc->pds = pds;
 	pdc->mode = mode;
 	pdc->is_initiator = !is_inbound;
-
+	pdc->rtx_queue = RB_ROOT;
 	if (!uet_pdc_id_get(pdc))
 		goto err_id_get;
 
@@ -83,9 +268,13 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 	pdc->rx_bitmap = bitmap_zalloc(UET_PDC_MPR, GFP_ATOMIC);
 	if (!pdc->rx_bitmap)
 		goto err_rx_bitmap;
+	pdc->tx_bitmap = bitmap_zalloc(UET_PDC_MPR, GFP_ATOMIC);
+	if (!pdc->tx_bitmap)
+		goto err_tx_bitmap;
 	pdc->ack_bitmap = bitmap_zalloc(UET_PDC_MPR, GFP_ATOMIC);
 	if (!pdc->ack_bitmap)
 		goto err_ack_bitmap;
+	timer_setup(&pdc->rtx_timer, uet_pdc_rtx_timer_expired, 0);
 	pdc->metadata = __ip_tun_set_dst(key->src_ip, key->dst_ip, tos, 0, dport,
 					 md_flags, 0, 0);
 	if (!pdc->metadata)
@@ -124,6 +313,8 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 err_tun_dst:
 	bitmap_free(pdc->ack_bitmap);
 err_ack_bitmap:
+	bitmap_free(pdc->tx_bitmap);
+err_tx_bitmap:
 	bitmap_free(pdc->rx_bitmap);
 err_rx_bitmap:
 	uet_pds_pdcid_remove(pdc);
@@ -135,8 +326,11 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 
 void uet_pdc_free(struct uet_pdc *pdc)
 {
+	timer_delete_sync(&pdc->rtx_timer);
+	uet_pdc_rtx_purge(pdc);
 	dst_release(&pdc->metadata->dst);
 	bitmap_free(pdc->ack_bitmap);
+	bitmap_free(pdc->tx_bitmap);
 	bitmap_free(pdc->rx_bitmap);
 	kfree(pdc);
 }
@@ -148,6 +342,53 @@ void uet_pdc_destroy(struct uet_pdc *pdc)
 	uet_pds_pdc_gc_queue(pdc);
 }
 
+static s64 uet_pdc_get_psn(struct uet_pdc *pdc)
+{
+	unsigned long fzb = find_first_zero_bit(pdc->tx_bitmap, UET_PDC_MPR);
+
+	if (unlikely(fzb == UET_PDC_MPR))
+		return -1;
+
+	set_bit(fzb, pdc->tx_bitmap);
+
+	return pdc->tx_base_psn + fzb;
+}
+
+static void uet_pdc_put_psn(struct uet_pdc *pdc, u32 psn)
+{
+	unsigned long psn_bit = psn - pdc->tx_base_psn;
+
+	clear_bit(psn_bit, pdc->tx_bitmap);
+}
+
+static int uet_pdc_build_req(struct uet_pdc *pdc,
+			     struct sk_buff *skb, u8 type, u8 flags)
+{
+	struct uet_pds_req_hdr *req;
+	s64 psn;
+
+	req = skb_push(skb, sizeof(*req));
+	uet_pdc_build_prologue(&req->prologue, type,
+			       UET_PDS_NEXT_HDR_REQ_STD, flags);
+	switch (pdc->state) {
+	case UET_PDC_EP_STATE_CLOSED:
+		pdc->psn_start = get_random_u32();
+		pdc->tx_base_psn = pdc->psn_start;
+		pdc->rx_base_psn = pdc->psn_start;
+		break;
+	}
+
+	psn = uet_pdc_get_psn(pdc);
+	if (unlikely(psn == -1))
+		return -ENOSPC;
+	UET_SKB_CB(skb)->psn = psn;
+	req->psn = cpu_to_be32(psn);
+	req->spdcid = cpu_to_be16(pdc->spdcid);
+	req->dpdcid = cpu_to_be16(pdc->dpdcid);
+
+	return 0;
+}
+
 static void pdc_build_ack(struct uet_pdc *pdc, struct sk_buff *skb, u32 psn,
 			  u8 ack_flags, bool exact_psn)
 {
@@ -200,15 +441,65 @@ static int uet_pdc_send_ses_ack(struct uet_pdc *pdc, __u8 ses_rc, __be16 msg_id,
 	return 0;
 }
 
-static void uet_pdc_mpr_advance_tx(struct uet_pdc *pdc, u32 bits)
+int uet_pdc_tx_req(struct uet_pdc *pdc, struct sk_buff *skb, u8 type)
 {
-	if (!test_bit(0, pdc->ack_bitmap))
-		return;
+	struct uet_pds_req_hdr *req;
+	int ret = 0;
 
-	bitmap_shift_right(pdc->ack_bitmap, pdc->ack_bitmap, bits, UET_PDC_MPR);
-	pdc->tx_base_psn += bits;
-	netdev_dbg(pds_netdev(pdc->pds), "%s: advancing tx to %u\n", __func__,
-		   pdc->tx_base_psn);
+	spin_lock_bh(&pdc->lock);
+	if (pdc->tx_busy) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	switch (pdc->state) {
+	case UET_PDC_EP_STATE_CLOSED:
+		ret = uet_pdc_build_req(pdc, skb, type, UET_PDS_REQ_FLAG_SYN);
+		if (ret)
+			goto out_unlock;
+		req = (struct uet_pds_req_hdr *)skb->data;
+		ret = uet_pdc_rtx_queue(pdc, skb, be32_to_cpu(req->psn));
+		if (ret) {
+			uet_pdc_put_psn(pdc, be32_to_cpu(req->psn));
+			goto out_unlock;
+		}
+		pdc->state = UET_PDC_EP_STATE_SYN_SENT;
+		pdc->tx_busy = true;
+		break;
+	case UET_PDC_EP_STATE_SYN_SENT:
+		break;
+	case UET_PDC_EP_STATE_ESTABLISHED:
+		ret = uet_pdc_build_req(pdc, skb, type, 0);
+		if (ret)
+			goto out_unlock;
+		req = (struct uet_pds_req_hdr *)skb->data;
+		ret = uet_pdc_rtx_queue(pdc, skb, be32_to_cpu(req->psn));
+		if (ret) {
+			uet_pdc_put_psn(pdc, be32_to_cpu(req->psn));
+			goto out_unlock;
+		}
+		break;
+	case UET_PDC_EP_STATE_QUIESCE:
+		break;
+	case UET_PDC_EP_STATE_ACK_WAIT:
+		break;
+	case UET_PDC_EP_STATE_CLOSE_ACK_WAIT:
+		break;
+	default:
+		WARN_ON(1);
+	}
+
+out_unlock:
+	netdev_dbg(pds_netdev(pdc->pds), "%s: tx_busy: %u pdc: [ tx_base_psn: %u"
+				  " state: %u dpdcid: %u spdcid: %u ] proto 0x%x\n",
+		   __func__, pdc->tx_busy, pdc->tx_base_psn, pdc->state,
+		   pdc->dpdcid, pdc->spdcid, ntohs(skb->protocol));
+	spin_unlock_bh(&pdc->lock);
+
+	if (!ret)
+		uet_pdc_xmit(pdc, skb);
+
+	return ret;
 }
 
 int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
@@ -221,6 +512,7 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 	u32 cack_psn = be32_to_cpu(ack->cack_psn);
 	u32 ack_psn = cack_psn + ack_psn_offset;
 	int ret = -EINVAL;
+	bool ecn_marked;
 	u32 psn_bit;
 
 	spin_lock(&pdc->lock);
@@ -237,7 +529,7 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 		goto err_dbg;
 
 	psn_bit = ack_psn - pdc->tx_base_psn;
-	if (!psn_bit_valid(psn_bit)) {
+	if (!psn_bit_valid(psn_bit) || !test_bit(psn_bit, pdc->tx_bitmap)) {
 		drop_reason = "ack_psn bit is invalid";
 		goto err_dbg;
 	}
@@ -252,6 +544,8 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 	/* we can advance only if the oldest pkt got acked */
 	if (!psn_bit)
 		uet_pdc_mpr_advance_tx(pdc, 1);
+	ecn_marked = !!(uet_prologue_flags(&ack->prologue) & UET_PDS_ACK_FLAG_M);
+	uet_pdc_ack_psn(pdc, skb, ack_psn, ecn_marked);
 
 	ret = 0;
 	switch (pdc->state) {
diff --git a/drivers/ultraeth/uet_pds.c b/drivers/ultraeth/uet_pds.c
index abc576e5b6e7..7efb634de85f 100644
--- a/drivers/ultraeth/uet_pds.c
+++ b/drivers/ultraeth/uet_pds.c
@@ -290,3 +290,60 @@ int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 
 	return ret;
 }
+
+static struct uet_pdc *uet_pds_new_pdc_tx(struct uet_pds *pds,
+					  struct sk_buff *skb,
+					  __be16 dport,
+					  struct uet_pdc_key *key,
+					  u8 mode, u8 state)
+{
+	struct uet_ses_req_hdr *ses_req = (struct uet_ses_req_hdr *)skb->data;
+
+	return uet_pdc_create(pds, 0, state, 0,
+			      uet_ses_req_pid_on_fep(ses_req),
+			      mode, 0, dport, key, false);
+}
+
+int uet_pds_tx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
+	       __be32 remote_fep_addr, __be16 dport, u32 job_id)
+{
+	struct uet_ses_req_hdr *ses_req = (struct uet_ses_req_hdr *)skb->data;
+	u32 req_job_id = uet_ses_req_job_id(ses_req);
+	struct uet_pdc_key key = {};
+	struct uet_pdc *pdc;
+
+	/* sending with wrong SES header job id? */
+	if (unlikely(job_id != req_job_id))
+		return -EINVAL;
+
+	key.src_ip = local_fep_addr;
+	key.dst_ip = remote_fep_addr;
+	key.job_id = job_id;
+
+	pdc = rhashtable_lookup_fast(&pds->pdcep_hash, &key,
+				     uet_pds_pdcep_rht_params);
+	/* new flow */
+	if (unlikely(!pdc)) {
+		struct uet_context *ctx;
+		struct uet_job *job;
+		struct uet_fep *fep;
+
+		ctx = container_of(pds, struct uet_context, pds);
+		job = uet_job_find(&ctx->job_reg, key.job_id);
+		if (!job)
+			return -ENOENT;
+		fep = rcu_dereference(job->fep);
+		if (!fep)
+			return -ECONNREFUSED;
+
+		pdc = uet_pds_new_pdc_tx(pds, skb, dport, &key,
+					 UET_PDC_MODE_RUD,
+					 UET_PDC_EP_STATE_CLOSED);
+		if (IS_ERR(pdc))
+			return PTR_ERR(pdc);
+	}
+
+	__net_timestamp(skb);
+
+	return uet_pdc_tx_req(pdc, skb, UET_PDS_TYPE_RUD_REQ);
+}
diff --git a/include/net/ultraeth/uet_pdc.h b/include/net/ultraeth/uet_pdc.h
index 1a42647489fe..261afc57ffe1 100644
--- a/include/net/ultraeth/uet_pdc.h
+++ b/include/net/ultraeth/uet_pdc.h
@@ -13,6 +13,12 @@
 
 #define UET_PDC_ID_MAX_ATTEMPTS 128
 #define UET_PDC_MAX_ID U16_MAX
+#define UET_PDC_RTX_DEFAULT_TIMEOUT_SEC 30
+#define UET_PDC_RTX_DEFAULT_TIMEOUT_JIFFIES (UET_PDC_RTX_DEFAULT_TIMEOUT_SEC * \
+					     HZ)
+#define UET_PDC_RTX_DEFAULT_TIMEOUT_NSEC (UET_PDC_RTX_DEFAULT_TIMEOUT_SEC * \
+					  NSEC_PER_SEC)
+#define UET_PDC_RTX_DEFAULT_MAX 3
 #define UET_PDC_MPR 128
 
 #define UET_SKB_CB(skb)       ((struct uet_skb_cb *)&((skb)->cb[0]))
@@ -20,6 +26,7 @@
 struct uet_skb_cb {
 	u32 psn;
 	__be32 remote_fep_addr;
+	u8 rtx_attempts;
 };
 
 enum {
@@ -51,6 +58,13 @@ enum mpr_pos {
 	UET_PDC_MPR_FUTURE
 };
 
+struct uet_pdc_pkt {
+	struct sk_buff *skb;
+	struct timer_list rtx_timer;
+	u32 psn;
+	int rtx;
+};
+
 struct uet_pdc {
 	struct rhash_head pdcid_node;
 	struct rhash_head pdcep_node;
@@ -69,12 +83,19 @@ struct uet_pdc {
 	u8 mode;
 	bool is_initiator;
 
+	int rtx_max;
+	struct timer_list rtx_timer;
+	unsigned long rtx_timeout;
+
 	unsigned long *rx_bitmap;
+	unsigned long *tx_bitmap;
 	unsigned long *ack_bitmap;
 
 	u32 rx_base_psn;
 	u32 tx_base_psn;
 
+	struct rb_root rtx_queue;
+
 	struct hlist_node gc_node;
 	struct rcu_head rcu;
 };
@@ -89,6 +110,7 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 		   __be32 remote_fep_addr, __u8 tos);
 int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 		   __be32 remote_fep_addr);
+int uet_pdc_tx_req(struct uet_pdc *pdc, struct sk_buff *skb, u8 type);
 
 static inline void uet_pdc_build_prologue(struct uet_prologue_hdr *prologue,
 					  u8 type, u8 next, u8 flags)
@@ -114,4 +136,9 @@ static inline bool psn_bit_valid(u32 bit)
 {
 	return bit < UET_PDC_MPR;
 }
+
+static inline bool before(u32 seq1, u32 seq2)
+{
+	return (s32)(seq1-seq2) < 0;
+}
 #endif /* _UECON_PDC_H */
diff --git a/include/net/ultraeth/uet_pds.h b/include/net/ultraeth/uet_pds.h
index 43f5748a318a..78624370f18c 100644
--- a/include/net/ultraeth/uet_pds.h
+++ b/include/net/ultraeth/uet_pds.h
@@ -40,6 +40,8 @@ void uet_pds_clean_job(struct uet_pds *pds, u32 job_id);
 
 int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 	       __be32 remote_fep_addr, __be16 dport, __u8 tos);
+int uet_pds_tx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
+	       __be32 remote_fep_addr, __be16 dport, u32 job_id);
 
 static inline struct uet_prologue_hdr *pds_prologue_hdr(const struct sk_buff *skb)
 {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 09/13] drivers: ultraeth: add support for coalescing ack
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (7 preceding siblings ...)
  2025-03-06 23:01 ` [RFC PATCH 08/13] drivers: ultraeth: add request transmit support Nikolay Aleksandrov
@ 2025-03-06 23:01 ` Nikolay Aleksandrov
  2025-03-06 23:02 ` [RFC PATCH 10/13] drivers: ultraeth: add sack support Nikolay Aleksandrov
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:01 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

This patch adds Rx support for coalescing ack based on the PDS spec. It
is controlled by two per-FEP variables that can be set when the FEP
requests to be associated to a job:
- ack_gen_trigger: number of bytes that will trigger an ACK
- ack_gen_min_pkt_add: minimum number of bytes to add on each packet

The default values are ack_gen_trigger = 16KB and ack_gen_min_pkt_add = 1KB
as per the spec.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/uet_pdc.c     | 119 ++++++++++++++++++++++++---------
 drivers/ultraeth/uet_pds.c     |  18 +++--
 include/net/ultraeth/uet_job.h |   2 +
 include/net/ultraeth/uet_pdc.h |   7 +-
 include/uapi/linux/ultraeth.h  |   2 +
 5 files changed, 111 insertions(+), 37 deletions(-)

diff --git a/drivers/ultraeth/uet_pdc.c b/drivers/ultraeth/uet_pdc.c
index dc79305cc3b5..55b893ac5479 100644
--- a/drivers/ultraeth/uet_pdc.c
+++ b/drivers/ultraeth/uet_pdc.c
@@ -19,9 +19,9 @@ static void uet_pdc_xmit(struct uet_pdc *pdc, struct sk_buff *skb)
 	dev_queue_xmit(skb);
 }
 
-static void uet_pdc_mpr_advance_tx(struct uet_pdc *pdc, u32 bits)
+static void __uet_pdc_mpr_advance_tx(struct uet_pdc *pdc, u32 bits)
 {
-	if (!test_bit(0, pdc->tx_bitmap) || !test_bit(0, pdc->ack_bitmap))
+	if (WARN_ON_ONCE(bits >= UET_PDC_MPR))
 		return;
 
 	bitmap_shift_right(pdc->tx_bitmap, pdc->tx_bitmap, bits, UET_PDC_MPR);
@@ -31,6 +31,15 @@ static void uet_pdc_mpr_advance_tx(struct uet_pdc *pdc, u32 bits)
 		   pdc->tx_base_psn);
 }
 
+static void uet_pdc_mpr_advance_tx(struct uet_pdc *pdc, u32 cack_psn)
+{
+	/* cumulative ack, clear all prior and including cack_psn */
+	if (cack_psn > pdc->tx_base_psn)
+		__uet_pdc_mpr_advance_tx(pdc, cack_psn - pdc->tx_base_psn);
+	else if (test_bit(0, pdc->tx_bitmap) && test_bit(0, pdc->ack_bitmap))
+		__uet_pdc_mpr_advance_tx(pdc, 1);
+}
+
 static void uet_pdc_rtx_skb(struct uet_pdc *pdc, struct sk_buff *skb, ktime_t ts)
 {
 	struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
@@ -227,7 +236,8 @@ static bool uet_pdc_id_get(struct uet_pdc *pdc)
 
 struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 			       u16 dpdcid, u16 pid_on_fep, u8 mode,
-			       u8 tos, __be16 dport,
+			       u8 tos, __be16 dport, u32 ack_gen_trigger,
+			       u32 ack_gen_min_pkt_add,
 			       const struct uet_pdc_key *key, bool is_inbound)
 {
 	struct uet_pdc *pdc, *pdc_ins = ERR_PTR(-ENOMEM);
@@ -254,6 +264,8 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 	pdc->pds = pds;
 	pdc->mode = mode;
 	pdc->is_initiator = !is_inbound;
+	pdc->ack_gen_trigger = ack_gen_trigger;
+	pdc->ack_gen_min_pkt_add = ack_gen_min_pkt_add;
 	pdc->rtx_queue = RB_ROOT;
 	if (!uet_pdc_id_get(pdc))
 		goto err_id_get;
@@ -541,11 +553,25 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 	/* either using ROD mode or in SYN_SENT state */
 	if (pdc->tx_busy)
 		pdc->tx_busy = false;
-	/* we can advance only if the oldest pkt got acked */
-	if (!psn_bit)
-		uet_pdc_mpr_advance_tx(pdc, 1);
 	ecn_marked = !!(uet_prologue_flags(&ack->prologue) & UET_PDS_ACK_FLAG_M);
-	uet_pdc_ack_psn(pdc, skb, ack_psn, ecn_marked);
+	/* we can advance only if the oldest pkt got acked or we got
+	 * a cumulative ack clearing >= 1 older packets
+	 */
+	if (!psn_bit || cack_psn > pdc->tx_base_psn) {
+		if (cack_psn >= pdc->tx_base_psn) {
+			u32 i;
+
+			for (i = 0; i <= cack_psn - pdc->tx_base_psn; i++)
+				uet_pdc_ack_psn(pdc, skb, cack_psn - i,
+						ecn_marked);
+		}
+
+		uet_pdc_mpr_advance_tx(pdc, cack_psn);
+	}
+
+	/* minor optimization, this can happen only if they are != */
+	if (cack_psn != ack_psn)
+		uet_pdc_ack_psn(pdc, skb, ack_psn, ecn_marked);
 
 	ret = 0;
 	switch (pdc->state) {
@@ -584,13 +610,39 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 
 static void uet_pdc_mpr_advance_rx(struct uet_pdc *pdc)
 {
-	if (!test_bit(0, pdc->rx_bitmap))
+	unsigned long fzb = find_first_zero_bit(pdc->rx_bitmap, UET_PDC_MPR);
+	u32 old_psn = pdc->rx_base_psn;
+
+	if (fzb == 0)
 		return;
 
-	bitmap_shift_right(pdc->rx_bitmap, pdc->rx_bitmap, 1, UET_PDC_MPR);
-	pdc->rx_base_psn++;
-	netdev_dbg(pds_netdev(pdc->pds), "%s: advancing rx to %u\n",
-		   __func__, pdc->rx_base_psn);
+	bitmap_shift_right(pdc->rx_bitmap, pdc->rx_bitmap, fzb, UET_PDC_MPR);
+	pdc->rx_base_psn += fzb;
+	netdev_dbg(pds_netdev(pdc->pds), "%s: advancing rx from %u to %u (%lu)\n",
+		   __func__, old_psn, pdc->rx_base_psn, fzb);
+}
+
+static void uet_pdc_rx_req_handle_ack(struct uet_pdc *pdc, unsigned int len,
+				      __be16 msg_id, u8 req_flags, u32 req_psn,
+				      u8 ack_flags, bool first_ack)
+{
+	pdc->ack_gen_count += max(pdc->ack_gen_min_pkt_add, len);
+	if (first_ack ||
+	    (req_flags & (UET_PDS_REQ_FLAG_AR | UET_PDS_REQ_FLAG_RETX)) ||
+	    pdc->ack_gen_count >= pdc->ack_gen_trigger) {
+		/* first advance so if the current psn == rx_base_psn
+		 * we will clear it with the cumulative ack
+		 */
+		uet_pdc_mpr_advance_rx(pdc);
+		pdc->ack_gen_count = 0;
+		/* req_psn is inside the cumulative ack range, so
+		 * it is covered by it
+		 */
+		if (unlikely(req_psn < pdc->rx_base_psn))
+			req_psn = pdc->rx_base_psn;
+		uet_pdc_send_ses_ack(pdc, UET_SES_RSP_RC_NULL, msg_id, req_psn,
+				     ack_flags, false);
+	}
 }
 
 int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
@@ -601,7 +653,9 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 	u8 req_flags = uet_prologue_flags(&req->prologue), ack_flags = 0;
 	u32 req_psn = be32_to_cpu(req->psn);
 	const char *drop_reason = "tx_busy";
-	unsigned long psn_bit;
+	__be16 msg_id = ses_req->msg_id;
+	unsigned int len = skb->len;
+	bool first_ack = false;
 	enum mpr_pos psn_pos;
 	int ret = -EINVAL;
 
@@ -648,22 +702,31 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 		break;
 	}
 
-	psn_bit = req_psn - pdc->rx_base_psn;
-	if (!psn_bit_valid(psn_bit)) {
-		drop_reason = "req psn bit is invalid";
-		goto err_dbg;
-	}
-	if (test_and_set_bit(psn_bit, pdc->rx_bitmap)) {
-		drop_reason = "req psn bit is already set in rx_bitmap";
-		goto err_dbg;
-	}
-
-	ret = 0;
 	switch (pdc->state) {
 	case UET_PDC_EP_STATE_SYN_SENT:
 		/* error */
 		break;
+	case UET_PDC_EP_STATE_NEW_ESTABLISHED:
+		/* special state when a connection is new, we need to
+		 * send first ack immediately
+		 */
+		pdc->state = UET_PDC_EP_STATE_ESTABLISHED;
+		first_ack = true;
+		fallthrough;
 	case UET_PDC_EP_STATE_ESTABLISHED:
+		if (!first_ack) {
+			unsigned long psn_bit = req_psn - pdc->rx_base_psn - 1;
+
+			if (!psn_bit_valid(psn_bit)) {
+				drop_reason = "req psn bit is invalid";
+				goto err_dbg;
+			}
+			if (test_and_set_bit(psn_bit, pdc->rx_bitmap)) {
+				drop_reason = "req psn bit is already set in rx_bitmap";
+				goto err_dbg;
+			}
+		}
+
 		/* Rx request and do an upcall, potentially return an ack */
 		ret = uet_job_fep_queue_skb(pds_context(pdc->pds),
 					    uet_ses_req_job_id(ses_req), skb,
@@ -678,12 +741,8 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 	}
 
 	if (ret >= 0)
-		uet_pdc_send_ses_ack(pdc, UET_SES_RSP_RC_NULL, ses_req->msg_id,
-				     req_psn, ack_flags, false);
-	/* TODO: NAK */
-
-	if (!psn_bit)
-		uet_pdc_mpr_advance_rx(pdc);
+		uet_pdc_rx_req_handle_ack(pdc, len, msg_id, req_flags,
+					  req_psn, ack_flags, first_ack);
 
 out:
 	spin_unlock(&pdc->lock);
diff --git a/drivers/ultraeth/uet_pds.c b/drivers/ultraeth/uet_pds.c
index 7efb634de85f..52122998079d 100644
--- a/drivers/ultraeth/uet_pds.c
+++ b/drivers/ultraeth/uet_pds.c
@@ -166,7 +166,8 @@ static int uet_pds_rx_ack(struct uet_pds *pds, struct sk_buff *skb,
 
 static struct uet_pdc *uet_pds_new_pdc_rx(struct uet_pds *pds,
 					  struct sk_buff *skb,
-					  __be16 dport,
+					  __be16 dport, u32 ack_gen_trigger,
+					  u32 ack_gen_min_pkt_add,
 					  struct uet_pdc_key *key,
 					  u8 mode, u8 state)
 {
@@ -176,7 +177,8 @@ static struct uet_pdc *uet_pds_new_pdc_rx(struct uet_pds *pds,
 	return uet_pdc_create(pds, be32_to_cpu(req->psn), state,
 			      be16_to_cpu(req->spdcid),
 			      uet_ses_req_pid_on_fep(ses_req),
-			      mode, 0, dport, key, true);
+			      mode, 0, dport, ack_gen_trigger,
+			      ack_gen_min_pkt_add, key, true);
 }
 
 static int uet_pds_rx_req(struct uet_pds *pds, struct sk_buff *skb,
@@ -215,7 +217,8 @@ static int uet_pds_rx_req(struct uet_pds *pds, struct sk_buff *skb,
 		if (fep->addr.in_address.ip != local_fep_addr)
 			return -ENOENT;
 
-		pdc = uet_pds_new_pdc_rx(pds, skb, dport, &key,
+		pdc = uet_pds_new_pdc_rx(pds, skb, dport, fep->ack_gen_trigger,
+					 fep->ack_gen_min_pkt_add, &key,
 					 UET_PDC_MODE_RUD,
 					 UET_PDC_EP_STATE_NEW_ESTABLISHED);
 		if (IS_ERR(pdc))
@@ -293,7 +296,8 @@ int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 
 static struct uet_pdc *uet_pds_new_pdc_tx(struct uet_pds *pds,
 					  struct sk_buff *skb,
-					  __be16 dport,
+					  __be16 dport, u32 ack_gen_trigger,
+					  u32 ack_gen_min_pkt_add,
 					  struct uet_pdc_key *key,
 					  u8 mode, u8 state)
 {
@@ -301,7 +305,8 @@ static struct uet_pdc *uet_pds_new_pdc_tx(struct uet_pds *pds,
 
 	return uet_pdc_create(pds, 0, state, 0,
 			      uet_ses_req_pid_on_fep(ses_req),
-			      mode, 0, dport, key, false);
+			      mode, 0, dport, ack_gen_trigger,
+			      ack_gen_min_pkt_add, key, false);
 }
 
 int uet_pds_tx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
@@ -336,7 +341,8 @@ int uet_pds_tx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 		if (!fep)
 			return -ECONNREFUSED;
 
-		pdc = uet_pds_new_pdc_tx(pds, skb, dport, &key,
+		pdc = uet_pds_new_pdc_tx(pds, skb, dport, fep->ack_gen_trigger,
+					 fep->ack_gen_min_pkt_add, &key,
 					 UET_PDC_MODE_RUD,
 					 UET_PDC_EP_STATE_CLOSED);
 		if (IS_ERR(pdc))
diff --git a/include/net/ultraeth/uet_job.h b/include/net/ultraeth/uet_job.h
index fac1f0752a78..555706a21e96 100644
--- a/include/net/ultraeth/uet_job.h
+++ b/include/net/ultraeth/uet_job.h
@@ -21,6 +21,8 @@ struct uet_fep {
 	struct uet_context *context;
 	struct sk_buff_head rxq;
 	struct fep_address addr;
+	u32 ack_gen_trigger;
+	u32 ack_gen_min_pkt_add;
 	u32 job_id;
 };
 
diff --git a/include/net/ultraeth/uet_pdc.h b/include/net/ultraeth/uet_pdc.h
index 261afc57ffe1..8a87fc0bc869 100644
--- a/include/net/ultraeth/uet_pdc.h
+++ b/include/net/ultraeth/uet_pdc.h
@@ -94,6 +94,10 @@ struct uet_pdc {
 	u32 rx_base_psn;
 	u32 tx_base_psn;
 
+	u32 ack_gen_trigger;
+	u32 ack_gen_min_pkt_add;
+	u32 ack_gen_count;
+
 	struct rb_root rtx_queue;
 
 	struct hlist_node gc_node;
@@ -102,7 +106,8 @@ struct uet_pdc {
 
 struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 			       u16 dpdcid, u16 pid_on_fep, u8 mode,
-			       u8 tos, __be16 dport,
+			       u8 tos, __be16 dport, u32 ack_gen_trigger,
+			       u32 ack_gen_min_pkt_add,
 			       const struct uet_pdc_key *key, bool is_inbound);
 void uet_pdc_destroy(struct uet_pdc *pdc);
 void uet_pdc_free(struct uet_pdc *pdc);
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
index 6f3ee5ac8cf4..cc39bf970e08 100644
--- a/include/uapi/linux/ultraeth.h
+++ b/include/uapi/linux/ultraeth.h
@@ -8,6 +8,8 @@
 
 #define UET_DEFAULT_PORT 5432
 #define UET_SVC_MAX_LEN 64
+#define UET_DEFAULT_ACK_GEN_TRIGGER (1 << 14)
+#define UET_DEFAULT_ACK_GEN_MIN_PKT_ADD (1 << 10)
 
 /* types used for prologue's type field */
 enum {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 10/13] drivers: ultraeth: add sack support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (8 preceding siblings ...)
  2025-03-06 23:01 ` [RFC PATCH 09/13] drivers: ultraeth: add support for coalescing ack Nikolay Aleksandrov
@ 2025-03-06 23:02 ` Nikolay Aleksandrov
  2025-03-06 23:02 ` [RFC PATCH 11/13] drivers: ultraeth: add nack support Nikolay Aleksandrov
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:02 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Add SACK support, we choose to send SACK (and extended header) when we
have to send an ACK but cannot advance CACK. The logic is a bit
complicated because the spec says we have to align CACK and SACK_BASE to
8 which could effectively move CACK back so we have to fill in for those
bits as 1s in the SACK bitmap

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/uet_pdc.c     | 100 ++++++++++++++++++++++++++++++---
 drivers/ultraeth/uet_pds.c     |   3 +
 include/net/ultraeth/uet_pdc.h |  10 ++++
 include/uapi/linux/ultraeth.h  |  40 +++++++++++++
 4 files changed, 146 insertions(+), 7 deletions(-)

diff --git a/drivers/ultraeth/uet_pdc.c b/drivers/ultraeth/uet_pdc.c
index 55b893ac5479..e9469edd9014 100644
--- a/drivers/ultraeth/uet_pdc.c
+++ b/drivers/ultraeth/uet_pdc.c
@@ -401,13 +401,55 @@ static int uet_pdc_build_req(struct uet_pdc *pdc,
 	return 0;
 }
 
+static void pdc_build_sack(struct uet_pdc *pdc,
+			   struct uet_pds_ack_ext_hdr *ack_ext)
+{
+	u32 sack_base = pdc->lowest_unack_psn, shift;
+	unsigned long bit, start_bit;
+	s16 sack_psn_offset;
+	u64 sack_bitmap;
+
+	if (sack_base + UET_PDC_SACK_BITS > pdc->max_rcv_psn)
+		sack_base = max(pdc->max_rcv_psn - UET_PDC_SACK_BITS,
+				pdc->rx_base_psn);
+	sack_base &= UET_PDC_SACK_MASK;
+	sack_psn_offset = (s16)(sack_base -
+				(pdc->rx_base_psn & UET_PDC_SACK_MASK));
+	if (sack_base == pdc->rx_base_psn) {
+		shift = 1;
+		sack_bitmap = 1;
+		bit = 0;
+	} else if (sack_base < pdc->rx_base_psn) {
+		shift = pdc->rx_base_psn - sack_base;
+		sack_bitmap = U64_MAX >> (64 - shift);
+		bit = 0;
+	} else {
+		shift = 0;
+		sack_bitmap = 0;
+		bit = sack_base - pdc->rx_base_psn;
+	}
+
+	start_bit = bit;
+	for_each_set_bit_from(bit, pdc->rx_bitmap, UET_PDC_MPR) {
+		shift += (bit - start_bit);
+		if (shift >= UET_PDC_SACK_BITS)
+			break;
+		sack_bitmap |= BIT(shift);
+	}
+
+	pdc->lowest_unack_psn += UET_PDC_SACK_BITS;
+	ack_ext->sack_psn_offset = cpu_to_be16(sack_psn_offset);
+	ack_ext->sack_bitmap = cpu_to_be64(sack_bitmap);
+}
+
 static void pdc_build_ack(struct uet_pdc *pdc, struct sk_buff *skb, u32 psn,
 			  u8 ack_flags, bool exact_psn)
 {
+	u8 type = pdc_should_sack(pdc) ? UET_PDS_TYPE_ACK_CC : UET_PDS_TYPE_ACK;
 	struct uet_pds_ack_hdr *ack = skb_put(skb, sizeof(*ack));
 
-	uet_pdc_build_prologue(&ack->prologue, UET_PDS_TYPE_ACK,
-			       UET_PDS_NEXT_HDR_RSP, ack_flags);
+	uet_pdc_build_prologue(&ack->prologue, type, UET_PDS_NEXT_HDR_RSP,
+			       ack_flags);
 	if (exact_psn) {
 		ack->ack_psn_offset = 0;
 		ack->cack_psn = cpu_to_be32(psn);
@@ -417,6 +459,13 @@ static void pdc_build_ack(struct uet_pdc *pdc, struct sk_buff *skb, u32 psn,
 	}
 	ack->spdcid = cpu_to_be16(pdc->spdcid);
 	ack->dpdcid = cpu_to_be16(pdc->dpdcid);
+
+	if (pdc_should_sack(pdc)) {
+		struct uet_pds_ack_ext_hdr *ack_ext = skb_put(skb,
+							      sizeof(*ack_ext));
+
+		pdc_build_sack(pdc, ack_ext);
+	}
 }
 
 static void uet_pdc_build_ses_ack(struct uet_pdc *pdc, struct sk_buff *skb,
@@ -439,10 +488,12 @@ static void uet_pdc_build_ses_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 static int uet_pdc_send_ses_ack(struct uet_pdc *pdc, __u8 ses_rc, __be16 msg_id,
 				u32 psn, u8 ack_flags, bool exact_psn)
 {
+	unsigned int skb_size = sizeof(struct uet_ses_rsp_hdr) +
+				sizeof(struct uet_pds_ack_hdr);
 	struct sk_buff *skb;
 
-	skb = alloc_skb(sizeof(struct uet_ses_rsp_hdr) +
-			sizeof(struct uet_pds_ack_hdr), GFP_ATOMIC);
+	skb_size += pdc_should_sack(pdc) ? sizeof(struct uet_pds_ack_ext_hdr) : 0;
+	skb = alloc_skb(skb_size, GFP_ATOMIC);
 	if (!skb)
 		return -ENOBUFS;
 
@@ -514,6 +565,30 @@ int uet_pdc_tx_req(struct uet_pdc *pdc, struct sk_buff *skb, u8 type)
 	return ret;
 }
 
+static void uet_pdc_rx_sack(struct uet_pdc *pdc, struct sk_buff *skb,
+			    u32 cack_psn, struct uet_pds_ack_ext_hdr *ext_ack,
+			    bool ecn_marked)
+{
+	unsigned long bit, *sack_bitmap = (unsigned long *)&ext_ack->sack_bitmap;
+	u32 sack_base_psn = cack_psn +
+			    (s16)be16_to_cpu(ext_ack->sack_psn_offset);
+
+	while ((bit = find_next_bit(sack_bitmap, 64, 0)) != 64) {
+		/* skip bits that were already acked */
+		if (sack_base_psn + bit <= pdc->tx_base_psn) {
+			if (sack_base_psn + bit == pdc->tx_base_psn)
+				__uet_pdc_mpr_advance_tx(pdc, 1);
+			continue;
+		}
+		if (!psn_bit_valid((sack_base_psn + bit) - pdc->tx_base_psn))
+			break;
+		if (test_and_set_bit((sack_base_psn + bit) - pdc->tx_base_psn,
+				     pdc->ack_bitmap))
+			continue;
+		uet_pdc_ack_psn(pdc, skb, sack_base_psn + bit, ecn_marked);
+	}
+}
+
 int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 		   __be32 remote_fep_addr)
 {
@@ -521,10 +596,11 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 	struct uet_pds_ack_hdr *ack = pds_ack_hdr(skb);
 	s16 ack_psn_offset = be16_to_cpu(ack->ack_psn_offset);
 	const char *drop_reason = "ack_psn not in MPR window";
+	struct uet_pds_ack_ext_hdr *ext_ack = NULL;
 	u32 cack_psn = be32_to_cpu(ack->cack_psn);
 	u32 ack_psn = cack_psn + ack_psn_offset;
+	bool is_sack = false, ecn_marked;
 	int ret = -EINVAL;
-	bool ecn_marked;
 	u32 psn_bit;
 
 	spin_lock(&pdc->lock);
@@ -545,9 +621,16 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 		drop_reason = "ack_psn bit is invalid";
 		goto err_dbg;
 	}
+	if (uet_prologue_type(&ack->prologue) == UET_PDS_TYPE_ACK_CC) {
+		ext_ack = pds_ack_ext_hdr(skb);
+		is_sack = !!ext_ack->sack_bitmap;
+	}
 	if (test_and_set_bit(psn_bit, pdc->ack_bitmap)) {
-		drop_reason = "ack_psn bit already set in ack_bitmap";
-		goto err_dbg;
+		/* SACK packets can include already acked packets */
+		if (!is_sack) {
+			drop_reason = "ack_psn bit already set in ack_bitmap";
+			goto err_dbg;
+		}
 	}
 
 	/* either using ROD mode or in SYN_SENT state */
@@ -573,6 +656,9 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 	if (cack_psn != ack_psn)
 		uet_pdc_ack_psn(pdc, skb, ack_psn, ecn_marked);
 
+	if (is_sack)
+		uet_pdc_rx_sack(pdc, skb, cack_psn, ext_ack, ecn_marked);
+
 	ret = 0;
 	switch (pdc->state) {
 	case UET_PDC_EP_STATE_SYN_SENT:
diff --git a/drivers/ultraeth/uet_pds.c b/drivers/ultraeth/uet_pds.c
index 52122998079d..436b63189800 100644
--- a/drivers/ultraeth/uet_pds.c
+++ b/drivers/ultraeth/uet_pds.c
@@ -266,6 +266,9 @@ int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 
 	prologue = pds_prologue_hdr(skb);
 	switch (uet_prologue_type(prologue)) {
+	case UET_PDS_TYPE_ACK_CC:
+		offset += sizeof(struct uet_pds_ack_ext_hdr);
+		fallthrough;
 	case UET_PDS_TYPE_ACK:
 		if (!uet_pds_rx_valid_ack_next_hdr(prologue))
 			break;
diff --git a/include/net/ultraeth/uet_pdc.h b/include/net/ultraeth/uet_pdc.h
index 8a87fc0bc869..d6710f92fb16 100644
--- a/include/net/ultraeth/uet_pdc.h
+++ b/include/net/ultraeth/uet_pdc.h
@@ -20,6 +20,8 @@
 					  NSEC_PER_SEC)
 #define UET_PDC_RTX_DEFAULT_MAX 3
 #define UET_PDC_MPR 128
+#define UET_PDC_SACK_BITS 64
+#define UET_PDC_SACK_MASK (U64_MAX << 3)
 
 #define UET_SKB_CB(skb)       ((struct uet_skb_cb *)&((skb)->cb[0]))
 
@@ -93,6 +95,8 @@ struct uet_pdc {
 
 	u32 rx_base_psn;
 	u32 tx_base_psn;
+	u32 lowest_unack_psn;
+	u32 max_rcv_psn;
 
 	u32 ack_gen_trigger;
 	u32 ack_gen_min_pkt_add;
@@ -146,4 +150,10 @@ static inline bool before(u32 seq1, u32 seq2)
 {
 	return (s32)(seq1-seq2) < 0;
 }
+
+static inline bool pdc_should_sack(const struct uet_pdc *pdc)
+{
+	return pdc->lowest_unack_psn > pdc->rx_base_psn &&
+	       pdc->lowest_unack_psn < pdc->max_rcv_psn;
+}
 #endif /* _UECON_PDC_H */
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
index cc39bf970e08..3b8e95d7ed7b 100644
--- a/include/uapi/linux/ultraeth.h
+++ b/include/uapi/linux/ultraeth.h
@@ -152,6 +152,46 @@ struct uet_pds_ack_hdr {
 	__be16 dpdcid;
 } __attribute__ ((__packed__));
 
+/* ext ack CC flags */
+enum {
+	UET_PDS_ACK_EXT_CC_F_RSVD	= (1 << 0)
+};
+
+/* field: cc_type_mpr_sack_off */
+#define UET_PDS_ACK_EXT_MPR_BITS 8
+#define UET_PDS_ACK_EXT_MPR_MASK 0xff
+#define UET_PDS_ACK_EXT_CC_FLAGS_BITS 4
+#define UET_PDS_ACK_EXT_CC_FLAGS_MASK 0xf
+#define UET_PDS_ACK_EXT_CC_FLAGS_SHIFT UET_PDS_ACK_EXT_MPR_BITS
+#define UET_PDS_ACK_EXT_CC_TYPE_BITS 4
+#define UET_PDS_ACK_EXT_CC_TYPE_MASK 0xf
+#define UET_PDS_ACK_EXT_CC_TYPE_SHIFT (UET_PDS_ACK_EXT_CC_FLAGS_SHIFT + \
+				       UET_PDS_ACK_EXT_CC_FLAGS_BITS)
+/* header used for ACK_CC */
+struct uet_pds_ack_ext_hdr {
+	__be16 cc_type_flags_mpr;
+	__be16 sack_psn_offset;
+	__be64 sack_bitmap;
+	__be64 ack_cc_state;
+} __attribute__ ((__packed__));
+
+static inline __u8 uet_pds_ack_ext_mpr(const struct uet_pds_ack_ext_hdr *ack)
+{
+	return __be16_to_cpu(ack->cc_type_flags_mpr) & UET_PDS_ACK_EXT_MPR_MASK;
+}
+
+static inline __u8 uet_pds_ack_ext_cc_flags(const struct uet_pds_ack_ext_hdr *ack)
+{
+	return (__be16_to_cpu(ack->cc_type_flags_mpr) >> UET_PDS_ACK_EXT_CC_FLAGS_SHIFT) &
+	       UET_PDS_ACK_EXT_CC_FLAGS_MASK;
+}
+
+static inline __u8 uet_pds_ack_ext_cc_type(const struct uet_pds_ack_ext_hdr *ack)
+{
+	return (__be16_to_cpu(ack->cc_type_flags_mpr) >> UET_PDS_ACK_EXT_CC_TYPE_SHIFT) &
+	       UET_PDS_ACK_EXT_CC_TYPE_MASK;
+}
+
 /* ses request op codes */
 enum {
 	UET_SES_REQ_OP_NOOP			= 0x00,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 11/13] drivers: ultraeth: add nack support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (9 preceding siblings ...)
  2025-03-06 23:02 ` [RFC PATCH 10/13] drivers: ultraeth: add sack support Nikolay Aleksandrov
@ 2025-03-06 23:02 ` Nikolay Aleksandrov
  2025-03-06 23:02 ` [RFC PATCH 12/13] drivers: ultraeth: add initiator and target idle timeout support Nikolay Aleksandrov
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:02 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Add nack header format with codes and helpers which allow sending of NACKs,
they construct the NACK packet dynamically and don't rely on a pre-existing
PDC.  Send back NACK packets if an error occurred when receiving a request.
The following events trigger NACKs:
 - DPDCID not found and SYN not set (UET_PDS_NACK_INV_DPDCID)
 - DPDCID not found and job/fep are invalid (UET_PDS_NACK_NO_RESOURCE)
 - DPDCID not found and local FEP address mismatches
   (UET_PDS_NACK_PDC_HDR_MISMATCH)
 - DPDCID is found but mode doesn't match
   (UET_PDS_NACK_PDC_MODE_MISMATCH)
 - DPDCID is found but PSN is wrong (UET_PDS_NACK_PSN_OOR_WINDOW or
   UET_PDS_NACK_INVALID_SYN if packet is with SYN)

Process received PDC_FATAL NACKs, the rest are silently ignored.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/uet_pdc.c     | 79 ++++++++++++++++++++++++++--
 drivers/ultraeth/uet_pds.c     | 95 ++++++++++++++++++++++++++++++++--
 include/net/ultraeth/uet_pdc.h |  3 ++
 include/net/ultraeth/uet_pds.h | 10 ++++
 include/uapi/linux/ultraeth.h  | 55 ++++++++++++++++++++
 5 files changed, 235 insertions(+), 7 deletions(-)

diff --git a/drivers/ultraeth/uet_pdc.c b/drivers/ultraeth/uet_pdc.c
index e9469edd9014..4f19bc68b570 100644
--- a/drivers/ultraeth/uet_pdc.c
+++ b/drivers/ultraeth/uet_pdc.c
@@ -6,6 +6,21 @@
 #include <net/ultraeth/uet_context.h>
 #include <net/ultraeth/uet_pdc.h>
 
+struct metadata_dst *uet_pdc_dst(const struct uet_pdc_key *key, __be16 dport,
+				 u8 tos)
+{
+	IP_TUNNEL_DECLARE_FLAGS(md_flags) = { };
+	struct metadata_dst *mdst;
+
+	mdst = __ip_tun_set_dst(key->src_ip, key->dst_ip, tos, 0, dport,
+				md_flags, 0, 0);
+	if (!mdst)
+		return NULL;
+	mdst->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
+
+	return mdst;
+}
+
 static void uet_pdc_xmit(struct uet_pdc *pdc, struct sk_buff *skb)
 {
 	skb->dev = pds_netdev(pdc->pds);
@@ -241,7 +256,6 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 			       const struct uet_pdc_key *key, bool is_inbound)
 {
 	struct uet_pdc *pdc, *pdc_ins = ERR_PTR(-ENOMEM);
-	IP_TUNNEL_DECLARE_FLAGS(md_flags) = { };
 	int ret __maybe_unused;
 
 	switch (mode) {
@@ -287,8 +301,7 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 	if (!pdc->ack_bitmap)
 		goto err_ack_bitmap;
 	timer_setup(&pdc->rtx_timer, uet_pdc_rtx_timer_expired, 0);
-	pdc->metadata = __ip_tun_set_dst(key->src_ip, key->dst_ip, tos, 0, dport,
-					 md_flags, 0, 0);
+	pdc->metadata = uet_pdc_dst(key, dport, tos);
 	if (!pdc->metadata)
 		goto err_tun_dst;
 
@@ -731,6 +744,19 @@ static void uet_pdc_rx_req_handle_ack(struct uet_pdc *pdc, unsigned int len,
 	}
 }
 
+static bool uet_pdc_req_validate_mode(const struct uet_pdc *pdc,
+				      const struct uet_pds_req_hdr *req)
+{
+	switch (uet_prologue_type(&req->prologue)) {
+	case UET_PDS_TYPE_RUD_REQ:
+		return pdc->mode == UET_PDC_MODE_RUD;
+	case UET_PDS_TYPE_ROD_REQ:
+		return pdc->mode == UET_PDC_MODE_ROD;
+	}
+
+	return false;
+}
+
 int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 		   __be32 remote_fep_addr, __u8 tos)
 {
@@ -743,6 +769,7 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 	unsigned int len = skb->len;
 	bool first_ack = false;
 	enum mpr_pos psn_pos;
+	__u8 nack_code = 0;
 	int ret = -EINVAL;
 
 	spin_lock(&pdc->lock);
@@ -761,6 +788,11 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 
 	if (unlikely(pdc->tx_busy))
 		goto err_dbg;
+	if (!uet_pdc_req_validate_mode(pdc, req)) {
+		drop_reason = "pdc mode doesn't match request";
+		nack_code = UET_PDS_NACK_PDC_MODE_MISMATCH;
+		goto err_dbg;
+	}
 
 	if (req_flags & UET_PDS_REQ_FLAG_RETX)
 		ack_flags |= UET_PDS_ACK_FLAG_RETX;
@@ -770,10 +802,15 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 	switch (psn_pos) {
 	case UET_PDC_MPR_FUTURE:
 		drop_reason = "req psn is in a future MPR window";
+		if (req_flags & UET_PDS_REQ_FLAG_SYN)
+			nack_code = UET_PDS_NACK_INVALID_SYN;
+		else
+			nack_code = UET_PDS_NACK_PSN_OOR_WINDOW;
 		goto err_dbg;
 	case UET_PDC_MPR_PREV:
 		if ((int)(req_psn - pdc->rx_base_psn) < S16_MIN) {
 			drop_reason = "req psn is too far in the past";
+			nack_code = UET_PDS_NACK_PSN_OOR_WINDOW;
 			goto err_dbg;
 		}
 		uet_pdc_send_ses_ack(pdc, UET_SES_RSP_RC_NULL, ses_req->msg_id,
@@ -805,6 +842,7 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 
 			if (!psn_bit_valid(psn_bit)) {
 				drop_reason = "req psn bit is invalid";
+				nack_code = UET_PDS_NACK_PSN_OOR_WINDOW;
 				goto err_dbg;
 			}
 			if (test_and_set_bit(psn_bit, pdc->rx_bitmap)) {
@@ -844,5 +882,40 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 		  pdc->state, pdc->dpdcid, pdc->spdcid,
 		  be16_to_cpu(ses_req->msg_id), be32_to_cpu(req->psn),
 		  be16_to_cpu(req->spdcid), be16_to_cpu(req->dpdcid));
+
+	if (nack_code)
+		uet_pds_send_nack(pdc->pds, &pdc->key,
+				  pdc->metadata->u.tun_info.key.tp_dst, 0,
+				  cpu_to_be16(pdc->spdcid),
+				  cpu_to_be16(pdc->dpdcid),
+				  nack_code, req->psn,
+				  pds_req_to_nack_flags(req_flags));
 	goto out;
 }
+
+void uet_pdc_rx_nack(struct uet_pdc *pdc, struct sk_buff *skb)
+{
+	struct uet_pds_nack_hdr *nack = pds_nack_hdr(skb);
+	u32 nack_psn = be32_to_cpu(nack->nack_psn_pkt_id);
+
+	spin_lock(&pdc->lock);
+	netdev_dbg(pds_netdev(pdc->pds), "%s: NACK pdc: [ spdcid: %u dpdcid: %u rx_base_psn %u ] "
+					 "nack header: [ nack_code: %u vendor_code: %u nack_psn: %u ]\n",
+		   __func__, pdc->spdcid, pdc->dpdcid, pdc->rx_base_psn,
+		   nack->nack_code, nack->vendor_code, nack_psn);
+	if (psn_mpr_pos(pdc->rx_base_psn, nack_psn) != UET_PDC_MPR_CUR)
+		goto out;
+	switch (nack->nack_code) {
+	/* PDC_FATAL codes */
+	case UET_PDS_NACK_CLOSING_IN_ERR:
+	case UET_PDS_NACK_INV_DPDCID:
+	case UET_PDS_NACK_NO_RESOURCE:
+	case UET_PDS_NACK_PDC_HDR_MISMATCH:
+	case UET_PDS_NACK_INVALID_SYN:
+	case UET_PDS_NACK_PDC_MODE_MISMATCH:
+		uet_pdc_destroy(pdc);
+		break;
+	}
+out:
+	spin_unlock(&pdc->lock);
+}
diff --git a/drivers/ultraeth/uet_pds.c b/drivers/ultraeth/uet_pds.c
index 436b63189800..c144b6df8327 100644
--- a/drivers/ultraeth/uet_pds.c
+++ b/drivers/ultraeth/uet_pds.c
@@ -149,6 +149,46 @@ void uet_pds_clean_job(struct uet_pds *pds, u32 job_id)
 	rhashtable_walk_exit(&iter);
 }
 
+static void uet_pds_build_nack(struct sk_buff *skb, __be16 spdcid, __be16 dpdcid,
+			       u8 nack_code, __be32 nack_psn, u8 flags)
+{
+	struct uet_pds_nack_hdr *nack = skb_put(skb, sizeof(*nack));
+
+	uet_pdc_build_prologue(&nack->prologue, UET_PDS_TYPE_NACK,
+			       UET_PDS_NEXT_HDR_NONE, flags);
+	nack->nack_code = nack_code;
+	nack->vendor_code = 0;
+	nack->nack_psn_pkt_id = nack_psn;
+	nack->spdcid = spdcid;
+	nack->dpdcid = dpdcid;
+	nack->payload = 0;
+}
+
+void uet_pds_send_nack(struct uet_pds *pds, const struct uet_pdc_key *key,
+		       __be16 dport, u8 tos, __be16 spdcid, __be16 dpdcid,
+		       __u8 nack_code, __be32 nack_psn, __u8 flags)
+{
+	struct metadata_dst *mdst;
+	struct sk_buff *skb;
+
+	if (WARN_ON_ONCE(!key))
+		return;
+
+	skb = alloc_skb(sizeof(struct uet_pds_nack_hdr), GFP_ATOMIC);
+	if (!skb)
+		return;
+
+	skb->dev = pds_netdev(pds);
+	uet_pds_build_nack(skb, spdcid, dpdcid, nack_code, nack_psn, flags);
+	mdst = uet_pdc_dst(key, dport, tos);
+	if (!mdst) {
+		kfree_skb(skb);
+		return;
+	}
+	skb_dst_set(skb, &mdst->dst);
+	dev_queue_xmit(skb);
+}
+
 static int uet_pds_rx_ack(struct uet_pds *pds, struct sk_buff *skb,
 			  __be32 local_fep_addr, __be32 remote_fep_addr)
 {
@@ -164,6 +204,20 @@ static int uet_pds_rx_ack(struct uet_pds *pds, struct sk_buff *skb,
 	return uet_pdc_rx_ack(pdc, skb, remote_fep_addr);
 }
 
+static void uet_pds_rx_nack(struct uet_pds *pds, struct sk_buff *skb)
+{
+	struct uet_pds_nack_hdr *nack = pds_nack_hdr(skb);
+	u16 pdcid = be16_to_cpu(nack->dpdcid);
+	struct uet_pdc *pdc;
+
+	pdc = rhashtable_lookup_fast(&pds->pdcid_hash, &pdcid,
+				     uet_pds_pdcid_rht_params);
+	if (!pdc)
+		return;
+
+	uet_pdc_rx_nack(pdc, skb);
+}
+
 static struct uet_pdc *uet_pds_new_pdc_rx(struct uet_pds *pds,
 					  struct sk_buff *skb,
 					  __be16 dport, u32 ack_gen_trigger,
@@ -201,21 +255,45 @@ static int uet_pds_rx_req(struct uet_pds *pds, struct sk_buff *skb,
 	/* new flow */
 	if (unlikely(!pdc)) {
 		struct uet_prologue_hdr *prologue = pds_prologue_hdr(skb);
+		__u8 req_flags = uet_prologue_flags(prologue);
 		struct uet_context *ctx;
 		struct uet_job *job;
 
-		if (!(uet_prologue_flags(prologue) & UET_PDS_REQ_FLAG_SYN))
+		if (!(uet_prologue_flags(prologue) & UET_PDS_REQ_FLAG_SYN)) {
+			uet_pds_send_nack(pds, &key, dport, 0, 0,
+					  pds_req->spdcid,
+					  UET_PDS_NACK_INV_DPDCID, pds_req->psn,
+					  pds_req_to_nack_flags(req_flags));
 			return -EINVAL;
+		}
 
 		ctx = container_of(pds, struct uet_context, pds);
 		job = uet_job_find(&ctx->job_reg, key.job_id);
-		if (!job)
+		if (!job) {
+			uet_pds_send_nack(pds, &key, dport, 0, 0,
+					  pds_req->spdcid,
+					  UET_PDS_NACK_NO_RESOURCE,
+					  pds_req->psn,
+					  pds_req_to_nack_flags(req_flags));
 			return -ENOENT;
+		}
 		fep = rcu_dereference(job->fep);
-		if (!fep)
+		if (!fep) {
+			uet_pds_send_nack(pds, &key, dport, 0, 0,
+					  pds_req->spdcid,
+					  UET_PDS_NACK_NO_RESOURCE,
+					  pds_req->psn,
+					  pds_req_to_nack_flags(req_flags));
 			return -ECONNREFUSED;
-		if (fep->addr.in_address.ip != local_fep_addr)
+		}
+		if (fep->addr.in_address.ip != local_fep_addr) {
+			uet_pds_send_nack(pds, &key, dport, 0, 0,
+					  pds_req->spdcid,
+					  UET_PDS_NACK_PDC_HDR_MISMATCH,
+					  pds_req->psn,
+					  pds_req_to_nack_flags(req_flags));
 			return -ENOENT;
+		}
 
 		pdc = uet_pds_new_pdc_rx(pds, skb, dport, fep->ack_gen_trigger,
 					 fep->ack_gen_min_pkt_add, &key,
@@ -290,6 +368,15 @@ int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 		ret = uet_pds_rx_req(pds, skb, local_fep_addr, remote_fep_addr,
 				     dport, tos);
 		break;
+	case UET_PDS_TYPE_NACK:
+		if (uet_prologue_next_hdr(prologue) != UET_PDS_NEXT_HDR_NONE)
+			break;
+		offset += sizeof(struct uet_pds_nack_hdr);
+		if (!pskb_may_pull(skb, offset))
+			break;
+		ret = 0;
+		uet_pds_rx_nack(pds, skb);
+		break;
 	default:
 		break;
 	}
diff --git a/include/net/ultraeth/uet_pdc.h b/include/net/ultraeth/uet_pdc.h
index d6710f92fb16..60aecc15d0f1 100644
--- a/include/net/ultraeth/uet_pdc.h
+++ b/include/net/ultraeth/uet_pdc.h
@@ -120,6 +120,9 @@ int uet_pdc_rx_req(struct uet_pdc *pdc, struct sk_buff *skb,
 int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 		   __be32 remote_fep_addr);
 int uet_pdc_tx_req(struct uet_pdc *pdc, struct sk_buff *skb, u8 type);
+void uet_pdc_rx_nack(struct uet_pdc *pdc, struct sk_buff *skb);
+struct metadata_dst *uet_pdc_dst(const struct uet_pdc_key *key, __be16 dport,
+				 u8 tos);
 
 static inline void uet_pdc_build_prologue(struct uet_prologue_hdr *prologue,
 					  u8 type, u8 next, u8 flags)
diff --git a/include/net/ultraeth/uet_pds.h b/include/net/ultraeth/uet_pds.h
index 78624370f18c..4e9794a4d3de 100644
--- a/include/net/ultraeth/uet_pds.h
+++ b/include/net/ultraeth/uet_pds.h
@@ -7,6 +7,7 @@
 #include <linux/rhashtable.h>
 #include <uapi/linux/ultraeth.h>
 #include <linux/skbuff.h>
+#include <net/ultraeth/uet_pdc.h>
 
 /**
  * struct uet_pds - Packet Delivery Sublayer state structure
@@ -43,6 +44,10 @@ int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 int uet_pds_tx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 	       __be32 remote_fep_addr, __be16 dport, u32 job_id);
 
+void uet_pds_send_nack(struct uet_pds *pds, const struct uet_pdc_key *key,
+		       __be16 dport, u8 tos, __be16 spdcid, __be16 dpdcid,
+		       __u8 nack_code, __be32 nack_psn, __u8 flags);
+
 static inline struct uet_prologue_hdr *pds_prologue_hdr(const struct sk_buff *skb)
 {
 	return (struct uet_prologue_hdr *)skb_network_header(skb);
@@ -92,4 +97,9 @@ static inline __be16 pds_ses_rsp_hdr_pack(__u8 opcode, __u8 version, __u8 list,
 			   (ses_rc & UET_SES_RSP_RC_MASK) <<
 			   UET_SES_RSP_RC_SHIFT);
 }
+
+static inline __u8 pds_req_to_nack_flags(__u8 req_flags)
+{
+	return req_flags & UET_PDS_REQ_FLAG_RETX ? UET_PDS_NACK_FLAG_RETX : 0;
+}
 #endif /* _UECON_PDS_H */
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
index 3b8e95d7ed7b..53d2124bc285 100644
--- a/include/uapi/linux/ultraeth.h
+++ b/include/uapi/linux/ultraeth.h
@@ -192,6 +192,61 @@ static inline __u8 uet_pds_ack_ext_cc_type(const struct uet_pds_ack_ext_hdr *ack
 	       UET_PDS_ACK_EXT_CC_TYPE_MASK;
 }
 
+/* NACK codes */
+enum {
+	UET_PDS_NACK_TRIMMED		= 0x01,
+	UET_PDS_NACK_TRIMMED_LASTHOP	= 0x02,
+	UET_PDS_NACK_TRIMMED_ACK	= 0x03,
+	UET_PDS_NACK_NO_PDC_AVAIL	= 0x04,
+	UET_PDS_NACK_NO_CCC_AVAIL	= 0x05,
+	UET_PDS_NACK_NO_BITMAP		= 0x06,
+	UET_PDS_NACK_NO_PKT_BUFFER	= 0x07,
+	UET_PDS_NACK_NO_GTD_DEL_AVAIL	= 0x08,
+	UET_PDS_NACK_NO_SES_MSG_AVAIL	= 0x09,
+	UET_PDS_NACK_NO_RESOURCE	= 0x0A,
+	UET_PDS_NACK_PSN_OOR_WINDOW	= 0x0B,
+	UET_PDS_NACK_FIRST_ROD_OOO	= 0x0C,
+	UET_PDS_NACK_ROD_OOO		= 0x0D,
+	UET_PDS_NACK_INV_DPDCID		= 0x0E,
+	UET_PDS_NACK_PDC_HDR_MISMATCH	= 0x0F,
+	UET_PDS_NACK_CLOSING		= 0x10,
+	UET_PDS_NACK_CLOSING_IN_ERR	= 0x11,
+	UET_PDS_NACK_PKT_NOT_RCVD	= 0x12,
+	UET_PDS_NACK_GTD_RESP_UNAVAIL	= 0x13,
+	UET_PDS_NACK_ACK_WITH_DATA	= 0x14,
+	UET_PDS_NACK_INVALID_SYN	= 0x15,
+	UET_PDS_NACK_PDC_MODE_MISMATCH	= 0x16,
+	UET_PDS_NACK_NEW_START_PSN	= 0x17,
+	UET_PDS_NACK_RCVD_SES_PROCG	= 0x18,
+	UET_PDS_NACK_UNEXP_EVENT	= 0x19,
+	UET_PDS_NACK_RCVR_INFER_LOSS	= 0x1A,
+	/* 0x1B - 0xFC reserved for UET */
+	UET_PDS_NACK_EXP_NACK_NORMAL	= 0xFD,
+	UET_PDS_NACK_T_EXP_NACK_ERR	= 0xFE,
+	UET_PDS_NACK_EXP_NACK_FATAL	= 0xFF
+};
+
+/* NACK flags */
+enum {
+	UET_PDS_NACK_FLAG_RSV21	= (1 << 0),
+	UET_PDS_NACK_FLAG_RSV22	= (1 << 1),
+	UET_PDS_NACK_FLAG_RSV23	= (1 << 2),
+	UET_PDS_NACK_FLAG_NT	= (1 << 3),
+	UET_PDS_NACK_FLAG_RETX	= (1 << 4),
+	UET_PDS_NACK_FLAG_M	= (1 << 5),
+	UET_PDS_NACK_FLAG_RSV	= (1 << 6)
+};
+
+struct uet_pds_nack_hdr {
+	struct uet_prologue_hdr prologue;
+	__u8 nack_code;
+	__u8 vendor_code;
+	__be32 nack_psn_pkt_id;
+	__be16 spdcid;
+	__be16 dpdcid;
+	__be32 payload;
+} __attribute__ ((__packed__));
+
 /* ses request op codes */
 enum {
 	UET_SES_REQ_OP_NOOP			= 0x00,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 12/13] drivers: ultraeth: add initiator and target idle timeout support
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (10 preceding siblings ...)
  2025-03-06 23:02 ` [RFC PATCH 11/13] drivers: ultraeth: add nack support Nikolay Aleksandrov
@ 2025-03-06 23:02 ` Nikolay Aleksandrov
  2025-03-06 23:02 ` [RFC PATCH 13/13] HACK: drivers: ultraeth: add char device Nikolay Aleksandrov
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:02 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

Add control packet header structure and a helper that builds a control
packet and transmits it. Currently it supports only CLOSE types,
use it to implement initiator and target timeout support by using the close
state machine. Upon initiator timeout we move to either ACK_WAIT (if
pending acks) or CLOSE_ACK_WAIT state, in the latter case we also send a
control message with CLOSE type. Upon target timeout we issue a REQ_CLOSE
control message and if it isn't answered in the timeout period we send a
NACK CLOSING_IN_ERR and close the PDC.

Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
Signed-off-by: Alex Badea <alex.badea@keysight.com>
---
 drivers/ultraeth/uet_pdc.c     | 241 ++++++++++++++++++++++++++++++---
 drivers/ultraeth/uet_pds.c     |  40 +++++-
 include/net/ultraeth/uet_pdc.h |  12 +-
 include/net/ultraeth/uet_pds.h |   5 +
 include/uapi/linux/ultraeth.h  |  19 +++
 5 files changed, 293 insertions(+), 24 deletions(-)

diff --git a/drivers/ultraeth/uet_pdc.c b/drivers/ultraeth/uet_pdc.c
index 4f19bc68b570..5967095867dc 100644
--- a/drivers/ultraeth/uet_pdc.c
+++ b/drivers/ultraeth/uet_pdc.c
@@ -21,6 +21,14 @@ struct metadata_dst *uet_pdc_dst(const struct uet_pdc_key *key, __be16 dport,
 	return mdst;
 }
 
+void uet_pdc_rx_refresh(struct uet_pdc *pdc)
+{
+	unsigned long rx_jiffies = jiffies;
+
+	if (rx_jiffies != READ_ONCE(pdc->rx_last_jiffies))
+		WRITE_ONCE(pdc->rx_last_jiffies, rx_jiffies);
+}
+
 static void uet_pdc_xmit(struct uet_pdc *pdc, struct sk_buff *skb)
 {
 	skb->dev = pds_netdev(pdc->pds);
@@ -97,10 +105,19 @@ static void uet_pdc_rtx_timer_expired(struct timer_list *t)
 			continue;
 		}
 		if (UET_SKB_CB(skb)->rtx_attempts == UET_PDC_RTX_DEFAULT_MAX) {
+			struct uet_prologue_hdr *prologue;
+
 			/* XXX: close connection, count drops etc */
-			netdev_dbg(pds_netdev(pdc->pds), "%s: psn: %u too many rtx attempts: %u\n",
+			prologue = (struct uet_prologue_hdr *)skb->data;
+			netdev_dbg(pds_netdev(pdc->pds), "%s: psn: %u type: %u too many rtx attempts: %u\n",
 				   __func__, UET_SKB_CB(skb)->psn,
+				   uet_prologue_type(prologue),
 				   UET_SKB_CB(skb)->rtx_attempts);
+			if (uet_prologue_type(prologue) == UET_PDS_TYPE_CTRL_MSG &&
+			    uet_prologue_ctl_type(prologue) == UET_CTL_TYPE_CLOSE) {
+				uet_pdc_destroy(pdc);
+				goto out_unlock;
+			}
 			/* if dropping the oldest packet move window */
 			if (UET_SKB_CB(skb)->psn == pdc->tx_base_psn)
 				uet_pdc_mpr_advance_tx(pdc, 1);
@@ -114,6 +131,7 @@ static void uet_pdc_rtx_timer_expired(struct timer_list *t)
 
 	mod_timer(&pdc->rtx_timer, jiffies +
 				   nsecs_to_jiffies(smallest_diff));
+out_unlock:
 	spin_unlock(&pdc->lock);
 }
 
@@ -228,6 +246,154 @@ static int uet_pdc_rtx_queue(struct uet_pdc *pdc, struct sk_buff *skb, u32 psn)
 	return 0;
 }
 
+static s64 uet_pdc_get_psn(struct uet_pdc *pdc)
+{
+	unsigned long fzb = find_first_zero_bit(pdc->tx_bitmap, UET_PDC_MPR);
+
+	if (unlikely(fzb == UET_PDC_MPR))
+		return -1;
+
+	set_bit(fzb, pdc->tx_bitmap);
+
+	return pdc->tx_base_psn + fzb;
+}
+
+static void uet_pdc_put_psn(struct uet_pdc *pdc, u32 psn)
+{
+	unsigned long psn_bit = psn - pdc->tx_base_psn;
+
+	clear_bit(psn_bit, pdc->tx_bitmap);
+}
+
+static int uet_pdc_tx_ctl(struct uet_pdc *pdc, u8 ctl_type, u8 flags,
+			  __be32 psn, __be32 payload)
+{
+	struct uet_pds_ctl_hdr *ctl;
+	struct sk_buff *skb;
+	int ret;
+
+	/* both CLOSE types need to be retransmitted and need a new PSN */
+	switch (ctl_type) {
+	case UET_CTL_TYPE_CLOSE:
+	case UET_CTL_TYPE_REQ_CLOSE:
+		/* payload & psn must be 0 */
+		if (payload || psn)
+			return -EINVAL;
+		/* AR must be set */
+		flags |= UET_PDS_CTL_FLAG_AR;
+		break;
+	default:
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
+	skb = alloc_skb(sizeof(struct uet_pds_ctl_hdr), GFP_ATOMIC);
+	if (!skb)
+		return -ENOBUFS;
+	ctl = skb_put(skb, sizeof(*ctl));
+	uet_pdc_build_prologue(&ctl->prologue, UET_PDS_TYPE_CTRL_MSG,
+			       ctl_type, flags);
+	if (!psn) {
+		s64 psn_new = uet_pdc_get_psn(pdc);
+
+		if (psn_new == -1) {
+			kfree_skb(skb);
+			return -ENOSPC;
+		}
+		psn = cpu_to_be32(psn_new);
+	}
+	ctl->psn = psn;
+	ctl->spdcid = cpu_to_be16(pdc->spdcid);
+	ctl->dpdcid_pdc_info_offset = cpu_to_be16(pdc->dpdcid);
+	ctl->payload = payload;
+
+	ret = uet_pdc_rtx_queue(pdc, skb, be32_to_cpu(psn));
+	if (ret) {
+		uet_pdc_put_psn(pdc, be32_to_cpu(psn));
+		kfree_skb(skb);
+		return ret;
+	}
+	uet_pdc_xmit(pdc, skb);
+
+	return 0;
+}
+
+static void uet_pdc_close(struct uet_pdc *pdc)
+{
+	u8 state;
+	int ret;
+
+	/* we have already transmitted the close control packet */
+	if (pdc->state > UET_PDC_EP_STATE_ACK_WAIT)
+		return;
+
+	if (!RB_EMPTY_ROOT(&pdc->rtx_queue)) {
+		if (pdc->state == UET_PDC_EP_STATE_ACK_WAIT)
+			return;
+		state = UET_PDC_EP_STATE_ACK_WAIT;
+	} else {
+		u8 ctl_type, ctl_flags = 0;
+
+		if (pdc->is_initiator) {
+			ctl_type = UET_CTL_TYPE_CLOSE;
+			state = UET_PDC_EP_STATE_CLOSE_ACK_WAIT;
+			ctl_flags = UET_PDS_CTL_FLAG_AR;
+		} else {
+			ctl_type = UET_CTL_TYPE_REQ_CLOSE;
+			state = UET_PDC_EP_STATE_CLOSE_WAIT;
+		}
+		ret = uet_pdc_tx_ctl(pdc, ctl_type, ctl_flags, 0, 0);
+		if (ret)
+			return;
+	}
+
+	pdc->state = state;
+}
+
+static void uet_pdc_timeout_timer_expired(struct timer_list *t)
+{
+	struct uet_pdc *pdc = from_timer(pdc, t, timeout_timer);
+	unsigned long now = jiffies, last_rx;
+	bool rearm_timer = true;
+
+	last_rx = READ_ONCE(pdc->rx_last_jiffies);
+	if (time_after_eq(last_rx, now) ||
+	    time_after_eq(last_rx + UET_PDC_IDLE_TIMEOUT_JIFFIES, now))
+		goto rearm_timeout;
+	spin_lock(&pdc->lock);
+	switch (pdc->state) {
+	case UET_PDC_EP_STATE_ACK_WAIT:
+		uet_pdc_close(pdc);
+		fallthrough;
+	case UET_PDC_EP_STATE_CLOSE_WAIT:
+	case UET_PDC_EP_STATE_CLOSE_ACK_WAIT:
+		/* we waited too long for the last acks */
+		if (time_before_eq(last_rx + (UET_PDC_IDLE_TIMEOUT_JIFFIES * 2),
+				   now)) {
+			if (!pdc->is_initiator)
+				uet_pds_send_nack(pdc->pds, &pdc->key,
+						  pdc->metadata->u.tun_info.key.tp_dst,
+						  0,
+						  cpu_to_be16(pdc->spdcid),
+						  cpu_to_be16(pdc->dpdcid),
+						  UET_PDS_NACK_CLOSING_IN_ERR,
+						  cpu_to_be32(pdc->rx_base_psn + 1),
+						  0);
+			uet_pdc_destroy(pdc);
+			rearm_timer = false;
+		}
+		break;
+	default:
+		uet_pdc_close(pdc);
+		break;
+	}
+	spin_unlock(&pdc->lock);
+rearm_timeout:
+	if (rearm_timer)
+		mod_timer(&pdc->timeout_timer,
+			  now + UET_PDC_IDLE_TIMEOUT_JIFFIES);
+}
+
 /* use the approach as nf nat, try a few rounds starting at random offset */
 static bool uet_pdc_id_get(struct uet_pdc *pdc)
 {
@@ -301,6 +467,7 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 	if (!pdc->ack_bitmap)
 		goto err_ack_bitmap;
 	timer_setup(&pdc->rtx_timer, uet_pdc_rtx_timer_expired, 0);
+	timer_setup(&pdc->timeout_timer, uet_pdc_timeout_timer_expired, 0);
 	pdc->metadata = uet_pdc_dst(key, dport, tos);
 	if (!pdc->metadata)
 		goto err_tun_dst;
@@ -331,6 +498,9 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 	}
 
 out:
+	mod_timer(&pdc->timeout_timer,
+		  jiffies + UET_PDC_IDLE_TIMEOUT_JIFFIES);
+
 	return pdc_ins;
 
 err_ep_insert:
@@ -351,6 +521,7 @@ struct uet_pdc *uet_pdc_create(struct uet_pds *pds, u32 rx_base_psn, u8 state,
 
 void uet_pdc_free(struct uet_pdc *pdc)
 {
+	timer_delete_sync(&pdc->timeout_timer);
 	timer_delete_sync(&pdc->rtx_timer);
 	uet_pdc_rtx_purge(pdc);
 	dst_release(&pdc->metadata->dst);
@@ -367,25 +538,6 @@ void uet_pdc_destroy(struct uet_pdc *pdc)
 	uet_pds_pdc_gc_queue(pdc);
 }
 
-static s64 uet_pdc_get_psn(struct uet_pdc *pdc)
-{
-	unsigned long fzb = find_first_zero_bit(pdc->tx_bitmap, UET_PDC_MPR);
-
-	if (unlikely(fzb == UET_PDC_MPR))
-		return -1;
-
-	set_bit(fzb, pdc->tx_bitmap);
-
-	return pdc->tx_base_psn + fzb;
-}
-
-static void uet_pdc_put_psn(struct uet_pdc *pdc, u32 psn)
-{
-	unsigned long psn_bit = psn - pdc->tx_base_psn;
-
-	clear_bit(psn_bit, pdc->tx_bitmap);
-}
-
 static int uet_pdc_build_req(struct uet_pdc *pdc,
 			     struct sk_buff *skb, u8 type, u8 flags)
 {
@@ -685,8 +837,17 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 					    remote_fep_addr);
 		break;
 	case UET_PDC_EP_STATE_ACK_WAIT:
+		ret = uet_job_fep_queue_skb(pds_context(pdc->pds),
+					    uet_ses_rsp_job_id(ses_rsp), skb,
+					    remote_fep_addr);
+		if (!RB_EMPTY_ROOT(&pdc->rtx_queue) || ret < 0)
+			break;
+		uet_pdc_close(pdc);
+		ret = 1;
 		break;
 	case UET_PDC_EP_STATE_CLOSE_ACK_WAIT:
+		uet_pdc_destroy(pdc);
+		ret = 0;
 		break;
 	}
 
@@ -919,3 +1080,43 @@ void uet_pdc_rx_nack(struct uet_pdc *pdc, struct sk_buff *skb)
 out:
 	spin_unlock(&pdc->lock);
 }
+
+int uet_pdc_rx_ctl(struct uet_pdc *pdc, struct sk_buff *skb,
+		   __be32 remote_fep_addr)
+{
+	struct uet_pds_ctl_hdr *ctl = pds_ctl_hdr(skb);
+	u32 ctl_psn = be32_to_cpu(ctl->psn);
+	int ret = -EINVAL;
+
+	spin_lock(&pdc->lock);
+	netdev_dbg(pds_netdev(pdc->pds), "%s: CTRL pdc: [ spdcid: %u dpdcid: %u rx_base_psn %u ] "
+					 "ctrl header: [ ctl_type: %u psn: %u ]\n",
+		   __func__, pdc->spdcid, pdc->dpdcid, pdc->rx_base_psn,
+		   uet_prologue_ctl_type(&ctl->prologue), ctl_psn);
+	if (psn_mpr_pos(pdc->rx_base_psn, ctl_psn) != UET_PDC_MPR_CUR)
+		goto out;
+	switch (uet_prologue_ctl_type(&ctl->prologue)) {
+	case UET_CTL_TYPE_CLOSE:
+		/* only the initiator can send CLOSE */
+		if (pdc->is_initiator)
+			break;
+		ret = 0;
+		uet_pdc_send_ses_ack(pdc, UET_SES_RSP_RC_NULL, 0,
+				     be32_to_cpu(ctl->psn),
+				     0, true);
+		uet_pdc_destroy(pdc);
+		break;
+	case UET_CTL_TYPE_REQ_CLOSE:
+		/* only the target can send REQ_CLOSE */
+		if (!pdc->is_initiator)
+			break;
+		uet_pdc_close(pdc);
+		break;
+	default:
+		break;
+	}
+out:
+	spin_unlock(&pdc->lock);
+
+	return ret;
+}
diff --git a/drivers/ultraeth/uet_pds.c b/drivers/ultraeth/uet_pds.c
index c144b6df8327..9ab0a088b308 100644
--- a/drivers/ultraeth/uet_pds.c
+++ b/drivers/ultraeth/uet_pds.c
@@ -195,13 +195,18 @@ static int uet_pds_rx_ack(struct uet_pds *pds, struct sk_buff *skb,
 	struct uet_pds_req_hdr *pds_req = pds_req_hdr(skb);
 	u16 pdcid = be16_to_cpu(pds_req->dpdcid);
 	struct uet_pdc *pdc;
+	int ret;
 
 	pdc = rhashtable_lookup_fast(&pds->pdcid_hash, &pdcid,
 				     uet_pds_pdcid_rht_params);
 	if (!pdc)
 		return -ENOENT;
 
-	return uet_pdc_rx_ack(pdc, skb, remote_fep_addr);
+	ret = uet_pdc_rx_ack(pdc, skb, remote_fep_addr);
+	if (ret >= 0)
+		uet_pdc_rx_refresh(pdc);
+
+	return ret;
 }
 
 static void uet_pds_rx_nack(struct uet_pds *pds, struct sk_buff *skb)
@@ -218,6 +223,26 @@ static void uet_pds_rx_nack(struct uet_pds *pds, struct sk_buff *skb)
 	uet_pdc_rx_nack(pdc, skb);
 }
 
+static int uet_pds_rx_ctl(struct uet_pds *pds, struct sk_buff *skb,
+			  __be32 remote_fep_addr)
+{
+	struct uet_pds_ctl_hdr *ctl = pds_ctl_hdr(skb);
+	u16 pdcid = be16_to_cpu(ctl->dpdcid_pdc_info_offset);
+	struct uet_pdc *pdc;
+	int ret;
+
+	pdc = rhashtable_lookup_fast(&pds->pdcid_hash, &pdcid,
+				     uet_pds_pdcid_rht_params);
+	if (!pdc)
+		return -ENOENT;
+
+	ret = uet_pdc_rx_ctl(pdc, skb, remote_fep_addr);
+	if (ret >= 0)
+		uet_pdc_rx_refresh(pdc);
+
+	return ret;
+}
+
 static struct uet_pdc *uet_pds_new_pdc_rx(struct uet_pds *pds,
 					  struct sk_buff *skb,
 					  __be16 dport, u32 ack_gen_trigger,
@@ -245,6 +270,7 @@ static int uet_pds_rx_req(struct uet_pds *pds, struct sk_buff *skb,
 	struct uet_pdc_key key = {};
 	struct uet_fep *fep;
 	struct uet_pdc *pdc;
+	int ret;
 
 	key.src_ip = local_fep_addr;
 	key.dst_ip = remote_fep_addr;
@@ -303,7 +329,11 @@ static int uet_pds_rx_req(struct uet_pds *pds, struct sk_buff *skb,
 			return PTR_ERR(pdc);
 	}
 
-	return uet_pdc_rx_req(pdc, skb, remote_fep_addr, tos);
+	ret = uet_pdc_rx_req(pdc, skb, remote_fep_addr, tos);
+	if (ret >= 0)
+		uet_pdc_rx_refresh(pdc);
+
+	return ret;
 }
 
 static bool uet_pds_rx_valid_req_next_hdr(const struct uet_prologue_hdr *prologue)
@@ -368,6 +398,12 @@ int uet_pds_rx(struct uet_pds *pds, struct sk_buff *skb, __be32 local_fep_addr,
 		ret = uet_pds_rx_req(pds, skb, local_fep_addr, remote_fep_addr,
 				     dport, tos);
 		break;
+	case UET_PDS_TYPE_CTRL_MSG:
+		offset += sizeof(struct uet_pds_ctl_hdr);
+		if (!pskb_may_pull(skb, offset))
+			break;
+		ret = uet_pds_rx_ctl(pds, skb, remote_fep_addr);
+		break;
 	case UET_PDS_TYPE_NACK:
 		if (uet_prologue_next_hdr(prologue) != UET_PDS_NEXT_HDR_NONE)
 			break;
diff --git a/include/net/ultraeth/uet_pdc.h b/include/net/ultraeth/uet_pdc.h
index 60aecc15d0f1..02d2d5716c48 100644
--- a/include/net/ultraeth/uet_pdc.h
+++ b/include/net/ultraeth/uet_pdc.h
@@ -22,6 +22,8 @@
 #define UET_PDC_MPR 128
 #define UET_PDC_SACK_BITS 64
 #define UET_PDC_SACK_MASK (U64_MAX << 3)
+#define UET_PDC_IDLE_TIMEOUT_SEC 60
+#define UET_PDC_IDLE_TIMEOUT_JIFFIES (UET_PDC_IDLE_TIMEOUT_SEC * HZ)
 
 #define UET_SKB_CB(skb)       ((struct uet_skb_cb *)&((skb)->cb[0]))
 
@@ -38,7 +40,8 @@ enum {
 	UET_PDC_EP_STATE_ESTABLISHED,
 	UET_PDC_EP_STATE_QUIESCE,
 	UET_PDC_EP_STATE_ACK_WAIT,
-	UET_PDC_EP_STATE_CLOSE_ACK_WAIT
+	UET_PDC_EP_STATE_CLOSE_ACK_WAIT,
+	UET_PDC_EP_STATE_CLOSE_WAIT
 };
 
 struct uet_pdc_key {
@@ -88,7 +91,7 @@ struct uet_pdc {
 	int rtx_max;
 	struct timer_list rtx_timer;
 	unsigned long rtx_timeout;
-
+	unsigned long rx_last_jiffies;
 	unsigned long *rx_bitmap;
 	unsigned long *tx_bitmap;
 	unsigned long *ack_bitmap;
@@ -102,6 +105,8 @@ struct uet_pdc {
 	u32 ack_gen_min_pkt_add;
 	u32 ack_gen_count;
 
+	struct timer_list timeout_timer;
+
 	struct rb_root rtx_queue;
 
 	struct hlist_node gc_node;
@@ -121,8 +126,11 @@ int uet_pdc_rx_ack(struct uet_pdc *pdc, struct sk_buff *skb,
 		   __be32 remote_fep_addr);
 int uet_pdc_tx_req(struct uet_pdc *pdc, struct sk_buff *skb, u8 type);
 void uet_pdc_rx_nack(struct uet_pdc *pdc, struct sk_buff *skb);
+int uet_pdc_rx_ctl(struct uet_pdc *pdc, struct sk_buff *skb,
+		   __be32 remote_fep_addr);
 struct metadata_dst *uet_pdc_dst(const struct uet_pdc_key *key, __be16 dport,
 				 u8 tos);
+void uet_pdc_rx_refresh(struct uet_pdc *pdc);
 
 static inline void uet_pdc_build_prologue(struct uet_prologue_hdr *prologue,
 					  u8 type, u8 next, u8 flags)
diff --git a/include/net/ultraeth/uet_pds.h b/include/net/ultraeth/uet_pds.h
index 4e9794a4d3de..fc2414cc2de8 100644
--- a/include/net/ultraeth/uet_pds.h
+++ b/include/net/ultraeth/uet_pds.h
@@ -73,6 +73,11 @@ static inline struct uet_pds_ack_ext_hdr *pds_ack_ext_hdr(const struct sk_buff *
 	return (struct uet_pds_ack_ext_hdr *)(pds_ack_hdr(skb) + 1);
 }
 
+static inline struct uet_pds_ctl_hdr *pds_ctl_hdr(const struct sk_buff *skb)
+{
+	return (struct uet_pds_ctl_hdr *)skb_network_header(skb);
+}
+
 static inline struct uet_ses_rsp_hdr *pds_ack_ses_rsp_hdr(const struct sk_buff *skb)
 {
 	/* TODO: ack_ext_hdr, CC_STATE, etc. */
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
index 53d2124bc285..c1d5457073e1 100644
--- a/include/uapi/linux/ultraeth.h
+++ b/include/uapi/linux/ultraeth.h
@@ -247,6 +247,25 @@ struct uet_pds_nack_hdr {
 	__be32 payload;
 } __attribute__ ((__packed__));
 
+/* control packet flags */
+enum {
+	UET_PDS_CTL_FLAG_RSV21	= (1 << 0),
+	UET_PDS_CTL_FLAG_RSV22	= (1 << 1),
+	UET_PDS_CTL_FLAG_SYN	= (1 << 2),
+	UET_PDS_CTL_FLAG_AR	= (1 << 3),
+	UET_PDS_CTL_FLAG_RETX	= (1 << 4),
+	UET_PDS_CTL_FLAG_RSV11	= (1 << 5),
+	UET_PDS_CTL_FLAG_RSV12	= (1 << 6),
+};
+
+struct uet_pds_ctl_hdr {
+	struct uet_prologue_hdr prologue;
+	__be32 psn;
+	__be16 spdcid;
+	__be16 dpdcid_pdc_info_offset;
+	__be32 payload;
+} __attribute__ ((__packed__));
+
 /* ses request op codes */
 enum {
 	UET_SES_REQ_OP_NOOP			= 0x00,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [RFC PATCH 13/13] HACK: drivers: ultraeth: add char device
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (11 preceding siblings ...)
  2025-03-06 23:02 ` [RFC PATCH 12/13] drivers: ultraeth: add initiator and target idle timeout support Nikolay Aleksandrov
@ 2025-03-06 23:02 ` Nikolay Aleksandrov
  2025-03-08 18:46 ` [RFC PATCH 00/13] Ultra Ethernet driver introduction Leon Romanovsky
  2025-03-19 16:48 ` Jason Gunthorpe
  14 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-06 23:02 UTC (permalink / raw)
  To: netdev
  Cc: shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt, roland,
	nikolay, winston.liu, dan.mihailescu, kheib, parth.v.parikh,
	davem, ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

From: Alex Badea <alex.badea@keysight.com>

Add a character device so we can send and receive packets from
user-space. It also implements a private ioctl for associating with a job.
This patch is just a quick hack to allow using the Ultra Ethernet
software device from user-space until proper user<->kernel APIs are
defined.

Signed-off-by: Alex Badea <alex.badea@keysight.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@enfabrica.net>
---
 Documentation/netlink/specs/ultraeth.yaml |   9 +
 drivers/ultraeth/Makefile                 |   2 +-
 drivers/ultraeth/uet_chardev.c            | 264 ++++++++++++++++++++++
 drivers/ultraeth/uet_context.c            |  31 ++-
 include/net/ultraeth/uet_chardev.h        |  11 +
 include/net/ultraeth/uet_context.h        |   3 +
 include/uapi/linux/ultraeth.h             |  21 ++
 include/uapi/linux/ultraeth_nl.h          |   3 +
 8 files changed, 342 insertions(+), 2 deletions(-)
 create mode 100644 drivers/ultraeth/uet_chardev.c
 create mode 100644 include/net/ultraeth/uet_chardev.h

diff --git a/Documentation/netlink/specs/ultraeth.yaml b/Documentation/netlink/specs/ultraeth.yaml
index 847f748efa52..3dc10e52131e 100644
--- a/Documentation/netlink/specs/ultraeth.yaml
+++ b/Documentation/netlink/specs/ultraeth.yaml
@@ -22,6 +22,15 @@ attribute-sets:
       -
         name: netdev-name
         type: string
+      -
+        name: chardev-name
+        type: string
+      -
+        name: chardev-major
+        type: s32
+      -
+        name: chardev-minor
+        type: s32
   -
     name: contexts
     attributes:
diff --git a/drivers/ultraeth/Makefile b/drivers/ultraeth/Makefile
index f2d6a8569dbf..bee8e7aa00bb 100644
--- a/drivers/ultraeth/Makefile
+++ b/drivers/ultraeth/Makefile
@@ -1,4 +1,4 @@
 obj-$(CONFIG_ULTRAETH) += ultraeth.o
 
 ultraeth-objs := uet_main.o uet_context.o uet_netlink.o uet_job.o \
-			uecon.o uet_pdc.o uet_pds.o
+			uecon.o uet_pdc.o uet_pds.o uet_chardev.o
diff --git a/drivers/ultraeth/uet_chardev.c b/drivers/ultraeth/uet_chardev.c
new file mode 100644
index 000000000000..f02f2c1e1afd
--- /dev/null
+++ b/drivers/ultraeth/uet_chardev.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/netdevice.h>
+
+#include <rdma/ib_umem.h>
+
+#include <uapi/linux/ultraeth.h>
+#include <net/ultraeth/uet_context.h>
+#include <net/ultraeth/uet_chardev.h>
+
+#define MAX_PDS_HDRLEN	64	/* -ish? */
+
+static int uet_char_open(struct inode *inode, struct file *file)
+{
+	struct uet_context *ctx;
+	struct uet_fep *fep;
+	int rv;
+
+	ctx = uet_context_get_by_minor(iminor(inode));
+	if (!ctx)
+		return -ENOENT;
+
+	fep = kzalloc(sizeof(*fep), GFP_KERNEL);
+	if (!fep) {
+		rv = -ENOMEM;
+		goto err_alloc;
+	}
+
+	fep->context = ctx;
+	fep->ack_gen_min_pkt_add = UET_DEFAULT_ACK_GEN_MIN_PKT_ADD;
+	fep->ack_gen_trigger = UET_DEFAULT_ACK_GEN_TRIGGER;
+	skb_queue_head_init(&fep->rxq);
+	file->private_data = fep;
+	rv = nonseekable_open(inode, file);
+	if (rv < 0)
+		goto err_open;
+
+	return rv;
+
+err_open:
+	kfree(fep);
+err_alloc:
+	uet_context_put(ctx);
+
+	return rv;
+}
+
+static int uet_char_release(struct inode *inode, struct file *file)
+{
+	struct uet_fep *fep = file->private_data;
+
+	uet_job_reg_disassociate(&fep->context->job_reg, fep->job_id);
+	skb_queue_purge(&fep->rxq);
+	uet_context_put(fep->context);
+	kfree(fep);
+
+	return 0;
+}
+
+static long uet_char_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct uet_fep *fep = file->private_data;
+	void __user *p = (void __user *)arg;
+	int ret = 0;
+
+	switch (cmd) {
+	case UET_ADDR_REQ: {
+		struct uet_job_addr_req areq;
+
+		if (copy_from_user(&areq, p, sizeof(areq)))
+			return -EFAULT;
+		// XXX: validate address
+
+		areq.service_name[UET_SVC_MAX_LEN - 1] = '\0';
+		memcpy(&fep->addr.in_address, &areq.address,
+		       sizeof(fep->addr.in_address));
+
+		ret = uet_job_reg_associate(&fep->context->job_reg, fep,
+					    areq.service_name);
+		if (!ret) {
+			if (areq.ack_gen_trigger > 0)
+				fep->ack_gen_trigger = areq.ack_gen_trigger;
+			if (areq.ack_gen_min_pkt_add > 0)
+				fep->ack_gen_min_pkt_add = areq.ack_gen_min_pkt_add;
+		}
+		break;
+	}
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+static ssize_t uet_char_read(struct file *file, char __user *ubuf,
+			       size_t usize, loff_t *off)
+{
+	struct uet_fep *fep = file->private_data;
+	struct uet_prologue_hdr *prologue;
+	struct uet_pds_meta meta = {};
+	struct sk_buff *skb = NULL;
+	int ret = -ENOTCONN;
+	int hdrlen = 0;
+	size_t userlen;
+
+	pr_debug("%s file=%p fep=%p size=%zu\n", __func__, file, fep, usize);
+
+	ret = -EAGAIN;
+	skb = skb_dequeue(&fep->rxq);
+	if (!skb)
+		goto out_err;
+
+	ret = skb_linearize(skb);
+	if (ret)
+		goto out_err;
+
+	prologue = pds_prologue_hdr(skb);
+	meta.next_hdr = uet_prologue_next_hdr(prologue);
+	meta.addr = UET_SKB_CB(skb)->remote_fep_addr;
+	switch (meta.next_hdr) {
+	case UET_PDS_NEXT_HDR_RSP_DATA:
+	case UET_PDS_NEXT_HDR_RSP_DATA_SMALL:
+		/* TODO */
+		ret = -EOPNOTSUPP;
+		goto out_err;
+	case UET_PDS_NEXT_HDR_RSP:
+		hdrlen = sizeof(struct uet_pds_ack_hdr);
+		break;
+	default:
+		hdrlen = sizeof(struct uet_pds_req_hdr);
+		break;
+	}
+	userlen = sizeof(meta) + skb->len - hdrlen;
+	if (userlen > usize) {
+		ret = -EMSGSIZE;
+		goto out_err;
+	}
+
+	if (copy_to_user(ubuf, &meta, sizeof(meta))) {
+		ret = -EFAULT;
+		goto out_err;
+	}
+	if (copy_to_user(ubuf + sizeof(meta), skb->data + hdrlen, skb->len - hdrlen)) {
+		ret = -EFAULT;
+		goto out_err;
+	}
+
+	consume_skb(skb);
+	ret = userlen;
+
+	return ret;
+
+out_err:
+	kfree_skb(skb);
+
+	return ret;
+}
+
+static ssize_t uet_char_write(struct file *file, const char __user *ubuf,
+			      size_t usize, loff_t *off)
+{
+	struct uet_fep *fep = file->private_data;
+	struct sk_buff *skb = NULL;
+	struct uet_pds_meta *meta;
+	struct uet_job *job;
+	__be32 daddr, saddr;
+	int ret = -ENODEV;
+	__be16 dport;
+	void *buf;
+
+	pr_debug("%s file=%p fep=%p size=%zu\n", __func__, file, fep, usize);
+
+	rcu_read_lock();
+	job = uet_job_find(&fep->context->job_reg, fep->job_id);
+	if (!job)
+		goto out_err;
+
+	ret = -ENOMEM;
+	skb = alloc_skb(MAX_HEADER + MAX_PDS_HDRLEN + usize, GFP_ATOMIC);
+	if (!skb)
+		goto out_err;
+	skb_reserve(skb, MAX_HEADER + MAX_PDS_HDRLEN);
+	buf = skb_put(skb, usize);
+	ret = -EFAULT;
+	if (copy_from_user(buf, ubuf, usize))
+		goto out_err;
+
+	print_hex_dump_bytes("pds tx ", DUMP_PREFIX_OFFSET, skb->data, skb->len);
+
+	meta = skb_pull_data(skb, sizeof(*meta));
+	if (!meta) {
+		ret = -EINVAL;
+		goto out_err;
+	}
+	/* TODO: IPv6 */
+	/* TODO: per-packet daddr */
+	saddr = fep->addr.in_address.ip;
+	daddr = meta->addr;
+	dport = meta->port;
+
+	switch (meta->next_hdr) {
+	case UET_PDS_NEXT_HDR_RSP_DATA:
+	case UET_PDS_NEXT_HDR_RSP_DATA_SMALL:
+		ret = -EOPNOTSUPP; /* TODO */
+		goto out_err;
+	case UET_PDS_NEXT_HDR_RSP:
+		ret = 0; /* FIXME: ACK PSN would be wrong */
+		break;
+	default:
+		ret = uet_pds_tx(&fep->context->pds, skb, saddr, daddr, dport,
+				 job->id);
+		break;
+	}
+
+	if (ret < 0)
+		goto out_err;
+	rcu_read_unlock();
+
+	return usize;
+
+out_err:
+	rcu_read_unlock();
+	kfree_skb(skb);
+
+	return ret;
+}
+
+static const struct file_operations uet_char_ops = {
+	.owner		= THIS_MODULE,
+	.open		= uet_char_open,
+	.release	= uet_char_release,
+	.read		= uet_char_read,
+	.write		= uet_char_write,
+	.unlocked_ioctl	= uet_char_ioctl,
+};
+
+#define UET_CHAR_MAX_NAME 20
+
+int uet_char_init(struct miscdevice *cdev, int id)
+{
+	int ret = -ENOMEM;
+
+	cdev->minor = MISC_DYNAMIC_MINOR;
+	cdev->name = kzalloc(UET_CHAR_MAX_NAME, GFP_KERNEL);
+	if (!cdev->name)
+		return ret;
+	snprintf((char *)cdev->name, UET_CHAR_MAX_NAME, "ultraeth%d", id);
+	cdev->fops = &uet_char_ops;
+
+	ret = misc_register(cdev);
+	if (ret)
+		kfree(cdev->name);
+
+	return ret;
+}
+
+void uet_char_uninit(struct miscdevice *cdev)
+{
+	kfree(cdev->name);
+	misc_deregister(cdev);
+}
diff --git a/drivers/ultraeth/uet_context.c b/drivers/ultraeth/uet_context.c
index 6bdd72344e01..7bddc810503b 100644
--- a/drivers/ultraeth/uet_context.c
+++ b/drivers/ultraeth/uet_context.c
@@ -2,6 +2,7 @@
 
 #include <net/ultraeth/uet_context.h>
 #include <net/ultraeth/uecon.h>
+#include <net/ultraeth/uet_chardev.h>
 #include "uet_netlink.h"
 
 #define MAX_CONTEXT_ID 256
@@ -78,6 +79,24 @@ struct uet_context *uet_context_get_by_id(int id)
 	return ctx;
 }
 
+struct uet_context *uet_context_get_by_minor(int minor)
+{
+	struct uet_context *ctx;
+
+	mutex_lock(&uet_context_lock);
+	list_for_each_entry(ctx, &uet_context_list, list) {
+		if (ctx->cdev.minor == minor) {
+			refcount_inc(&ctx->refcnt);
+			goto out;
+		}
+	}
+	ctx = NULL;
+out:
+	mutex_unlock(&uet_context_lock);
+
+	return ctx;
+}
+
 void uet_context_put(struct uet_context *ctx)
 {
 	if (refcount_dec_and_test(&ctx->refcnt))
@@ -111,6 +130,10 @@ int uet_context_create(int id)
 	if (err)
 		goto ctx_pds_err;
 
+	err = uet_char_init(&ctx->cdev, ctx->id);
+	if (err)
+		goto ctx_char_err;
+
 	err = uecon_netdev_init(ctx);
 	if (err)
 		goto ctx_netdev_err;
@@ -120,6 +143,8 @@ int uet_context_create(int id)
 	return 0;
 
 ctx_netdev_err:
+	uet_char_uninit(&ctx->cdev);
+ctx_char_err:
 	uet_pds_uninit(&ctx->pds);
 ctx_pds_err:
 	uet_jobs_uninit(&ctx->job_reg);
@@ -135,6 +160,7 @@ static void __uet_context_destroy(struct uet_context *ctx)
 {
 	uet_context_unlink(ctx);
 	uecon_netdev_uninit(ctx);
+	uet_char_uninit(&ctx->cdev);
 	uet_pds_uninit(&ctx->pds);
 	uet_jobs_uninit(&ctx->job_reg);
 	uet_context_put_id(ctx);
@@ -183,7 +209,10 @@ static int __nl_ctx_fill_one(struct sk_buff *skb,
 
 	if (nla_put_s32(skb, ULTRAETH_A_CONTEXT_ID, ctx->id) ||
 	    nla_put_s32(skb, ULTRAETH_A_CONTEXT_NETDEV_IFINDEX, ctx->netdev->ifindex) ||
-	    nla_put_string(skb, ULTRAETH_A_CONTEXT_NETDEV_NAME, ctx->netdev->name))
+	    nla_put_string(skb, ULTRAETH_A_CONTEXT_NETDEV_NAME, ctx->netdev->name) ||
+	    nla_put_string(skb, ULTRAETH_A_CONTEXT_CHARDEV_NAME, ctx->cdev.name) ||
+	    nla_put_s32(skb, ULTRAETH_A_CONTEXT_CHARDEV_MAJOR, MISC_MAJOR) ||
+	    nla_put_s32(skb, ULTRAETH_A_CONTEXT_CHARDEV_MINOR, ctx->cdev.minor))
 		goto out_err;
 
 	genlmsg_end(skb, hdr);
diff --git a/include/net/ultraeth/uet_chardev.h b/include/net/ultraeth/uet_chardev.h
new file mode 100644
index 000000000000..963b3e247630
--- /dev/null
+++ b/include/net/ultraeth/uet_chardev.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */
+
+#ifndef _UECON_CHARDEV_H
+#define _UECON_CHAR_H
+
+#include <linux/miscdevice.h>
+
+int uet_char_init(struct miscdevice *cdev, int id);
+void uet_char_uninit(struct miscdevice *cdev);
+
+#endif /* _UECON_CHARDEV_H */
diff --git a/include/net/ultraeth/uet_context.h b/include/net/ultraeth/uet_context.h
index 76077df3bce6..06a5c7f252ac 100644
--- a/include/net/ultraeth/uet_context.h
+++ b/include/net/ultraeth/uet_context.h
@@ -11,6 +11,7 @@
 #include <linux/wait.h>
 #include <net/ultraeth/uet_job.h>
 #include <net/ultraeth/uet_pds.h>
+#include <linux/miscdevice.h>
 
 struct uet_context {
 	int id;
@@ -21,9 +22,11 @@ struct uet_context {
 	struct net_device *netdev;
 	struct uet_job_registry job_reg;
 	struct uet_pds pds;
+	struct miscdevice cdev;
 };
 
 struct uet_context *uet_context_get_by_id(int id);
+struct uet_context *uet_context_get_by_minor(int minor);
 void uet_context_put(struct uet_context *ses_pl);
 
 int uet_context_create(int id);
diff --git a/include/uapi/linux/ultraeth.h b/include/uapi/linux/ultraeth.h
index c1d5457073e1..2843bb710f1e 100644
--- a/include/uapi/linux/ultraeth.h
+++ b/include/uapi/linux/ultraeth.h
@@ -512,4 +512,25 @@ struct fep_address {
 	__u16 padding;
 	__u8 version;
 };
+
+/* char device hacks */
+#define UET_IOCTL_MAGIC	'u'
+#define UET_ADDR_REQ		_IO(UET_IOCTL_MAGIC, 1)
+
+struct uet_job_addr_req {
+	struct fep_in_address address;
+	char service_name[UET_SVC_MAX_LEN];
+	__u32 ack_gen_trigger;
+	__u32 ack_gen_min_pkt_add;
+	__u8 flags;
+};
+
+struct uet_pds_meta {
+	__u8 next_hdr:4;
+	__u8 reserved1:4;
+	__u8 reserved2;
+	__be16 port;
+	/* XXX: fep_address */
+	__be32 addr;
+} __attribute__((packed));
 #endif /* _UAPI_LINUX_ULTRAETH_H */
diff --git a/include/uapi/linux/ultraeth_nl.h b/include/uapi/linux/ultraeth_nl.h
index 515044022906..884fa165adb6 100644
--- a/include/uapi/linux/ultraeth_nl.h
+++ b/include/uapi/linux/ultraeth_nl.h
@@ -13,6 +13,9 @@ enum {
 	ULTRAETH_A_CONTEXT_ID = 1,
 	ULTRAETH_A_CONTEXT_NETDEV_IFINDEX,
 	ULTRAETH_A_CONTEXT_NETDEV_NAME,
+	ULTRAETH_A_CONTEXT_CHARDEV_NAME,
+	ULTRAETH_A_CONTEXT_CHARDEV_MAJOR,
+	ULTRAETH_A_CONTEXT_CHARDEV_MINOR,
 
 	__ULTRAETH_A_CONTEXT_MAX,
 	ULTRAETH_A_CONTEXT_MAX = (__ULTRAETH_A_CONTEXT_MAX - 1)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (12 preceding siblings ...)
  2025-03-06 23:02 ` [RFC PATCH 13/13] HACK: drivers: ultraeth: add char device Nikolay Aleksandrov
@ 2025-03-08 18:46 ` Leon Romanovsky
  2025-03-09  3:21   ` Parav Pandit
  2025-03-12  9:40   ` Nikolay Aleksandrov
  2025-03-19 16:48 ` Jason Gunthorpe
  14 siblings, 2 replies; 76+ messages in thread
From: Leon Romanovsky @ 2025-03-08 18:46 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni, Jason Gunthorpe

On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> Hi all,

<...>

> Ultra Ethernet is a new RDMA transport.

Awesome, and now please explain why new subsystem is needed when
drivers/infiniband already supports at least 5 different RDMA
transports (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).

Maybe after this discussion it will be very clear that new subsystem
is needed, but at least it needs to be stated clearly.

An please CC RDMA maintainers to any Ultra Ethernet related discussions
as it is more RDMA than Ethernet.

Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-08 18:46 ` [RFC PATCH 00/13] Ultra Ethernet driver introduction Leon Romanovsky
@ 2025-03-09  3:21   ` Parav Pandit
  2025-03-11 14:20     ` Bernard Metzler
  2025-03-12  9:40   ` Nikolay Aleksandrov
  1 sibling, 1 reply; 76+ messages in thread
From: Parav Pandit @ 2025-03-09  3:21 UTC (permalink / raw)
  To: Leon Romanovsky, Nikolay Aleksandrov
  Cc: netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, bmt@zurich.ibm.com,
	roland@enfabrica.net, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, kheib@redhat.com,
	parth.v.parikh@keysight.com, davem@redhat.com, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, pabeni@redhat.com,
	Jason Gunthorpe



> From: Leon Romanovsky <leon@kernel.org>
> Sent: Sunday, March 9, 2025 12:17 AM
> 
> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > Hi all,
> 
> <...>
> 
> > Ultra Ethernet is a new RDMA transport.
> 
> Awesome, and now please explain why new subsystem is needed when
> drivers/infiniband already supports at least 5 different RDMA transports
> (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> 
6th transport is drivers/infiniband/hw/efa (srd).

> Maybe after this discussion it will be very clear that new subsystem is needed,
> but at least it needs to be stated clearly.
> 
> An please CC RDMA maintainers to any Ultra Ethernet related discussions as it
> is more RDMA than Ethernet.
> 
> Thanks


^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-09  3:21   ` Parav Pandit
@ 2025-03-11 14:20     ` Bernard Metzler
  2025-03-11 14:55       ` Leon Romanovsky
  2025-03-11 17:11       ` Sean Hefty
  0 siblings, 2 replies; 76+ messages in thread
From: Bernard Metzler @ 2025-03-11 14:20 UTC (permalink / raw)
  To: Parav Pandit, Leon Romanovsky, Nikolay Aleksandrov
  Cc: netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni,
	Jason Gunthorpe

> -----Original Message-----
> From: Parav Pandit <parav@nvidia.com>
> Sent: Sunday, March 9, 2025 4:22 AM
> To: Leon Romanovsky <leon@kernel.org>; Nikolay Aleksandrov
> <nikolay@enfabrica.net>
> Cc: netdev@vger.kernel.org; shrijeet@enfabrica.net;
> alex.badea@keysight.com; eric.davis@broadcom.com; rip.sohan@amd.com;
> dsahern@kernel.org; Bernard Metzler <BMT@zurich.ibm.com>;
> roland@enfabrica.net; winston.liu@keysight.com;
> dan.mihailescu@keysight.com; Kamal Heib <kheib@redhat.com>;
> parth.v.parikh@keysight.com; Dave Miller <davem@redhat.com>;
> ian.ziemba@hpe.com; andrew.tauferner@cornelisnetworks.com;
> welch@hpe.com; rakhahari.bhunia@keysight.com;
> kingshuk.mandal@keysight.com; linux-rdma@vger.kernel.org;
> kuba@kernel.org; Paolo Abeni <pabeni@redhat.com>; Jason Gunthorpe
> <jgg@nvidia.com>
> Subject: [EXTERNAL] RE: [RFC PATCH 00/13] Ultra Ethernet driver
> introduction
> 
> 
> 
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Sunday, March 9, 2025 12:17 AM
> >
> > On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > Hi all,
> >
> > <...>
> >
> > > Ultra Ethernet is a new RDMA transport.
> >
> > Awesome, and now please explain why new subsystem is needed when
> > drivers/infiniband already supports at least 5 different RDMA
> transports
> > (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> >
> 6th transport is drivers/infiniband/hw/efa (srd).
> 
> > Maybe after this discussion it will be very clear that new subsystem
> is needed,
> > but at least it needs to be stated clearly.

I am not sure if a new subsystem is what this RFC calls
for, but rather a discussion about the proper integration of
a new RDMA transport into the Linux kernel.

Ultra Ethernet Transport is probably not just another transport
up for easy integration into the current RDMA subsystem.
First of all, its design does not follow the well-known RDMA
verbs model inherited from InfiniBand, which has largely shaped
the current structure of the RDMA subsystem. While having send,
receive and completion queues (and completion counters) to steer
message exchange, there is no concept of a queue pair. Endpoints
can span multiple queues, can have multiple peer addresses.
Communication resources sharing is controlled in a different way
than within protection domains. Connections are ephemeral,
created and released by the provider as needed. There are more
differences. In a nutshell, the UET communication model is
trimmed for extreme scalability. Its API semantics follow
libfabrics, not RDMA verbs.

I think Nik gave us a first still incomplete look at the UET
protocol engine to help us understand some of the specifics.
It's just the lower part (packet delivery). The implementation
of the upper part (resource management, communication semantics,
job management) may largely depend on the environment we all
choose.

IMO, integrating UET with the current RDMA subsystem would ask
for its extension to allow exposing all of UETs intended
functionality, probably starting with a more generic RDMA
device model than current ib_device.

The different API semantics of UET may further call
for either extending verbs to cover it as well, or exposing a
new non-verbs API (libfabrics), or both.

Thanks,
Bernard.

> >
> > An please CC RDMA maintainers to any Ultra Ethernet related
> discussions as it
> > is more RDMA than Ethernet.
> >
> > Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-11 14:20     ` Bernard Metzler
@ 2025-03-11 14:55       ` Leon Romanovsky
  2025-03-11 17:11       ` Sean Hefty
  1 sibling, 0 replies; 76+ messages in thread
From: Leon Romanovsky @ 2025-03-11 14:55 UTC (permalink / raw)
  To: Bernard Metzler
  Cc: Parav Pandit, Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	roland@enfabrica.net, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni,
	Jason Gunthorpe

On Tue, Mar 11, 2025 at 02:20:07PM +0000, Bernard Metzler wrote:
> 
> 
> > -----Original Message-----
> > From: Parav Pandit <parav@nvidia.com>
> > Sent: Sunday, March 9, 2025 4:22 AM
> > To: Leon Romanovsky <leon@kernel.org>; Nikolay Aleksandrov
> > <nikolay@enfabrica.net>
> > Cc: netdev@vger.kernel.org; shrijeet@enfabrica.net;
> > alex.badea@keysight.com; eric.davis@broadcom.com; rip.sohan@amd.com;
> > dsahern@kernel.org; Bernard Metzler <BMT@zurich.ibm.com>;
> > roland@enfabrica.net; winston.liu@keysight.com;
> > dan.mihailescu@keysight.com; Kamal Heib <kheib@redhat.com>;
> > parth.v.parikh@keysight.com; Dave Miller <davem@redhat.com>;
> > ian.ziemba@hpe.com; andrew.tauferner@cornelisnetworks.com;
> > welch@hpe.com; rakhahari.bhunia@keysight.com;
> > kingshuk.mandal@keysight.com; linux-rdma@vger.kernel.org;
> > kuba@kernel.org; Paolo Abeni <pabeni@redhat.com>; Jason Gunthorpe
> > <jgg@nvidia.com>
> > Subject: [EXTERNAL] RE: [RFC PATCH 00/13] Ultra Ethernet driver
> > introduction
> > 
> > 
> > 
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Sunday, March 9, 2025 12:17 AM
> > >
> > > On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > > Hi all,
> > >
> > > <...>
> > >
> > > > Ultra Ethernet is a new RDMA transport.
> > >
> > > Awesome, and now please explain why new subsystem is needed when
> > > drivers/infiniband already supports at least 5 different RDMA
> > transports
> > > (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> > >
> > 6th transport is drivers/infiniband/hw/efa (srd).
> > 
> > > Maybe after this discussion it will be very clear that new subsystem
> > is needed,
> > > but at least it needs to be stated clearly.
> 
> I am not sure if a new subsystem is what this RFC calls
> for, but rather a discussion about the proper integration of
> a new RDMA transport into the Linux kernel.

<...>

> The different API semantics of UET may further call
> for either extending verbs to cover it as well, or exposing a
> new non-verbs API (libfabrics), or both.

So you should start from there (UAPI) by presenting the device model and
how the verbs API needs to be extended, so it will be possible to evaluate
how to fit that model into existing Linux kernel codebase.

RDNA subsystem provides multiple type of QPs and operational models, some of them
are indeed follow IB style, but not all of them (SRD, DC e.t.c).

Thanks

> 
> Thanks,
> Bernard.
> 
> 
> > >
> > > An please CC RDMA maintainers to any Ultra Ethernet related
> > discussions as it
> > > is more RDMA than Ethernet.
> > >
> > > Thanks
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-11 14:20     ` Bernard Metzler
  2025-03-11 14:55       ` Leon Romanovsky
@ 2025-03-11 17:11       ` Sean Hefty
  2025-03-12  9:20         ` Nikolay Aleksandrov
  1 sibling, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-11 17:11 UTC (permalink / raw)
  To: Bernard Metzler, Parav Pandit, Leon Romanovsky,
	Nikolay Aleksandrov
  Cc: netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni,
	Jason Gunthorpe

> I am not sure if a new subsystem is what this RFC calls for, but rather a
> discussion about the proper integration of a new RDMA transport into the
> Linux kernel.
> 
> Ultra Ethernet Transport is probably not just another transport up for easy
> integration into the current RDMA subsystem.
> First of all, its design does not follow the well-known RDMA verbs model
> inherited from InfiniBand, which has largely shaped the current structure of
> the RDMA subsystem. While having send, receive and completion queues (and
> completion counters) to steer message exchange, there is no concept of a
> queue pair. Endpoints can span multiple queues, can have multiple peer
> addresses.
> Communication resources sharing is controlled in a different way than within
> protection domains. Connections are ephemeral, created and released by the
> provider as needed. There are more differences. In a nutshell, the UET
> communication model is trimmed for extreme scalability. Its API semantics
> follow libfabrics, not RDMA verbs.
> 
> I think Nik gave us a first still incomplete look at the UET protocol engine to
> help us understand some of the specifics.
> It's just the lower part (packet delivery). The implementation of the upper part
> (resource management, communication semantics, job management) may
> largely depend on the environment we all choose.
> 
> IMO, integrating UET with the current RDMA subsystem would ask for its
> extension to allow exposing all of UETs intended functionality, probably
> starting with a more generic RDMA device model than current ib_device.
> 
> The different API semantics of UET may further call for either extending verbs
> to cover it as well, or exposing a new non-verbs API (libfabrics), or both.

Reading through the submissions, what I found lacking is a description of some higher-level plan.  I don't easily see how to relate this series to NICs that may implement UET in HW.

Should the PDS be viewed as a partial implementation of a SW UET 'device', similar to soft RoCE or iWarp?  If so, having a description of a proposed device model seems like a necessary first step.

If, instead, the PDS should be viewed more along the lines of a partial RDS-like path, then that changes the uapi.

Or, am I not viewing this series as intended at all?

It is almost guaranteed that there will be NICs which will support both RoCE and UET, and it's not farfetched to think that an app may use both simultaneously.   IMO, a common device model is ideal, assuming exposing a device model is the intent.

I agree that different transport models should not be forced together unnaturally, but I think that's solvable.  In the end, the application developer is exposed to libfabric naming anyway.  Besides, even a repurposed RDMA name is still better than the naming used within OpenMPI.  :)

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-11 17:11       ` Sean Hefty
@ 2025-03-12  9:20         ` Nikolay Aleksandrov
  0 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-12  9:20 UTC (permalink / raw)
  To: Sean Hefty, Bernard Metzler, Parav Pandit, Leon Romanovsky
  Cc: netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni,
	Jason Gunthorpe

On 3/11/25 7:11 PM, Sean Hefty wrote:
>> I am not sure if a new subsystem is what this RFC calls for, but rather a
>> discussion about the proper integration of a new RDMA transport into the
>> Linux kernel.
>>
>> Ultra Ethernet Transport is probably not just another transport up for easy
>> integration into the current RDMA subsystem.
>> First of all, its design does not follow the well-known RDMA verbs model
>> inherited from InfiniBand, which has largely shaped the current structure of
>> the RDMA subsystem. While having send, receive and completion queues (and
>> completion counters) to steer message exchange, there is no concept of a
>> queue pair. Endpoints can span multiple queues, can have multiple peer
>> addresses.
>> Communication resources sharing is controlled in a different way than within
>> protection domains. Connections are ephemeral, created and released by the
>> provider as needed. There are more differences. In a nutshell, the UET
>> communication model is trimmed for extreme scalability. Its API semantics
>> follow libfabrics, not RDMA verbs.
>>
>> I think Nik gave us a first still incomplete look at the UET protocol engine to
>> help us understand some of the specifics.
>> It's just the lower part (packet delivery). The implementation of the upper part
>> (resource management, communication semantics, job management) may
>> largely depend on the environment we all choose.
>>
>> IMO, integrating UET with the current RDMA subsystem would ask for its
>> extension to allow exposing all of UETs intended functionality, probably
>> starting with a more generic RDMA device model than current ib_device.
>>
>> The different API semantics of UET may further call for either extending verbs
>> to cover it as well, or exposing a new non-verbs API (libfabrics), or both.
> 
> Reading through the submissions, what I found lacking is a description of some higher-level plan.  I don't easily see how to relate this series to NICs that may implement UET in HW.
> 
> Should the PDS be viewed as a partial implementation of a SW UET 'device', similar to soft RoCE or iWarp?  If so, having a description of a proposed device model seems like a necessary first step.
> 

Hi Sean,
To quote the cover letter:
"...As there
isn't any UET hardware available yet, we introduce a software
device model which implements the lowest sublayer of the spec - PDS..."

and

"The plan is to have that split into core Ultra Ethernet module
(ultraeth.ko) which is responsible for managing the UET contexts, jobs
and all other common/generic UET configuration, and the software UET
device model (uecon.ko) which implements the UET protocols for
communication in software (e.g. the PDS will be a part of uecon) and is
represented by a UDP tunnel network device."

So as I said, it is in very early stage, but we plan to split this into
core UET code and uecon software device model that implements the UEC
specs.

> If, instead, the PDS should be viewed more along the lines of a partial RDS-like path, then that changes the uapi.
> 
> Or, am I not viewing this series as intended at all?
> 
> It is almost guaranteed that there will be NICs which will support both RoCE and UET, and it's not farfetched to think that an app may use both simultaneously.   IMO, a common device model is ideal, assuming exposing a device model is the intent.
> 

That is the goal and we're working on UET kernel device API as I've
noted in the cover letter.

> I agree that different transport models should not be forced together unnaturally, but I think that's solvable.  In the end, the application developer is exposed to libfabric naming anyway.  Besides, even a repurposed RDMA name is still better than the naming used within OpenMPI.  :)
> 
> - Sean

Cheers,
 Nik



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-08 18:46 ` [RFC PATCH 00/13] Ultra Ethernet driver introduction Leon Romanovsky
  2025-03-09  3:21   ` Parav Pandit
@ 2025-03-12  9:40   ` Nikolay Aleksandrov
  2025-03-12 11:29     ` Leon Romanovsky
  1 sibling, 1 reply; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-12  9:40 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni, Jason Gunthorpe

On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
>> Hi all,
> 
> <...>
> 
>> Ultra Ethernet is a new RDMA transport.
> > Awesome, and now please explain why new subsystem is needed when
> drivers/infiniband already supports at least 5 different RDMA
> transports (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> 

As Bernard commented, we're not trying to add a new subsystem, but
start a discussion on where UEC should live because it has multiple
objects and semantics that don't map well to the  current
infrastructure. For example from this set - managing contexts, jobs and
fabric endpoints. Also we have the ephemeral PDC connections
that come and go as needed. There more such objects coming with more
state, configuration and lifecycle management. That is why we added a
separate netlink family to cleanly manage them without trying to fit
a square peg in a round hole so to speak. In the next version I'll make
sure to expand much more on this topic. By the way I believe Sean is
working on the verbs mapping for parts of UEC, he can probably also
share more details.
We definitely want to re-use as much as possible from the current
infrastructure, noone is trying to reinvent the wheel.

> Maybe after this discussion it will be very clear that new subsystem
> is needed, but at least it needs to be stated clearly.
> 
> An please CC RDMA maintainers to any Ultra Ethernet related discussions
> as it is more RDMA than Ethernet.
> 

Of course it's RDMA, that's stated in the first few sentences, I made a
mistake with the "To", but I did add linux-rdma@ to the recipient list.
I'll make sure to also add the rdma maintainers personally for the next
version and change the "to".

> Thanks

Cheers,
 Nik

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-12  9:40   ` Nikolay Aleksandrov
@ 2025-03-12 11:29     ` Leon Romanovsky
  2025-03-12 14:20       ` Nikolay Aleksandrov
  0 siblings, 1 reply; 76+ messages in thread
From: Leon Romanovsky @ 2025-03-12 11:29 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni, Jason Gunthorpe

On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> >> Hi all,
> > 
> > <...>
> > 
> >> Ultra Ethernet is a new RDMA transport.
> > > Awesome, and now please explain why new subsystem is needed when
> > drivers/infiniband already supports at least 5 different RDMA
> > transports (OmniPath, iWARP, Infiniband, RoCE v1 and RoCE v2).
> > 
> 
> As Bernard commented, we're not trying to add a new subsystem, 

So why did you create new drivers/ultraeth/ folder?

> but start a discussion on where UEC should live because it has multiple
> objects and semantics that don't map well to the  current
> infrastructure. For example from this set - managing contexts, jobs and
> fabric endpoints. 

It is just different names which libfabric used to do not use
traditional verbs naming. There is nothing in the stack which prevents
from QP to have same properties as "fabric endpoints" have.

> Also we have the ephemeral PDC connections
> that come and go as needed. There more such objects coming with more
> state, configuration and lifecycle management. That is why we added a
> separate netlink family to cleanly manage them without trying to fit
> a square peg in a round hole so to speak.

Yeah, I saw that you are planning to use netlink to manage objects,
which is very questionable. It is slow, unreliable, requires sockets,
needs more parsing logic e.t.c

To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
fits better for object configurations.

Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-12 11:29     ` Leon Romanovsky
@ 2025-03-12 14:20       ` Nikolay Aleksandrov
  2025-03-12 15:10         ` Leon Romanovsky
  0 siblings, 1 reply; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-12 14:20 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni, Jason Gunthorpe

On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
>> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
>>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
[snip]
>> Also we have the ephemeral PDC connections>> that come and go as
needed. There more such objects coming with more
>> state, configuration and lifecycle management. That is why we added a
>> separate netlink family to cleanly manage them without trying to fit
>> a square peg in a round hole so to speak.
> 
> Yeah, I saw that you are planning to use netlink to manage objects,
> which is very questionable. It is slow, unreliable, requires sockets,
> needs more parsing logic e.t.c
> 
> To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> fits better for object configurations.
> 
> Thanks

We'd definitely like to keep using netlink for control path object
management. Also please note we're talking about genetlink family. It is
fast and reliable enough for us, very easily extensible,
has a nice precise object definition with policies to enforce various
limitations, has extensive tooling (e.g. ynl), communication can be
monitored in realtime for debugging (e.g. nlmon), has a nice human
readable error reporting, gives the ability to easily dump large object
groups with filters applied, YAML family definitions and so on.
Having sockets or parsing are not issues.

Cheers,
 Nik



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-12 14:20       ` Nikolay Aleksandrov
@ 2025-03-12 15:10         ` Leon Romanovsky
  2025-03-12 16:00           ` Nikolay Aleksandrov
                             ` (3 more replies)
  0 siblings, 4 replies; 76+ messages in thread
From: Leon Romanovsky @ 2025-03-12 15:10 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni, Jason Gunthorpe

On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> [snip]
> >> Also we have the ephemeral PDC connections>> that come and go as
> needed. There more such objects coming with more
> >> state, configuration and lifecycle management. That is why we added a
> >> separate netlink family to cleanly manage them without trying to fit
> >> a square peg in a round hole so to speak.
> > 
> > Yeah, I saw that you are planning to use netlink to manage objects,
> > which is very questionable. It is slow, unreliable, requires sockets,
> > needs more parsing logic e.t.c
> > 
> > To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> > fits better for object configurations.
> > 
> > Thanks
> 
> We'd definitely like to keep using netlink for control path object
> management. Also please note we're talking about genetlink family. It is
> fast and reliable enough for us, very easily extensible,
> has a nice precise object definition with policies to enforce various
> limitations, has extensive tooling (e.g. ynl), communication can be
> monitored in realtime for debugging (e.g. nlmon), has a nice human
> readable error reporting, gives the ability to easily dump large object
> groups with filters applied, YAML family definitions and so on.
> Having sockets or parsing are not issues.

Of course it is issue as netlink relies on Netlink sockets, which means
that you constantly move your configuration data instead of doing
standard to whole linux kernel pattern of allocating configuration
structs in user-space and just providing pointer to that through ioctl
call.

However, this discussion is premature and as an intro it is worth to
read this cover letter for how object management is done in RDMA
subsystem.

https://lore.kernel.org/linux-rdma/1501765627-104860-1-git-send-email-matanb@mellanox.com/

Thanks

> 
> Cheers,
>  Nik
> 
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-12 15:10         ` Leon Romanovsky
@ 2025-03-12 16:00           ` Nikolay Aleksandrov
  2025-03-14 14:53           ` Bernard Metzler
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-12 16:00 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni, Jason Gunthorpe

On 3/12/25 5:10 PM, Leon Romanovsky wrote:
> On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
>> On 3/12/25 1:29 PM, Leon Romanovsky wrote:
>>> On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
>>>> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
>>>>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
>> [snip]
>>>> Also we have the ephemeral PDC connections>> that come and go as
>> needed. There more such objects coming with more
>>>> state, configuration and lifecycle management. That is why we added a
>>>> separate netlink family to cleanly manage them without trying to fit
>>>> a square peg in a round hole so to speak.
>>>
>>> Yeah, I saw that you are planning to use netlink to manage objects,
>>> which is very questionable. It is slow, unreliable, requires sockets,
>>> needs more parsing logic e.t.c
>>>
>>> To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
>>> fits better for object configurations.
>>>
>>> Thanks
>>
>> We'd definitely like to keep using netlink for control path object
>> management. Also please note we're talking about genetlink family. It is
>> fast and reliable enough for us, very easily extensible,
>> has a nice precise object definition with policies to enforce various
>> limitations, has extensive tooling (e.g. ynl), communication can be
>> monitored in realtime for debugging (e.g. nlmon), has a nice human
>> readable error reporting, gives the ability to easily dump large object
>> groups with filters applied, YAML family definitions and so on.
>> Having sockets or parsing are not issues.
> 
> Of course it is issue as netlink relies on Netlink sockets, which means
> that you constantly move your configuration data instead of doing
> standard to whole linux kernel pattern of allocating configuration
> structs in user-space and just providing pointer to that through ioctl
> call.
> 

I should've been more specific - it is not an issue for UEC and the way
our driver's netlink API is designed. We fully understand the pros and
cons of our approach.

> However, this discussion is premature and as an intro it is worth to
> read this cover letter for how object management is done in RDMA
> subsystem.
>
> https://lore.kernel.org/linux-rdma/1501765627-104860-1-git-send-email-matanb@mellanox.com/
> 

Sure, I know how uverbs work, but thanks for the pointer!

> Thanks>

Cheers,
 Nik

>>
>> Cheers,
>>  Nik
>>
>>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-12 15:10         ` Leon Romanovsky
  2025-03-12 16:00           ` Nikolay Aleksandrov
@ 2025-03-14 14:53           ` Bernard Metzler
  2025-03-17 12:52             ` Leon Romanovsky
  2025-03-19 13:52             ` Jason Gunthorpe
  2025-03-14 20:51           ` Stanislav Fomichev
  2025-03-15 20:49           ` Netlink vs ioctl WAS(Re: " Jamal Hadi Salim
  3 siblings, 2 replies; 76+ messages in thread
From: Bernard Metzler @ 2025-03-14 14:53 UTC (permalink / raw)
  To: Leon Romanovsky, Nikolay Aleksandrov
  Cc: netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni,
	Jason Gunthorpe



> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Wednesday, March 12, 2025 4:11 PM
> To: Nikolay Aleksandrov <nikolay@enfabrica.net>
> Cc: netdev@vger.kernel.org; shrijeet@enfabrica.net;
> alex.badea@keysight.com; eric.davis@broadcom.com; rip.sohan@amd.com;
> dsahern@kernel.org; Bernard Metzler <BMT@zurich.ibm.com>;
> roland@enfabrica.net; winston.liu@keysight.com;
> dan.mihailescu@keysight.com; Kamal Heib <kheib@redhat.com>;
> parth.v.parikh@keysight.com; Dave Miller <davem@redhat.com>;
> ian.ziemba@hpe.com; andrew.tauferner@cornelisnetworks.com; welch@hpe.com;
> rakhahari.bhunia@keysight.com; kingshuk.mandal@keysight.com; linux-
> rdma@vger.kernel.org; kuba@kernel.org; Paolo Abeni <pabeni@redhat.com>;
> Jason Gunthorpe <jgg@nvidia.com>
> Subject: [EXTERNAL] Re: [RFC PATCH 00/13] Ultra Ethernet driver
> introduction
> 
> On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > [snip]
> > >> Also we have the ephemeral PDC connections>> that come and go as
> > needed. There more such objects coming with more
> > >> state, configuration and lifecycle management. That is why we added a
> > >> separate netlink family to cleanly manage them without trying to fit
> > >> a square peg in a round hole so to speak.
> > >
> > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > which is very questionable. It is slow, unreliable, requires sockets,
> > > needs more parsing logic e.t.c
> > >
> > > To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> > > fits better for object configurations.
> > >
> > > Thanks
> >
> > We'd definitely like to keep using netlink for control path object
> > management. Also please note we're talking about genetlink family. It is
> > fast and reliable enough for us, very easily extensible,
> > has a nice precise object definition with policies to enforce various
> > limitations, has extensive tooling (e.g. ynl), communication can be
> > monitored in realtime for debugging (e.g. nlmon), has a nice human
> > readable error reporting, gives the ability to easily dump large object
> > groups with filters applied, YAML family definitions and so on.
> > Having sockets or parsing are not issues.
> 
> Of course it is issue as netlink relies on Netlink sockets, which means
> that you constantly move your configuration data instead of doing
> standard to whole linux kernel pattern of allocating configuration
> structs in user-space and just providing pointer to that through ioctl
> call.
> 
> However, this discussion is premature and as an intro it is worth to
> read this cover letter for how object management is done in RDMA
> subsystem.
> 
> https://lore.kernel.org/linux% 
> 2Drdma_1501765627-2D104860-2D1-2Dgit-2Dsend-2Demail-2Dmatanb-
> 40mellanox.com_&d=DwIBAg&c=BSDicqBQBDjDI9RkVyTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tY
> SbqxyOwdSiLedP4yO55g&m=U78K-khiLd-
> LLkbuNRzBStNppsXFTXdM7br052fwal1mzxpaOcOSQXCnguAK8t3g&s=U9dQl07fp-
> e9380xjR94fW-UGixoMsoxr5HfXKYggLk&e=
> 
Nice old stuff. Often history teaches us something. 😉

I assume the correct way forward is to first clarify the
structure of all user-visible objects that need to be
created/controlled/destroyed, and to route them through
this interface. Some will require extensions to given objects,
some may be new, some will be as-is. rdma_netlink will probably
be the right interface to look at for job control.

Best,
Bernard.


> Thanks
> 
> >
> > Cheers,
> >  Nik
> >
> >

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-12 15:10         ` Leon Romanovsky
  2025-03-12 16:00           ` Nikolay Aleksandrov
  2025-03-14 14:53           ` Bernard Metzler
@ 2025-03-14 20:51           ` Stanislav Fomichev
  2025-03-17 12:30             ` Leon Romanovsky
  2025-03-15 20:49           ` Netlink vs ioctl WAS(Re: " Jamal Hadi Salim
  3 siblings, 1 reply; 76+ messages in thread
From: Stanislav Fomichev @ 2025-03-14 20:51 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Nikolay Aleksandrov, netdev, shrijeet, alex.badea, eric.davis,
	rip.sohan, dsahern, bmt, roland, winston.liu, dan.mihailescu,
	kheib, parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, kuba, pabeni,
	Jason Gunthorpe

On 03/12, Leon Romanovsky wrote:
> On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > [snip]
> > >> Also we have the ephemeral PDC connections>> that come and go as
> > needed. There more such objects coming with more
> > >> state, configuration and lifecycle management. That is why we added a
> > >> separate netlink family to cleanly manage them without trying to fit
> > >> a square peg in a round hole so to speak.
> > > 
> > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > which is very questionable. It is slow, unreliable, requires sockets,
> > > needs more parsing logic e.t.c
> > > 
> > > To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> > > fits better for object configurations.
> > > 
> > > Thanks
> > 
> > We'd definitely like to keep using netlink for control path object
> > management. Also please note we're talking about genetlink family. It is
> > fast and reliable enough for us, very easily extensible,
> > has a nice precise object definition with policies to enforce various
> > limitations, has extensive tooling (e.g. ynl), communication can be
> > monitored in realtime for debugging (e.g. nlmon), has a nice human
> > readable error reporting, gives the ability to easily dump large object
> > groups with filters applied, YAML family definitions and so on.
> > Having sockets or parsing are not issues.
> 
> Of course it is issue as netlink relies on Netlink sockets, which means
> that you constantly move your configuration data instead of doing
> standard to whole linux kernel pattern of allocating configuration
> structs in user-space and just providing pointer to that through ioctl
> call.

And you still call copy_from_user on that user-space pointer. So how
is it an improvement over netlink? netlink is just a flexible tlv,
if you don't like read/write calls, we can add netlink_ioctl with
a pointer to netlink message...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-12 15:10         ` Leon Romanovsky
                             ` (2 preceding siblings ...)
  2025-03-14 20:51           ` Stanislav Fomichev
@ 2025-03-15 20:49           ` Jamal Hadi Salim
  2025-03-17 12:57             ` Leon Romanovsky
  2025-03-18 22:49             ` Jason Gunthorpe
  3 siblings, 2 replies; 76+ messages in thread
From: Jamal Hadi Salim @ 2025-03-15 20:49 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Nikolay Aleksandrov, Linux Kernel Network Developers,
	Shrijeet Mukherjee, alex.badea, eric.davis, rip.sohan,
	David Ahern, bmt, roland, Winston Liu, dan.mihailescu, kheib,
	parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, Jakub Kicinski,
	Paolo Abeni, Jason Gunthorpe

On Wed, Mar 12, 2025 at 11:11 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > [snip]
> > >> Also we have the ephemeral PDC connections>> that come and go as
> > needed. There more such objects coming with more
> > >> state, configuration and lifecycle management. That is why we added a
> > >> separate netlink family to cleanly manage them without trying to fit
> > >> a square peg in a round hole so to speak.
> > >
> > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > which is very questionable. It is slow, unreliable, requires sockets,
> > > needs more parsing logic e.t.c

To chime in on the above re: netlink vs ioctl,
[this is going to be a long message - over caffeinated and stuck on a trip....]

On "slow" - Mostly netlink can be deemed to "slow" for the following
reasons 1) locks - which over the last year have been highly reduced
2) crossing user/kernel - which i believe is fixable with some mmap
scheme (although past attempts at doing this have been unsuccessful)
3)async vs ioctl sync (more below)

On "unreliable": This is typically a result of some request response
(or a subscribed to event) whose execution has failed to allocate
memory in the kernel or overrun some buffers towards user space;
however, any such failures are signalled to user space and can be
recovered from.

ioctl is synchronous which gives it the "reliability" and "speed".
iirc, if memory failure was to happen on ioctl it will block until it
is successful? vs netlink which is async and will get signalled to
user space if data is lost or cant be fully delivered. Example, if a
user issued a dump of a very large amount of data from the kernel and
that data wasnt fully delivered perhaps because of memory pressure,
user space will be notified via socket errors and can use that info to
recover.

Extensibility: ioctl take binary structs which make it much harder to
extend but adds to that "speed". Once you pick your struct, you are
stuck with it - as opposed to netlink which uses very extensible
formally defined TLVs that makes it highly extensible. Yes,
extensibility requires more parsing as you stated above. Note: if you
have one-offs you could just hardcode a ioctl-like data structure into
a TLV and use blocking netlink sockets and that should get you pretty
close to ioctl "speed"

To build more on reliability: if you really cared, there are
mechanisms which can be used to build a fully reliable mechanism of
communication with the kernel since netlink is infact a wire protocol
(which alas has been broken for a while because you cant really use it
as a wire protocol across machines); see for example:
https://datatracker.ietf.org/doc/html/rfc3549#section-2.3.2.1
And if you dont really care about reliability you can just shoot
messages into the kernel and turn off the ACK flag (and then issue
requests when you feel you need to check on configuration).

Debuggability: extended ACKs(heavily used by networking) provide an
excellent operational information user space in fine grained details
on errors (famous EINVAL can tell you exactly what the EINVAL means
for example).

netlink has a multicast publish-subscribe mechanism. Multicast being
one-to-many means multi-user(important detail for both scaling and
independent debugging) interface. Meaning you can have multiple
processes subscribing to events that the kernel publishes. You dont
have to resort to polling the kernel for details of dynamic changes
(example "a new entry has been added to table foo" etc)
As a matter of fact, original design  used to allow user space to
advertise to both kernel and other user space apps (and unicast worked
to/from kernel/user and user/user). I haent looked at that recently,
so it could be broken.
Note: while these events are also subject to message loss - netlink
robustness described earlier is usable here as well (via socket
errors).
Example, if the kernel attempted to send an event which had the
misfortune of not making it - user will be notified and can recover by
requesting a related table dump, etc to see what changed..

- And as Nik mentioned: The new (yaml)model-to-generatedcode approach
that is now common in generic netlink highly reduces developer effort.
Although in my opinion we really need this stuff integrated into tools
like iproute2..

I am pretty sure i left out some important details (maybe i can write
a small doc when i am in better shape).

cheers,
jamal

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-14 20:51           ` Stanislav Fomichev
@ 2025-03-17 12:30             ` Leon Romanovsky
  2025-03-19 19:12               ` Stanislav Fomichev
  0 siblings, 1 reply; 76+ messages in thread
From: Leon Romanovsky @ 2025-03-17 12:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Nikolay Aleksandrov, netdev, shrijeet, alex.badea, eric.davis,
	rip.sohan, dsahern, bmt, roland, winston.liu, dan.mihailescu,
	kheib, parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, kuba, pabeni,
	Jason Gunthorpe

On Fri, Mar 14, 2025 at 01:51:33PM -0700, Stanislav Fomichev wrote:
> On 03/12, Leon Romanovsky wrote:
> > On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > [snip]
> > > >> Also we have the ephemeral PDC connections>> that come and go as
> > > needed. There more such objects coming with more
> > > >> state, configuration and lifecycle management. That is why we added a
> > > >> separate netlink family to cleanly manage them without trying to fit
> > > >> a square peg in a round hole so to speak.
> > > > 
> > > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > > which is very questionable. It is slow, unreliable, requires sockets,
> > > > needs more parsing logic e.t.c
> > > > 
> > > > To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> > > > fits better for object configurations.
> > > > 
> > > > Thanks
> > > 
> > > We'd definitely like to keep using netlink for control path object
> > > management. Also please note we're talking about genetlink family. It is
> > > fast and reliable enough for us, very easily extensible,
> > > has a nice precise object definition with policies to enforce various
> > > limitations, has extensive tooling (e.g. ynl), communication can be
> > > monitored in realtime for debugging (e.g. nlmon), has a nice human
> > > readable error reporting, gives the ability to easily dump large object
> > > groups with filters applied, YAML family definitions and so on.
> > > Having sockets or parsing are not issues.
> > 
> > Of course it is issue as netlink relies on Netlink sockets, which means
> > that you constantly move your configuration data instead of doing
> > standard to whole linux kernel pattern of allocating configuration
> > structs in user-space and just providing pointer to that through ioctl
> > call.
> 
> And you still call copy_from_user on that user-space pointer. So how
> is it an improvement over netlink? netlink is just a flexible tlv,
> if you don't like read/write calls, we can add netlink_ioctl with
> a pointer to netlink message...

You need to built that netlink message, which you do by multiple copying
in the user space.

I understand your desire to see netdev patterns everywhere and agree
with the position that netlink is a perfect choice for dynamic configurations.
However I hold a position that it is not good fit to configure strictly dependent
hardware objects.

You already have TLB-based API in drivers/infiniband, there is no need
to invent new one.

Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-14 14:53           ` Bernard Metzler
@ 2025-03-17 12:52             ` Leon Romanovsky
  2025-03-19 13:52             ` Jason Gunthorpe
  1 sibling, 0 replies; 76+ messages in thread
From: Leon Romanovsky @ 2025-03-17 12:52 UTC (permalink / raw)
  To: Bernard Metzler
  Cc: Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	roland@enfabrica.net, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni,
	Jason Gunthorpe

On Fri, Mar 14, 2025 at 02:53:40PM +0000, Bernard Metzler wrote:
> 
> 
> > -----Original Message-----
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Wednesday, March 12, 2025 4:11 PM
> > To: Nikolay Aleksandrov <nikolay@enfabrica.net>
> > Cc: netdev@vger.kernel.org; shrijeet@enfabrica.net;
> > alex.badea@keysight.com; eric.davis@broadcom.com; rip.sohan@amd.com;
> > dsahern@kernel.org; Bernard Metzler <BMT@zurich.ibm.com>;
> > roland@enfabrica.net; winston.liu@keysight.com;
> > dan.mihailescu@keysight.com; Kamal Heib <kheib@redhat.com>;
> > parth.v.parikh@keysight.com; Dave Miller <davem@redhat.com>;
> > ian.ziemba@hpe.com; andrew.tauferner@cornelisnetworks.com; welch@hpe.com;
> > rakhahari.bhunia@keysight.com; kingshuk.mandal@keysight.com; linux-
> > rdma@vger.kernel.org; kuba@kernel.org; Paolo Abeni <pabeni@redhat.com>;
> > Jason Gunthorpe <jgg@nvidia.com>
> > Subject: [EXTERNAL] Re: [RFC PATCH 00/13] Ultra Ethernet driver
> > introduction
> > 
> > On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > [snip]
> > > >> Also we have the ephemeral PDC connections>> that come and go as
> > > needed. There more such objects coming with more
> > > >> state, configuration and lifecycle management. That is why we added a
> > > >> separate netlink family to cleanly manage them without trying to fit
> > > >> a square peg in a round hole so to speak.
> > > >
> > > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > > which is very questionable. It is slow, unreliable, requires sockets,
> > > > needs more parsing logic e.t.c
> > > >
> > > > To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> > > > fits better for object configurations.
> > > >
> > > > Thanks
> > >
> > > We'd definitely like to keep using netlink for control path object
> > > management. Also please note we're talking about genetlink family. It is
> > > fast and reliable enough for us, very easily extensible,
> > > has a nice precise object definition with policies to enforce various
> > > limitations, has extensive tooling (e.g. ynl), communication can be
> > > monitored in realtime for debugging (e.g. nlmon), has a nice human
> > > readable error reporting, gives the ability to easily dump large object
> > > groups with filters applied, YAML family definitions and so on.
> > > Having sockets or parsing are not issues.
> > 
> > Of course it is issue as netlink relies on Netlink sockets, which means
> > that you constantly move your configuration data instead of doing
> > standard to whole linux kernel pattern of allocating configuration
> > structs in user-space and just providing pointer to that through ioctl
> > call.
> > 
> > However, this discussion is premature and as an intro it is worth to
> > read this cover letter for how object management is done in RDMA
> > subsystem.
> > 
> > https://lore.kernel.org/linux% 
> > 2Drdma_1501765627-2D104860-2D1-2Dgit-2Dsend-2Demail-2Dmatanb-
> > 40mellanox.com_&d=DwIBAg&c=BSDicqBQBDjDI9RkVyTcHQ&r=4ynb4Sj_4MUcZXbhvovE4tY
> > SbqxyOwdSiLedP4yO55g&m=U78K-khiLd-
> > LLkbuNRzBStNppsXFTXdM7br052fwal1mzxpaOcOSQXCnguAK8t3g&s=U9dQl07fp-
> > e9380xjR94fW-UGixoMsoxr5HfXKYggLk&e=
> > 
> Nice old stuff. Often history teaches us something. 😉

Maybe, and this is what this submission is missing. An explanation what
was learnt, what needs to be changed, why it is impossible to fix and
everything needs to be started from the scratch.

> 
> I assume the correct way forward is to first clarify the
> structure of all user-visible objects that need to be
> created/controlled/destroyed, and to route them through
> this interface. 

Yes, the actual objects structure, their relation and exposure is
actually the interesting part.

Thanks

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-15 20:49           ` Netlink vs ioctl WAS(Re: " Jamal Hadi Salim
@ 2025-03-17 12:57             ` Leon Romanovsky
  2025-03-18 22:49             ` Jason Gunthorpe
  1 sibling, 0 replies; 76+ messages in thread
From: Leon Romanovsky @ 2025-03-17 12:57 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Nikolay Aleksandrov, Linux Kernel Network Developers,
	Shrijeet Mukherjee, alex.badea, eric.davis, rip.sohan,
	David Ahern, bmt, roland, Winston Liu, dan.mihailescu, kheib,
	parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, Jakub Kicinski,
	Paolo Abeni, Jason Gunthorpe

On Sat, Mar 15, 2025 at 04:49:20PM -0400, Jamal Hadi Salim wrote:
> On Wed, Mar 12, 2025 at 11:11 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > [snip]
> > > >> Also we have the ephemeral PDC connections>> that come and go as
> > > needed. There more such objects coming with more
> > > >> state, configuration and lifecycle management. That is why we added a
> > > >> separate netlink family to cleanly manage them without trying to fit
> > > >> a square peg in a round hole so to speak.
> > > >
> > > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > > which is very questionable. It is slow, unreliable, requires sockets,
> > > > needs more parsing logic e.t.c
> 
> To chime in on the above re: netlink vs ioctl,
> [this is going to be a long message - over caffeinated and stuck on a trip....]
> 
> On "slow" - Mostly netlink can be deemed to "slow" for the following
> reasons 1) locks - which over the last year have been highly reduced
> 2) crossing user/kernel - which i believe is fixable with some mmap
> scheme (although past attempts at doing this have been unsuccessful)
> 3)async vs ioctl sync (more below)
> 
> On "unreliable": This is typically a result of some request response
> (or a subscribed to event) whose execution has failed to allocate
> memory in the kernel or overrun some buffers towards user space;
> however, any such failures are signalled to user space and can be
> recovered from.
> 
> ioctl is synchronous which gives it the "reliability" and "speed".
> iirc, if memory failure was to happen on ioctl it will block until it
> is successful? vs netlink which is async and will get signalled to
> user space if data is lost or cant be fully delivered. Example, if a
> user issued a dump of a very large amount of data from the kernel and
> that data wasnt fully delivered perhaps because of memory pressure,
> user space will be notified via socket errors and can use that info to
> recover.
> 
> Extensibility: ioctl take binary structs which make it much harder to
> extend but adds to that "speed". Once you pick your struct, you are
> stuck with it - as opposed to netlink which uses very extensible
> formally defined TLVs that makes it highly extensible. Yes,
> extensibility requires more parsing as you stated above. Note: if you
> have one-offs you could just hardcode a ioctl-like data structure into
> a TLV and use blocking netlink sockets and that should get you pretty
> close to ioctl "speed"
> 
> To build more on reliability: if you really cared, there are
> mechanisms which can be used to build a fully reliable mechanism of
> communication with the kernel since netlink is infact a wire protocol
> (which alas has been broken for a while because you cant really use it
> as a wire protocol across machines); see for example:
> https://datatracker.ietf.org/doc/html/rfc3549#section-2.3.2.1
> And if you dont really care about reliability you can just shoot
> messages into the kernel and turn off the ACK flag (and then issue
> requests when you feel you need to check on configuration).
> 
> Debuggability: extended ACKs(heavily used by networking) provide an
> excellent operational information user space in fine grained details
> on errors (famous EINVAL can tell you exactly what the EINVAL means
> for example).
> 
> netlink has a multicast publish-subscribe mechanism. Multicast being
> one-to-many means multi-user(important detail for both scaling and
> independent debugging) interface. Meaning you can have multiple
> processes subscribing to events that the kernel publishes. You dont
> have to resort to polling the kernel for details of dynamic changes
> (example "a new entry has been added to table foo" etc)
> As a matter of fact, original design  used to allow user space to
> advertise to both kernel and other user space apps (and unicast worked
> to/from kernel/user and user/user). I haent looked at that recently,
> so it could be broken.
> Note: while these events are also subject to message loss - netlink
> robustness described earlier is usable here as well (via socket
> errors).
> Example, if the kernel attempted to send an event which had the
> misfortune of not making it - user will be notified and can recover by
> requesting a related table dump, etc to see what changed..
> 
> - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> that is now common in generic netlink highly reduces developer effort.
> Although in my opinion we really need this stuff integrated into tools
> like iproute2..
> 
> I am pretty sure i left out some important details (maybe i can write
> a small doc when i am in better shape).

Thanks for such a detailed answer. I'm not against netlink, I'm against
netlink to configure complex HW objects.

Thanks

> 
> cheers,
> jamal
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-15 20:49           ` Netlink vs ioctl WAS(Re: " Jamal Hadi Salim
  2025-03-17 12:57             ` Leon Romanovsky
@ 2025-03-18 22:49             ` Jason Gunthorpe
  2025-03-19 18:21               ` Jamal Hadi Salim
  1 sibling, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 22:49 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Leon Romanovsky, Nikolay Aleksandrov,
	Linux Kernel Network Developers, Shrijeet Mukherjee, alex.badea,
	eric.davis, rip.sohan, David Ahern, bmt, roland, Winston Liu,
	dan.mihailescu, kheib, parth.v.parikh, davem, ian.ziemba,
	andrew.tauferner, welch, rakhahari.bhunia, kingshuk.mandal,
	linux-rdma, Jakub Kicinski, Paolo Abeni

On Sat, Mar 15, 2025 at 04:49:20PM -0400, Jamal Hadi Salim wrote:

> On "unreliable": This is typically a result of some request response
> (or a subscribed to event) whose execution has failed to allocate
> memory in the kernel or overrun some buffers towards user space;
> however, any such failures are signalled to user space and can be
> recovered from.

No, they can't be recovered from in all cases. Randomly failing system
calls because of memory pressure is a horrible foundation to build
what something like RDMA needs. It is not acceptable that something
like a destroy system call would just randomly fail because the kernel
is OOMing. There is no recovery from this beyond leaking memory - the
opposite of what you want in an OOM situation.

> ioctl is synchronous which gives it the "reliability" and "speed".
> iirc, if memory failure was to happen on ioctl it will block until it
> is successful? 

It would fail back to userspace and unwind whatever it did.

The unwinding is tricky and RDMA's infrastructure has alot of support
to make it easier for driver writers to get this right in all the
different error cases.

Overall systems calls here should either succeed or fail and be the
same as a NOP. No failure that actually did something and then creates
some resource leak or something because userspace didn't know about
it.

> Extensibility: ioctl take binary structs which make it much harder to
> extend but adds to that "speed". Once you pick your struct, you are
> stuck with it - as opposed to netlink which uses very extensible
> formally defined TLVs that makes it highly extensible. 

RDMA uses TLVs now too. It has one of the largest uAPI surfaces in the
kernel, TLVs were introduced for the same reason netlink uses them.

RDMA also has special infrastructure to split up the TLV space between
core code and HW driver code which is a key feature and necessary part
of how you'd build a user/kernel split driver.

> - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> that is now common in generic netlink highly reduces developer effort.
> Although in my opinion we really need this stuff integrated into tools
> like iproute2..

RDMA also has a DSL like scheme for defining schema, and centralized
parsing and validation. IMHO it's capability falls someplace between
the old netlink policy stuff and the new YAML stuff.

But just focusing on schema and TLVs really undersells all the
specialized infrastructure that exists for managing objects, security,
HW pass through and other infrastructure things unique to RDMA.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-14 14:53           ` Bernard Metzler
  2025-03-17 12:52             ` Leon Romanovsky
@ 2025-03-19 13:52             ` Jason Gunthorpe
  2025-03-19 14:02               ` Nikolay Aleksandrov
  1 sibling, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 13:52 UTC (permalink / raw)
  To: Bernard Metzler
  Cc: Leon Romanovsky, Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	roland@enfabrica.net, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Fri, Mar 14, 2025 at 02:53:40PM +0000, Bernard Metzler wrote:

> I assume the correct way forward is to first clarify the
> structure of all user-visible objects that need to be
> created/controlled/destroyed, and to route them through
> this interface. Some will require extensions to given objects,
> some may be new, some will be as-is. rdma_netlink will probably
> be the right interface to look at for job control.

As I understand the job ID model you will need to have some privileged
entity to create a "job ID file descriptor" that can be passed around
to unprivileged processes to grant them access to the job ID. This is
necessary since the Job ID becomes part of the packet headers and we
must secure userspace to prevent a hijack or spoof these values on the
wire.

Netlink has a major downside that you can't use filesystem ACL
permissions to control access, so building a low privilege daemon just
to do job id management seems to me to be more difficult.

As an example, I would imagine having a job management char device
with a filesystem ACL that only allows something like SLRUM's
privileged orchestrator to talk to it. SLURM wouldn't have something
like CAP_NET_ADMIN. SLURM would setup the job ID and pass the "Job ID
FD" to the actual MPI workload processes to grant them permission to
use those network headers.

Nobody else in the system can create Job ID's besides SLURM, and in a
multi-user environment one user cannot reach into the other and hijack
their job ID because the FD does not leak outside the MPI process
tree.

This RFC doesn't describe the intended security model, but I'm very
surprised to see ultraeth_nl_job_new_doit() not do any capability
checks, or any security what so ever around access to the job.

It should be obvious that this would be a fairly trivial add on to
rdma, a new char dev, some rdma netlink to report and inspect the
global job list, and a little driver helper to associate job FDs with
uverbs objects and retrieve the job id. In fact we just did something
similar with the UCAP system..

Further, jobs are not a concept unique to UE, I am seeing other RDMA
scenarios talk about jobs now, perhaps inspired by UE. There is zero
reason to make jobs a UE specific concept and not a general RDMA
concept.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-19 13:52             ` Jason Gunthorpe
@ 2025-03-19 14:02               ` Nikolay Aleksandrov
  0 siblings, 0 replies; 76+ messages in thread
From: Nikolay Aleksandrov @ 2025-03-19 14:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Bernard Metzler
  Cc: Leon Romanovsky, netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On 3/19/25 15:52, Jason Gunthorpe wrote:
> On Fri, Mar 14, 2025 at 02:53:40PM +0000, Bernard Metzler wrote:
> 
>> I assume the correct way forward is to first clarify the
>> structure of all user-visible objects that need to be
>> created/controlled/destroyed, and to route them through
>> this interface. Some will require extensions to given objects,
>> some may be new, some will be as-is. rdma_netlink will probably
>> be the right interface to look at for job control.
> 
> As I understand the job ID model you will need to have some privileged
> entity to create a "job ID file descriptor" that can be passed around
> to unprivileged processes to grant them access to the job ID. This is
> necessary since the Job ID becomes part of the packet headers and we
> must secure userspace to prevent a hijack or spoof these values on the
> wire.
> 
> Netlink has a major downside that you can't use filesystem ACL
> permissions to control access, so building a low privilege daemon just
> to do job id management seems to me to be more difficult.
> 
> As an example, I would imagine having a job management char device
> with a filesystem ACL that only allows something like SLRUM's
> privileged orchestrator to talk to it. SLURM wouldn't have something
> like CAP_NET_ADMIN. SLURM would setup the job ID and pass the "Job ID
> FD" to the actual MPI workload processes to grant them permission to
> use those network headers.
> 
> Nobody else in the system can create Job ID's besides SLURM, and in a
> multi-user environment one user cannot reach into the other and hijack
> their job ID because the FD does not leak outside the MPI process
> tree.
> 
> This RFC doesn't describe the intended security model, but I'm very
> surprised to see ultraeth_nl_job_new_doit() not do any capability
> checks, or any security what so ever around access to the job.
> 

It doesn't need to do any capability checking because it is defined in the YAML
model, there you can see flags: [ admin-perm ] so in the genl ops code that is
automatically generated we get .flags		= GENL_ADMIN_PERM | GENL_CMD_CAP_DO
for these ops, which in turn means the genetlink code will check if the caller has
CAP_NET_ADMIN. The unprivileged process can request to associate with multiple jobs
and it's the privileged process that has to configure and control them. In this
version we have only configuration. Once the specs become publicly available we
will be able to share more information about how it's expected to work.

Cheers,
 Nik



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
                   ` (13 preceding siblings ...)
  2025-03-08 18:46 ` [RFC PATCH 00/13] Ultra Ethernet driver introduction Leon Romanovsky
@ 2025-03-19 16:48 ` Jason Gunthorpe
  2025-03-20 11:13   ` Yunsheng Lin
  2025-03-24 20:22   ` Roland Dreier
  14 siblings, 2 replies; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 16:48 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni

On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> Hi all,
> This patch-set introduces minimal Ultra Ethernet driver infrastructure and
> the lowest Ultra Ethernet sublayer - the Packet Delivery Sublayer (PDS),
> which underpins the entire communication model of the Ultra Ethernet
> Transport[1] (UET). Ultra Ethernet is a new RDMA transport designed for
> efficient AI and HPC communication.

I was away while this discussion happened so I've gone through and
read the threads, looked at the patches and I don't think I've changed
my view since I talked to Enfabrica privately on this topic almost a
year ago.

I do not agree with creating a new subsystem (or whatever you are
calling drivers/ultraeth) for a single RDMA protocol and see nothing
new here to change my mind. I would likely NAK the direction I see in
this RFC, as I have other past attempts to build RDMA HW interfaces
outside of the RDMA subystem.

Since none of that past discussion seems to have been acknowledged or
rebutted in this series I will repeat the main points:

1) I'm aware of something like 5-7 new protocols that are competing
   for the same market as Ultra Ethernet. We can't give everyone and
   their dog a new subsystem (or whatever) and all the maintainability
   negatives that come with that. As a matter of maintainability we
   need to see consolidation here, not fragmentation!

   Yes, UE is a consortium driven standard, which is unique and a big
   positive, but I don't believe anyone can say for certain what
   direction the industry is going to go in. Many consortium standards
   have failed to get adoption in the past even with a large number of
   member companies.

   Nor can we know what concepts in UE are going to be copied into
   other competing RDMA transports. See my other remarks on job key
   for an example. Prematurely siloing stuff in drivers/ultraeth is
   very much the wrong technical direction for maintainability.

   That said, I think UE should be in the kernel and have a fair
   chance to compete for market share. Just in a maintainable and
   appropriate way while the industry evolves.

2) Due to the above, I'm pretty confident we will see RDMA NICs
   supporting a lot of different protocols. In fact they already do.

   From a kernel maintainability perspective we really want one RDMA
   driver leveraging as much common infrastructure between the
   protocols as possible. We do not want to see a single HW driver
   further split up needlessly to other subsystems, that would be a
   big maintainability downside.

   To put a clear point on this, mlx5 has been gaining new protocols
   and fitting into the existing driver model for a number of years
   now. In fact there is speculation that UE could be implemented in
   mlx5 RDMA with minimal kernel changes. There would be no reason to
   try to mess up the driver to also interact with this stuff in
   drivers/ultraeth as seems to be proposed here.

   I think other HW will be similar. UE isn't so radically different
   that every HW path will need to diverge from classical RDMA. Nor is
   is so dissimilar to other competing proposals. We don't want
   artificial differences we want to create things that can be re-used
   when appropriate.

   Leon's response to Bart is correct, we already have similar
   examples of almost everything UE does. Bart is also correct that
   verbs would be a PITA, but RDMA userspace has moved beyond verbs
   limitations years ago now. Alot of mlx5 stuff is not using verbs
   today, for instance. EFA and other examples use extensive stuff
   beyond verbs.

3) Building a user/kernel split HW driver model is very hard. RDMA has
   spent 20 years learning how to do this and making alot of mistakes
   along the way. I think we are in a good place now as alot of new
   functionality has been rolled out with very little stress in the
   past few years. I see no reason to believe UE would not follow that
   same pattern.

   Frankly, I see no evidence in this RFC of any of that learning.

   Probably because it doesn't actually show any HW or even seem to
   contemplate what HW would even look like. There isn't even a call
   to pin_user_pages() in this RFC. You can't call yourself *RDMA* if
   you are not doing direct access to userspace memory!

   So, this RFC is woefully incomplete. I think you greatly underestimate
   how much work you are looking at to duplicate and re-invent the
   existing RDMA infrastructure. Frankly I'm not even sure why you
   sent this RFC when it doesn't show enough to even evaluate..

4) For example, I get the feeling this RFC is repeating the original
   cardinal sin of RDMA by biasing the UAPI design toward a single
   philosophy.

   Ie you said:

    > I should've been more specific - it is not an issue for UEC and the way
    > our driver's netlink API is designed. We fully understand the pros and
    > cons of our approach.

   Which is exactly the kind of narrow thinking that creates long term
   trouble in uAPI design. Do your choices actually work for *ALL*
   future HW designs and others drivers not just "our drivers
   netlink"? I think not.

   Given UE spec doesn't even have something pretending to be a
   kernel/user interface standard I think we will see an extreme
   variety of HW implementations here.

   The proven modern RDMA approach to uAPI design is the right way to
   solve this problem. It is shown to work. It already implements
   multi-protocol RDMA and has alot of drivers demonstrating it now.

5) RDMA actually has pretty good infrastructure. It has alot of
   complex infrastructure features, for example see the long threads I
   recently wrote on how it's hot plug architecture works.

   Even "basic" things like mmaping a doorbell page have thousands of
   lines of support infrastructure to make the drivers work well and
   support enterprise level HA features.

   You get to have these features if you write a RDMA
   driver. Otherwise you have to clone them all.

   From what I can tell in this RFC the implementations of basic
   things like the object model are worse that what we have in RDMA
   already. Things like a device model don't even exist. Let alone
   advanced stuff like hot plug, namespace, crgoups, DMA operations
   and all the stuff needed for HW bindings.

   It has a *long* way to go to even reach feature parity in terms of
   what the core RDMA device model and object model provides a HW
   driver, let alone complex things like uverbs :\

   This whole RFC reeks of NIH: it is more fun to go off and do
   something greenfield than do the maintenance work to evolve an
   existing code base.

6) I offered many things, including not having to use libibverbs,
   adding someone to maintain the UE specific portions, and helping to
   architect the solution within RDMA. So it is not like there is some
   blocker that is forcing a drivers/ultraeth, or that someone has
   even said no to any proposal made.

   For instance I spent alot of time with the Habana labs guys to work
   out how to fit their almost-RDMA stuff into RDMA. It required some
   careful thinking to accommodate their limited HW, but in the end it
   did manage to fit in fine.

   They also started as you did here with some weird thing. In the end
   we all agreed that RDMA HW support belongs in the RDMA subsystem,
   using normal RDMA APIs. We are trying not to proliferate these
   things.

I feel like this is repeating the drivers/accel vs DRM debate from a
few years ago. All the points DaveA made apply here just as well,
arguably even more so as RDMA has even more robust shared
infrastructure that should be used instead of re-invented. At least
Habana had a reason for accel - they wanted to skip some DRM
rules. This RFC doesn't even have that.

Thus, I don't expect you will get support for something like this to
be merged, you should change directions.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-18 22:49             ` Jason Gunthorpe
@ 2025-03-19 18:21               ` Jamal Hadi Salim
  2025-03-19 19:19                 ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Jamal Hadi Salim @ 2025-03-19 18:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Nikolay Aleksandrov,
	Linux Kernel Network Developers, Shrijeet Mukherjee, alex.badea,
	eric.davis, rip.sohan, David Ahern, bmt, roland, Winston Liu,
	dan.mihailescu, kheib, parth.v.parikh, davem, ian.ziemba,
	andrew.tauferner, welch, rakhahari.bhunia, kingshuk.mandal,
	linux-rdma, Jakub Kicinski, Paolo Abeni

On Tue, Mar 18, 2025 at 6:49 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Sat, Mar 15, 2025 at 04:49:20PM -0400, Jamal Hadi Salim wrote:
>
> > On "unreliable": This is typically a result of some request response
> > (or a subscribed to event) whose execution has failed to allocate
> > memory in the kernel or overrun some buffers towards user space;
> > however, any such failures are signalled to user space and can be
> > recovered from.
>
> No, they can't be recovered from in all cases.
> Randomly failing system
> calls because of memory pressure is a horrible foundation to build
> what something like RDMA needs. It is not acceptable that something
> like a destroy system call would just randomly fail because the kernel
> is OOMing. There is no recovery from this beyond leaking memory - the
> opposite of what you want in an OOM situation.
>

Curious how you guarantee that a "destroy" will not fail under OOM. Do
you have pre-allocated memory?
Note: Basic request-response netlink messaging like a destroy which
merely returns you a success/fail indication _should not fail_ once
that message hits the kernel consumer (example tc subsystem). A
request to destroy or create something in the kernel for example would
be a fit.
Things that may fail because of memory pressure are requests that
solicit data(typically lots of data) from the kernel, example if you
dump a large kernel table that wont fit in one netlink message, it
will be sent to you/user in multiple messages; somewhere after the
first chunk gets sent your way we may hit an oom issue. For these
sorts of message types, user space will be signalled so it can
recover. "Recover" could be to issue another message to continue where
we left off....

> > ioctl is synchronous which gives it the "reliability" and "speed".
> > iirc, if memory failure was to happen on ioctl it will block until it
> > is successful?
>
> It would fail back to userspace and unwind whatever it did.
>

Very similar with netlink.

> The unwinding is tricky and RDMA's infrastructure has alot of support
> to make it easier for driver writers to get this right in all the
> different error cases.
>
> Overall systems calls here should either succeed or fail and be the
> same as a NOP. No failure that actually did something and then creates
> some resource leak or something because userspace didn't know about
> it.
>

Yes, this is how netlink works as well. If a failure to delete an
object occurs then every transient state gets restored. This is always
the case for simple requests(a delete/create/update). For requests
that batch multiple objects there are cases where there is no
unwinding. Example you could send a request to create a bunch of
objects in the kernel and half way through the kernel fails for
whatever and has to bail out.
Most of the subsystems i have seen as such return a "success", even
though they only succeeded on the first half. Some return a success
with a count of how many objects were created.
It is feasible on a per subsystem level to set flags which would
instruct the kernel of which mode to use, etc.

> > Extensibility: ioctl take binary structs which make it much harder to
> > extend but adds to that "speed". Once you pick your struct, you are
> > stuck with it - as opposed to netlink which uses very extensible
> > formally defined TLVs that makes it highly extensible.
>
> RDMA uses TLVs now too. It has one of the largest uAPI surfaces in the
> kernel, TLVs were introduced for the same reason netlink uses them.
>

Makes sense. So ioctls with TLVs ;->
I am suspecting you don't have concepts of TLVs inside TLVs for
hierarchies within objects.

> RDMA also has special infrastructure to split up the TLV space between
> core code and HW driver code which is a key feature and necessary part
> of how you'd build a user/kernel split driver.
>

The T namespace is split between core code and driver code?
I can see that as being useful for debugging maybe? What else?

> > - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> > that is now common in generic netlink highly reduces developer effort.
> > Although in my opinion we really need this stuff integrated into tools
> > like iproute2..
>
> RDMA also has a DSL like scheme for defining schema, and centralized
> parsing and validation. IMHO it's capability falls someplace between
> the old netlink policy stuff and the new YAML stuff.
>

I meant the ability to start with a data model and generate code as
being useful.
Where can i find the RDMA DSL?

> But just focusing on schema and TLVs really undersells all the
> specialized infrastructure that exists for managing objects, security,
> HW pass through and other infrastructure things unique to RDMA.
>

I dont know enough about RDMA infra to comment but iiuc, you are
saying that it is the control infrastructure (that sits in
userspace?), that does all those things you mention, that is more
important.
IMO, when you start building complex systems that's always the case
(the "mechanism vs. policy" principle).

cheers,
jamal

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-17 12:30             ` Leon Romanovsky
@ 2025-03-19 19:12               ` Stanislav Fomichev
  0 siblings, 0 replies; 76+ messages in thread
From: Stanislav Fomichev @ 2025-03-19 19:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Nikolay Aleksandrov, netdev, shrijeet, alex.badea, eric.davis,
	rip.sohan, dsahern, bmt, roland, winston.liu, dan.mihailescu,
	kheib, parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, kuba, pabeni,
	Jason Gunthorpe

On 03/17, Leon Romanovsky wrote:
> On Fri, Mar 14, 2025 at 01:51:33PM -0700, Stanislav Fomichev wrote:
> > On 03/12, Leon Romanovsky wrote:
> > > On Wed, Mar 12, 2025 at 04:20:08PM +0200, Nikolay Aleksandrov wrote:
> > > > On 3/12/25 1:29 PM, Leon Romanovsky wrote:
> > > > > On Wed, Mar 12, 2025 at 11:40:05AM +0200, Nikolay Aleksandrov wrote:
> > > > >> On 3/8/25 8:46 PM, Leon Romanovsky wrote:
> > > > >>> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
> > > > [snip]
> > > > >> Also we have the ephemeral PDC connections>> that come and go as
> > > > needed. There more such objects coming with more
> > > > >> state, configuration and lifecycle management. That is why we added a
> > > > >> separate netlink family to cleanly manage them without trying to fit
> > > > >> a square peg in a round hole so to speak.
> > > > > 
> > > > > Yeah, I saw that you are planning to use netlink to manage objects,
> > > > > which is very questionable. It is slow, unreliable, requires sockets,
> > > > > needs more parsing logic e.t.c
> > > > > 
> > > > > To avoid all this overhead, RDMA uses netlink-like ioctl calls, which
> > > > > fits better for object configurations.
> > > > > 
> > > > > Thanks
> > > > 
> > > > We'd definitely like to keep using netlink for control path object
> > > > management. Also please note we're talking about genetlink family. It is
> > > > fast and reliable enough for us, very easily extensible,
> > > > has a nice precise object definition with policies to enforce various
> > > > limitations, has extensive tooling (e.g. ynl), communication can be
> > > > monitored in realtime for debugging (e.g. nlmon), has a nice human
> > > > readable error reporting, gives the ability to easily dump large object
> > > > groups with filters applied, YAML family definitions and so on.
> > > > Having sockets or parsing are not issues.
> > > 
> > > Of course it is issue as netlink relies on Netlink sockets, which means
> > > that you constantly move your configuration data instead of doing
> > > standard to whole linux kernel pattern of allocating configuration
> > > structs in user-space and just providing pointer to that through ioctl
> > > call.
> > 
> > And you still call copy_from_user on that user-space pointer. So how
> > is it an improvement over netlink? netlink is just a flexible tlv,
> > if you don't like read/write calls, we can add netlink_ioctl with
> > a pointer to netlink message...
> 
> You need to built that netlink message, which you do by multiple copying
> in the user space.
>
> I understand your desire to see netdev patterns everywhere and agree
> with the position that netlink is a perfect choice for dynamic configurations.
> However I hold a position that it is not good fit to configure strictly dependent
> hardware objects.
> 
> You already have TLB-based API in drivers/infiniband, there is no need
> to invent new one.

Let's revisit this discussion later depending on where ultra eth stuff
lands. If it gets folded into ibv subsystem - keeping the same ibv
conventions makes sense. If not, not sure I understand your "multiple copying
in the user space" argument.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-19 18:21               ` Jamal Hadi Salim
@ 2025-03-19 19:19                 ` Jason Gunthorpe
  2025-03-25 14:12                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-19 19:19 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Leon Romanovsky, Nikolay Aleksandrov,
	Linux Kernel Network Developers, Shrijeet Mukherjee, alex.badea,
	eric.davis, rip.sohan, David Ahern, bmt, roland, Winston Liu,
	dan.mihailescu, kheib, parth.v.parikh, davem, ian.ziemba,
	andrew.tauferner, welch, rakhahari.bhunia, kingshuk.mandal,
	linux-rdma, Jakub Kicinski, Paolo Abeni

On Wed, Mar 19, 2025 at 02:21:23PM -0400, Jamal Hadi Salim wrote:

> Curious how you guarantee that a "destroy" will not fail under OOM. Do
> you have pre-allocated memory?

It just never allocates memory? Why would a simple system call like a
destruction allocate any memory?

> > Overall systems calls here should either succeed or fail and be the
> > same as a NOP. No failure that actually did something and then creates
> > some resource leak or something because userspace didn't know about
> > it.
> 
> Yes, this is how netlink works as well. If a failure to delete an
> object occurs then every transient state gets restored. This is always
> the case for simple requests(a delete/create/update). For requests
> that batch multiple objects there are cases where there is no
> unwinding. 

I'm not sure that is complely true, like if userspace messes up the
netlink read() side of the API and copy_to_user() fails then you can
get these inconsistencies. In the RDMA model even those edge case are
properly unwound, just like a normal system call would.

> Makes sense. So ioctls with TLVs ;->
> I am suspecting you don't have concepts of TLVs inside TLVs for
> hierarchies within objects.

No, it has not been needed yet, or at least the cases that have come
up have been happy to use arrays of structs for the nesting. The
method calls themselves don't tend to have that kind of challenging
structure for their arguments.

> > RDMA also has special infrastructure to split up the TLV space between
> > core code and HW driver code which is a key feature and necessary part
> > of how you'd build a user/kernel split driver.
> 
> The T namespace is split between core code and driver code?
> I can see that as being useful for debugging maybe? What else?

RDMA is all about having a user/kernel driver co-design.  This means a
driver has code in a userspace library and code in the kernel that
work together to implement the functionality. The userspace library
should be thought of as an extension of the kernel driver into
userspace.

So, there is alot of traffic between the two driver components that is
just private and unique to the driver. This is what the driver
namespace is used for.

For instance there is a common method call to create a queue. The
queue has a number of core parameters like depth, and address, then it
calls the driver and there are bunch of device specific parameters
too, like say queue entry format.

Every driver gets to define its own parameters best suited to its own
device and its own user/kernel split.

Building a split user/kernel driver is complicated and uAPI is one of
the biggest challenges :\

> > > - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> > > that is now common in generic netlink highly reduces developer effort.
> > > Although in my opinion we really need this stuff integrated into tools
> > > like iproute2..
> >
> > RDMA also has a DSL like scheme for defining schema, and centralized
> > parsing and validation. IMHO it's capability falls someplace between
> > the old netlink policy stuff and the new YAML stuff.
> >
> 
> I meant the ability to start with a data model and generate code as
> being useful.
> Where can i find the RDMA DSL?

It is done with the C preprocessor instead of an external YAML
file. Look at drivers/infiniband/core/uverbs_std_types_mr.c at the
end. It describes a data model, but it is elaborated at runtime into
an efficient parse tree, not by using a code generator.

The schema is more classical object oriented RPC type scheme where you
define objects, methods and then method parameters. The objects have
an entire kernel side infrastructure to manage their lifecycle and the
attributes have validation and parsing done prior to reaching the C
function implementing the method.

I always thought it was netlink inspired, but more suited to building
a uAPI out of. Like you get actual system call names (eg
UVERBS_METHOD_REG_DMABUF_MR) that have actual C functions implementing
them. There is special help to implement object allocation and
destruction functions, and freedom to have as many methods per object
as make sense.

> I dont know enough about RDMA infra to comment but iiuc, you are
> saying that it is the control infrastructure (that sits in
> userspace?), that does all those things you mention, that is more
> important.

There is an entire object model in the kernel and it is linked into
the schema.

For instance in the above example we have a schema for an object
method like this:

DECLARE_UVERBS_NAMED_METHOD(
        UVERBS_METHOD_REG_DMABUF_MR,
        UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
                        UVERBS_OBJECT_MR,
                        UVERBS_ACCESS_NEW,
                        UA_MANDATORY),
        UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
                        UVERBS_OBJECT_PD,
                        UVERBS_ACCESS_READ,
                        UA_MANDATORY),

That says it accepts two object handles MR and PD as input to the
method call.

The core code keeps track of all these object handles, validates the
ID number given by userspace is refering to the correct object, of the
correct type, in the correct state. Locks things against concurrent
destruction, and then gives a trivial way for the C method
implementation to pick up the object pointer:

        struct ib_pd *pd =
                uverbs_attr_get_obj(attrs, UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE);

Which can't fail because everything was already checked before we get
here.  This is all designed to greatly simplify and make robust the
method implementations that are often in driver code.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-19 16:48 ` Jason Gunthorpe
@ 2025-03-20 11:13   ` Yunsheng Lin
  2025-03-20 14:32     ` Jason Gunthorpe
  2025-03-24 20:22   ` Roland Dreier
  1 sibling, 1 reply; 76+ messages in thread
From: Yunsheng Lin @ 2025-03-20 11:13 UTC (permalink / raw)
  To: Jason Gunthorpe, Nikolay Aleksandrov
  Cc: netdev, shrijeet, alex.badea, eric.davis, rip.sohan, dsahern, bmt,
	roland, winston.liu, dan.mihailescu, kheib, parth.v.parikh, davem,
	ian.ziemba, andrew.tauferner, welch, rakhahari.bhunia,
	kingshuk.mandal, linux-rdma, kuba, pabeni, huchunzhi,
	jerry.lilijun, zhangkun09, wang.chihyung

On 2025/3/20 0:48, Jason Gunthorpe wrote:
> On Fri, Mar 07, 2025 at 01:01:50AM +0200, Nikolay Aleksandrov wrote:
>> Hi all,
>> This patch-set introduces minimal Ultra Ethernet driver infrastructure and
>> the lowest Ultra Ethernet sublayer - the Packet Delivery Sublayer (PDS),
>> which underpins the entire communication model of the Ultra Ethernet
>> Transport[1] (UET). Ultra Ethernet is a new RDMA transport designed for
>> efficient AI and HPC communication.
> 
> I was away while this discussion happened so I've gone through and
> read the threads, looked at the patches and I don't think I've changed
> my view since I talked to Enfabrica privately on this topic almost a
> year ago.
> 
> I do not agree with creating a new subsystem (or whatever you are
> calling drivers/ultraeth) for a single RDMA protocol and see nothing
> new here to change my mind. I would likely NAK the direction I see in
> this RFC, as I have other past attempts to build RDMA HW interfaces
> outside of the RDMA subystem.
> 
> Since none of that past discussion seems to have been acknowledged or
> rebutted in this series I will repeat the main points:
> 
> 1) I'm aware of something like 5-7 new protocols that are competing
>    for the same market as Ultra Ethernet. We can't give everyone and
>    their dog a new subsystem (or whatever) and all the maintainability
>    negatives that come with that. As a matter of maintainability we
>    need to see consolidation here, not fragmentation!
> 
>    Yes, UE is a consortium driven standard, which is unique and a big
>    positive, but I don't believe anyone can say for certain what
>    direction the industry is going to go in. Many consortium standards
>    have failed to get adoption in the past even with a large number of
>    member companies.
> 
>    Nor can we know what concepts in UE are going to be copied into
>    other competing RDMA transports. See my other remarks on job key
>    for an example. Prematurely siloing stuff in drivers/ultraeth is
>    very much the wrong technical direction for maintainability.
> 
>    That said, I think UE should be in the kernel and have a fair
>    chance to compete for market share. Just in a maintainable and
>    appropriate way while the industry evolves.
> 
> 2) Due to the above, I'm pretty confident we will see RDMA NICs
>    supporting a lot of different protocols. In fact they already do.
> 
>    From a kernel maintainability perspective we really want one RDMA
>    driver leveraging as much common infrastructure between the
>    protocols as possible. We do not want to see a single HW driver
>    further split up needlessly to other subsystems, that would be a
>    big maintainability downside.
> 
>    To put a clear point on this, mlx5 has been gaining new protocols
>    and fitting into the existing driver model for a number of years
>    now. In fact there is speculation that UE could be implemented in
>    mlx5 RDMA with minimal kernel changes. There would be no reason to
>    try to mess up the driver to also interact with this stuff in
>    drivers/ultraeth as seems to be proposed here.
> 
>    I think other HW will be similar. UE isn't so radically different
>    that every HW path will need to diverge from classical RDMA. Nor is
>    is so dissimilar to other competing proposals. We don't want
>    artificial differences we want to create things that can be re-used
>    when appropriate.
> 
>    Leon's response to Bart is correct, we already have similar
>    examples of almost everything UE does. Bart is also correct that
>    verbs would be a PITA, but RDMA userspace has moved beyond verbs
>    limitations years ago now. Alot of mlx5 stuff is not using verbs
>    today, for instance. EFA and other examples use extensive stuff
>    beyond verbs.

Regarding to reuse the existing rdma subsystem for a new protocol:
Currently EFA seems to be layering a RDM layer on top of the SRD
transport layer, see [1], and RDM layer is implemented by software in
the libfabric while SRD seems to be implemented by hardware, which
provides 'Scalable Reliable Datagram' service through the QP type
of EFA_QP_DRIVER_TYPE_SRD.

I am not sure if layers like SRD and RDM are clean layering from
protocol design perspective.
But if the hardware implement both SRD and RDM layer in hardware,
then there might be two types of object need managing, SRD object
might be shared between different applications, and RDM object
need to be created based on a SRD object.

As the existing rdma subsystem doesn't seems to support the above
use case yet and as we are discussing a possible new subsystem or
updating existing subsystem to support new protocol here, it would
be good to discuss if it is possible to support the above case or
another new subsystem is needed for that use case too.

1. https://github.com/ofiwg/libfabric/blob/main/prov/efa/docs/efa_rdm_protocol_v4.md

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-20 11:13   ` Yunsheng Lin
@ 2025-03-20 14:32     ` Jason Gunthorpe
  2025-03-20 20:05       ` Sean Hefty
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 14:32 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Nikolay Aleksandrov, netdev, shrijeet, alex.badea, eric.davis,
	rip.sohan, dsahern, bmt, roland, winston.liu, dan.mihailescu,
	kheib, parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, kuba, pabeni,
	huchunzhi, jerry.lilijun, zhangkun09, wang.chihyung

On Thu, Mar 20, 2025 at 07:13:01PM +0800, Yunsheng Lin wrote:

> As the existing rdma subsystem doesn't seems to support the above
> use case yet

Why would you say that? If EFA needs SRD and RDM objects in RDMA they
can create them, it is not a big issue. To my knowledge they haven't
asked for them.

mlx5 has all kinds of wacky objects these days, it is not an issue to
allow HW to innovate however it likes within RDMA, and we are not
limited to purely IBTA defined verbs like objects any longer.

mlx5 already has RDM like objects <shrug>

> As the existing rdma subsystem doesn't seems to support the above
> use case yet and as we are discussing a possible new subsystem or

If you want something concrete then ask for it and we can discuss how
to fit it in.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-20 14:32     ` Jason Gunthorpe
@ 2025-03-20 20:05       ` Sean Hefty
  2025-03-20 20:12         ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-20 20:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Yunsheng Lin
  Cc: Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	bmt@zurich.ibm.com, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com,
	kheib@redhat.com, parth.v.parikh@keysight.com, davem@redhat.com,
	ian.ziemba@hpe.com, andrew.tauferner@cornelisnetworks.com,
	welch@hpe.com, rakhahari.bhunia@keysight.com,
	kingshuk.mandal@keysight.com, linux-rdma@vger.kernel.org,
	kuba@kernel.org, pabeni@redhat.com, huchunzhi,
	jerry.lilijun@huawei.com, zhangkun09@huawei.com,
	wang.chihyung@huawei.com

> > As the existing rdma subsystem doesn't seems to support the above use
> > case yet
> 
> Why would you say that? If EFA needs SRD and RDM objects in RDMA they
> can create them, it is not a big issue. To my knowledge they haven't asked for
> them.

When looking at how to integrate UET support into verbs, there were changes relevant to this discussion that I found needed.

1. Allow an RDMA device to indicate that it supports multiple transports, separated per port.
2. Specify the QP type separate from the protocol.
3. Define a reliable, unconnected QP type.

Lin might be referring to 2 (assuming 3 is resolved).

These are straightforward to address.  I don't think we'd end up with a protocol object (e.g. SRD), versus it just being an attribute of 3 (e.g. RDM QP).

EFA defined a custom QP type with a single protocol, so they didn't try to standardize this.  However, it could fit into the above model.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-20 20:05       ` Sean Hefty
@ 2025-03-20 20:12         ` Jason Gunthorpe
  2025-03-21  2:02           ` Yunsheng Lin
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 20:12 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Yunsheng Lin, Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	bmt@zurich.ibm.com, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com,
	kheib@redhat.com, parth.v.parikh@keysight.com, davem@redhat.com,
	ian.ziemba@hpe.com, andrew.tauferner@cornelisnetworks.com,
	welch@hpe.com, rakhahari.bhunia@keysight.com,
	kingshuk.mandal@keysight.com, linux-rdma@vger.kernel.org,
	kuba@kernel.org, pabeni@redhat.com, huchunzhi,
	jerry.lilijun@huawei.com, zhangkun09@huawei.com,
	wang.chihyung@huawei.com

On Thu, Mar 20, 2025 at 08:05:05PM +0000, Sean Hefty wrote:
> > > As the existing rdma subsystem doesn't seems to support the above use
> > > case yet
> > 
> > Why would you say that? If EFA needs SRD and RDM objects in RDMA they
> > can create them, it is not a big issue. To my knowledge they haven't asked for
> > them.
> 
> When looking at how to integrate UET support into verbs, there were
> changes relevant to this discussion that I found needed.
>
> 1. Allow an RDMA device to indicate that it supports multiple transports, separated per port.
> 2. Specify the QP type separate from the protocol.
> 3. Define a reliable, unconnected QP type.
> 
> Lin might be referring to 2 (assuming 3 is resolved).

That's at a verbs level though, at the kernel uAPI level we already have
various ways to do all three..

What you say makes sense to me for verbs.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-20 20:12         ` Jason Gunthorpe
@ 2025-03-21  2:02           ` Yunsheng Lin
  2025-03-21 12:01             ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Yunsheng Lin @ 2025-03-21  2:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Sean Hefty
  Cc: Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	bmt@zurich.ibm.com, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com,
	kheib@redhat.com, parth.v.parikh@keysight.com, davem@redhat.com,
	ian.ziemba@hpe.com, andrew.tauferner@cornelisnetworks.com,
	welch@hpe.com, rakhahari.bhunia@keysight.com,
	kingshuk.mandal@keysight.com, linux-rdma@vger.kernel.org,
	kuba@kernel.org, pabeni@redhat.com, huchunzhi,
	jerry.lilijun@huawei.com, zhangkun09@huawei.com,
	wang.chihyung@huawei.com

On 2025/3/21 4:12, Jason Gunthorpe wrote:
> On Thu, Mar 20, 2025 at 08:05:05PM +0000, Sean Hefty wrote:
>>>> As the existing rdma subsystem doesn't seems to support the above use
>>>> case yet
>>>
>>> Why would you say that? If EFA needs SRD and RDM objects in RDMA they
>>> can create them, it is not a big issue. To my knowledge they haven't asked for
>>> them.
>>
>> When looking at how to integrate UET support into verbs, there were
>> changes relevant to this discussion that I found needed.
>>
>> 1. Allow an RDMA device to indicate that it supports multiple transports, separated per port.
>> 2. Specify the QP type separate from the protocol.
>> 3. Define a reliable, unconnected QP type.
>>
>> Lin might be referring to 2 (assuming 3 is resolved).

Yes, that was mainly my concern too. If supporting different protocol need
to use the driver specific QP type and do most of the handling in the driver
in the existing rdma subsystem, then it does not seem reasonable from new
protocol perspective if multi vendors implementing the same new protocol
might have duplicated code in their drivers.

> 
> That's at a verbs level though, at the kernel uAPI level we already have
> various ways to do all three..
> 
> What you say makes sense to me for verbs.

I suppose the verbs is corresponding to what is defined in IB spec, and
verbs is not easily updated without updating the IB spec?

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-21  2:02           ` Yunsheng Lin
@ 2025-03-21 12:01             ` Jason Gunthorpe
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-21 12:01 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Sean Hefty, Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	bmt@zurich.ibm.com, roland@enfabrica.net,
	winston.liu@keysight.com, dan.mihailescu@keysight.com,
	kheib@redhat.com, parth.v.parikh@keysight.com, davem@redhat.com,
	ian.ziemba@hpe.com, andrew.tauferner@cornelisnetworks.com,
	welch@hpe.com, rakhahari.bhunia@keysight.com,
	kingshuk.mandal@keysight.com, linux-rdma@vger.kernel.org,
	kuba@kernel.org, pabeni@redhat.com, huchunzhi,
	jerry.lilijun@huawei.com, zhangkun09@huawei.com,
	wang.chihyung@huawei.com

On Fri, Mar 21, 2025 at 10:02:09AM +0800, Yunsheng Lin wrote:

> > That's at a verbs level though, at the kernel uAPI level we already have
> > various ways to do all three..
> > 
> > What you say makes sense to me for verbs.
> 
> I suppose the verbs is corresponding to what is defined in IB spec, and
> verbs is not easily updated without updating the IB spec?

Not any more, verbs refers to libibverbs which is largely constrained
by its stable ABI contract as a shared library :\

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-19 16:48 ` Jason Gunthorpe
  2025-03-20 11:13   ` Yunsheng Lin
@ 2025-03-24 20:22   ` Roland Dreier
  2025-03-24 21:28     ` Sean Hefty
  2025-03-26 15:16     ` Jason Gunthorpe
  1 sibling, 2 replies; 76+ messages in thread
From: Roland Dreier @ 2025-03-24 20:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nikolay Aleksandrov, netdev, shrijeet, alex.badea, eric.davis,
	rip.sohan, dsahern, bmt, winston.liu, dan.mihailescu, kheib,
	parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, kuba, pabeni

Hi Jason,

I think we were not clear on the overall discussion and so we are much
closer to agreement than you might think, see below.

 > I was away while this discussion happened so I've gone through and
 > read the threads, looked at the patches and I don't think I've changed
 > my view since I talked to Enfabrica privately on this topic almost a
 > year ago.

First, want to clarify that this patchset is collaborative development
within the overall Ultra Ethernet Consortium. That's not to take away
from the large effort that Nik from Enfabrica put into this but simply
to give a little more context.

 > I do not agree with creating a new subsystem (or whatever you are
 > calling drivers/ultraeth) for a single RDMA protocol and see nothing
 > new here to change my mind. I would likely NAK the direction I see in
 > this RFC, as I have other past attempts to build RDMA HW interfaces
 > outside of the RDMA subystem.

UEC is definitely not trying to create anything new beyond adding
support for Ultra Ethernet. By far the bulk of this patchset is adding
a software model of the specific Ultra Ethernet transport's protocol /
packet handling, and that code is currently in drivers/ultraeth. I
don't feel that pathnames are particularly important, and we could
move the code to something like drivers/infiniband/ultraeth, but that
seems a bit silly. But certainly we are open to suggestions.

 >    So, this RFC is woefully incomplete. I think you greatly underestimate
 >    how much work you are looking at to duplicate and re-invent the
 >    existing RDMA infrastructure. Frankly I'm not even sure why you
 >    sent this RFC when it doesn't show enough to even evaluate..

Total agreement that this RFC is incomplete! We are trying to get
something out early, exactly to get the discussion started and agree
on the best way to add kernel support for UE.

To be clear - we are not trying to reinvent or bypass uverbs, and
there is complete agreement within UEC that we should reuse the uverbs
infrastructure so that we get the advantages of solid, mature
mechanisms for memory pinning, resource tracking / cleanup ordering,
etc.

With that said, Ultra Ethernet devices likely will not have interfaces
that map well onto QPs, MRs, etc. so we will be sending patches to
drivers/infiniband/uverbs* that generalize things to allow "struct
ib_device" objects that do not implement "required verbs."

 >    Ie you said:
 >
 >     > I should've been more specific - it is not an issue for UEC and the way
 >     > our driver's netlink API is designed. We fully understand the pros and
 >     > cons of our approach.
 >
 >    Which is exactly the kind of narrow thinking that creates long term
 >    trouble in uAPI design. Do your choices actually work for *ALL*
 >    future HW designs and others drivers not just "our drivers
 >    netlink"? I think not.
 >
 >    Given UE spec doesn't even have something pretending to be a
 >    kernel/user interface standard I think we will see an extreme
 >    variety of HW implementations here.

I think the netlink API and job handling overall is the area where the
most discussion is probably required. UE is somewhat novel in
elevating the concept of a "job" to a standard object with specific
properties that determine the values in packet headers. But I'm open
to making "job" a top-level RDMA object... I guess the idea would be
to define an interface for creating a new type of "job FD" with a
standard ABI for setting properties?

 - Roland

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-24 20:22   ` Roland Dreier
@ 2025-03-24 21:28     ` Sean Hefty
  2025-03-25 13:22       ` Bernard Metzler
  2025-03-26 15:16     ` Jason Gunthorpe
  1 sibling, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-24 21:28 UTC (permalink / raw)
  To: Roland Dreier, Jason Gunthorpe
  Cc: Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	bmt@zurich.ibm.com, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, kheib@redhat.com,
	parth.v.parikh@keysight.com, davem@redhat.com, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, pabeni@redhat.com

> I think the netlink API and job handling overall is the area where the most
> discussion is probably required. UE is somewhat novel in elevating the concept
> of a "job" to a standard object with specific properties that determine the
> values in packet headers. But I'm open to making "job" a top-level RDMA
> object... I guess the idea would be to define an interface for creating a new
> type of "job FD" with a standard ABI for setting properties?

I view a job as scoped by a network address, versus a system global object.  So, I was looking at a per device scope, though I guess a per port (similar to a pkey) is also possible.  My reasoning was that a device _may_ need to allocate some per job resource.  Per device job objects could be configured to have the same 'job address', for an indirect association.

I considered using an fd to share a job object between processes; however, sharing was restricted per device.

I believe a job may be associated with other security objects (e.g. encryption) which may also need per device allocations and state tracking.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-24 21:28     ` Sean Hefty
@ 2025-03-25 13:22       ` Bernard Metzler
  2025-03-25 17:02         ` Sean Hefty
  0 siblings, 1 reply; 76+ messages in thread
From: Bernard Metzler @ 2025-03-25 13:22 UTC (permalink / raw)
  To: Sean Hefty, Roland Dreier, Jason Gunthorpe
  Cc: Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	winston.liu@keysight.com, dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni



> -----Original Message-----
> From: Sean Hefty <shefty@nvidia.com>
> Sent: Monday, March 24, 2025 10:28 PM
> To: Roland Dreier <roland@enfabrica.net>; Jason Gunthorpe <jgg@nvidia.com>
> Cc: Nikolay Aleksandrov <nikolay@enfabrica.net>; netdev@vger.kernel.org;
> shrijeet@enfabrica.net; alex.badea@keysight.com; eric.davis@broadcom.com;
> rip.sohan@amd.com; dsahern@kernel.org; Bernard Metzler
> <BMT@zurich.ibm.com>; winston.liu@keysight.com;
> dan.mihailescu@keysight.com; Kamal Heib <kheib@redhat.com>;
> parth.v.parikh@keysight.com; Dave Miller <davem@redhat.com>;
> ian.ziemba@hpe.com; andrew.tauferner@cornelisnetworks.com; welch@hpe.com;
> rakhahari.bhunia@keysight.com; kingshuk.mandal@keysight.com; linux-
> rdma@vger.kernel.org; kuba@kernel.org; Paolo Abeni <pabeni@redhat.com>
> Subject: [EXTERNAL] RE: [RFC PATCH 00/13] Ultra Ethernet driver
> introduction
> 
> > I think the netlink API and job handling overall is the area where the
> most
> > discussion is probably required. UE is somewhat novel in elevating the
> concept
> > of a "job" to a standard object with specific properties that determine
> the
> > values in packet headers. But I'm open to making "job" a top-level RDMA
> > object... I guess the idea would be to define an interface for creating a
> new
> > type of "job FD" with a standard ABI for setting properties?
> 
> I view a job as scoped by a network address, versus a system global object.
> So, I was looking at a per device scope, though I guess a per port (similar
> to a pkey) is also possible.  My reasoning was that a device _may_ need to
> allocate some per job resource.  Per device job objects could be configured
> to have the same 'job address', for an indirect association.
> 


If I understand UEC's job semantics correctly, then the local scope
of a job may span multiple local ports from multiple local devices.
It would of course translate into device specific reservations.

> I considered using an fd to share a job object between processes; however,
> sharing was restricted per device.
> 
> I believe a job may be associated with other security objects (e.g.
> encryption) which may also need per device allocations and state tracking.
> 
> - Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-19 19:19                 ` Jason Gunthorpe
@ 2025-03-25 14:12                   ` Jamal Hadi Salim
  2025-03-26 15:50                     ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Jamal Hadi Salim @ 2025-03-25 14:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Nikolay Aleksandrov,
	Linux Kernel Network Developers, Shrijeet Mukherjee, alex.badea,
	eric.davis, rip.sohan, David Ahern, bmt, roland, Winston Liu,
	dan.mihailescu, kheib, parth.v.parikh, davem, ian.ziemba,
	andrew.tauferner, welch, rakhahari.bhunia, kingshuk.mandal,
	linux-rdma, Jakub Kicinski, Paolo Abeni

On Wed, Mar 19, 2025 at 3:19 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Mar 19, 2025 at 02:21:23PM -0400, Jamal Hadi Salim wrote:
>
> > Curious how you guarantee that a "destroy" will not fail under OOM. Do
> > you have pre-allocated memory?
>
> It just never allocates memory? Why would a simple system call like a
> destruction allocate any memory?

You need to at least construct the message parameterization in user
space which would require some memory, no? And then copy_from_user
would still need memory to copy to?
I am probably missing something basic.

> > > Overall systems calls here should either succeed or fail and be the
> > > same as a NOP. No failure that actually did something and then creates
> > > some resource leak or something because userspace didn't know about
> > > it.
> >
> > Yes, this is how netlink works as well. If a failure to delete an
> > object occurs then every transient state gets restored. This is always
> > the case for simple requests(a delete/create/update). For requests
> > that batch multiple objects there are cases where there is no
> > unwinding.
>
> I'm not sure that is complely true, like if userspace messes up the
> netlink read() side of the API and copy_to_user() fails then you can
> get these inconsistencies. In the RDMA model even those edge case are
> properly unwound, just like a normal system call would.
>

For a read() to fail at say copy_to_user() feels like your app or
system must be in really bad shape.
A contingency plan could be to replay the message from the app/control
plane and hope you get an "object doesnt exist" kind of message for a
failed destroy msg.
Or IMO restart the app or system and try to recover/cleanup from
scratch to build a good known state.
IOW, while unwinding is more honorable, unless it comes for cheap it
may not be worth it.
Regardless: How would RDMA unwind in such a case?

> > Makes sense. So ioctls with TLVs ;->
> > I am suspecting you don't have concepts of TLVs inside TLVs for
> > hierarchies within objects.
>
> No, it has not been needed yet, or at least the cases that have come
> up have been happy to use arrays of structs for the nesting. The
> method calls themselves don't tend to have that kind of challenging
> structure for their arguments.
>

ok.
Not sure if this applies to you: Netlink good practise is to ensure
any structs exchanged are 32b aligned and in cases they are not mostly
adding explicit pads.
Fun back in the day on tc when everything worked on x86 then failures
galore on esoteric architectures(I remember trying to run on a switch
which had a PPC cpu with 8B alignmennt). I am searching my brain cells
for what the failures were but getting ENOENT; i think it was more the
way TLV alignment was structured although it could have been offsets
of different fields ended in the wrong place, etc.

> > > RDMA also has special infrastructure to split up the TLV space between
> > > core code and HW driver code which is a key feature and necessary part
> > > of how you'd build a user/kernel split driver.
> >
> > The T namespace is split between core code and driver code?
> > I can see that as being useful for debugging maybe? What else?
>
> RDMA is all about having a user/kernel driver co-design.  This means a
> driver has code in a userspace library and code in the kernel that
> work together to implement the functionality. The userspace library
> should be thought of as an extension of the kernel driver into
> userspace.
>
> So, there is alot of traffic between the two driver components that is
> just private and unique to the driver. This is what the driver
> namespace is used for.
>
> For instance there is a common method call to create a queue. The
> queue has a number of core parameters like depth, and address, then it
> calls the driver and there are bunch of device specific parameters
> too, like say queue entry format.
>
> Every driver gets to define its own parameters best suited to its own
> device and its own user/kernel split.
>

I think i got it, to reword what you said:
When you say "driver" you mean "control/provisioning plane" activity
between a userspace control app and kernel objects which likely extend
to hardware (as opposed to datapath send/receive kind activity).
That you have a set of common, agreed-to attributes and then each
vendor would add their own (separate namespace) attributes?
The control app issuing a request would first invoke some common
interface which would populate the applicable common TLVs for that
request then call into a vendor interface to populate vendor specific
attributes.
And in the kernel, some common code would process the common
attributes then pass on the vendor specific data to a vendor driver.

If my reading is right, some comments:
1) You can achieve this fine with netlink. My view of the model is you
would have a T (call it VendorData, which is is defined within the
common namespace) that puts the vendor specific TLVs within a
hierarchy. i.e.
when constructing or parsing the VendorData you invoke vendor specific
extensions.

2) Hopefully the vendor extensions are in the minority. Otherwise the
complexity of someone writing an app to control multiple vendors would
be challenging over time as different vendors add more attributes. I
cant imagine a commonly used utility like iproute2/tc being invoked
with "when using broadcom then use foo=x bar=y" apply but when using
intel use "goo=x-1 and gah=y-2".

3) A Pro/con to #2 depending on which lens you use:  it could be
"innnovation" or "vendor lockin" - depends on the community i.e on the
one hand a vendor could add features faster and is not bottlenecked by
endless mailing list discussions but otoh, said vendor may not be in
any hurry to move such features to the common path (because it gives
them an advantage).

> Building a split user/kernel driver is complicated and uAPI is one of
> the biggest challenges :\
>
> > > > - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> > > > that is now common in generic netlink highly reduces developer effort.
> > > > Although in my opinion we really need this stuff integrated into tools
> > > > like iproute2..
> > >
> > > RDMA also has a DSL like scheme for defining schema, and centralized
> > > parsing and validation. IMHO it's capability falls someplace between
> > > the old netlink policy stuff and the new YAML stuff.
> > >
> >
> > I meant the ability to start with a data model and generate code as
> > being useful.
> > Where can i find the RDMA DSL?
>
> It is done with the C preprocessor instead of an external YAML
> file. Look at drivers/infiniband/core/uverbs_std_types_mr.c at the
> end. It describes a data model, but it is elaborated at runtime into
> an efficient parse tree, not by using a code generator.
>
> The schema is more classical object oriented RPC type scheme where you
> define objects, methods and then method parameters. The objects have
> an entire kernel side infrastructure to manage their lifecycle and the
> attributes have validation and parsing done prior to reaching the C
> function implementing the method.
>
> I always thought it was netlink inspired, but more suited to building
> a uAPI out of. Like you get actual system call names (eg
> UVERBS_METHOD_REG_DMABUF_MR) that have actual C functions implementing
> them. There is special help to implement object allocation and
> destruction functions, and freedom to have as many methods per object
> as make sense.
>

I took a quick look at what you pointed to. It's RPC-ish (just like
_most_ of netlink use is) - so similar roots. IOW, you end up with
methods like create_myfoo() and create_mybar()
Two things:
1) I am not a fan of the RPC approach because it has a higher
developer effort when adding new features. Based on my experience, I
am a fan of CRUD(Create Read Update Delete) - and with netlink i also
get for free the subscribe/publish parts; to be specific _all you
need_ are CRUDPS methods i.e 6 methods tops (which never change). You
can craft any objects to conform to those interfaces. Example, you can
have create(myfoo) not being syntactically different from
create(mybar). This simplifies the data model immensely (and allows
for better automation). Unfortunately the gprcs and thrifts out there
have permeated RPC semantics everywhere(thrift being slightly better
IMO).

2) Using C as the modelling sounds like a good first start to someone
who knows C well but tbh, those macros hurt my eyes for a bit (and i
am someone who loves macro witchcraft). The big advantage IMO of using
yaml or json is mostly the available tooling, example being polyglot.
I am not sure if that is a requirement in RDMA.

> > I dont know enough about RDMA infra to comment but iiuc, you are
> > saying that it is the control infrastructure (that sits in
> > userspace?), that does all those things you mention, that is more
> > important.
>
> There is an entire object model in the kernel and it is linked into
> the schema.
>
> For instance in the above example we have a schema for an object
> method like this:
>
> DECLARE_UVERBS_NAMED_METHOD(
>         UVERBS_METHOD_REG_DMABUF_MR,
>         UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
>                         UVERBS_OBJECT_MR,
>                         UVERBS_ACCESS_NEW,
>                         UA_MANDATORY),
>         UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
>                         UVERBS_OBJECT_PD,
>                         UVERBS_ACCESS_READ,
>                         UA_MANDATORY),
>
> That says it accepts two object handles MR and PD as input to the
> method call.
>
> The core code keeps track of all these object handles, validates the
> ID number given by userspace is refering to the correct object, of the
> correct type, in the correct state. Locks things against concurrent
> destruction, and then gives a trivial way for the C method
> implementation to pick up the object pointer:
>
>         struct ib_pd *pd =
>                 uverbs_attr_get_obj(attrs, UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE);
>
> Which can't fail because everything was already checked before we get
> here.  This is all designed to greatly simplify and make robust the
> method implementations that are often in driver code.
>

Again, I could be missing something but the semantics seem to be the
same as netlink.

BTW, do you do fuzzy testing with this? In the old days the whole
netlink infra assumed a philosophy of "we give you the gun, if you
want to shoot yourself in the small toe then go ahead".
IOW, there was no assumption of people doing stupid things - and that
stupid things will only harm them.  Now we have hostile actors,
syzkaller and bounty hunters creating all kinds of UAFs and trying to
trick the kernel into some funky state just because....

cheers,
jamal

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-25 13:22       ` Bernard Metzler
@ 2025-03-25 17:02         ` Sean Hefty
  2025-03-26 14:45           ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-25 17:02 UTC (permalink / raw)
  To: Bernard Metzler, Roland Dreier, Jason Gunthorpe
  Cc: Nikolay Aleksandrov, netdev@vger.kernel.org,
	shrijeet@enfabrica.net, alex.badea@keysight.com,
	eric.davis@broadcom.com, rip.sohan@amd.com, dsahern@kernel.org,
	winston.liu@keysight.com, dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> > I view a job as scoped by a network address, versus a system global object.
> > So, I was looking at a per device scope, though I guess a per port
> > (similar to a pkey) is also possible.  My reasoning was that a device
> > _may_ need to allocate some per job resource.  Per device job objects
> > could be configured to have the same 'job address', for an indirect association.
> >
> 
> If I understand UEC's job semantics correctly, then the local scope of a job may
> span multiple local ports from multiple local devices.
> It would of course translate into device specific reservations.

Agreed.  I should have said job id/address has a network address scope.  For example, job 3 at 10.0.0.1 _may_ be a different logical job than job 3 at 10.0.0.2.  Or they could also belong to the same logical job.  Or the same logical job may use different job id values for different network addresses.

A device-centric model is more aligned with the RDMA stack.  IMO, higher-level SW would then be responsible for configuring and managing the logical job.  For example, maybe it needs to assign and configure non-RDMA resources as well.  For that reason, I would push the logical job management outside the kernel subsystem.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-25 17:02         ` Sean Hefty
@ 2025-03-26 14:45           ` Jason Gunthorpe
  2025-03-26 15:29             ` Sean Hefty
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-26 14:45 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Tue, Mar 25, 2025 at 05:02:37PM +0000, Sean Hefty wrote:
> > > I view a job as scoped by a network address, versus a system global object.
> > > So, I was looking at a per device scope, though I guess a per port
> > > (similar to a pkey) is also possible.  My reasoning was that a device
> > > _may_ need to allocate some per job resource.  Per device job objects
> > > could be configured to have the same 'job address', for an indirect association.
> > >
> > 
> > If I understand UEC's job semantics correctly, then the local scope of a job may
> > span multiple local ports from multiple local devices.
> > It would of course translate into device specific reservations.
> 
> Agreed.  I should have said job id/address has a network address
> scope.  For example, job 3 at 10.0.0.1 _may_ be a different logical
> job than job 3 at 10.0.0.2.  Or they could also belong to the same
> logical job.  Or the same logical job may use different job id
> values for different network addresses.
> 
> A device-centric model is more aligned with the RDMA stack.  IMO,
> higher-level SW would then be responsible for configuring and
> managing the logical job.  For example, maybe it needs to assign and
> configure non-RDMA resources as well.  For that reason, I would push
> the logical job management outside the kernel subsystem.

Like I said already, I think Job needs to be a first class RDMA object
that is used by all transports that have job semantics.

I expect variation here, UEC made it's choices for how the job headers
are stacked on the wire and I forsee that other protocols will make
different choices.

Jobs may have other data like addresses and encryption keys to define
what packets are part of the job on the network.

So the specific scope of the job may change based on the protocol.

The act of creating a job is really creating a global security object
with some protocol specific properties and must come with a sane
security model to both restrict creation and restrict consuming the
job security object. I favour FD passing for the latter and file
system ACLs for the former.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-24 20:22   ` Roland Dreier
  2025-03-24 21:28     ` Sean Hefty
@ 2025-03-26 15:16     ` Jason Gunthorpe
  1 sibling, 0 replies; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-26 15:16 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Nikolay Aleksandrov, netdev, shrijeet, alex.badea, eric.davis,
	rip.sohan, dsahern, bmt, winston.liu, dan.mihailescu, kheib,
	parth.v.parikh, davem, ian.ziemba, andrew.tauferner, welch,
	rakhahari.bhunia, kingshuk.mandal, linux-rdma, kuba, pabeni

On Mon, Mar 24, 2025 at 01:22:13PM -0700, Roland Dreier wrote:
> Hi Jason,
> 
> I think we were not clear on the overall discussion and so we are much
> closer to agreement than you might think, see below.

Certainly this was my impression after we talked, so is hard to
understand what this series is supposed to be. It doesn't seem to
advance things towards using the RDMA subsystem.

> First, want to clarify that this patchset is collaborative development
> within the overall Ultra Ethernet Consortium. 

Consortiums don't get to vote their way into Linux patch acceptance.

> UEC is definitely not trying to create anything new beyond adding
> support for Ultra Ethernet. By far the bulk of this patchset is adding
> a software model of the specific Ultra Ethernet transport's protocol /
> packet handling, and that code is currently in drivers/ultraeth. I
> don't feel that pathnames are particularly important, and we could
> move the code to something like drivers/infiniband/ultraeth, but that
> seems a bit silly. But certainly we are open to suggestions.

It is not the directory name that is at issue, it is completely
ignoring all the existing infrastructure and doing something entirely
new and entirely isolated.

I expect you will have uec named files under drivers/infiniband, and
they should be integrated within existing architecture, including
extensions.

It is a shame we won't rename the directory name, or rename alot of
other stuff, but I gather that is the community preference.

> To be clear - we are not trying to reinvent or bypass uverbs, and
> there is complete agreement within UEC that we should reuse the uverbs
> infrastructure so that we get the advantages of solid, mature
> mechanisms for memory pinning, resource tracking / cleanup ordering,
> etc.

My expectation is that the software interface path for a RDMA
transport exist mainly for testing and development purposes. It should
be deliberately designed to mimic a HW driver and exercise the same
interfaces. Even if that is more work or inconvenient.

> With that said, Ultra Ethernet devices likely will not have interfaces
> that map well onto QPs, MRs, etc. so we will be sending patches to
> drivers/infiniband/uverbs* that generalize things to allow "struct
> ib_device" objects that do not implement "required verbs."

That's fine, we can look at all those things as you go along.

Just be mindful that given UEC's lack of a HW standard it will be hard
to judge if things are HW specific or UEC general. Explain in the
patches why you think many HW devices will be using new general common
objects.

Job ID is a good example that is obviously required by spec to be
common.

My advice, and the same advice I gave to Habana, is to ignore the
spelling of things like PD, QP and MR and focus on the fundamental
purpose they represent. UEC has a "QP" in that all HW devices will
have some kind of queue structure to userspace. UEC has "PD" in that
it must have some kind of HW security boundary to keep one uverbs
context from touching another's resources (it may be that job is how
UEC spells PD), and so on.

Use driver specific calls when appropriate.

kernel-uet will be a different conversation, and I suspect kernel uet
will be very feature limited to focus just on something like storage.

> I think the netlink API and job handling overall is the area where the
> most discussion is probably required. UE is somewhat novel in

Yes, that is new, but also an idea that is being copied.

> elevating the concept of a "job" to a standard object with specific
> properties that determine the values in packet headers. But I'm open
> to making "job" a top-level RDMA object...

I think this is right

> I guess the idea would be
> to define an interface for creating a new type of "job FD" with a
> standard ABI for setting properties?

I suspect so? /dev/infiniband/job perhaps where opening the FD creates
a job container then some ioctls to realize it into a per-protocol job
description with per-protocol additional properties?

Present a job FD to a uverbs FD to join the job's security context

Another variation might be an entire jobfs but I would probably start
with job FD first and only do a jobfs later if people demand it..

I think CAP_SYS_NET_ADMIN is a bad security model for jobs.

Regards,
Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-26 14:45           ` Jason Gunthorpe
@ 2025-03-26 15:29             ` Sean Hefty
  2025-03-26 15:53               ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-26 15:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> > > If I understand UEC's job semantics correctly, then the local scope
> > > of a job may span multiple local ports from multiple local devices.
> > > It would of course translate into device specific reservations.
> >
> > Agreed.  I should have said job id/address has a network address
> > scope.  For example, job 3 at 10.0.0.1 _may_ be a different logical
> > job than job 3 at 10.0.0.2.  Or they could also belong to the same
> > logical job.  Or the same logical job may use different job id values
> > for different network addresses.
> >
> > A device-centric model is more aligned with the RDMA stack.  IMO,
> > higher-level SW would then be responsible for configuring and managing
> > the logical job.  For example, maybe it needs to assign and configure
> > non-RDMA resources as well.  For that reason, I would push the logical
> > job management outside the kernel subsystem.
> 
> Like I said already, I think Job needs to be a first class RDMA object that is used
> by all transports that have job semantics.

How do you handle or expose device specific resource allocations or restrictions, which may be needed?  Should a kernel 'RDMA job manager' abstract device level resources?

Consider a situation where a MR or MW should only be accessible by a specific job.  When the MR is created, the device specific job resource may be needed.  Should drivers need to query the job manager to map some global object to a device specific resource?

Other than this difference, I agree with the other points.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-25 14:12                   ` Jamal Hadi Salim
@ 2025-03-26 15:50                     ` Jason Gunthorpe
  2025-04-08 14:16                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-26 15:50 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Leon Romanovsky, Nikolay Aleksandrov,
	Linux Kernel Network Developers, Shrijeet Mukherjee, alex.badea,
	eric.davis, rip.sohan, David Ahern, bmt, roland, Winston Liu,
	dan.mihailescu, kheib, parth.v.parikh, davem, ian.ziemba,
	andrew.tauferner, welch, rakhahari.bhunia, kingshuk.mandal,
	linux-rdma, Jakub Kicinski, Paolo Abeni

On Tue, Mar 25, 2025 at 10:12:49AM -0400, Jamal Hadi Salim wrote:

> You need to at least construct the message parameterization in user
> space which would require some memory, no? And then copy_from_user
> would still need memory to copy to?
> I am probably missing something basic.

It usually all stack memory on the userspace side, and no kernel
memory allocation. Like there is no mandatory SKB in uverbs.

> For a read() to fail at say copy_to_user() feels like your app or
> system must be in really bad shape.

Yes, but still the semantic we want is that if a creation ioctl
returns 0 (success) then the object exists and if it returns any error
code then the creation was a NOP.

> A contingency plan could be to replay the message from the app/control
> plane and hope you get an "object doesnt exist" kind of message for a
> failed destroy msg.

Nope, it's racey, it must be multi-threaded safe. Another thread could
have created and re-used the object ID.

> IOW, while unwinding is more honorable, unless it comes for cheap it
> may not be worth it.

It was cheap

> Regardless: How would RDMA unwind in such a case?

The object infrastructure takes care of this with a three step object
creation protocol and some helpers.

> Not sure if this applies to you: Netlink good practise is to ensure
> any structs exchanged are 32b aligned and in cases they are not mostly
> adding explicit pads.

The alignment is less important as a ABI requirement since
copy_to_user will fix the alignment when it copies arrays to kernel
memory that will be properly aligned as required. netlink has this
issue because it bulk copies everything into a skb and uses pointers
to that copy. The approach here only copies small stuff in advance and
larger stuff is not copied until memory is allocated to hold it.

> When you say "driver" you mean "control/provisioning plane" activity
> between a userspace control app and kernel objects which likely
> extend

No, I literally mean driver.

The user of this HW will not do something like socket() as standard
system call abstracted by the kernel. Instead it makes a library call
ib_create_qp() which goes into a library with the userspace driver
components. The abstraction is now done in userspace. The library
figures out what HW the kernel has and loads a userspace driver
component with a driver_create_qp() op that does more processing and
eventually calls the kernel.

It is "control path" in the sense that it is slow path creating
objects for data transfer, but the purpose of most of the actions is
actually setting up for data plane operations.

> That you have a set of common, agreed-to attributes and then each
> vendor would add their own (separate namespace) attributes?

Yes

> The control app issuing a request would first invoke some common
> interface which would populate the applicable common TLVs for that
> request then call into a vendor interface to populate vendor specific
> attributes.

Yes

> And in the kernel, some common code would process the common
> attributes then pass on the vendor specific data to a vendor driver.

Yes

> If my reading is right, some comments:
> 1) You can achieve this fine with netlink. My view of the model is you
> would have a T (call it VendorData, which is is defined within the
> common namespace) that puts the vendor specific TLVs within a
> hierarchy.

Yes, that was a direction that was suggested here too. But when we got
to micro optimizing the ioctl ABI format it became clear there was
significant advantage to keeping things one level and not trying to do
some kind of nesting. This also gives a nice simple in-kernel API for
working with method arguments, it is always the same. We don't have
different APIs depending on driver/common callers.

> 2) Hopefully the vendor extensions are in the minority. Otherwise the
> complexity of someone writing an app to control multiple vendors would
> be challenging over time as different vendors add more attributes.

Nope, it is about 50/50, and there is not a challenge because the
methodology is everyone uses the *same* userspace driver code. It is
too complicated for people to reasonable try to rewrite.

> I cant imagine a commonly used utility like iproute2/tc being
> invoked with "when using broadcom then use foo=x bar=y" apply but
> when using intel use "goo=x-1 and gah=y-2".

Right, it doesn't make sense for a tool like iproute, but we aren't
building anything remotely like iproute.

> 3) A Pro/con to #2 depending on which lens you use:  it could be
> "innnovation" or "vendor lockin" - depends on the community i.e on the
> one hand a vendor could add features faster and is not bottlenecked by
> endless mailing list discussions but otoh, said vendor may not be in
> any hurry to move such features to the common path (because it gives
> them an advantage).

There is no community advantage to the common kernel path.

The users all use the library, the only thing that matters is how
accessible the vendor has made their unique ideas to the library
users.

For instance, if the user is running a MPI application and the vendor
makes standard open source MPI 5% faster with some unique HW
innovation should anyone actually care about the "common path" deep,
deep below MPI?

> 1) I am not a fan of the RPC approach because it has a higher
> developer effort when adding new features. Based on my experience, I
> am a fan of CRUD(Create Read Update Delete) 

It suites some things better than others. I don't think semantically
update is the right language for most of what is happening
here. "read" is almost never done. Like socket() Fd's and it's API
surface isn't a good fit for CRUD.

> - and with netlink i also
> get for free the subscribe/publish parts; to be specific _all you

publish/subscribe doesn't make sense in this context. We don't do it.

> 2) Using C as the modelling sounds like a good first start to someone
> who knows C well but tbh, those macros hurt my eyes for a bit (and i
> am someone who loves macro witchcraft). The big advantage IMO of using
> yaml or json is mostly the available tooling, example being polyglot.
> I am not sure if that is a requirement in RDMA.

I agree with this.. When it was first made I suggested a code
generator instead but at that time code generators in the kernel did
not seem to be a well accepted idea. I'm glad to see that improving.

> Again, I could be missing something but the semantics seem to be the
> same as netlink.

AFAIK netlink doesn't have the same notion of objects or having the
validation obtain references and locking on referenced objects at all.

> BTW, do you do fuzzy testing with this?

syzkaller runs on rdma, but I don't recall how much coverage syzkaller
gets on these forms. We fixed a huge number of syzkaller bugs at
least.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-26 15:29             ` Sean Hefty
@ 2025-03-26 15:53               ` Jason Gunthorpe
  2025-03-26 17:39                 ` Sean Hefty
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-26 15:53 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Wed, Mar 26, 2025 at 03:29:01PM +0000, Sean Hefty wrote:
> > > > If I understand UEC's job semantics correctly, then the local scope
> > > > of a job may span multiple local ports from multiple local devices.
> > > > It would of course translate into device specific reservations.
> > >
> > > Agreed.  I should have said job id/address has a network address
> > > scope.  For example, job 3 at 10.0.0.1 _may_ be a different logical
> > > job than job 3 at 10.0.0.2.  Or they could also belong to the same
> > > logical job.  Or the same logical job may use different job id values
> > > for different network addresses.
> > >
> > > A device-centric model is more aligned with the RDMA stack.  IMO,
> > > higher-level SW would then be responsible for configuring and managing
> > > the logical job.  For example, maybe it needs to assign and configure
> > > non-RDMA resources as well.  For that reason, I would push the logical
> > > job management outside the kernel subsystem.
> > 
> > Like I said already, I think Job needs to be a first class RDMA object that is used
> > by all transports that have job semantics.
> 
> How do you handle or expose device specific resource allocations or
> restrictions, which may be needed?  Should a kernel 'RDMA job
> manager' abstract device level resources?
> 
> Consider a situation where a MR or MW should only be accessible by a
> specific job.  When the MR is created, the device specific job
> resource may be needed.  Should drivers need to query the job
> manager to map some global object to a device specific resource?

I imagine for cases like that the job would be linked to the PD and
then MR -> PD -> Job.

The kernel side would create any HW object for the job when the PD is
created for a specific HW device.

The PD security semantic for the MR would be a little bit different in
that the PD is more like a shared PD.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-26 15:53               ` Jason Gunthorpe
@ 2025-03-26 17:39                 ` Sean Hefty
  2025-03-27 13:26                   ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-26 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> > > Like I said already, I think Job needs to be a first class RDMA
> > > object that is used by all transports that have job semantics.
> >
> > How do you handle or expose device specific resource allocations or
> > restrictions, which may be needed?  Should a kernel 'RDMA job manager'
> > abstract device level resources?
> >
> > Consider a situation where a MR or MW should only be accessible by a
> > specific job.  When the MR is created, the device specific job
> > resource may be needed.  Should drivers need to query the job manager
> > to map some global object to a device specific resource?
> 
> I imagine for cases like that the job would be linked to the PD and then MR ->
> PD -> Job.
> 
> The kernel side would create any HW object for the job when the PD is created
> for a specific HW device.
> 
> The PD security semantic for the MR would be a little bit different in that the
> PD is more like a shared PD.

The PD is a problem, as it's not a transport function.  It's a hardware implementation component; one which may NOT exist for a UEC NIC.  (I know there are NICs which do not implement PDs and have secure RDMA transfers.)  I have a proposal to rework/redefine PDs to support a more general model, which I think will work for NICs that need a PD and ones that don't.  It can support MR -> PD -> Job, but I considered the PD -> job relationship as 1 to many.  I can't immediately think of a reason why a 1:1 'job-based PD' wouldn't work in theory.

It's challenging in that a UET endpoint (QP) may communicate with multiple jobs, and a MR may be accessible by a single job, all jobs, or only a few.

Basically, the RDMA PD model forces a HW implementation.  Some, but not all, NICs will implement this.  But in general, there's not a clean {PD, QP, MR, job} relationship.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-26 17:39                 ` Sean Hefty
@ 2025-03-27 13:26                   ` Jason Gunthorpe
  2025-03-28 12:20                     ` Yunsheng Lin
  2025-03-31 19:29                     ` Sean Hefty
  0 siblings, 2 replies; 76+ messages in thread
From: Jason Gunthorpe @ 2025-03-27 13:26 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Wed, Mar 26, 2025 at 05:39:52PM +0000, Sean Hefty wrote:

> The PD is a problem, as it's not a transport function.  It's a
> hardware implementation component; one which may NOT exist for a UEC
> NIC.  (I know there are NICs which do not implement PDs and have
> secure RDMA transfers.)

The PD is just a concept representing security, there are lots of ways
to implement this, so long as it achieves an isolation you would label
it a PD and the PD flows through all the objects that participate in
the isolation.

The basic essential requirement is that a registered userspace memory
cannot be accessed by things outside the definition of pd/shared pd.

This is really important, I'm quite concerned that any RDMA protocol
come with some solid definition of PD mapped to the underlying
technology that matches Linux's inter-process security needs.

For instance Habana defined a PD as a singleton object and the first
process to get it had exclusive use of the HW. This is because their
HW could not do any inter-process security.

>  I have a proposal to rework/redefine PDs to
> support a more general model,

It would certainly be good to have some text explaining some of the
mappings to different technologies.

> which I think will work for NICs that
> need a PD and ones that don't.  It can support MR -> PD -> Job, but
> I considered the PD -> job relationship as 1 to many. 

Yes, and the 1:1 is degenerate.

> Sure, It's challenging in that a UET endpoint (QP) may communicate
> with multiple jobs, and a MR may be accessible by a single job, all
> jobs, or only a few.

I would suggest that the PD is a superset of all jobs and the objects
(endpoint, mr, etc) get to choose a subset of the PD's jobs during
allocation?

Or you keep job/pd as 1:1 and allow specifying multiple PDs during
object allocation.

But to be clear, this is largely verbs modeling stuff - however there
is a certain practicality to trying to fit this multi-job ability into
a PD because it allow reusing alot of existing uAPI kernel code.

Especially if people are going to take existing RDMA HW and tweak it
to some level of UET (ie support only single job) and still require a
HW level PD under the covers.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-27 13:26                   ` Jason Gunthorpe
@ 2025-03-28 12:20                     ` Yunsheng Lin
  2025-03-31 19:49                       ` Sean Hefty
  2025-03-31 19:29                     ` Sean Hefty
  1 sibling, 1 reply; 76+ messages in thread
From: Yunsheng Lin @ 2025-03-28 12:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Sean Hefty
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On 2025/3/27 21:26, Jason Gunthorpe wrote:

...

> 
>> which I think will work for NICs that
>> need a PD and ones that don't.  It can support MR -> PD -> Job, but
>> I considered the PD -> job relationship as 1 to many. 
> 
> Yes, and the 1:1 is degenerate.
> 
>> Sure, It's challenging in that a UET endpoint (QP) may communicate
>> with multiple jobs, and a MR may be accessible by a single job, all
>> jobs, or only a few.
> 
> I would suggest that the PD is a superset of all jobs and the objects
> (endpoint, mr, etc) get to choose a subset of the PD's jobs during
> allocation?
> 
> Or you keep job/pd as 1:1 and allow specifying multiple PDs during
> object allocation.
> 
> But to be clear, this is largely verbs modeling stuff - however there
> is a certain practicality to trying to fit this multi-job ability into
> a PD because it allow reusing alot of existing uAPI kernel code.
> 
> Especially if people are going to take existing RDMA HW and tweak it
> to some level of UET (ie support only single job) and still require a
> HW level PD under the covers.

Through reading this patchset, it seems the semantics of 'job' for UEC
is about how to identify a PDC(Packet Delivery Context) instance, which
is specified by src fep_address/pdc_id and dst fep_address/pdc_id as
there seems to be more than one PDC instance between two nodes, so the
'job' is really about grouping processes from the same 'job' to use the
same PDC instance and preventing processes from different 'job' from
using the same PDC instance?

And the interesting part about the PDC seems to be:
1. It is created dynamically when a request packet is sent on the target
   side or received on the initiator side, and destroyed dynamically through
   timeout mechanism.
2. It seems a PDC instance can be shared between different processes from the
   same 'job'.

And SRD seems to be using AH(address handle) to specify a SRD connection between
two nodes, and there is only one SRD context instance between two nodes, so different
process may have their own AH pointing to the same SRD context instance between the
same two nodes?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-27 13:26                   ` Jason Gunthorpe
  2025-03-28 12:20                     ` Yunsheng Lin
@ 2025-03-31 19:29                     ` Sean Hefty
  2025-04-01 13:04                       ` Jason Gunthorpe
  1 sibling, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-31 19:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> >  I have a proposal to rework/redefine PDs to support a more general
> > model,
> 
> It would certainly be good to have some text explaining some of the mappings
> to different technologies.
> 
> > which I think will work for NICs that
> > need a PD and ones that don't.  It can support MR -> PD -> Job, but I
> > considered the PD -> job relationship as 1 to many.
> 
> Yes, and the 1:1 is degenerate.
> 
> > Sure, It's challenging in that a UET endpoint (QP) may communicate
> > with multiple jobs, and a MR may be accessible by a single job, all
> > jobs, or only a few.
> 
> I would suggest that the PD is a superset of all jobs and the objects (endpoint,
> mr, etc) get to choose a subset of the PD's jobs during allocation?
> 
> Or you keep job/pd as 1:1 and allow specifying multiple PDs during object
> allocation.
> 
> But to be clear, this is largely verbs modeling stuff - however there is a certain
> practicality to trying to fit this multi-job ability into a PD because it allow
> reusing alot of existing uAPI kernel code.
> 
> Especially if people are going to take existing RDMA HW and tweak it to some
> level of UET (ie support only single job) and still require a HW level PD under
> the covers.

Yes, I'm trying to ensure that the existing RDMA model continues to work but also support NICs/transports which implement the equivalent security model at the QP (endpoint) level, reusing the PD for both.

Specifically, I want to *allow* separating the different functions that a single PD provides into separate PDs.  The functions being page mapping (registration), local (lkey) access, and remote (rkey) access.  The RDMA model limits a QP to a single PD for all.  To support job-based transports, I propose allowing a QP to use 1 PD for local access (PD specified at QP creation) and multiple PDs for remote access.  Each PD used for remote access would correspond to a different job.

Note: a NIC may limit a QP to being used with a single job and require the local and remote PD be the same (i.e. 1 pd per qp).  So, the RDMA model still fits.

As an optimization, registration can be a separate function, so that the same page mapping can be re-used across different jobs as they start and end.  This requires some ability to import a MR from one PD into another.  This is probably just an optimization and not required for a job model.

I was still envisioning a job manager allocating device specific resources for a job and sharing those with the local processes.  I.e. it shares a set of fd's, with each fd associated with a device, which restricts the job to those devices.  A job may also have device specific resource limits or allocations (limit on number of MRs, specific endpoint addresses, etc.)  A global job object could work, but a subsequent user to device flow will need to access and translate the global object.  Either way, there's uABI requirement(s).

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-28 12:20                     ` Yunsheng Lin
@ 2025-03-31 19:49                       ` Sean Hefty
  2025-04-01  9:19                         ` Yunsheng Lin
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-03-31 19:49 UTC (permalink / raw)
  To: Yunsheng Lin, Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> Through reading this patchset, it seems the semantics of 'job' for UEC is about
> how to identify a PDC(Packet Delivery Context) instance, which is specified by
> src fep_address/pdc_id and dst fep_address/pdc_id as there seems to be more
> than one PDC instance between two nodes, so the 'job' is really about
> grouping processes from the same 'job' to use the same PDC instance and
> preventing processes from different 'job' from using the same PDC instance?

UEC targets HPC and AI workloads, so the concept of a job in this discussion represents a parallel application.  I.e. a group of processes across multiple nodes communicating.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-31 19:49                       ` Sean Hefty
@ 2025-04-01  9:19                         ` Yunsheng Lin
  0 siblings, 0 replies; 76+ messages in thread
From: Yunsheng Lin @ 2025-04-01  9:19 UTC (permalink / raw)
  To: Sean Hefty, Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On 2025/4/1 3:49, Sean Hefty wrote:
>> Through reading this patchset, it seems the semantics of 'job' for UEC is about
>> how to identify a PDC(Packet Delivery Context) instance, which is specified by
>> src fep_address/pdc_id and dst fep_address/pdc_id as there seems to be more
>> than one PDC instance between two nodes, so the 'job' is really about
>> grouping processes from the same 'job' to use the same PDC instance and
>> preventing processes from different 'job' from using the same PDC instance?
> 
> UEC targets HPC and AI workloads, so the concept of a job in this discussion represents a parallel application.  I.e. a group of processes across multiple nodes communicating.

Ok, I guess this patchset only implement a portion of semantics for the 'job'
in UEC, the page mapping, local access, and remote access functions grouping
and separating you mentioned in other thread does not seem to be implemented
yet.

> 
> - Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-31 19:29                     ` Sean Hefty
@ 2025-04-01 13:04                       ` Jason Gunthorpe
  2025-04-01 16:57                         ` Sean Hefty
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-04-01 13:04 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Mon, Mar 31, 2025 at 07:29:32PM +0000, Sean Hefty wrote:

> Specifically, I want to *allow* separating the different functions
> that a single PD provides into separate PDs.  The functions being
> page mapping (registration), local (lkey) access, and remote (rkey)
> access.  

That seems like quite a stretch for the PD.. Especially from a verbs
perspective we do expect single PD and that is the entire security
context.

I think you face a philosophical choice of either a bigger PD that
encompasses multiple jobs, or a PD that isn't a security context and
then things like job handle lists in other APIs..

> As an optimization, registration can be a separate function, so that
> the same page mapping can be re-used across different jobs as they
> start and end.  This requires some ability to import a MR from one
> PD into another.  This is probably just an optimization and not
> required for a job model.

Donno, it depends what the spec says about the labels. Is there an
expectation that the rkey equivalent is identical across all jobs, or
is there an expectation that every job has a unique rkey for the same
memory?

I still wouldn't do something like import (which implies sharing the
underlying page list), having a single MR object with multiple rkeys
will make an easier implementation.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-01 13:04                       ` Jason Gunthorpe
@ 2025-04-01 16:57                         ` Sean Hefty
  2025-04-01 19:39                           ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-04-01 16:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> > Specifically, I want to *allow* separating the different functions
> > that a single PD provides into separate PDs.  The functions being page
> > mapping (registration), local (lkey) access, and remote (rkey) access.
> 
> That seems like quite a stretch for the PD.. Especially from a verbs perspective
> we do expect single PD and that is the entire security context.

From the viewpoint of a transport, the target QPN and incoming rkey must align on some backing security object (let's call that the PD).  As a model, I view this as there needs to exist some {QPN, rkey, PD ID} tuple with appropriate memory access permissions. 

The change here is to expand that tuple to include a job id: {QPN, rkey, job ID, PD ID}.

Conceptually, one could view the rkey + job ID as a larger, virtual rkey.  (Or maybe job ID + PD ID is a bigger, virtual PD ID...  Or job ID + QPN ...)

> I think you face a philosophical choice of either a bigger PD that encompasses
> multiple jobs, or a PD that isn't a security context and then things like job
> handle lists in other APIs..
>
> > As an optimization, registration can be a separate function, so that
> > the same page mapping can be re-used across different jobs as they
> > start and end.  This requires some ability to import a MR from one PD
> > into another.  This is probably just an optimization and not required
> > for a job model.
> 
> Donno, it depends what the spec says about the labels. Is there an expectation
> that the rkey equivalent is identical across all jobs, or is there an expectation
> that every job has a unique rkey for the same memory?
> 
> I still wouldn't do something like import (which implies sharing the underlying
> page list), having a single MR object with multiple rkeys will make an easier
> implementation.

I don't know that I can talk about the UEC spec, but the libfabric memory registration APIs (UEC has openly mentioned adopting libfabric) are closer to a single MR object with multiple keys.  Different jobs could have different rkeys.

Libfabric defines a 'base MR' and allows 'sub-MRs' to be created from that base.  So, there are separate MR objects for tracking purposes.  A sub-MR has its own access rights, job association, and rkey.

Libfabric doesn't have PDs, but this model is closer to the bigger PD that encompasses multiple jobs.  A job is assigned to the MR at MR creation.

A possible RDMA model could be:

PD <-- QP
   ^--- MR (not affiliated with a job)
   ^--- job thingy  <-- MR (restricted to job)

A device likely needs some capability to indicate whether it can limit MR access by {QPN, rkey, job ID, PD ID}.

I can envision a job manager creating, sharing, and possibly controlling the PD-related resources.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-01 16:57                         ` Sean Hefty
@ 2025-04-01 19:39                           ` Jason Gunthorpe
  2025-04-03  1:30                             ` Sean Hefty
  2025-04-04 16:03                             ` Ziemba, Ian
  0 siblings, 2 replies; 76+ messages in thread
From: Jason Gunthorpe @ 2025-04-01 19:39 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Tue, Apr 01, 2025 at 04:57:52PM +0000, Sean Hefty wrote:
> > > Specifically, I want to *allow* separating the different functions
> > > that a single PD provides into separate PDs.  The functions being page
> > > mapping (registration), local (lkey) access, and remote (rkey) access.
> > 
> > That seems like quite a stretch for the PD.. Especially from a verbs perspective
> > we do expect single PD and that is the entire security context.
> 
> From the viewpoint of a transport, the target QPN and incoming rkey
> must align on some backing security object (let's call that the PD).
> As a model, I view this as there needs to exist some {QPN, rkey, PD
> ID} tuple with appropriate memory access permissions.

Yes, but I'd say the IBTA modeling has the network headers select a PD
and then the PD limits what objects that packet can reach.

I still think that is a good starting point, and if there is more fine
grained limitations then I'd say each object has an additional ACL
list about what subset of the network headers (already within the PD)
that it can accept.

> The change here is to expand that tuple to include a job id: {QPN, rkey, job ID, PD ID}.

IB does QPN -> PD, if you do Job -> PD I think that would make
sense. If the QP is providing additional restriction that would be a
job for ACLs..

> I don't know that I can talk about the UEC spec, 

Right, it is too early to talk about UEC and Linux until people can
freely talk about what it actually needs.

> I can envision a job manager creating, sharing, and possibly
> controlling the PD-related resources.

Really? Beyond Job, what would make sense? Addressing Handles?

I wondering if addressing handles are really part of the job..

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-01 19:39                           ` Jason Gunthorpe
@ 2025-04-03  1:30                             ` Sean Hefty
  2025-04-04 16:03                             ` Ziemba, Ian
  1 sibling, 0 replies; 76+ messages in thread
From: Sean Hefty @ 2025-04-03  1:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller, ian.ziemba@hpe.com,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> On Tue, Apr 01, 2025 at 04:57:52PM +0000, Sean Hefty wrote:
> > > > Specifically, I want to *allow* separating the different functions
> > > > that a single PD provides into separate PDs.  The functions being
> > > > page mapping (registration), local (lkey) access, and remote (rkey) access.
> > >
> > > That seems like quite a stretch for the PD.. Especially from a verbs
> > > perspective we do expect single PD and that is the entire security context.
> >
> > From the viewpoint of a transport, the target QPN and incoming rkey
> > must align on some backing security object (let's call that the PD).
> > As a model, I view this as there needs to exist some {QPN, rkey, PD
> > ID} tuple with appropriate memory access permissions.
> 
> Yes, but I'd say the IBTA modeling has the network headers select a PD and
> then the PD limits what objects that packet can reach.

My claim is PD selection is made using the QPN and rkey.

For the model, QPN -> PD and rkey -> PD are deterministic, and both must select the same PD.  The SW model reflects this.  I think UEC can fit this model.

> I still think that is a good starting point, and if there is more fine grained
> limitations then I'd say each object has an additional ACL list about what
> subset of the network headers (already within the PD) that it can accept.

I'm unclear on the contents of the ACL, but I'm also unsure it's needed.

> > The change here is to expand that tuple to include a job id: {QPN, rkey, job
> ID, PD ID}.
> 
> IB does QPN -> PD, if you do Job -> PD I think that would make sense. If the
> QP is providing additional restriction that would be a job for ACLs..

I believe job id -> PD works.  To enable job-secure MRs, I believe rkey -> job ID is needed and not properly captured by an ACL or QP setting.  Rkey -> job ID is optional, but deterministic.

To keep the model simple, all QPs under the same PD would belong to the same set of jobs.  If this makes sense to you, I can discuss in the UEC to see if all the above work.

To summarize, the SW object model looks like:

create_pd(&pd)
create_qp(pd, &qp)
create_job(pd, &job)
reg_mr(pd, [job], access, &mr)
share_mr(mr, [job], access, &new_mr) -- get a new rkey

I include share_mr() for completeness, but I don't think it needs to be an explicit part of a common uABI.

A gap in this model relative to libfabric is supporting NICs which associate MRs directly with QPs.  Similar to share_mr(), I believe it can be handled directly by the NIC vendor and should not impact a common uABI.

> > I don't know that I can talk about the UEC spec,
> 
> Right, it is too early to talk about UEC and Linux until people can freely talk
> about what it actually needs.
> 
> > I can envision a job manager creating, sharing, and possibly
> > controlling the PD-related resources.
> 
> Really? Beyond Job, what would make sense? Addressing Handles?
>
> I wondering if addressing handles are really part of the job..

My thoughts were allocating and configuring the QPs.  Given the above model, this could include the job setup.  (I can also envision a process not having the ability to create or modify QPs or jobs).

Addressing is probably a separate discussion, which I'm happy to defer defining for later.  IMO, it makes sense to share addressing among processes to reduce the memory footprint.  This is best handled by the job manager populating some address table.  UEC has also publicly mentioned 'group keying' for security, which suggests attributes applied to a collection of addresses related by job.  So, I think there will be uABI concerns here.

- Sean 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-01 19:39                           ` Jason Gunthorpe
  2025-04-03  1:30                             ` Sean Hefty
@ 2025-04-04 16:03                             ` Ziemba, Ian
  2025-04-05  1:07                               ` Sean Hefty
  1 sibling, 1 reply; 76+ messages in thread
From: Ziemba, Ian @ 2025-04-04 16:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Sean Hefty
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On 4/1/2025 2:39 PM, Jason Gunthorpe wrote:
>> I don't know that I can talk about the UEC spec, 
> 
> Right, it is too early to talk about UEC and Linux until people can
> freely talk about what it actually needs.

While the UE specs are not yet out, concepts can be discussed. I
recognize this may be annoying without specs.

The following is my understanding of the UE job model.

A job ID identifies "who am I." There are two different operational
modes:

1. Relative Addressing

   The endpoint address is only valid within the scope of the job ID.
   Endpoints can only transmit and receive on the associated job ID.
   Parallel applications will use this to restrict communication to
   only processes within the job.

   Processes must be granted access to the job ID. In addition,
   multiple processes may share a job ID. Some mechanism is required to
   restrict what job IDs an endpoint can be configured against. Having
   a device-level RDMA job object and a path to associate the job
   object with an endpoint seems reasonable.

2. Absolute Addressing

   The target endpoint address is outside the scope of the job ID. This
   behavior allows an endpoint to receive on all job IDs and transmit
   on only authorized job IDs. This mode enables server endpoints to
   support multiple clients with different job IDs.

   Since this mode impacts the job IDs transmitted in packets,
   processes must be granted access. A device-level RDMA job object
   seems reasonable for this as well.

   An optional mechanism to restrict a receive buffer and MR to a
   specific job ID is needed. This enables a server endpoint to have
   per client job ID resources. Job ID verification is unnecessary
   since the job ID associated with a receive buffer or MR does not
   impact the packet job ID.

UE is going to need some object to restrict registered user-space
memory. Having the PD as the object supporting local memory
registration isolation seems ok. The UE object relationship could look
like job  <- 1 --- 0..* ->  endpoint  <- 0..* --- 1 ->  PD.

Thanks,

Ian

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-04 16:03                             ` Ziemba, Ian
@ 2025-04-05  1:07                               ` Sean Hefty
  2025-04-07 19:32                                 ` Ziemba, Ian
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-04-05  1:07 UTC (permalink / raw)
  To: Ziemba, Ian, Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> On 4/1/2025 2:39 PM, Jason Gunthorpe wrote:
> >> I don't know that I can talk about the UEC spec,
> >
> > Right, it is too early to talk about UEC and Linux until people can
> > freely talk about what it actually needs.
> 
> While the UE specs are not yet out, concepts can be discussed. I recognize this
> may be annoying without specs.
> 
> The following is my understanding of the UE job model.
> 
> A job ID identifies "who am I." There are two different operational
> modes:
> 
> 1. Relative Addressing
> 
>    The endpoint address is only valid within the scope of the job ID.
>    Endpoints can only transmit and receive on the associated job ID.
>    Parallel applications will use this to restrict communication to
>    only processes within the job.
> 
>    Processes must be granted access to the job ID. In addition,
>    multiple processes may share a job ID. Some mechanism is required to
>    restrict what job IDs an endpoint can be configured against. Having
>    a device-level RDMA job object and a path to associate the job
>    object with an endpoint seems reasonable.
> 
> 2. Absolute Addressing
> 
>    The target endpoint address is outside the scope of the job ID. This
>    behavior allows an endpoint to receive on all job IDs and transmit
>    on only authorized job IDs. This mode enables server endpoints to
>    support multiple clients with different job IDs.
> 
>    Since this mode impacts the job IDs transmitted in packets,
>    processes must be granted access. A device-level RDMA job object
>    seems reasonable for this as well.
> 
>    An optional mechanism to restrict a receive buffer and MR to a
>    specific job ID is needed. This enables a server endpoint to have
>    per client job ID resources. Job ID verification is unnecessary
>    since the job ID associated with a receive buffer or MR does not
>    impact the packet job ID.
> 
> UE is going to need some object to restrict registered user-space memory.
> Having the PD as the object supporting local memory registration isolation
> seems ok. The UE object relationship could look like job  <- 1 --- 0..* ->
> endpoint  <- 0..* --- 1 ->  PD.

There's also the MR relationship:

Job <- 1 --- 0..n -> MR <- 0..n --- 1 -> PD

There's discussion on defining this relationship:

Job <- 0..n --- 1 -> PD

I can't think of a technical reason why that's needed.  Without it, the Job basically acts in the same role as a second PD, which feels off.  Plus, some NIC implementations may need such a restriction.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-05  1:07                               ` Sean Hefty
@ 2025-04-07 19:32                                 ` Ziemba, Ian
  2025-04-08  4:40                                   ` Sean Hefty
  2025-04-16 23:58                                   ` Sean Hefty
  0 siblings, 2 replies; 76+ messages in thread
From: Ziemba, Ian @ 2025-04-07 19:32 UTC (permalink / raw)
  To: Sean Hefty, Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> There's also the MR relationship:
> 
> Job <- 1 --- 0..n -> MR <- 0..n --- 1 -> PD

Current UE memory registration is centered around the libfabric
FI_MR_ENDPOINT mode which states memory regions are associated with an
endpoint instead of libfabric domain. For remotely accessible UE MRs,
this translates to the following.

- Relative Addressing: {job ID, endpoint, RKEY} identifies the MR.

- Absolute Addressing: {endpoint, RKEY} identifies the MR with optional
  MR job-ID access control check.

In addition, UE memory registration supports user-defined RKEYs. This
enables programming model implementations the optimization to use
well-known per process endpoint RKEYs. For relative addressing, this
could result in the same RKEY value existing multiple times at the
{job ID} level, but only once at the {job ID, endpoint} level.

The UE remote MR relationship would look like:

Job <- 1 --- 0..n -> EP <- 0..n ---------------------- 1 -> PD
                      ^                                      ^
                      |--- 1 --- 0..n -> MR <- 0..n --- 1 ---|

> There's discussion on defining this relationship:
> 
> Job <- 0..n --- 1 -> PD
> 
> I can't think of a technical reason why that's needed.

From my UE perspective, I agree. UE needs to share job IDs across
processes while still having inter-process isolation for things like
local memory registrations.

Thanks,

Ian

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-07 19:32                                 ` Ziemba, Ian
@ 2025-04-08  4:40                                   ` Sean Hefty
  2025-04-16 23:58                                   ` Sean Hefty
  1 sibling, 0 replies; 76+ messages in thread
From: Sean Hefty @ 2025-04-08  4:40 UTC (permalink / raw)
  To: Ziemba, Ian, Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> > There's also the MR relationship:
> >
> > Job <- 1 --- 0..n -> MR <- 0..n --- 1 -> PD
> 
> Current UE memory registration is centered around the libfabric
> FI_MR_ENDPOINT mode which states memory regions are associated with an
> endpoint instead of libfabric domain. For remotely accessible UE MRs, this
> translates to the following.
>
> - Relative Addressing: {job ID, endpoint, RKEY} identifies the MR.
> 
> - Absolute Addressing: {endpoint, RKEY} identifies the MR with optional
>   MR job-ID access control check.
>
> In addition, UE memory registration supports user-defined RKEYs. This enables
> programming model implementations the optimization to use well-known per
> process endpoint RKEYs. For relative addressing, this could result in the same
> RKEY value existing multiple times at the {job ID} level, but only once at the
> {job ID, endpoint} level.
> 
> The UE remote MR relationship would look like:
> 
> Job <- 1 --- 0..n -> EP <- 0..n ---------------------- 1 -> PD
>                       ^                                      ^
>                       |--- 1 --- 0..n -> MR <- 0..n --- 1 ---|

IMO, job support should be transport agnostic and include connection-oriented RDMA transports.

I might have ended up too concise.  Transport header fields identify QP, MR, and job objects.  I think in terms of these minimal mappings:

QPN -> QP
rkey -> MR
(also lkey -> MR)
Job ID -> job

It's an oversimplification.  Obviously, QPN=1 uses other fields.  Maybe Job ID is actually 'pkey'.  Each vendor owns mapping transport fields to SW visible objects.

Libfabric supports both vendor and user selected rkeys.  I don't think this matters to the model.  If rkeys are device unique, rkey -> MR is trivial.  If rkeys are not, that's a feature-complexity trade-off.  Similarly, a vendor may support 1 job per QP or many.  Regardless, I still view the security model as finding a valid tuple:

{PD, QP, MR, Job}

The MR and/or Job may be N/A for any given transfer or configuration.

> > There's discussion on defining this relationship:
> >
> > Job <- 0..n --- 1 -> PD
> >
> > I can't think of a technical reason why that's needed.
> 
> From my UE perspective, I agree. UE needs to share job IDs across processes
> while still having inter-process isolation for things like local memory
> registrations.

I distinguish job instance running on the cluster from an instance of some job object.  Two object instances, identified by some other scope, could result in the same job ID.  (E.g. each process has its own object.) 

I lean towards defining a device specific job object, assuming HW resources may be required.  Job attributes include selected transport and address/ID.

Is this sufficient scope?  If a job maps to HW resources and is used with transport processing, does it need process isolation that comes with pairing it to a PD?  Or is it enough that the job is paired with the QP and, optionally, MR?  Is the creation of a job object always a privileged operation, or is only the act of assigning the job ID privileged?

Thinking through implementation options, since job is specified per transfer, it seems easier to validate the 'jkey' written with a WR shares the same PD of the QP, similar to how an lkey check might work.  One alternative is a jkey list off the QP.  Other options could offload libfabric address vectors.  It's possible vendors may implement this differently, with differences showing up in which MRs are reachable.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-03-26 15:50                     ` Jason Gunthorpe
@ 2025-04-08 14:16                       ` Jamal Hadi Salim
  2025-04-09 16:10                         ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Jamal Hadi Salim @ 2025-04-08 14:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Nikolay Aleksandrov,
	Linux Kernel Network Developers, Shrijeet Mukherjee, alex.badea,
	eric.davis, rip.sohan, David Ahern, bmt, roland, Winston Liu,
	dan.mihailescu, kheib, parth.v.parikh, davem, ian.ziemba,
	andrew.tauferner, welch, rakhahari.bhunia, kingshuk.mandal,
	linux-rdma, Jakub Kicinski, Paolo Abeni

Sorry was too distracted elsewhere..

On Wed, Mar 26, 2025 at 11:50 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Mar 25, 2025 at 10:12:49AM -0400, Jamal Hadi Salim wrote:
>

[Trimmed for brevity..]


> > For a read() to fail at say copy_to_user() feels like your app or
> > system must be in really bad shape.
>
> Yes, but still the semantic we want is that if a creation ioctl
> returns 0 (success) then the object exists and if it returns any error
> code then the creation was a NOP.
>
> > A contingency plan could be to replay the message from the app/control
> > plane and hope you get an "object doesnt exist" kind of message for a
> > failed destroy msg.
>
> Nope, it's racey, it must be multi-threaded safe. Another thread could
> have created and re-used the object ID.
>
> > IOW, while unwinding is more honorable, unless it comes for cheap it
> > may not be worth it.
>
> It was cheap
>
> > Regardless: How would RDMA unwind in such a case?
>
> The object infrastructure takes care of this with a three step object
> creation protocol and some helpers.
>


[..]

> > When you say "driver" you mean "control/provisioning plane" activity
> > between a userspace control app and kernel objects which likely
> > extend
>
> No, I literally mean driver.
>
> The user of this HW will not do something like socket() as standard
> system call abstracted by the kernel. Instead it makes a library call
> ib_create_qp() which goes into a library with the userspace driver
> components. The abstraction is now done in userspace. The library
> figures out what HW the kernel has and loads a userspace driver
> component with a driver_create_qp() op that does more processing and
> eventually calls the kernel.
>
> It is "control path" in the sense that it is slow path creating
> objects for data transfer, but the purpose of most of the actions is
> actually setting up for data plane operations.
>

Ok, if i read correctly thus far - seems you have some (3 phase)
transactional approach?
Earlier phase with this user driver interaction which guarantees
needed resources being available that subsequent phases then use..

> > If my reading is right, some comments:
> > 1) You can achieve this fine with netlink. My view of the model is you
> > would have a T (call it VendorData, which is is defined within the
> > common namespace) that puts the vendor specific TLVs within a
> > hierarchy.
>
> Yes, that was a direction that was suggested here too. But when we got
> to micro optimizing the ioctl ABI format it became clear there was
> significant advantage to keeping things one level and not trying to do
> some kind of nesting. This also gives a nice simple in-kernel API for
> working with method arguments, it is always the same. We don't have
> different APIs depending on driver/common callers.
>

agreed, flat namespace is a win as long as the modelling doesnt have
to be squished into a round-peg-for-square-hole abstraction.

> > 2) Hopefully the vendor extensions are in the minority. Otherwise the
> > complexity of someone writing an app to control multiple vendors would
> > be challenging over time as different vendors add more attributes.
>
> Nope, it is about 50/50, and there is not a challenge because the
> methodology is everyone uses the *same* userspace driver code. It is
> too complicated for people to reasonable try to rewrite.
>
> > I cant imagine a commonly used utility like iproute2/tc being
> > invoked with "when using broadcom then use foo=x bar=y" apply but
> > when using intel use "goo=x-1 and gah=y-2".
>
> Right, it doesn't make sense for a tool like iproute, but we aren't
> building anything remotely like iproute.
>

My point was on the API. I dont know enough so pardon my ignorance. My
basic assumption is there is common cross-vendor tooling and that
deployments may have to be multi-vendor. If that assumption is wrong
then then my concern is not valid.
If my assumption is correct, whatever provisioning app is involved it
needs to keep track of the multiple vendor interfacing - which means
the code will have to understand different semantics across vendors.

> > 3) A Pro/con to #2 depending on which lens you use:  it could be
> > "innnovation" or "vendor lockin" - depends on the community i.e on the
> > one hand a vendor could add features faster and is not bottlenecked by
> > endless mailing list discussions but otoh, said vendor may not be in
> > any hurry to move such features to the common path (because it gives
> > them an advantage).
>
> There is no community advantage to the common kernel path.
>
> The users all use the library, the only thing that matters is how
> accessible the vendor has made their unique ideas to the library
> users.
>
> For instance, if the user is running a MPI application and the vendor
> makes standard open source MPI 5% faster with some unique HW
> innovation should anyone actually care about the "common path" deep,
> deep below MPI?
>

I would say they shouldnt care because the customer gets to benefit.
But on the flip side, again, that is counting on the goodwill of the
vendor.

cheers,
jamal

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-08 14:16                       ` Jamal Hadi Salim
@ 2025-04-09 16:10                         ` Jason Gunthorpe
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Gunthorpe @ 2025-04-09 16:10 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Leon Romanovsky, Nikolay Aleksandrov,
	Linux Kernel Network Developers, Shrijeet Mukherjee, alex.badea,
	eric.davis, rip.sohan, David Ahern, bmt, roland, Winston Liu,
	dan.mihailescu, kheib, parth.v.parikh, davem, ian.ziemba,
	andrew.tauferner, welch, rakhahari.bhunia, kingshuk.mandal,
	linux-rdma, Jakub Kicinski, Paolo Abeni

On Tue, Apr 08, 2025 at 10:16:45AM -0400, Jamal Hadi Salim wrote:
> > > I cant imagine a commonly used utility like iproute2/tc being
> > > invoked with "when using broadcom then use foo=x bar=y" apply but
> > > when using intel use "goo=x-1 and gah=y-2".
> >
> > Right, it doesn't make sense for a tool like iproute, but we aren't
> > building anything remotely like iproute.
> >
> 
> My point was on the API. I dont know enough so pardon my ignorance. My
> basic assumption is there is common cross-vendor tooling and that
> deployments may have to be multi-vendor. If that assumption is wrong
> then then my concern is not valid.
> If my assumption is correct, whatever provisioning app is involved it
> needs to keep track of the multiple vendor interfacing - which means
> the code will have to understand different semantics across vendors.

It is like DRM and other places. There is only one userspace
implementation, coded into a library that all actual implementations
use.

For example, one of the ioctls is 'alloc pd'. All user applications
will link to libibverb.so and invoke ibv_alloc_pd().

ib_create_pd() under the covers has detected what kind of kernel
driver is present and will load an appropriate helper library, lets's
say libmlx5.so. So it calls mlx5_alloc_pd() which knows how to talk to
the kernel mlx5 side.

It is the responsibility of libibverbs.so/libmlx5.so to present a
standardized library call interface that is largely perscribed by the
IBTA specification.

For DRM this is similar to how libmesa/etc present a standardized
Vulkan/OpenGL library call interface but the kernel ioctls are all
very device specific.

There is no use case, or interest, in making it easy for anyone to
invoke the ioctls without using the single userspace library.

DRM/RDMA are all building things like this in pursuit of maximum
performance. We cannot afford to put an abstraction layer in the
kernel. Instead it is abstracted in userspace code with a
userspace/kernel split driver architecture.

> > For instance, if the user is running a MPI application and the vendor
> > makes standard open source MPI 5% faster with some unique HW
> > innovation should anyone actually care about the "common path" deep,
> > deep below MPI?
> 
> I would say they shouldnt care because the customer gets to benefit.
> But on the flip side, again, that is counting on the goodwill of the
> vendor.

In this space we have sophisticated large customers, it is not good
will. The open source stuff appears because the customers demand it,
so long as that is true I feel pretty comfortable with things.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-07 19:32                                 ` Ziemba, Ian
  2025-04-08  4:40                                   ` Sean Hefty
@ 2025-04-16 23:58                                   ` Sean Hefty
  2025-04-17  1:23                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-04-16 23:58 UTC (permalink / raw)
  To: Ziemba, Ian, Jason Gunthorpe
  Cc: Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> > There's discussion on defining this relationship:
> >
> > Job <- 0..n --- 1 -> PD
> >
> > I can't think of a technical reason why that's needed.
> 
> From my UE perspective, I agree. UE needs to share job IDs across processes
> while still having inter-process isolation for things like local memory
> registrations.

We seem stuck on this.  Here's a specific proposal that I'm considering:

1. Define a device level 'security key'.  The skey encapsulates encryption attributes.
    The skey may be shared between processes.
2. Define a device level 'job', or maybe more generic 'communication domain'*.
    A job object is associated with a transport protocol and these optional attributes:
    address, job id (required for UET), and security key.
    The job object may be shared between processes.
3. Define a PD level 'job key'.  The job key references a single job object.
    Multiple job keys may be created under a single PD, if each references a separate job.
4. Support creating MRs that reference job keys.

We can share job IDs across processes with process-level isolation of MRs.  The security model can be viewed as meeting these checks:

Endpoint ID (QPN) -> endpoint (QP) -> PD
job ID -> job key -> PD
rkey -> MR -> PD    or    rkey -> MR -> job key -> PD
lkey -> MR -> PD    or    lkey -> MR -> job key -> PD (?)

(Other fields carried in the headers are needed to make these mappings, but the concept is the same).  Access is allowed if the PDs and job keys (if applicable) match.  The endpoint can only send to jobs associated with the same PD.  E.g. a jkey is specified in the WR.  The endpoint can be configured to receive from any job or only those jobs associated with the same PD.  E.g. On receives, enforce the second check or not.  I am unsure of the lkey -> job key check.

If a NIC or endpoint only supports a single job, the job key is conceptually identical to the PD.  (An endpoint can only receive from the assigned job).

* The job may also be used to store and peer addresses between processes.  That is, it acts like a libfabric address vector restricted to a single authorization key or no key.  (Conversely, a libfabric AV maps to multiple job objects, separated by auth_key).  To reflect a more generic use, I would consider calling it a 'comm domain', rather than a job.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-16 23:58                                   ` Sean Hefty
@ 2025-04-17  1:23                                     ` Jason Gunthorpe
  2025-04-17  2:59                                       ` Sean Hefty
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-04-17  1:23 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Ziemba, Ian, Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Wed, Apr 16, 2025 at 11:58:45PM +0000, Sean Hefty wrote:
> > > There's discussion on defining this relationship:
> > >
> > > Job <- 0..n --- 1 -> PD
> > >
> > > I can't think of a technical reason why that's needed.
> > 
> > From my UE perspective, I agree. UE needs to share job IDs across processes
> > while still having inter-process isolation for things like local memory
> > registrations.
> 
> We seem stuck on this.  Here's a specific proposal that I'm considering:

I still think it is hard to have this discussion without information
flowing from UET..

I think the "Relative Addressing" Ian described is just a PD pointing
to a single job and all MRs within the PD linked to a single job. Is
there more than that?

"Absolute Addressing" seems confusing from a OS perspective. You can
receive packets on any Job ID but the OS prevents you from sending on
unauthorized Job IDs. Implying authorization happens dynamically.  So
if you Rx a packet, how does an unpriv process go about getting OS
permission to use the Rx'd Job ID as a Tx? How does it NAK the Rx that
it isn't permitted? Why would you want to create an entire special
security mechanism just to partition MRs in this funny mode?

How does receive buffer job key partitioning work? UET will HW match
receive buffers to specific packets?

> 1. Define a device level 'security key'.  The skey encapsulates encryption attributes.
>     The skey may be shared between processes.
> 2. Define a device level 'job', or maybe more generic 'communication domain'*.
>     A job object is associated with a transport protocol and these optional attributes:
>     address, job id (required for UET), and security key.
>     The job object may be shared between processes.
> 3. Define a PD level 'job key'.  The job key references a single job object.
>     Multiple job keys may be created under a single PD, if each references a separate job.
> 4. Support creating MRs that reference job keys.

This seems reasonable as a starting framework to me. I have wondered
if the 'security key' is really addressing information though. Sharing
IP's/MAC's/Encryption/etc across all job users seems appealing for MPI
type workloads.

But is one job key under a MR sufficient or does UET expect this to be
a list of job keys?

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-17  1:23                                     ` Jason Gunthorpe
@ 2025-04-17  2:59                                       ` Sean Hefty
  2025-04-17 13:31                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-04-17  2:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ziemba, Ian, Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> On Wed, Apr 16, 2025 at 11:58:45PM +0000, Sean Hefty wrote:
> > > > There's discussion on defining this relationship:
> > > >
> > > > Job <- 0..n --- 1 -> PD
> > > >
> > > > I can't think of a technical reason why that's needed.
> > >
> > > From my UE perspective, I agree. UE needs to share job IDs across
> > > processes while still having inter-process isolation for things like
> > > local memory registrations.
> >
> > We seem stuck on this.  Here's a specific proposal that I'm considering:
> 
> I still think it is hard to have this discussion without information flowing from
> UET..
> 
> I think the "Relative Addressing" Ian described is just a PD pointing to a single
> job and all MRs within the PD linked to a single job. Is there more than that?

Relative / absolute addressing is in regard to the endpoint address.  I.e. the equivalent of the QPN.

With relative addressing, the QPN is relative to the job ID.  So QPN=5 for job=2 and QPN=5 for job=3 may or may not be the same HW resource.  A HW QP may still belong to multiple jobs, if supported by the vendor.

> "Absolute Addressing" seems confusing from a OS perspective. You can
> receive packets on any Job ID but the OS prevents you from sending on
> unauthorized Job IDs. Implying authorization happens dynamically.  So if you
> Rx a packet, how does an unpriv process go about getting OS permission to
> use the Rx'd Job ID as a Tx? How does it NAK the Rx that it isn't permitted?
> Why would you want to create an entire special security mechanism just to
> partition MRs in this funny mode?

Absolute addressing means the QPN is basically relative to the IP address.  So, the HW resource can be located without using the job ID.  Job IDs are carried in the transport, so every send must indicate what that value should be.

As an example, assigning MRs to jobs allows the server to setup RMA buffers with access restricted to that job.

I have no idea how the receiver plans to enable sending back a response.

> How does receive buffer job key partitioning work? UET will HW match receive
> buffers to specific packets?

Not directly.  Libfabric has 2 features useful to consider here.  The simplest is tag matching.  Different jobs could use different tags bits.  MR partitioning can enforce one job doesn't try to jump into another job's tag space.  The second feature is called scalable endpoints.  A scalable endpoint has multiple receive queues, which are directly addressable by the peer.  Different jobs could target different receive queues.

> > 1. Define a device level 'security key'.  The skey encapsulates encryption
> attributes.
> >     The skey may be shared between processes.
> > 2. Define a device level 'job', or maybe more generic 'communication
> domain'*.
> >     A job object is associated with a transport protocol and these optional
> attributes:
> >     address, job id (required for UET), and security key.
> >     The job object may be shared between processes.
> > 3. Define a PD level 'job key'.  The job key references a single job object.
> >     Multiple job keys may be created under a single PD, if each references a
> separate job.
> > 4. Support creating MRs that reference job keys.
> 
> This seems reasonable as a starting framework to me. I have wondered if the
> 'security key' is really addressing information though. Sharing
> IP's/MAC's/Encryption/etc across all job users seems appealing for MPI type
> workloads.

I've gone back and forth between separating and combining the 'security key' and job objects.  Today I opted for separate, more focused objects.  Tomorrow, who knows?  Job is where addressing information goes.  Since security key is passed as an attribute to the job, an MPI/AI job can share encryption/IPs/etc. across processes.  (Btw, I prefer the term 'comm domain' over job for this top-level object, but I don't know if that makes things more or less confusing for others.  Job starts taking on different meanings.)

A separate security key made more sense to me when I considered applying it to an RC QP.  Additionally, an MPI/AI job may require multiple job objects, one for each IP address.  (Imagine a system connected to separate networks, such that the job ID value cannot be global).  A single security key can be used with all job instances.

> But is one job key under a MR sufficient or does UET expect this to be a list of
> job keys?

One, I believe.  Libfabric allows a MR to attach to a single job.  However, it does support derivative MRs, which could have different properties, but share page mappings.

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-17  2:59                                       ` Sean Hefty
@ 2025-04-17 13:31                                         ` Jason Gunthorpe
  2025-04-18 16:50                                           ` Sean Hefty
  0 siblings, 1 reply; 76+ messages in thread
From: Jason Gunthorpe @ 2025-04-17 13:31 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Ziemba, Ian, Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Thu, Apr 17, 2025 at 02:59:58AM +0000, Sean Hefty wrote:
> > I think the "Relative Addressing" Ian described is just a PD pointing to a single
> > job and all MRs within the PD linked to a single job. Is there more than that?
> 
> Relative / absolute addressing is in regard to the endpoint address.
> I.e. the equivalent of the QPN.
> 
> With relative addressing, the QPN is relative to the job ID.  So
> QPN=5 for job=2 and QPN=5 for job=3 may or may not be the same HW
> resource.  A HW QP may still belong to multiple jobs, if supported
> by the vendor.

Yes, but I think the key distinction is that everything is relative
to, or contained with in the job key so we only have ony job key and
every single object touched by a packet must be within that job. That
is the same security model as PD if the PD has 1 job.

> As an example, assigning MRs to jobs allows the server to setup RMA
> buffers with access restricted to that job.
> 
> I have no idea how the receiver plans to enable sending back a response.

Or get access to the new job id, which seems like a more important
question for the OS. I think I understand that there must be some
privileged entity that grants fine grained access to jobs, but I have
not seen any detail on how that would actually work inside the OS to
cover all these cases.

Does this all-listening process have to do some kind of DBUS operation
to request access to a job and get back a job FD? Something else? Does
anyone have a plan in mind?

MPI seems to have a more obvious design where the launcher could be
privileged and pass a job FD to its children. The global MPI scheduler
could allocate the network-global job ids. Un priv processes never
request a job on the fly.

> The second feature is called scalable
> endpoints.  A scalable endpoint has multiple receive queues, which
> are directly addressable by the peer.  Different jobs could target
> different receive queues.

That's just a new queue with different addressing rules. If the new
queue is created inside a new PD from it's endpoint are we OK then?

> I've gone back and forth between separating and combining the
> 'security key' and job objects.  Today I opted for separate, more
> focused objects.  Tomorrow, who knows?  Job is where addressing
> information goes.

I don't know about combining, but it seems like security key and
addressing are sub objects of the top level job? Is there any reason
to share a security key with two jobs???

> A separate security key made more sense to me when I considered
> applying it to an RC QP.  Additionally, an MPI/AI job may require
> multiple job objects, one for each IP address.  (Imagine a system
> connected to separate networks, such that the job ID value cannot be
> global).  A single security key can be used with all job instances.

I haven't heard any definition of how the job id is actually matched.

If you are talking about permitting on-the-wire job ids that alias and
map to different OS level job security domains then the HW must be
doing a full (src IP, dst IP, job key) -> Job Context search on every
packet to disambiguate?

That seems like something a latency focused HPC NIC would not want to do.

If you are not doing full searching based on all allowed src IPs then
you can't have separate networks with separate job id spaces either.

But even then, managing the number space seem very hard. If a MPI
scheduler is assiging on-the-wire job ids from src/dst IP pairs within
its cluster then nothing else can assign job IDs from that pool or it
will conflict.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-17 13:31                                         ` Jason Gunthorpe
@ 2025-04-18 16:50                                           ` Sean Hefty
  2025-04-22 15:44                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 76+ messages in thread
From: Sean Hefty @ 2025-04-18 16:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ziemba, Ian, Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

> On Thu, Apr 17, 2025 at 02:59:58AM +0000, Sean Hefty wrote:
> > > I think the "Relative Addressing" Ian described is just a PD
> > > pointing to a single job and all MRs within the PD linked to a single job. Is
> there more than that?
> >
> > Relative / absolute addressing is in regard to the endpoint address.
> > I.e. the equivalent of the QPN.
> >
> > With relative addressing, the QPN is relative to the job ID.  So
> > QPN=5 for job=2 and QPN=5 for job=3 may or may not be the same HW
> > resource.  A HW QP may still belong to multiple jobs, if supported by
> > the vendor.
> 
> Yes, but I think the key distinction is that everything is relative to, or contained
> with in the job key so we only have ony job key and every single object
> touched by a packet must be within that job. That is the same security model
> as PD if the PD has 1 job.

Relative addressing does not constrain the QP to a single job.  QPN=5 job=2 and QPN=4 job=3 may be the same HW QP.  There's a per-job table/hash/tree used to map QPNs to HW queues.  A multi-port NIC may need separate per-job tables per port.

(Let's ignore how the QP addressing gets assigned...)

> > As an example, assigning MRs to jobs allows the server to setup RMA
> > buffers with access restricted to that job.
> >
> > I have no idea how the receiver plans to enable sending back a response.
> 
> Or get access to the new job id, which seems like a more important question
> for the OS. I think I understand that there must be some privileged entity that
> grants fine grained access to jobs, but I have not seen any detail on how that
> would actually work inside the OS to cover all these cases.
> 
> Does this all-listening process have to do some kind of DBUS operation to
> request access to a job and get back a job FD? Something else? Does anyone
> have a plan in mind?
> 
> MPI seems to have a more obvious design where the launcher could be
> privileged and pass a job FD to its children. The global MPI scheduler could
> allocate the network-global job ids. Un priv processes never request a job on
> the fly.

My guess is storage is allocated and configured prior to launching the compute nodes using the mechanism being defined.  Once the compute portion of the job completes, the storage portion of the job is removed.  I have not heard of a specific plan in this area, however.

> > The second feature is called scalable
> > endpoints.  A scalable endpoint has multiple receive queues, which are
> > directly addressable by the peer.  Different jobs could target
> > different receive queues.
> 
> That's just a new queue with different addressing rules. If the new queue is
> created inside a new PD from it's endpoint are we OK then?

I.. think so.

> > I've gone back and forth between separating and combining the
> > 'security key' and job objects.  Today I opted for separate, more
> > focused objects.  Tomorrow, who knows?  Job is where addressing
> > information goes.
> 
> I don't know about combining, but it seems like security key and addressing
> are sub objects of the top level job? Is there any reason to share a security key
> with two jobs???

I doubt sharing a security key between HPC jobs is needed.  I think of the set of addresses being a component of the top-level job.  Individual addresses are sub-objects, if that's what you mean.

I was thinking of security key as an independent object, passed as an attribute when creating the top-level job.  The separation is so a job isn't needed to apply encryption to some RDMA QP in the future.  It seems possible to define security key as a component of the top-level job (and give job a new name), rather than an independent object.

> > A separate security key made more sense to me when I considered
> > applying it to an RC QP.  Additionally, an MPI/AI job may require
> > multiple job objects, one for each IP address.  (Imagine a system
> > connected to separate networks, such that the job ID value cannot be
> > global).  A single security key can be used with all job instances.
> 
> I haven't heard any definition of how the job id is actually matched.

I define a job key.  The job key provides a secure way to select the job ID carried in the transport.  A job key references a PD and is specified as part of any transfer.

A job key may be provided when creating a MR.  If so, the job *ID* is stored with the MR.  The PD of the job key and MR must be the same.

With absolute addressing, the QPN finds the QP through some table/hash/lookup.  An rkey locates a MR.  If the MR has a valid job ID associated with it, it's compared with the job ID from the transport.  If those match, the transfer is valid.  This check is in addition to verifying the QP and MR belong to the same PD.

With relative addressing, the job ID selects some table/hash, which identifies the QP.  Job matching is a natural part of mapping the QPN to the QP.  Job related checks against target MRs is the same as above.

There are other ways these checks may be implemented, including tighter restrictions on what MRs a QP may access.  But at least the above checks should hold.

Generalizing the above to remove UET addressing, a QP may either receive from any job or only those jobs that it is associated with.  A QP may belong to multiple jobs.  And a MR may be restricted to access by a single job.  Vendors may optimize their implementations around which features to support.  E.g. limit a QP to 1 job, no per job MRs, etc. 

- Sean

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
  2025-04-18 16:50                                           ` Sean Hefty
@ 2025-04-22 15:44                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 76+ messages in thread
From: Jason Gunthorpe @ 2025-04-22 15:44 UTC (permalink / raw)
  To: Sean Hefty
  Cc: Ziemba, Ian, Bernard Metzler, Roland Dreier, Nikolay Aleksandrov,
	netdev@vger.kernel.org, shrijeet@enfabrica.net,
	alex.badea@keysight.com, eric.davis@broadcom.com,
	rip.sohan@amd.com, dsahern@kernel.org, winston.liu@keysight.com,
	dan.mihailescu@keysight.com, Kamal Heib,
	parth.v.parikh@keysight.com, Dave Miller,
	andrew.tauferner@cornelisnetworks.com, welch@hpe.com,
	rakhahari.bhunia@keysight.com, kingshuk.mandal@keysight.com,
	linux-rdma@vger.kernel.org, kuba@kernel.org, Paolo Abeni

On Fri, Apr 18, 2025 at 04:50:24PM +0000, Sean Hefty wrote:
> > On Thu, Apr 17, 2025 at 02:59:58AM +0000, Sean Hefty wrote:
> > > > I think the "Relative Addressing" Ian described is just a PD
> > > > pointing to a single job and all MRs within the PD linked to a single job. Is
> > there more than that?
> > >
> > > Relative / absolute addressing is in regard to the endpoint address.
> > > I.e. the equivalent of the QPN.
> > >
> > > With relative addressing, the QPN is relative to the job ID.  So
> > > QPN=5 for job=2 and QPN=5 for job=3 may or may not be the same HW
> > > resource.  A HW QP may still belong to multiple jobs, if supported by
> > > the vendor.
> > 
> > Yes, but I think the key distinction is that everything is relative to, or contained
> > with in the job key so we only have ony job key and every single object
> > touched by a packet must be within that job. That is the same security model
> > as PD if the PD has 1 job.
> 
> Relative addressing does not constrain the QP to a single job.
> QPN=5 job=2 and QPN=4 job=3 may be the same HW QP.  There's a
> per-job table/hash/tree used to map QPNs to HW queues.  A multi-port
> NIC may need separate per-job tables per port.

I would say QPN=5 QPN=4 are the objects, and they are constrained.

If there are other objects outside the PD/Job (like some kind of
shared queue) then that is a different thing.

It is why I asked if we can have the "new queue" inside different
PDs. Forget about language, there is an on-the-wire lable that
identifies the QPN and that QPN must be 1:1 with the job. That can be
a direct software object, even if it does not come with any queues,
but delivers to some other queue-holding object that is outside the
PD.

> My guess is storage is allocated and configured prior to launching
> the compute nodes using the mechanism being defined.  Once the
> compute portion of the job completes, the storage portion of the job
> is removed.  I have not heard of a specific plan in this area,
> however.

That seems too vauge for an OS implementation.. We have to define how
"configured" works, and how do the various components, for instance
kernel storage components, get permission to use the required job
keys.

> I was thinking of security key as an independent object, passed as
> an attribute when creating the top-level job.  The separation is so
> a job isn't needed to apply encryption to some RDMA QP in the
> future.  It seems possible to define security key as a component of
> the top-level job (and give job a new name), rather than an
> independent object.

I would probably duplicate the keys, both as part of a job and as part
of an address handle if that is the worry.

The schema doesn't need to be fully normalized, that can be harmful
when we are talking about different security contexts. A job
encryption key is some global cross-process object and a AH is a
per-process, per-uverbs context object. They should not be the same.

> > > A separate security key made more sense to me when I considered
> > > applying it to an RC QP.  Additionally, an MPI/AI job may require
> > > multiple job objects, one for each IP address.  (Imagine a system
> > > connected to separate networks, such that the job ID value cannot be
> > > global).  A single security key can be used with all job instances.
> > 
> > I haven't heard any definition of how the job id is actually matched.
> 
> With absolute addressing, the QPN finds the QP through some
> table/hash/lookup.  

I meant how the job ID is matched starting from the head of the
ethernet packet.

You cannot have "separate networks with non-global job IDs" without
more strictly defining how the job is determined, by including things
like IP addresses pairs and possibly more.

If the job number in the packet is port-global, or vlan global or
something, then it is global and we don't need to worry about
"separate networks" because that isn't possible.

Jason

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2025-04-22 16:13 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 01/13] drivers: ultraeth: add initial skeleton and kconfig option Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 02/13] drivers: ultraeth: add context support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 03/13] drivers: ultraeth: add new genl family Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 04/13] drivers: ultraeth: add job support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 05/13] drivers: ultraeth: add tunnel udp device support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 06/13] drivers: ultraeth: add initial PDS infrastructure Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 07/13] drivers: ultraeth: add request and ack receive support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 08/13] drivers: ultraeth: add request transmit support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 09/13] drivers: ultraeth: add support for coalescing ack Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 10/13] drivers: ultraeth: add sack support Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 11/13] drivers: ultraeth: add nack support Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 12/13] drivers: ultraeth: add initiator and target idle timeout support Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 13/13] HACK: drivers: ultraeth: add char device Nikolay Aleksandrov
2025-03-08 18:46 ` [RFC PATCH 00/13] Ultra Ethernet driver introduction Leon Romanovsky
2025-03-09  3:21   ` Parav Pandit
2025-03-11 14:20     ` Bernard Metzler
2025-03-11 14:55       ` Leon Romanovsky
2025-03-11 17:11       ` Sean Hefty
2025-03-12  9:20         ` Nikolay Aleksandrov
2025-03-12  9:40   ` Nikolay Aleksandrov
2025-03-12 11:29     ` Leon Romanovsky
2025-03-12 14:20       ` Nikolay Aleksandrov
2025-03-12 15:10         ` Leon Romanovsky
2025-03-12 16:00           ` Nikolay Aleksandrov
2025-03-14 14:53           ` Bernard Metzler
2025-03-17 12:52             ` Leon Romanovsky
2025-03-19 13:52             ` Jason Gunthorpe
2025-03-19 14:02               ` Nikolay Aleksandrov
2025-03-14 20:51           ` Stanislav Fomichev
2025-03-17 12:30             ` Leon Romanovsky
2025-03-19 19:12               ` Stanislav Fomichev
2025-03-15 20:49           ` Netlink vs ioctl WAS(Re: " Jamal Hadi Salim
2025-03-17 12:57             ` Leon Romanovsky
2025-03-18 22:49             ` Jason Gunthorpe
2025-03-19 18:21               ` Jamal Hadi Salim
2025-03-19 19:19                 ` Jason Gunthorpe
2025-03-25 14:12                   ` Jamal Hadi Salim
2025-03-26 15:50                     ` Jason Gunthorpe
2025-04-08 14:16                       ` Jamal Hadi Salim
2025-04-09 16:10                         ` Jason Gunthorpe
2025-03-19 16:48 ` Jason Gunthorpe
2025-03-20 11:13   ` Yunsheng Lin
2025-03-20 14:32     ` Jason Gunthorpe
2025-03-20 20:05       ` Sean Hefty
2025-03-20 20:12         ` Jason Gunthorpe
2025-03-21  2:02           ` Yunsheng Lin
2025-03-21 12:01             ` Jason Gunthorpe
2025-03-24 20:22   ` Roland Dreier
2025-03-24 21:28     ` Sean Hefty
2025-03-25 13:22       ` Bernard Metzler
2025-03-25 17:02         ` Sean Hefty
2025-03-26 14:45           ` Jason Gunthorpe
2025-03-26 15:29             ` Sean Hefty
2025-03-26 15:53               ` Jason Gunthorpe
2025-03-26 17:39                 ` Sean Hefty
2025-03-27 13:26                   ` Jason Gunthorpe
2025-03-28 12:20                     ` Yunsheng Lin
2025-03-31 19:49                       ` Sean Hefty
2025-04-01  9:19                         ` Yunsheng Lin
2025-03-31 19:29                     ` Sean Hefty
2025-04-01 13:04                       ` Jason Gunthorpe
2025-04-01 16:57                         ` Sean Hefty
2025-04-01 19:39                           ` Jason Gunthorpe
2025-04-03  1:30                             ` Sean Hefty
2025-04-04 16:03                             ` Ziemba, Ian
2025-04-05  1:07                               ` Sean Hefty
2025-04-07 19:32                                 ` Ziemba, Ian
2025-04-08  4:40                                   ` Sean Hefty
2025-04-16 23:58                                   ` Sean Hefty
2025-04-17  1:23                                     ` Jason Gunthorpe
2025-04-17  2:59                                       ` Sean Hefty
2025-04-17 13:31                                         ` Jason Gunthorpe
2025-04-18 16:50                                           ` Sean Hefty
2025-04-22 15:44                                             ` Jason Gunthorpe
2025-03-26 15:16     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).