* [PATCH net-next v4.1] Expose RDS features via sysfs.
@ 2025-06-23 15:51 Konrad Rzeszutek Wilk
2025-06-23 15:51 ` [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF) Konrad Rzeszutek Wilk
0 siblings, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2025-06-23 15:51 UTC (permalink / raw)
To: allison.henderson, netdev, linux-rdma, rds-devel, tj, andrew
Cc: hannes, mkoutny, cgroups
Hi folks,
Changelog:
+ Since v1:
- Changed it to use sysfs and expose a 'features' file with the data.
+ Since v2:
- Use a 'features' directory similar to ext4, btrfs and expose each feature.
+ Since v3:
- Fix ret returning 0 if kobject_create_and_add fails
- Follow the syntax of Documentation/ABI/stable/sysfs-module.
+ Since v4:
- Changed period to coma in ABI documentation.
- Added Allison's Reviewed-by
This patch addresses an issue where we have applications compiled against
against older (or newer) kernels. RDS does not have any ioctls to query
for what version of ABIs it has or what features it has. Hence this solution
that proposes to put this ABI information in sysfs space.
Documentation/ABI/stable/sysfs-transport-rds | 43 +++++++++++++++++++
net/rds/af_rds.c | 63 +++++++++++++++++++++++++++-
2 files changed, 105 insertions(+), 1 deletion(-)
Konrad Rzeszutek Wilk (1):
rds: Expose feature parameters via sysfs (and ELF)
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF)
2025-06-23 15:51 [PATCH net-next v4.1] Expose RDS features via sysfs Konrad Rzeszutek Wilk
@ 2025-06-23 15:51 ` Konrad Rzeszutek Wilk
2025-06-25 23:30 ` Jakub Kicinski
0 siblings, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2025-06-23 15:51 UTC (permalink / raw)
To: allison.henderson, netdev, linux-rdma, rds-devel, tj, andrew
Cc: hannes, mkoutny, cgroups, Konrad Rzeszutek Wilk
We would like to have a programmatic way for applications
to query which of the features defined in include/uapi/linux/rds.h
are actually implemented by the kernel.
The problem is that applications can be built against newer
kernel (or older) and they may have the feature implemented or not.
The lack of a certain feature would signify that the kernel
does not support it. The presence of it signifies the existence
of it.
This would provide the application to query the sysfs and figure
out what is supported on a running system.
This patch would expose this extra sysfs file:
/sys/kernel/rds/features/ioctl_get_tos
/sys/kernel/rds/features/ioctl_set_tos
/sys/kernel/rds/features/socket_cancel_sent_to
/sys/kernel/rds/features/socket_cong_monitor
/sys/kernel/rds/features/socket_free_mr
/sys/kernel/rds/features/socket_get_mr
/sys/kernel/rds/features/socket_get_mr_for_dest
/sys/kernel/rds/features/socket_recverr
/sys/kernel/rds/features/socket_so_rxpath_latency
/sys/kernel/rds/features/socket_so_transport
With the value of 'supported' in them. In the future this value
could change to say 'deprecated' or have other values (for example
different versions) or can be runtime changed.
The choice to use sysfs and this particular way is modeled on the
filesystems usage exposing their features.
Alternative solution such as exposing one file ('features') with
each feature enumerated (which cgroup does) is a bit limited in
that it does not provide means to provide extra content in the future
for each feature. For example if one of the features had three
modes and one wanted to set a particular one at runtime - that
does not exist in cgroup (albeit it can be implemented but it would
be quite hectic to have just one single attribute).
Another solution of using an ioctl to expose a bitmask has the
disadvantage of being less flexible in the future and while it can
have a bit of supported/unsupported, it is not clear how one would
change modes or expose versions. It is most certainly feasible
but it can get seriously complex fast.
As such this mechanism offers the basic support we require
now and offers the flexibility for the future.
Lastly, we also utilize the ELF note macro to expose these via
so that applications that have not yet initialized RDS transport
can inspect the kernel module to see if they have the appropiate
support and choose an alternative protocol if they wish so.
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v3: Add the missing documentation
Remove the CONFIG_SYSFS #ifdef machinations
Redo it with each feature as a seperate file in 'features'
directory.
v4: Follow the syntax in Documentation/ABI/stable/sysfs-module
Tack on the ret value if we fail kobject_create_and_add
v4.1: Change period to coma in the sysfs-transport-rds
Added Allison's Reviewed-by
---
Documentation/ABI/stable/sysfs-transport-rds | 43 +++++++++++++
net/rds/af_rds.c | 63 +++++++++++++++++++-
2 files changed, 105 insertions(+), 1 deletion(-)
create mode 100644 Documentation/ABI/stable/sysfs-transport-rds
diff --git a/Documentation/ABI/stable/sysfs-transport-rds b/Documentation/ABI/stable/sysfs-transport-rds
new file mode 100644
index 000000000000..cd7559b9517c
--- /dev/null
+++ b/Documentation/ABI/stable/sysfs-transport-rds
@@ -0,0 +1,43 @@
+What: /sys/kernel/rds/features/*
+Date: June 2025
+KernelVersion: 6.17
+Contact: rds-devel@oss.oracle.com
+Description: This directory contains the features that this kernel
+ has been built with and supports. They correspond
+ to the include/uapi/linux/rds.h features.
+
+ The intent is for applications compiled against rds.h
+ to be able to query and find out what features the
+ driver supports. The current expected value is 'supported'.
+
+What: /sys/kernel/rds/features/ioctl_[get,set]_tos
+Description: Allows the user to set on the socket a type of
+ service(tos) value associated forever.
+
+What: /sys/kernel/rds/features/socket_cancel_sent_to
+Description: Allows to cancel all pending messages to a given destination.
+
+What: /sys/kernel/rds/features/socket_cong_monitor
+Description: RDS provides explicit congestion monitoring for a socket using
+ a 64-bit mask. Each bit in the mask corresponds to a group of ports.
+
+ When a congestion update arrives, RDS checks the set of ports
+ that became uncongested against the bit mask.
+
+ If they overlap, a control messages is enqueued on the socket,
+ and the application is woken up.
+
+What: /sys/kernel/rds/features/socket_[get_mr,get_mr_for_dest,free_mr]
+Description: RDS allows a process to register or release memory ranges for
+ RDMA.
+
+What: /sys/kernel/rds/features/socket_recverr
+Description: RDS will send RDMA notification messages to the application for
+ any RDMA operation that fails. By default this is off.
+
+What: /sys/kernel/rds/features/socket_so_rxpath_latency
+Description: Receive path latency in various stages of receive path.
+
+What: /sys/kernel/rds/features/socket_so_transport
+Description: Attach the socket to the underlaying transport (TCP, RDMA
+ or loop) before invoking bind on the socket.
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 8435a20968ef..f920cc6f0b2b 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -32,7 +32,9 @@
*/
#include <linux/module.h>
#include <linux/errno.h>
+#include <linux/elfnote.h>
#include <linux/kernel.h>
+#include <linux/kobject.h>
#include <linux/gfp.h>
#include <linux/in.h>
#include <linux/ipv6.h>
@@ -871,6 +873,53 @@ static void rds6_sock_info(struct socket *sock, unsigned int len,
}
#endif
+static ssize_t supported_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ return sysfs_emit(buf, "supported\n");
+}
+
+#define SYSFS_ATTR(_name) \
+ELFNOTE64("rds." #_name, 0, 1); \
+static struct kobj_attribute rds_attr_##_name = { \
+ .attr = {.name = __stringify(_name), .mode = 0444 }, \
+ .show = supported_show, \
+}
+
+SYSFS_ATTR(ioctl_set_tos);
+SYSFS_ATTR(ioctl_get_tos);
+SYSFS_ATTR(socket_cancel_sent_to);
+SYSFS_ATTR(socket_cong_monitor);
+SYSFS_ATTR(socket_get_mr);
+SYSFS_ATTR(socket_get_mr_for_dest);
+SYSFS_ATTR(socket_free_mr);
+SYSFS_ATTR(socket_recverr);
+SYSFS_ATTR(socket_so_rxpath_latency);
+SYSFS_ATTR(socket_so_transport);
+
+#define ATTR_LIST(_name) &rds_attr_##_name.attr
+
+static struct attribute *rds_feat_attrs[] = {
+ ATTR_LIST(ioctl_set_tos),
+ ATTR_LIST(ioctl_get_tos),
+ ATTR_LIST(socket_cancel_sent_to),
+ ATTR_LIST(socket_cong_monitor),
+ ATTR_LIST(socket_get_mr),
+ ATTR_LIST(socket_get_mr_for_dest),
+ ATTR_LIST(socket_free_mr),
+ ATTR_LIST(socket_recverr),
+ ATTR_LIST(socket_so_rxpath_latency),
+ ATTR_LIST(socket_so_transport),
+ NULL,
+};
+
+static const struct attribute_group rds_feat_group = {
+ .attrs = rds_feat_attrs,
+ .name = "features",
+};
+
+static struct kobject *rds_sysfs_kobj;
+
static void rds_exit(void)
{
sock_unregister(rds_family_ops.family);
@@ -882,6 +931,8 @@ static void rds_exit(void)
rds_stats_exit();
rds_page_exit();
rds_bind_lock_destroy();
+ sysfs_remove_group(rds_sysfs_kobj, &rds_feat_group);
+ kobject_put(rds_sysfs_kobj);
rds_info_deregister_func(RDS_INFO_SOCKETS, rds_sock_info);
rds_info_deregister_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
#if IS_ENABLED(CONFIG_IPV6)
@@ -923,6 +974,15 @@ static int __init rds_init(void)
if (ret)
goto out_proto;
+ rds_sysfs_kobj = kobject_create_and_add("rds", kernel_kobj);
+ if (!rds_sysfs_kobj) {
+ ret = -ENOMEM;
+ goto out_proto;
+ }
+ ret = sysfs_create_group(rds_sysfs_kobj, &rds_feat_group);
+ if (ret)
+ goto out_kobject;
+
rds_info_register_func(RDS_INFO_SOCKETS, rds_sock_info);
rds_info_register_func(RDS_INFO_RECV_MESSAGES, rds_sock_inc_info);
#if IS_ENABLED(CONFIG_IPV6)
@@ -931,7 +991,8 @@ static int __init rds_init(void)
#endif
goto out;
-
+out_kobject:
+ kobject_put(rds_sysfs_kobj);
out_proto:
proto_unregister(&rds_proto);
out_stats:
--
2.43.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF)
2025-06-23 15:51 ` [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF) Konrad Rzeszutek Wilk
@ 2025-06-25 23:30 ` Jakub Kicinski
2025-06-26 8:45 ` Paolo Abeni
2025-06-28 0:44 ` Konrad Rzeszutek Wilk
0 siblings, 2 replies; 6+ messages in thread
From: Jakub Kicinski @ 2025-06-25 23:30 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: allison.henderson, netdev, linux-rdma, rds-devel, tj, andrew,
hannes, mkoutny, cgroups, linux-kernel
On Mon, 23 Jun 2025 11:51:36 -0400 Konrad Rzeszutek Wilk wrote:
> With the value of 'supported' in them. In the future this value
> could change to say 'deprecated' or have other values (for example
> different versions) or can be runtime changed.
I'm curious how this theoretical 'deprecated' value would work
in context of uAPI which can never regress..
> The choice to use sysfs and this particular way is modeled on the
> filesystems usage exposing their features.
>
> Alternative solution such as exposing one file ('features') with
> each feature enumerated (which cgroup does) is a bit limited in
> that it does not provide means to provide extra content in the future
> for each feature. For example if one of the features had three
> modes and one wanted to set a particular one at runtime - that
> does not exist in cgroup (albeit it can be implemented but it would
> be quite hectic to have just one single attribute).
>
> Another solution of using an ioctl to expose a bitmask has the
> disadvantage of being less flexible in the future and while it can
> have a bit of supported/unsupported, it is not clear how one would
> change modes or expose versions. It is most certainly feasible
> but it can get seriously complex fast.
>
> As such this mechanism offers the basic support we require
> now and offers the flexibility for the future.
>
> Lastly, we also utilize the ELF note macro to expose these via
> so that applications that have not yet initialized RDS transport
> can inspect the kernel module to see if they have the appropiate
> support and choose an alternative protocol if they wish so.
Looks like this paragraph had a line starting with #, presumably
talking about the ELF note and it got eaten by git? Please fix.
FWIW to me this series has a strong whiff of "we have an OOT module
which has much more functionality and want to support a degraded /
upstream-only mode in the user space stack". I'm probably over-
-interpreting, and you could argue this will help you make real
use of the upstream RDS. I OTOH would argue that it's a technical
solution to a non-technical problem of not giving upstreaming
sufficient priority; I'd prefer to see code flowing upstream _first_
and then worry about compatibility.
$ git log --oneline --since='6 months ago' -- net/rds/
433dce0692a0 rds: Correct spelling
6e307a873d30 rds: Correct endian annotation of port and addr assignments
5bccdc51f90c replace strncpy with strscpy_pad
c50d295c37f2 rds: Use nested-BH locking for rds_page_remainder
0af5928f358c rds: Acquire per-CPU pointer within BH disabled section
aaaaa6639cf5 rds: Disable only bottom halves in rds_page_remainder_alloc()
357660d7596b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
5c70eb5c593d net: better track kernel sockets lifetime
c451715d78e3 net/rds: Replace deprecated strncpy() with strscpy_pad()
7f5611cbc487 rds: sysctl: rds_tcp_{rcv,snd}buf: avoid using current->nsproxy
$
IOW applying this patch is a bit of a leap of faith that RDS
upstreaming will restart. I don't have anything against the patch
per se, but neither do I have much faith in this. So if v5 is taking
a long time to get applied it will be because its waiting for DaveM or
Paolo to take it.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF)
2025-06-25 23:30 ` Jakub Kicinski
@ 2025-06-26 8:45 ` Paolo Abeni
2025-06-28 0:44 ` Konrad Rzeszutek Wilk
1 sibling, 0 replies; 6+ messages in thread
From: Paolo Abeni @ 2025-06-26 8:45 UTC (permalink / raw)
To: Jakub Kicinski, Konrad Rzeszutek Wilk
Cc: allison.henderson, netdev, linux-rdma, rds-devel, tj, andrew,
hannes, mkoutny, cgroups, linux-kernel
On 6/26/25 1:30 AM, Jakub Kicinski wrote:
> IOW applying this patch is a bit of a leap of faith that RDS
> upstreaming will restart. I don't have anything against the patch
> per se, but neither do I have much faith in this. So if v5 is taking
> a long time to get applied it will be because its waiting for DaveM or
> Paolo to take it.
I agree with the above. I think that to accept this patch we need it to
be part of a series actually introducing new features and/or deprecating
existing one. And likely deprecating new features without introducing
new ones will make little sense.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF)
2025-06-25 23:30 ` Jakub Kicinski
2025-06-26 8:45 ` Paolo Abeni
@ 2025-06-28 0:44 ` Konrad Rzeszutek Wilk
2025-06-28 8:23 ` Andrew Lunn
1 sibling, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2025-06-28 0:44 UTC (permalink / raw)
To: Jakub Kicinski
Cc: allison.henderson, netdev, linux-rdma, rds-devel, tj, andrew,
hannes, mkoutny, cgroups, linux-kernel
On Wed, Jun 25, 2025 at 04:30:09PM -0700, Jakub Kicinski wrote:
> On Mon, 23 Jun 2025 11:51:36 -0400 Konrad Rzeszutek Wilk wrote:
> > With the value of 'supported' in them. In the future this value
> > could change to say 'deprecated' or have other values (for example
> > different versions) or can be runtime changed.
>
> I'm curious how this theoretical 'deprecated' value would work
> in context of uAPI which can never regress..
Kind of sad considering there are some APIs that should really
be fixed. Perhaps more of 'it-is-busted-use-this-other-API'?
>
> > The choice to use sysfs and this particular way is modeled on the
> > filesystems usage exposing their features.
> >
> > Alternative solution such as exposing one file ('features') with
> > each feature enumerated (which cgroup does) is a bit limited in
> > that it does not provide means to provide extra content in the future
> > for each feature. For example if one of the features had three
> > modes and one wanted to set a particular one at runtime - that
> > does not exist in cgroup (albeit it can be implemented but it would
> > be quite hectic to have just one single attribute).
> >
> > Another solution of using an ioctl to expose a bitmask has the
> > disadvantage of being less flexible in the future and while it can
> > have a bit of supported/unsupported, it is not clear how one would
> > change modes or expose versions. It is most certainly feasible
> > but it can get seriously complex fast.
> >
> > As such this mechanism offers the basic support we require
> > now and offers the flexibility for the future.
> >
> > Lastly, we also utilize the ELF note macro to expose these via
.. <missing>
> > so that applications that have not yet initialized RDS transport
> > can inspect the kernel module to see if they have the appropiate
> > support and choose an alternative protocol if they wish so.
>
> Looks like this paragraph had a line starting with #, presumably
> talking about the ELF note and it got eaten by git? Please fix.
Yup
>
>
> FWIW to me this series has a strong whiff of "we have an OOT module
> which has much more functionality and want to support a degraded /
> upstream-only mode in the user space stack". I'm probably over-
> -interpreting, and you could argue this will help you make real
> use of the upstream RDS. I OTOH would argue that it's a technical
> solution to a non-technical problem of not giving upstreaming
> sufficient priority; I'd prefer to see code flowing upstream _first_
> and then worry about compatibility.
The goal here was to lay the groundwork for another patch series that
Allison had in her backlog which was to introduce the reset functionality.
Let me work with Allison on adding this to that patch series.
>
> $ git log --oneline --since='6 months ago' -- net/rds/
> 433dce0692a0 rds: Correct spelling
> 6e307a873d30 rds: Correct endian annotation of port and addr assignments
> 5bccdc51f90c replace strncpy with strscpy_pad
> c50d295c37f2 rds: Use nested-BH locking for rds_page_remainder
> 0af5928f358c rds: Acquire per-CPU pointer within BH disabled section
> aaaaa6639cf5 rds: Disable only bottom halves in rds_page_remainder_alloc()
> 357660d7596b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> 5c70eb5c593d net: better track kernel sockets lifetime
> c451715d78e3 net/rds: Replace deprecated strncpy() with strscpy_pad()
> 7f5611cbc487 rds: sysctl: rds_tcp_{rcv,snd}buf: avoid using current->nsproxy
> $
>
> IOW applying this patch is a bit of a leap of faith that RDS
> upstreaming will restart. I don't have anything against the patch
It has to. We have to make the RDS TCP be bug-free as there are
customers demanding that.
> per se, but neither do I have much faith in this. So if v5 is taking
> a long time to get applied it will be because its waiting for DaveM or
> Paolo to take it.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF)
2025-06-28 0:44 ` Konrad Rzeszutek Wilk
@ 2025-06-28 8:23 ` Andrew Lunn
0 siblings, 0 replies; 6+ messages in thread
From: Andrew Lunn @ 2025-06-28 8:23 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Jakub Kicinski, allison.henderson, netdev, linux-rdma, rds-devel,
tj, hannes, mkoutny, cgroups, linux-kernel
On Fri, Jun 27, 2025 at 08:44:49PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 25, 2025 at 04:30:09PM -0700, Jakub Kicinski wrote:
> > On Mon, 23 Jun 2025 11:51:36 -0400 Konrad Rzeszutek Wilk wrote:
> > > With the value of 'supported' in them. In the future this value
> > > could change to say 'deprecated' or have other values (for example
> > > different versions) or can be runtime changed.
> >
> > I'm curious how this theoretical 'deprecated' value would work
> > in context of uAPI which can never regress..
>
> Kind of sad considering there are some APIs that should really
> be fixed. Perhaps more of 'it-is-busted-use-this-other-API'?
I expect any attempt to actually use 'deprecated' is going to draw a
lot of close review and push back. ABI is ABI, even if it is broken.
> > $ git log --oneline --since='6 months ago' -- net/rds/
> > 433dce0692a0 rds: Correct spelling
> > 6e307a873d30 rds: Correct endian annotation of port and addr assignments
> > 5bccdc51f90c replace strncpy with strscpy_pad
> > c50d295c37f2 rds: Use nested-BH locking for rds_page_remainder
> > 0af5928f358c rds: Acquire per-CPU pointer within BH disabled section
> > aaaaa6639cf5 rds: Disable only bottom halves in rds_page_remainder_alloc()
> > 357660d7596b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> > 5c70eb5c593d net: better track kernel sockets lifetime
> > c451715d78e3 net/rds: Replace deprecated strncpy() with strscpy_pad()
> > 7f5611cbc487 rds: sysctl: rds_tcp_{rcv,snd}buf: avoid using current->nsproxy
> > $
> >
> > IOW applying this patch is a bit of a leap of faith that RDS
> > upstreaming will restart. I don't have anything against the patch
>
> It has to. We have to make the RDS TCP be bug-free as there are
> customers demanding that.
Maybe we should hold off on this patch until there is a real need for
it? Make it part of a patch series which adds new functionality to the
ABI. It is only when the ABI changes does it make any sense to have
this API.
Andrew
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-06-28 8:23 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-23 15:51 [PATCH net-next v4.1] Expose RDS features via sysfs Konrad Rzeszutek Wilk
2025-06-23 15:51 ` [PATCH net-next v4.1] rds: Expose feature parameters via sysfs (and ELF) Konrad Rzeszutek Wilk
2025-06-25 23:30 ` Jakub Kicinski
2025-06-26 8:45 ` Paolo Abeni
2025-06-28 0:44 ` Konrad Rzeszutek Wilk
2025-06-28 8:23 ` Andrew Lunn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).