From: Jason Gunthorpe <jgg@nvidia.com>
To: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Leon Romanovsky <leon@kernel.org>,
Nikolay Aleksandrov <nikolay@enfabrica.net>,
Linux Kernel Network Developers <netdev@vger.kernel.org>,
Shrijeet Mukherjee <shrijeet@enfabrica.net>,
alex.badea@keysight.com, eric.davis@broadcom.com,
rip.sohan@amd.com, David Ahern <dsahern@kernel.org>,
bmt@zurich.ibm.com, roland@enfabrica.net,
Winston Liu <winston.liu@keysight.com>,
dan.mihailescu@keysight.com, kheib@redhat.com,
parth.v.parikh@keysight.com, davem@redhat.com,
ian.ziemba@hpe.com, andrew.tauferner@cornelisnetworks.com,
welch@hpe.com, rakhahari.bhunia@keysight.com,
kingshuk.mandal@keysight.com, linux-rdma@vger.kernel.org,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Subject: Re: Netlink vs ioctl WAS(Re: [RFC PATCH 00/13] Ultra Ethernet driver introduction
Date: Wed, 19 Mar 2025 16:19:46 -0300 [thread overview]
Message-ID: <20250319191946.GP9311@nvidia.com> (raw)
In-Reply-To: <CAM0EoMkVz8HfEUg33hptE91nddSrao7=6BzkUS-3YDyHQeOhVw@mail.gmail.com>
On Wed, Mar 19, 2025 at 02:21:23PM -0400, Jamal Hadi Salim wrote:
> Curious how you guarantee that a "destroy" will not fail under OOM. Do
> you have pre-allocated memory?
It just never allocates memory? Why would a simple system call like a
destruction allocate any memory?
> > Overall systems calls here should either succeed or fail and be the
> > same as a NOP. No failure that actually did something and then creates
> > some resource leak or something because userspace didn't know about
> > it.
>
> Yes, this is how netlink works as well. If a failure to delete an
> object occurs then every transient state gets restored. This is always
> the case for simple requests(a delete/create/update). For requests
> that batch multiple objects there are cases where there is no
> unwinding.
I'm not sure that is complely true, like if userspace messes up the
netlink read() side of the API and copy_to_user() fails then you can
get these inconsistencies. In the RDMA model even those edge case are
properly unwound, just like a normal system call would.
> Makes sense. So ioctls with TLVs ;->
> I am suspecting you don't have concepts of TLVs inside TLVs for
> hierarchies within objects.
No, it has not been needed yet, or at least the cases that have come
up have been happy to use arrays of structs for the nesting. The
method calls themselves don't tend to have that kind of challenging
structure for their arguments.
> > RDMA also has special infrastructure to split up the TLV space between
> > core code and HW driver code which is a key feature and necessary part
> > of how you'd build a user/kernel split driver.
>
> The T namespace is split between core code and driver code?
> I can see that as being useful for debugging maybe? What else?
RDMA is all about having a user/kernel driver co-design. This means a
driver has code in a userspace library and code in the kernel that
work together to implement the functionality. The userspace library
should be thought of as an extension of the kernel driver into
userspace.
So, there is alot of traffic between the two driver components that is
just private and unique to the driver. This is what the driver
namespace is used for.
For instance there is a common method call to create a queue. The
queue has a number of core parameters like depth, and address, then it
calls the driver and there are bunch of device specific parameters
too, like say queue entry format.
Every driver gets to define its own parameters best suited to its own
device and its own user/kernel split.
Building a split user/kernel driver is complicated and uAPI is one of
the biggest challenges :\
> > > - And as Nik mentioned: The new (yaml)model-to-generatedcode approach
> > > that is now common in generic netlink highly reduces developer effort.
> > > Although in my opinion we really need this stuff integrated into tools
> > > like iproute2..
> >
> > RDMA also has a DSL like scheme for defining schema, and centralized
> > parsing and validation. IMHO it's capability falls someplace between
> > the old netlink policy stuff and the new YAML stuff.
> >
>
> I meant the ability to start with a data model and generate code as
> being useful.
> Where can i find the RDMA DSL?
It is done with the C preprocessor instead of an external YAML
file. Look at drivers/infiniband/core/uverbs_std_types_mr.c at the
end. It describes a data model, but it is elaborated at runtime into
an efficient parse tree, not by using a code generator.
The schema is more classical object oriented RPC type scheme where you
define objects, methods and then method parameters. The objects have
an entire kernel side infrastructure to manage their lifecycle and the
attributes have validation and parsing done prior to reaching the C
function implementing the method.
I always thought it was netlink inspired, but more suited to building
a uAPI out of. Like you get actual system call names (eg
UVERBS_METHOD_REG_DMABUF_MR) that have actual C functions implementing
them. There is special help to implement object allocation and
destruction functions, and freedom to have as many methods per object
as make sense.
> I dont know enough about RDMA infra to comment but iiuc, you are
> saying that it is the control infrastructure (that sits in
> userspace?), that does all those things you mention, that is more
> important.
There is an entire object model in the kernel and it is linked into
the schema.
For instance in the above example we have a schema for an object
method like this:
DECLARE_UVERBS_NAMED_METHOD(
UVERBS_METHOD_REG_DMABUF_MR,
UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_HANDLE,
UVERBS_OBJECT_MR,
UVERBS_ACCESS_NEW,
UA_MANDATORY),
UVERBS_ATTR_IDR(UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE,
UVERBS_OBJECT_PD,
UVERBS_ACCESS_READ,
UA_MANDATORY),
That says it accepts two object handles MR and PD as input to the
method call.
The core code keeps track of all these object handles, validates the
ID number given by userspace is refering to the correct object, of the
correct type, in the correct state. Locks things against concurrent
destruction, and then gives a trivial way for the C method
implementation to pick up the object pointer:
struct ib_pd *pd =
uverbs_attr_get_obj(attrs, UVERBS_ATTR_REG_DMABUF_MR_PD_HANDLE);
Which can't fail because everything was already checked before we get
here. This is all designed to greatly simplify and make robust the
method implementations that are often in driver code.
Jason
next prev parent reply other threads:[~2025-03-19 19:19 UTC|newest]
Thread overview: 76+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-06 23:01 [RFC PATCH 00/13] Ultra Ethernet driver introduction Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 01/13] drivers: ultraeth: add initial skeleton and kconfig option Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 02/13] drivers: ultraeth: add context support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 03/13] drivers: ultraeth: add new genl family Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 04/13] drivers: ultraeth: add job support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 05/13] drivers: ultraeth: add tunnel udp device support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 06/13] drivers: ultraeth: add initial PDS infrastructure Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 07/13] drivers: ultraeth: add request and ack receive support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 08/13] drivers: ultraeth: add request transmit support Nikolay Aleksandrov
2025-03-06 23:01 ` [RFC PATCH 09/13] drivers: ultraeth: add support for coalescing ack Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 10/13] drivers: ultraeth: add sack support Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 11/13] drivers: ultraeth: add nack support Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 12/13] drivers: ultraeth: add initiator and target idle timeout support Nikolay Aleksandrov
2025-03-06 23:02 ` [RFC PATCH 13/13] HACK: drivers: ultraeth: add char device Nikolay Aleksandrov
2025-03-08 18:46 ` [RFC PATCH 00/13] Ultra Ethernet driver introduction Leon Romanovsky
2025-03-09 3:21 ` Parav Pandit
2025-03-11 14:20 ` Bernard Metzler
2025-03-11 14:55 ` Leon Romanovsky
2025-03-11 17:11 ` Sean Hefty
2025-03-12 9:20 ` Nikolay Aleksandrov
2025-03-12 9:40 ` Nikolay Aleksandrov
2025-03-12 11:29 ` Leon Romanovsky
2025-03-12 14:20 ` Nikolay Aleksandrov
2025-03-12 15:10 ` Leon Romanovsky
2025-03-12 16:00 ` Nikolay Aleksandrov
2025-03-14 14:53 ` Bernard Metzler
2025-03-17 12:52 ` Leon Romanovsky
2025-03-19 13:52 ` Jason Gunthorpe
2025-03-19 14:02 ` Nikolay Aleksandrov
2025-03-14 20:51 ` Stanislav Fomichev
2025-03-17 12:30 ` Leon Romanovsky
2025-03-19 19:12 ` Stanislav Fomichev
2025-03-15 20:49 ` Netlink vs ioctl WAS(Re: " Jamal Hadi Salim
2025-03-17 12:57 ` Leon Romanovsky
2025-03-18 22:49 ` Jason Gunthorpe
2025-03-19 18:21 ` Jamal Hadi Salim
2025-03-19 19:19 ` Jason Gunthorpe [this message]
2025-03-25 14:12 ` Jamal Hadi Salim
2025-03-26 15:50 ` Jason Gunthorpe
2025-04-08 14:16 ` Jamal Hadi Salim
2025-04-09 16:10 ` Jason Gunthorpe
2025-03-19 16:48 ` Jason Gunthorpe
2025-03-20 11:13 ` Yunsheng Lin
2025-03-20 14:32 ` Jason Gunthorpe
2025-03-20 20:05 ` Sean Hefty
2025-03-20 20:12 ` Jason Gunthorpe
2025-03-21 2:02 ` Yunsheng Lin
2025-03-21 12:01 ` Jason Gunthorpe
2025-03-24 20:22 ` Roland Dreier
2025-03-24 21:28 ` Sean Hefty
2025-03-25 13:22 ` Bernard Metzler
2025-03-25 17:02 ` Sean Hefty
2025-03-26 14:45 ` Jason Gunthorpe
2025-03-26 15:29 ` Sean Hefty
2025-03-26 15:53 ` Jason Gunthorpe
2025-03-26 17:39 ` Sean Hefty
2025-03-27 13:26 ` Jason Gunthorpe
2025-03-28 12:20 ` Yunsheng Lin
2025-03-31 19:49 ` Sean Hefty
2025-04-01 9:19 ` Yunsheng Lin
2025-03-31 19:29 ` Sean Hefty
2025-04-01 13:04 ` Jason Gunthorpe
2025-04-01 16:57 ` Sean Hefty
2025-04-01 19:39 ` Jason Gunthorpe
2025-04-03 1:30 ` Sean Hefty
2025-04-04 16:03 ` Ziemba, Ian
2025-04-05 1:07 ` Sean Hefty
2025-04-07 19:32 ` Ziemba, Ian
2025-04-08 4:40 ` Sean Hefty
2025-04-16 23:58 ` Sean Hefty
2025-04-17 1:23 ` Jason Gunthorpe
2025-04-17 2:59 ` Sean Hefty
2025-04-17 13:31 ` Jason Gunthorpe
2025-04-18 16:50 ` Sean Hefty
2025-04-22 15:44 ` Jason Gunthorpe
2025-03-26 15:16 ` Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250319191946.GP9311@nvidia.com \
--to=jgg@nvidia.com \
--cc=alex.badea@keysight.com \
--cc=andrew.tauferner@cornelisnetworks.com \
--cc=bmt@zurich.ibm.com \
--cc=dan.mihailescu@keysight.com \
--cc=davem@redhat.com \
--cc=dsahern@kernel.org \
--cc=eric.davis@broadcom.com \
--cc=ian.ziemba@hpe.com \
--cc=jhs@mojatatu.com \
--cc=kheib@redhat.com \
--cc=kingshuk.mandal@keysight.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=nikolay@enfabrica.net \
--cc=pabeni@redhat.com \
--cc=parth.v.parikh@keysight.com \
--cc=rakhahari.bhunia@keysight.com \
--cc=rip.sohan@amd.com \
--cc=roland@enfabrica.net \
--cc=shrijeet@enfabrica.net \
--cc=welch@hpe.com \
--cc=winston.liu@keysight.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).