From: Dust Li <dust.li@linux.alibaba.com>
To: Andrew Lunn <andrew@lunn.ch>, Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Alexandra Winter <wintera@linux.ibm.com>,
Julian Ruess <julianr@linux.ibm.com>,
Wenjia Zhang <wenjia@linux.ibm.com>,
Jan Karcher <jaka@linux.ibm.com>,
Gerd Bayer <gbayer@linux.ibm.com>,
Halil Pasic <pasic@linux.ibm.com>,
"D. Wythe" <alibuda@linux.alibaba.com>,
Tony Lu <tonylu@linux.alibaba.com>,
Wen Gu <guwen@linux.alibaba.com>,
Peter Oberparleiter <oberpar@linux.ibm.com>,
David Miller <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
Thorsten Winkler <twinkler@linux.ibm.com>,
netdev@vger.kernel.org, linux-s390@vger.kernel.org,
Heiko Carstens <hca@linux.ibm.com>,
Vasily Gorbik <gor@linux.ibm.com>,
Alexander Gordeev <agordeev@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Sven Schnelle <svens@linux.ibm.com>,
Simon Horman <horms@kernel.org>
Subject: Re: [RFC net-next 0/7] Provide an ism layer
Date: Mon, 20 Jan 2025 14:21:12 +0800 [thread overview]
Message-ID: <20250120062112.GL89233@linux.alibaba.com> (raw)
In-Reply-To: <85d94131-6c2b-41bd-ad93-c0e7c24801db@lunn.ch>
On 2025-01-17 21:29:09, Andrew Lunn wrote:
>On Fri, Jan 17, 2025 at 05:57:10PM +0100, Niklas Schnelle wrote:
>> On Fri, 2025-01-17 at 17:33 +0100, Andrew Lunn wrote:
>> > > Conceptually kind of but the existing s390 specific ISM device is a bit
>> > > special. But let me start with some background. On s390 aka Mainframes
>> > > OSs including Linux runs in so called logical partitions (LPARs) which
>> > > are machine hypervisor VMs which use partitioned non-paging memory. The
>> > > fact that memory is partitioned is important because this means LPARs
>> > > can not share physical memory by mapping it.
>> > >
>> > > Now at a high level an ISM device allows communication between two such
>> > > Linux LPARs on the same machine. The device is discovered as a PCI
>> > > device and allows Linux to take a buffer called a DMB map that in the
>> > > IOMMU and generate a token specific to another LPAR which also sees an
>> > > ISM device sharing the same virtual channel identifier (VCHID). This
>> > > token can then be transferred out of band (e.g. as part of an extended
>> > > TCP handshake in SMC-D) to that other system. With the token the other
>> > > system can use its ISM device to securely (authenticated by the token,
>> > > LPAR identity and the IOMMU mapping) write into the original systems
>> > > DMB at throughput and latency similar to doing a memcpy() via a
>> > > syscall.
>> > >
>> > > On the implementation level the ISM device is actually a piece of
>> > > firmware and the write to a remote DMB is a special case of our PCI
>> > > Store Block instruction (no real MMIO on s390, instead there are
>> > > special instructions). Sadly there are a few more quirks but in
>> > > principle you can think of it as redirecting writes to a part of the
>> > > ISM PCI devices' BAR to the DMB in the peer system if that makes sense.
>> > > There's of course also a mechanism to cause an interrupt on the
>> > > receiver as the write completes.
>> >
>> > So the s390 details are interesting, but as you say, it is
>> > special. Ideally, all the special should be hidden away inside the
>> > driver.
>>
>> Yes and it will be. There are some exceptions e.g. for vfio-pci pass-
>> through but that's not unusual and why there is already the concept of
>> vfio-pci extension module.
>>
>> >
>> > So please take a step back. What is the abstract model?
>>
>> I think my high level description may be a good start. The abstract
>> model is the ability to share a memory buffer (DMB) for writing by a
>> communication partner, authenticated by a DMB Token. Plus stuff like
>> triggering an interrupt on write or explicit trigger. Then Alibaba
>> added optional support for what they called attaching the buffer which
>> means it becomes truly shared between the peers but which IBM's ISM
>> can't support. Plus a few more optional pieces such as VLANs, PNETIDs
>> don't ask. The idea for the new layer then is to define this interface
>> with operations and documentation.
>>
>> >
>> > Can the abstract model be mapped onto CLX? Could it be used with a GPU
>> > vRAM? SoC with real shared memory between a pool of CPUs.
>> >
>> > Andrew
>>
>> I'd think that yes, one could implement such a mechanism on top of CXL
>> as well as on SoC. Or even with no special hardware between a host and
>> a DPU (e.g. via PCIe endpoint framework). Basically anything that can
>> DMA and IRQs between two OS instances.
>
>Is DMA part of the abstract model? That would suggest a true shared
>memory system is excluded, since that would not require DMA.
>
>Maybe take a look at subsystems like USB, I2C.
>
>usb_submit_urb(struct urb *urb, gfp_t mem_flags)
>
>An URB is a data structure with a block of memory associated with it,
>contains the detail to pass to the USB device.
>
>i2c_transfer(struct i2c_adapter *adap, struct i2c_msg *msgs, int num)
>
>*msgs points to num of messages which get transferred to/from the I2C
>device.
>
>Could the high level API look like this? No DMA, no IRQ, no concept of
>a somewhat shared memory. Just an API which asks for a message to be
>sent to the other end? struct urb has some USB concepts in it, struct
>i2c_msg has some I2C concepts in it. A struct ism_msg would follow the
>same pattern, but does it need to care about the DMA, the IRQ, the
>memory which is semi shared?
I don’t have a clear picture of what the API should look like yet, but I
believe it’s possible to avoid DMA and IRQ. In fact, the current data
transfer API, ops->move_data() in include/linux/ism.h, already abstracts
away the DMA and IRQ details.
One thing we cannot hide, however, is whether the operation is zero-copy
or copy. This distinction is important because we can reuse the data at
different times in copy mode and zero-copy mode.
Best regards,
Dust
next prev parent reply other threads:[~2025-01-20 6:21 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-15 19:55 [RFC net-next 0/7] Provide an ism layer Alexandra Winter
2025-01-15 19:55 ` [RFC net-next 1/7] net/ism: Create net/ism Alexandra Winter
2025-01-16 20:08 ` Andrew Lunn
2025-01-17 12:06 ` Alexandra Winter
2025-01-15 19:55 ` [RFC net-next 2/7] net/ism: Remove dependencies between ISM_VPCI and SMC Alexandra Winter
2025-01-15 19:55 ` [RFC net-next 3/7] net/ism: Use uuid_t for ISM GID Alexandra Winter
2025-01-20 17:18 ` Simon Horman
2025-01-22 14:46 ` Alexandra Winter
2025-01-15 19:55 ` [RFC net-next 4/7] net/ism: Add kernel-doc comments for ism functions Alexandra Winter
2025-01-15 22:06 ` Halil Pasic
2025-01-20 6:32 ` Dust Li
2025-01-20 9:56 ` Alexandra Winter
2025-01-20 10:07 ` Julian Ruess
2025-01-20 11:35 ` Alexandra Winter
2025-01-20 10:34 ` Niklas Schnelle
2025-01-22 15:02 ` Dust Li
2025-01-15 19:55 ` [RFC net-next 5/7] net/ism: Move ism_loopback to net/ism Alexandra Winter
2025-01-20 3:55 ` Dust Li
2025-01-20 9:31 ` Alexandra Winter
2025-02-06 17:36 ` Julian Ruess
2025-02-10 10:39 ` Alexandra Winter
2025-01-15 19:55 ` [RFC net-next 6/7] s390/ism: Define ismvp_dev Alexandra Winter
2025-01-15 19:55 ` [RFC net-next 7/7] net/smc: Use only ism_ops Alexandra Winter
2025-01-16 9:32 ` [RFC net-next 0/7] Provide an ism layer Dust Li
2025-01-16 11:55 ` Julian Ruess
2025-01-16 16:17 ` Alexandra Winter
2025-01-16 17:08 ` Julian Ruess
2025-01-17 2:13 ` Dust Li
2025-01-17 10:38 ` Niklas Schnelle
2025-01-17 15:02 ` Andrew Lunn
2025-01-17 16:00 ` Niklas Schnelle
2025-01-17 16:33 ` Andrew Lunn
2025-01-17 16:57 ` Niklas Schnelle
2025-01-17 20:29 ` Andrew Lunn
2025-01-20 6:21 ` Dust Li [this message]
2025-01-20 12:03 ` Alexandra Winter
2025-01-20 16:01 ` Andrew Lunn
2025-01-20 17:25 ` Alexandra Winter
2025-01-18 15:31 ` Dust Li
2025-01-28 16:04 ` Alexandra Winter
2025-02-10 5:08 ` Dust Li
2025-02-10 9:38 ` Alexandra Winter
2025-02-11 1:57 ` Dust Li
2025-02-16 15:40 ` Wen Gu
2025-02-19 11:25 ` [RFC net-next 0/7] Provide an ism layer - naming Alexandra Winter
2025-02-25 1:36 ` Dust Li
2025-02-25 8:40 ` Alexandra Winter
2025-01-17 13:00 ` [RFC net-next 0/7] Provide an ism layer Alexandra Winter
2025-01-17 15:10 ` Andrew Lunn
2025-01-17 16:20 ` Alexandra Winter
2025-01-20 10:28 ` Alexandra Winter
2025-01-22 3:04 ` Dust Li
2025-01-22 12:02 ` Alexandra Winter
2025-01-22 12:05 ` Alexandra Winter
2025-01-22 14:10 ` Dust Li
2025-01-17 15:06 ` Andrew Lunn
2025-01-17 15:38 ` Alexandra Winter
2025-02-16 15:38 ` Wen Gu
2025-01-17 11:04 ` Alexandra Winter
2025-01-18 15:24 ` Dust Li
2025-01-20 11:45 ` Alexandra Winter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250120062112.GL89233@linux.alibaba.com \
--to=dust.li@linux.alibaba.com \
--cc=agordeev@linux.ibm.com \
--cc=alibuda@linux.alibaba.com \
--cc=andrew+netdev@lunn.ch \
--cc=andrew@lunn.ch \
--cc=borntraeger@linux.ibm.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gbayer@linux.ibm.com \
--cc=gor@linux.ibm.com \
--cc=guwen@linux.alibaba.com \
--cc=hca@linux.ibm.com \
--cc=horms@kernel.org \
--cc=jaka@linux.ibm.com \
--cc=julianr@linux.ibm.com \
--cc=kuba@kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=oberpar@linux.ibm.com \
--cc=pabeni@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=schnelle@linux.ibm.com \
--cc=svens@linux.ibm.com \
--cc=tonylu@linux.alibaba.com \
--cc=twinkler@linux.ibm.com \
--cc=wenjia@linux.ibm.com \
--cc=wintera@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).