Linux CIFS filesystem development
 help / color / mirror / Atom feed
* cifs RDMA restrictions
@ 2025-04-01 16:55 David Howells
  2025-04-01 18:02 ` Tom Talpey
  0 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2025-04-01 16:55 UTC (permalink / raw)
  To: Tom Talpey
  Cc: dhowells, Steve French, Stefan Metzmacher, Paulo Alcantara,
	Matthew Wilcox, linux-cifs

Hi Tom,

I'm looking at cleaning up the "struct page" instances in smbdirect.c as much
as possible.  Can you tell me what are the restrictions on the size of a
segment attached to an SGE?  For instance, by:

	static bool smb_set_sge(struct smb_extract_to_rdma *rdma,
				struct page *lowest_page, size_t off, size_t len)
	{
		struct ib_sge *sge = &rdma->sge[rdma->nr_sge];
		u64 addr;

		addr = ib_dma_map_page(rdma->device, lowest_page,
				       off, len, rdma->direction);
		if (ib_dma_mapping_error(rdma->device, addr))
			return false;

		sge->addr   = addr;
		sge->length = len;
		sge->lkey   = rdma->local_dma_lkey;
		rdma->nr_sge++;
		return true;
	}

Thanks,
David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cifs RDMA restrictions
  2025-04-01 16:55 cifs RDMA restrictions David Howells
@ 2025-04-01 18:02 ` Tom Talpey
  2025-04-01 19:34   ` David Howells
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Talpey @ 2025-04-01 18:02 UTC (permalink / raw)
  To: David Howells
  Cc: Steve French, Stefan Metzmacher, Paulo Alcantara, Matthew Wilcox,
	linux-cifs

On 4/1/2025 12:55 PM, David Howells wrote:
> Hi Tom,
> 
> I'm looking at cleaning up the "struct page" instances in smbdirect.c as much
> as possible.  Can you tell me what are the restrictions on the size of a
> segment attached to an SGE?  For instance, by:

I started typing a long answer but I ended up not sure of the question.

A single sge is just a hunk of memory that is registered with the RDMA
provider. The routine you're quoting is generating the dma_addr that
the provider (adapter) will use to access the data so it's passing what
the ib_dma stuff wants.

The sge *list* is more interesting. The first sge can have a non-zero
offset but anything that follows it can't. Such a "hole" would require
a subsequent registration, yielding handle(s) that might need multiple
RDMA operations. But it's the caller of the routine you quoted that
figured this out.

I'm going to be testing at SambaXP next week, by the way. If you have
some code, let me know.

Tom.


> 	static bool smb_set_sge(struct smb_extract_to_rdma *rdma,
> 				struct page *lowest_page, size_t off, size_t len)
> 	{
> 		struct ib_sge *sge = &rdma->sge[rdma->nr_sge];
> 		u64 addr;
> 
> 		addr = ib_dma_map_page(rdma->device, lowest_page,
> 				       off, len, rdma->direction);
> 		if (ib_dma_mapping_error(rdma->device, addr))
> 			return false;
> 
> 		sge->addr   = addr;
> 		sge->length = len;
> 		sge->lkey   = rdma->local_dma_lkey;
> 		rdma->nr_sge++;
> 		return true;
> 	}
> 
> Thanks,
> David
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cifs RDMA restrictions
  2025-04-01 18:02 ` Tom Talpey
@ 2025-04-01 19:34   ` David Howells
  2025-04-02 17:02     ` Tom Talpey
  0 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2025-04-01 19:34 UTC (permalink / raw)
  To: Tom Talpey
  Cc: dhowells, Steve French, Stefan Metzmacher, Paulo Alcantara,
	Matthew Wilcox, linux-cifs

Tom Talpey <tom@talpey.com> wrote:

> A single sge is just a hunk of memory that is registered with the RDMA
> provider. The routine you're quoting is generating the dma_addr that
> the provider (adapter) will use to access the data so it's passing what
> the ib_dma stuff wants.

The issue is that currently, we pass individual page fragments to the RDMA
layer, and none of them will cross a page boundary and they will be at most
PAGE_SIZE in size.

However, in order to better support large folios and large bvec descriptors, I
could in theory set a segment much larger than PAGE_SIZE.

But, will the device handle that?  And can the DMA API map a single buffer
that big with the IOMMU?  And does it need aligning to a larger pow-of-2
granule?

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cifs RDMA restrictions
  2025-04-01 19:34   ` David Howells
@ 2025-04-02 17:02     ` Tom Talpey
  2025-04-02 17:09       ` David Howells
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Talpey @ 2025-04-02 17:02 UTC (permalink / raw)
  To: David Howells
  Cc: Steve French, Stefan Metzmacher, Paulo Alcantara, Matthew Wilcox,
	linux-cifs

On 4/1/2025 3:34 PM, David Howells wrote:
> Tom Talpey <tom@talpey.com> wrote:
> 
>> A single sge is just a hunk of memory that is registered with the RDMA
>> provider. The routine you're quoting is generating the dma_addr that
>> the provider (adapter) will use to access the data so it's passing what
>> the ib_dma stuff wants.
> 
> The issue is that currently, we pass individual page fragments to the RDMA
> layer, and none of them will cross a page boundary and they will be at most
> PAGE_SIZE in size.

Oh, that's suboptimal, maybe I'm thinking of the server code doing that.
It means the maximum size might be reduced, if the list exceeds the
adapter's max sge count. Not really a huge deal though, it just means
a pipeline of extra RDMA operations, which flow efficiently anyway.

> However, in order to better support large folios and large bvec descriptors, I
> could in theory set a segment much larger than PAGE_SIZE.

That would be desirable. Fewer sge's.

> But, will the device handle that?  And can the DMA API map a single buffer
> that big with the IOMMU?  And does it need aligning to a larger pow-of-2
> granule?

Yes, maybe and no.
1 - The device just expects a base and length, so size is not an issue.
2 - The IOMMU is a platform question, if it's in use and the mapping
     allows, it should work.
3 - No specific alignment required.

With the obvious caveat that this answer presumes no bugs in the adapter
code.

Tom.

> 
> David
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cifs RDMA restrictions
  2025-04-02 17:02     ` Tom Talpey
@ 2025-04-02 17:09       ` David Howells
       [not found]         ` <CAH2r5ms2J06tJr4VEVDgmcj_1uqOnhYzbC1ybrMWDm=f8wVDoA@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2025-04-02 17:09 UTC (permalink / raw)
  To: Tom Talpey
  Cc: dhowells, Steve French, Stefan Metzmacher, Paulo Alcantara,
	Matthew Wilcox, linux-cifs

Tom Talpey <tom@talpey.com> wrote:

> > But, will the device handle that?  And can the DMA API map a single buffer
> > that big with the IOMMU?  And does it need aligning to a larger pow-of-2
> > granule?
> 
> Yes, maybe and no.
> 1 - The device just expects a base and length, so size is not an issue.
> 2 - The IOMMU is a platform question, if it's in use and the mapping
>     allows, it should work.
> 3 - No specific alignment required.

Is there a way to query the RDMA layer as to its limits?

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cifs RDMA restrictions
       [not found]         ` <CAH2r5ms2J06tJr4VEVDgmcj_1uqOnhYzbC1ybrMWDm=f8wVDoA@mail.gmail.com>
@ 2025-04-02 18:04           ` Tom Talpey
  2025-04-02 19:26             ` David Howells
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Talpey @ 2025-04-02 18:04 UTC (permalink / raw)
  To: Steve French, David Howells
  Cc: Steve French, Stefan Metzmacher, Paulo Alcantara, Matthew Wilcox,
	CIFS

On 4/2/2025 1:13 PM, Steve French wrote:
>  > Is there a way to query the RDMA layer as to its limits?
> 
> Would the smb311 query interfaces fsctl response have enough info on the 
> adapter properties?

These are in the adapter attributes, queryable via the verbs.
The value is stored in client/smbdirect.c "max_frmr_depth", and
is initialized in smbd_ia_open() line 654-ish:

	info->max_frmr_depth = min_t(int,
		smbd_max_frmr_depth,
		info->id->device->attrs.max_fast_reg_page_list_len);

Note while it's called the "page list', it's really just the
sge list, because each element can have more-or-less arbitrary
length. Typically 16 elements is table stakes for an adapter,
many are capable of much higher, though with certain performance
trivia.

Steve, this is a local-only value and not related at all to the
query interfaces fsctl, which is returning the server's adapter
list.

Tom.


> Thanks,
> 
> Steve
> 
> On Wed, Apr 2, 2025, 12:10 PM David Howells <dhowells@redhat.com 
> <mailto:dhowells@redhat.com>> wrote:
> 
>     Tom Talpey <tom@talpey.com <mailto:tom@talpey.com>> wrote:
> 
>      > > But, will the device handle that?  And can the DMA API map a
>     single buffer
>      > > that big with the IOMMU?  And does it need aligning to a larger
>     pow-of-2
>      > > granule?
>      >
>      > Yes, maybe and no.
>      > 1 - The device just expects a base and length, so size is not an
>     issue.
>      > 2 - The IOMMU is a platform question, if it's in use and the mapping
>      >     allows, it should work.
>      > 3 - No specific alignment required.
> 
>     Is there a way to query the RDMA layer as to its limits?
> 
>     David
> 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cifs RDMA restrictions
  2025-04-02 18:04           ` Tom Talpey
@ 2025-04-02 19:26             ` David Howells
  2025-04-02 19:36               ` Tom Talpey
  0 siblings, 1 reply; 9+ messages in thread
From: David Howells @ 2025-04-02 19:26 UTC (permalink / raw)
  To: Tom Talpey
  Cc: dhowells, Steve French, Steve French, Stefan Metzmacher,
	Paulo Alcantara, Matthew Wilcox, CIFS

Hi Tom,

Btw, in smb_recv_buf(), am I reading it right that it doesn't actually pass
buf off to the RDMA layer, but rather just copies into it from a response
popped off of the reassembly queue?

If that's the case, it should be possible to collapse smbd_recv_buf() and
smbd_recv_page() into smbd_recv() and just replace the memcpy() with
copy_to_iter().

David


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cifs RDMA restrictions
  2025-04-02 19:26             ` David Howells
@ 2025-04-02 19:36               ` Tom Talpey
  2025-04-02 19:54                 ` [RFC PATCH] cifs: Collapse smbd_recv_*() into smbd_recv() and just use copy_to_iter() David Howells
  0 siblings, 1 reply; 9+ messages in thread
From: Tom Talpey @ 2025-04-02 19:36 UTC (permalink / raw)
  To: David Howells
  Cc: Steve French, Steve French, Stefan Metzmacher, Paulo Alcantara,
	Matthew Wilcox, CIFS

On 4/2/2025 3:26 PM, David Howells wrote:
> Hi Tom,
> 
> Btw, in smb_recv_buf(), am I reading it right that it doesn't actually pass
> buf off to the RDMA layer, but rather just copies into it from a response
> popped off of the reassembly queue?
> 
> If that's the case, it should be possible to collapse smbd_recv_buf() and
> smbd_recv_page() into smbd_recv() and just replace the memcpy() with
> copy_to_iter().
I guess so. That code has been there all long and I haven't
looked into changing it. The comment at the top certainly says
it's something to consider.

The SMB Direct reassembly stuff is the non-RDMA path, where a
message payload is too large for a single ~1KB datagram, but
was not registered for direct placement. This would mainly be
fsctl's and small reads. I wouldn't get too worked up about
highly optimizing those.

Tom.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH] cifs: Collapse smbd_recv_*() into smbd_recv() and just use copy_to_iter()
  2025-04-02 19:36               ` Tom Talpey
@ 2025-04-02 19:54                 ` David Howells
  0 siblings, 0 replies; 9+ messages in thread
From: David Howells @ 2025-04-02 19:54 UTC (permalink / raw)
  To: Tom Talpey
  Cc: dhowells, Steve French, Steve French, Stefan Metzmacher,
	Paulo Alcantara, Matthew Wilcox, CIFS

Tom Talpey <tom@talpey.com> wrote:

> I guess so. That code has been there all long and I haven't
> looked into changing it. The comment at the top certainly says
> it's something to consider.
> 
> The SMB Direct reassembly stuff is the non-RDMA path, where a
> message payload is too large for a single ~1KB datagram, but
> was not registered for direct placement. This would mainly be
> fsctl's and small reads. I wouldn't get too worked up about
> highly optimizing those.

The idea is to get rid of struct page entirely from the kernel.

I've attached an untested patch that should clean up smbd_recv() for your
perusal.

David
---
commit dec3911780bf7849d637cd31e697b9b9ba2f67da
Author: David Howells <dhowells@redhat.com>
Date:   Wed Apr 2 20:27:26 2025 +0100

    cifs: Collapse smbd_recv_*() into smbd_recv() and just use copy_to_iter()
    
    Collapse smbd_recv_buf() and smbd_recv_page() into smbd_recv() and just use
    copy_to_iter() instead of memcpy().
    
    Signed-off-by: David Howells <dhowells@redhat.com>
    cc: Steve French <stfrench@microsoft.com>
    cc: Tom Talpey <tom@talpey.com>
    cc: Stefan Metzmacher <metze@samba.org>
    cc: Paulo Alcantara (Red Hat) <pc@manguebit.com>
    cc: Matthew Wilcox <willy@infradead.org>
    cc: linux-cifs@vger.kernel.org
    cc: netfs@lists.linux.dev
    cc: linux-fsdevel@vger.kernel.org

diff --git a/fs/smb/client/smbdirect.c b/fs/smb/client/smbdirect.c
index b0b7254661e9..4d3c3b50af99 100644
--- a/fs/smb/client/smbdirect.c
+++ b/fs/smb/client/smbdirect.c
@@ -1718,35 +1718,40 @@ struct smbd_connection *smbd_get_connection(
 	return ret;
 }
 
+
 /*
- * Receive data from receive reassembly queue
+ * Receive data from the transport's receive reassembly queue
  * All the incoming data packets are placed in reassembly queue
- * buf: the buffer to read data into
+ * iter: the buffer to read data into
  * size: the length of data to read
  * return value: actual data read
- * Note: this implementation copies the data from reassebmly queue to receive
+ *
+ * Note: this implementation copies the data from reassembly queue to receive
  * buffers used by upper layer. This is not the optimal code path. A better way
  * to do it is to not have upper layer allocate its receive buffers but rather
  * borrow the buffer from reassembly queue, and return it after data is
  * consumed. But this will require more changes to upper layer code, and also
  * need to consider packet boundaries while they still being reassembled.
  */
-static int smbd_recv_buf(struct smbd_connection *info, char *buf,
-		unsigned int size)
+int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
 {
 	struct smbd_response *response;
 	struct smbd_data_transfer *data_transfer;
+	size_t size = msg->msg_iter.count;
 	int to_copy, to_read, data_read, offset;
 	u32 data_length, remaining_data_length, data_offset;
 	int rc;
 
+	if (WARN_ON_ONCE(iov_iter_rw(&msg->msg_iter) == WRITE))
+		return -EINVAL; /* It's a bug in upper layer to get there */
+
 again:
 	/*
 	 * No need to hold the reassembly queue lock all the time as we are
 	 * the only one reading from the front of the queue. The transport
 	 * may add more entries to the back of the queue at the same time
 	 */
-	log_read(INFO, "size=%d info->reassembly_data_length=%d\n", size,
+	log_read(INFO, "size=%zd info->reassembly_data_length=%d\n", size,
 		info->reassembly_data_length);
 	if (info->reassembly_data_length >= size) {
 		int queue_length;
@@ -1782,9 +1787,12 @@ static int smbd_recv_buf(struct smbd_connection *info, char *buf,
 			 * transport layer is added
 			 */
 			if (response->first_segment && size == 4) {
-				unsigned int rfc1002_len =
+				unsigned int len =
 					data_length + remaining_data_length;
-				*((__be32 *)buf) = cpu_to_be32(rfc1002_len);
+				__be32 rfc1002_len = cpu_to_be32(len);
+				if (copy_to_iter(&rfc1002_len, sizeof(rfc1002_len),
+						 &msg->msg_iter) != sizeof(rfc1002_len))
+					return -EFAULT;
 				data_read = 4;
 				response->first_segment = false;
 				log_read(INFO, "returning rfc1002 length %d\n",
@@ -1793,10 +1801,9 @@ static int smbd_recv_buf(struct smbd_connection *info, char *buf,
 			}
 
 			to_copy = min_t(int, data_length - offset, to_read);
-			memcpy(
-				buf + data_read,
-				(char *)data_transfer + data_offset + offset,
-				to_copy);
+			if (copy_to_iter((char *)data_transfer + data_offset + offset,
+					 to_copy, &msg->msg_iter) != to_copy)
+				return -EFAULT;
 
 			/* move on to the next buffer? */
 			if (to_copy == data_length - offset) {
@@ -1841,6 +1848,8 @@ static int smbd_recv_buf(struct smbd_connection *info, char *buf,
 			 data_read, info->reassembly_data_length,
 			 info->first_entry_offset);
 read_rfc1002_done:
+		/* SMBDirect will read it all or nothing */
+		msg->msg_iter.count = 0;
 		return data_read;
 	}
 
@@ -1861,89 +1870,6 @@ static int smbd_recv_buf(struct smbd_connection *info, char *buf,
 	goto again;
 }
 
-/*
- * Receive a page from receive reassembly queue
- * page: the page to read data into
- * to_read: the length of data to read
- * return value: actual data read
- */
-static int smbd_recv_page(struct smbd_connection *info,
-		struct page *page, unsigned int page_offset,
-		unsigned int to_read)
-{
-	int ret;
-	char *to_address;
-	void *page_address;
-
-	/* make sure we have the page ready for read */
-	ret = wait_event_interruptible(
-		info->wait_reassembly_queue,
-		info->reassembly_data_length >= to_read ||
-			info->transport_status != SMBD_CONNECTED);
-	if (ret)
-		return ret;
-
-	/* now we can read from reassembly queue and not sleep */
-	page_address = kmap_atomic(page);
-	to_address = (char *) page_address + page_offset;
-
-	log_read(INFO, "reading from page=%p address=%p to_read=%d\n",
-		page, to_address, to_read);
-
-	ret = smbd_recv_buf(info, to_address, to_read);
-	kunmap_atomic(page_address);
-
-	return ret;
-}
-
-/*
- * Receive data from transport
- * msg: a msghdr point to the buffer, can be ITER_KVEC or ITER_BVEC
- * return: total bytes read, or 0. SMB Direct will not do partial read.
- */
-int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
-{
-	char *buf;
-	struct page *page;
-	unsigned int to_read, page_offset;
-	int rc;
-
-	if (iov_iter_rw(&msg->msg_iter) == WRITE) {
-		/* It's a bug in upper layer to get there */
-		cifs_dbg(VFS, "Invalid msg iter dir %u\n",
-			 iov_iter_rw(&msg->msg_iter));
-		rc = -EINVAL;
-		goto out;
-	}
-
-	switch (iov_iter_type(&msg->msg_iter)) {
-	case ITER_KVEC:
-		buf = msg->msg_iter.kvec->iov_base;
-		to_read = msg->msg_iter.kvec->iov_len;
-		rc = smbd_recv_buf(info, buf, to_read);
-		break;
-
-	case ITER_BVEC:
-		page = msg->msg_iter.bvec->bv_page;
-		page_offset = msg->msg_iter.bvec->bv_offset;
-		to_read = msg->msg_iter.bvec->bv_len;
-		rc = smbd_recv_page(info, page, page_offset, to_read);
-		break;
-
-	default:
-		/* It's a bug in upper layer to get there */
-		cifs_dbg(VFS, "Invalid msg type %d\n",
-			 iov_iter_type(&msg->msg_iter));
-		rc = -EINVAL;
-	}
-
-out:
-	/* SMBDirect will read it all or nothing */
-	if (rc > 0)
-		msg->msg_iter.count = 0;
-	return rc;
-}
-
 /*
  * Send data to transport
  * Each rqst is transported as a SMBDirect payload


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-02 19:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-01 16:55 cifs RDMA restrictions David Howells
2025-04-01 18:02 ` Tom Talpey
2025-04-01 19:34   ` David Howells
2025-04-02 17:02     ` Tom Talpey
2025-04-02 17:09       ` David Howells
     [not found]         ` <CAH2r5ms2J06tJr4VEVDgmcj_1uqOnhYzbC1ybrMWDm=f8wVDoA@mail.gmail.com>
2025-04-02 18:04           ` Tom Talpey
2025-04-02 19:26             ` David Howells
2025-04-02 19:36               ` Tom Talpey
2025-04-02 19:54                 ` [RFC PATCH] cifs: Collapse smbd_recv_*() into smbd_recv() and just use copy_to_iter() David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox