XFS NVMe RDMA?

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* XFS NVMe RDMA?
@ 2021-10-20 11:51 Dan Greenfield
  2021-10-20 16:33 ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Dan Greenfield @ 2021-10-20 11:51 UTC (permalink / raw)
  To: linux-xfs

Dear XFS Experts,

   as you may or may not know, XFS on NVMe was used as part of the #1 entry to the IO500 benchmarks, as announced at ISC21. That entry swept away the other competition (albeit on large custom hardware), including systems using Intel’s DAOS using Octane/PMem, WekaIO, Lustre, GekkoFS and others.

There’s no publication associated with it, however there’s a video presenting how they did it:
https://www.youtube.com/watch?v=BJpkpA6hsDc

In it they describe how they used XFS for storing data chunks, and RocksDB for storing metadata. I’m trying to dig deeper on how they could have used XFS, and in particular how they could have used RDMA to access XFS data. The XFS DAX mode as far as I’m aware requires PMem rather than NVMe?

Do you have any ideas how they could have been able to utilise RDMA so that node A can directly access data chunks stored on XFS on node B? Is the only approach to mmap the chunk on node B and then RDMA it to/from node A?

Kind regards,
   Dan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS NVMe RDMA?
  2021-10-20 11:51 XFS NVMe RDMA? Dan Greenfield
@ 2021-10-20 16:33 ` Christoph Hellwig
  2021-10-20 16:35   ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2021-10-20 16:33 UTC (permalink / raw)
  To: Dan Greenfield; +Cc: linux-xfs

On Wed, Oct 20, 2021 at 12:51:05PM +0100, Dan Greenfield wrote:
> Do you have any ideas how they could have been able to utilise RDMA so that node A can directly access data chunks stored on XFS on node B? Is the only approach to mmap the chunk on node B and then RDMA it to/from node A?

I'm not going to watch a video, but with the pNFS code other nodes can
access data on an XFS node directly using any SCSI transport.
For RMDA that would be SRP or iSCSI/iSER.

Note that I also have an unfinished draft to support NVMe, which has
an RDMA transports as well and someone else could trivially reimplement
that as well.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS NVMe RDMA?
  2021-10-20 16:33 ` Christoph Hellwig
@ 2021-10-20 16:35   ` Christoph Hellwig
  2021-10-21  9:58     ` Dan Greenfield
  2021-10-27  8:23     ` Dan Greenfield
  0 siblings, 2 replies; 6+ messages in thread
From: Christoph Hellwig @ 2021-10-20 16:35 UTC (permalink / raw)
  To: Dan Greenfield; +Cc: linux-xfs

On Wed, Oct 20, 2021 at 09:33:43AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 20, 2021 at 12:51:05PM +0100, Dan Greenfield wrote:
> > Do you have any ideas how they could have been able to utilise RDMA so that node A can directly access data chunks stored on XFS on node B? Is the only approach to mmap the chunk on node B and then RDMA it to/from node A?
> 
> I'm not going to watch a video, but with the pNFS code other nodes can
> access data on an XFS node directly using any SCSI transport.
> For RMDA that would be SRP or iSCSI/iSER.
> 
> Note that I also have an unfinished draft to support NVMe, which has
> an RDMA transports as well and someone else could trivially reimplement
> that as well.

Oh, and just FYI here are my slides on the pNFS support:

https://events.static.linuxfound.org/sites/events/files/slides/pnfs.pdf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS NVMe RDMA?
  2021-10-20 16:35   ` Christoph Hellwig
@ 2021-10-21  9:58     ` Dan Greenfield
  2021-10-27  8:23     ` Dan Greenfield
  1 sibling, 0 replies; 6+ messages in thread
From: Dan Greenfield @ 2021-10-21  9:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

Thank you Christoph! That’s very cool and makes sense.

Quick Question: would I be correct in assuming that if I mmap()ed each 8-64MB (chunk) file from XFS, and then did RDMA from the mmap region, that it would first be copied from NVMe into DRAM (does this bypass CPU?) and *then* be copied across RDMA, rather than directly be copied from NVMe by RDMA? Or does O_DIRECT properly allow bypass straight to NVMe for RDMA?

For what this #1 entry are doing though, each of the 512 nodes have their own separate XFS FS as well as their own separate RocksDB, both backed by NVMe. They are doing filesystem ops almost entirely in user-mode (no kernel, no FUSE) by intercepting application binaries and rewriting syscall instructions into jumps into their user-mode library code and doing message passing to RDMA transfers to/from application memory from/to remote node’s NVMe. I don’t believe they’ve modified XFS, nor using pNFS. I don’t know if there’s any other mechanism though other than mmap() and then RDMA on that region?

- Dan

> On 20 Oct 2021, at 17:35, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Wed, Oct 20, 2021 at 09:33:43AM -0700, Christoph Hellwig wrote:
>> On Wed, Oct 20, 2021 at 12:51:05PM +0100, Dan Greenfield wrote:
>>> Do you have any ideas how they could have been able to utilise RDMA so that node A can directly access data chunks stored on XFS on node B? Is the only approach to mmap the chunk on node B and then RDMA it to/from node A?
>> 
>> I'm not going to watch a video, but with the pNFS code other nodes can
>> access data on an XFS node directly using any SCSI transport.
>> For RMDA that would be SRP or iSCSI/iSER.
>> 
>> Note that I also have an unfinished draft to support NVMe, which has
>> an RDMA transports as well and someone else could trivially reimplement
>> that as well.
> 
> Oh, and just FYI here are my slides on the pNFS support:
> 
> https://events.static.linuxfound.org/sites/events/files/slides/pnfs.pdf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS NVMe RDMA?
  2021-10-20 16:35   ` Christoph Hellwig
  2021-10-21  9:58     ` Dan Greenfield
@ 2021-10-27  8:23     ` Dan Greenfield
  2021-11-12  6:40       ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Dan Greenfield @ 2021-10-27  8:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

Thank you again Christoph for your earlier answer.

Quick Question: would I be correct in assuming that if I mmap()ed each 8-64MB (chunk) file from XFS, and then did RDMA from the mmap region, that it would first be copied from NVMe into DRAM (does this bypass CPU?) and *then* be copied across RDMA, rather than directly be copied from NVMe by RDMA? Or does O_DIRECT properly allow bypass straight to NVMe for RDMA?

For what this #1 entry are doing though, each of the 512 nodes have their own separate XFS FS as well as their own separate RocksDB, both backed by NVMe. They are doing filesystem ops almost entirely in user-mode (no kernel, no FUSE) by intercepting application binaries and rewriting syscall instructions into jumps into their user-mode library code and doing message passing to RDMA transfers to/from application memory from/to remote node’s NVMe. I don’t believe they’ve modified XFS, nor using pNFS. I don’t know if there’s any other mechanism though other than mmap() and then RDMA on that region?

- Dan

> On 20 Oct 2021, at 17:35, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Wed, Oct 20, 2021 at 09:33:43AM -0700, Christoph Hellwig wrote:
>> On Wed, Oct 20, 2021 at 12:51:05PM +0100, Dan Greenfield wrote:
>>> Do you have any ideas how they could have been able to utilise RDMA so that node A can directly access data chunks stored on XFS on node B? Is the only approach to mmap the chunk on node B and then RDMA it to/from node A?
>> 
>> I'm not going to watch a video, but with the pNFS code other nodes can
>> access data on an XFS node directly using any SCSI transport.
>> For RMDA that would be SRP or iSCSI/iSER.
>> 
>> Note that I also have an unfinished draft to support NVMe, which has
>> an RDMA transports as well and someone else could trivially reimplement
>> that as well.
> 
> Oh, and just FYI here are my slides on the pNFS support:
> 
> https://events.static.linuxfound.org/sites/events/files/slides/pnfs.pdf

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: XFS NVMe RDMA?
  2021-10-27  8:23     ` Dan Greenfield
@ 2021-11-12  6:40       ` Christoph Hellwig
  0 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2021-11-12  6:40 UTC (permalink / raw)
  To: Dan Greenfield; +Cc: Christoph Hellwig, linux-xfs

Hi Dan,

sorry for the late reply.  This sat in my outbox for a while almost
fully written.

On Wed, Oct 27, 2021 at 09:23:42AM +0100, Dan Greenfield wrote:
> Quick Question: would I be correct in assuming that if I mmap()ed each 8-64MB (chunk) file from XFS, and then did RDMA from the mmap region, that it would first be copied from NVMe into DRAM (does this bypass CPU?) and *then* be copied across RDMA, rather than directly be copied from NVMe by RDMA? Or does O_DIRECT properly allow bypass straight to NVMe for RDMA?

The answer is: it depends.  mmap on a non-DAX file system always copied
into DRAM.  mmap on a DAX file system (that is using pmem) can map
the "storage" directly into memory, in which case some RDMA setups can
DMA without a copy.

But with a plain old SSD there is no path available to userspace to
transfer without copying to DRAM.  If OTOH you use in-kernel NVMe over
fabics target, it can directly transfers from the SSDs in some
circumstances.  Currently that does require using the SSD directly
without a file system, but with a little more work it could also work
using a file system.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-11-12  6:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-20 11:51 XFS NVMe RDMA? Dan Greenfield
2021-10-20 16:33 ` Christoph Hellwig
2021-10-20 16:35   ` Christoph Hellwig
2021-10-21  9:58     ` Dan Greenfield
2021-10-27  8:23     ` Dan Greenfield
2021-11-12  6:40       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox