Re: [PATCH] nfsd: Implement large extent array support in pNFS

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Christoph Hellwig <hch@infradead.org>
To: Sergey Bashirov <sergeybashirov@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Chuck Lever <chuck.lever@oracle.com>,
	"J . Bruce Fields" <bfields@fieldses.org>,
	Konstantin Evtushenko <koevtushenko@yandex.com>,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH] nfsd: Implement large extent array support in pNFS
Date: Tue, 10 Jun 2025 23:55:09 -0700	[thread overview]
Message-ID: <aEkoTdJttLesPv6M@infradead.org> (raw)
In-Reply-To: <75iqhi3to6gohuo2o4h3cewslcjzsfyrl7l7x2x3qyiaaecjci@uwoeqjubvqft>

On Tue, Jun 10, 2025 at 06:24:03PM +0300, Sergey Bashirov wrote:
> On Mon, Jun 09, 2025 at 10:39:06PM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 10, 2025 at 03:36:49AM +0300, Sergey Bashirov wrote:
> > > Together with Konstantin we spent a lot of time enabling the pNFS block
> > > volume setup. We have SDS that can attach virtual block devices via
> > > vhost-user-blk to virtual machines. And we researched the way to create
> > > parallel or distributed file system on top of this SDS. From this point
> > > of view, pNFS block volume layout architecture looks quite suitable. So,
> > > we created several VMs, configured pNFS and started testing. In fact,
> > > during our extensive testing, we encountered a variety of issues including
> > > deadlocks, livelocks, and corrupted files, which we eventually fixed.
> > > Now we have a working setup and we would like to clean up the code and
> > > contribute it.
> > 
> > Can you share your reproducer scripts for client and server?
> 
> I will try. First of all, you need two VMs connected to the same network.
> The hardest part is somehow to connect a shared block device to both VMs
> with RW access.

I know the basic setup :)

> On the client side, you need to have the same /dev/vda device available,
> but not mounted. Additionally, you need the blkmapd service running.

blkmapd is only needed for the block layout, which should generally be
avoided as it can't be used reliably because working fencing is
almost impossible.

> This should create 2.5k extents:
> fio --name=test --filename=/mnt/pnfs/test.raw --size=10M \
>     --rw=randwrite --ioengine=libaio --direct=1 --bs=4k  \
>     --iodepth=128 --fallocate=none

Thanks!  We should find a way to wire up the test coverage
somewhere, e.g. xfstests.

> Troubleshooting. If any error occurs, then kernel falls back to NFSv3.

That should really be NFSv4.

> the client code also has problems with the block extent array. Currently
> the client tries to pack all the block extents it needs to commit into
> one RPC. And if there are too many of them, you will see
> "RPC: fragment too large" error on the server side. That's why
> we set rsize and wsize to 1M for now.

We'll really need to fix the client to split when going over the maximum
compoung size.

> Another problem is that when the
> extent array does not fit into a single memory page, the client code
> discards the first page of encoded extents while reallocating a larger
> buffer to continue layout commit encoding. So even with this patch you
> may still notice that some files are not written correctly. But at least
> the server shouldn't send the badxdr error on a well-formed layout commit.

Eww, we'll need to fix that as well.  Would be good to have a reproducer
for that case as well.

> > Btw, also as a little warning:  the current pNFS code mean any client
> > can corrupt the XFS metadata.  If you want to actually use the code
> > in production you'll probably want to figure out a way to either use
> > the RT device for exposed data (should be easy, but the RT allocator
> > sucks..), or find a way to otherwise restrict clients from overwriting
> > metadata.
> 
> Thanks for the advice! Yes, we have had issues with XFS corruption
> especially when multiple clients were writing to the same file in
> parallel. Spent some time debugging layout recalls and client fencing
> to figure out what happened.

Normal operation should not cause that, what did you see there?

I mean a malicious client targeting metadata outside it's layout.

next prev parent reply	other threads:[~2025-06-11  6:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-04 13:07 [PATCH] nfsd: Implement large extent array support in pNFS Sergey Bashirov
2025-06-04 14:10 ` Chuck Lever
2025-06-04 14:54 ` Christoph Hellwig
2025-06-10  0:36   ` Sergey Bashirov
2025-06-10  5:39     ` Christoph Hellwig
2025-06-10 15:24       ` Sergey Bashirov
2025-06-11  6:55         ` Christoph Hellwig [this message]
2025-06-11 12:19           ` Sergey Bashirov
2025-06-12  6:33             ` Christoph Hellwig
2025-06-12  8:13               ` Sergey Bashirov
2025-06-11 13:53           ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aEkoTdJttLesPv6M@infradead.org \
    --to=hch@infradead.org \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=koevtushenko@yandex.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=sergeybashirov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox