From: Dai Ngo <dai.ngo@oracle.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>,
Dave Chinner <dgc@kernel.org>,
cem@kernel.org, linux-xfs@vger.kernel.org,
linux-nfs@vger.kernel.org
Subject: Re: [PATCH 1/1] xfs: fix overlapping extents returned for pNFS LAYOUTGET
Date: Tue, 19 May 2026 10:34:20 -0700 [thread overview]
Message-ID: <0e67fc75-382d-4698-ba93-39464e112e30@oracle.com> (raw)
In-Reply-To: <20260519145949.GH9555@frogsfrogsfrogs>
On 5/19/26 7:59 AM, Darrick J. Wong wrote:
> On Tue, May 19, 2026 at 06:44:18AM -0700, Dai Ngo wrote:
>> On 5/18/26 11:30 PM, Christoph Hellwig wrote:
>>> On Mon, May 18, 2026 at 12:55:17PM -0700, Dai Ngo wrote:
>>>> As shown, the file map changes. Entry# 7 and 8 before the NFSD calls
>>>> merged into entry#7 after the calls. So there must be some activities
>>>> that cause the map to change. I don't know whether the activities were
>>>> triggered by NFS or something in XFS or the block device layer.
>>> Hmm. We only insert layouts (and search for conflicts) after calling
>>> ->proc_layoutget. So this might be racy against unwritten extent
>>> conversion, or other writers, which is a bit of an issue. I guess
>>> we need to fix nfsd4_layoutget to insert an in-progress layout before
>>> calling into ->layouget.
>> I can't quite see the race condition regarding layoutget here. Could you
>> please explain?
> /me has no idea about nfs layouts but has thoughts anyway :)
>
> 1) xfs_fs_map_blocks only takes i_rwsem and XFS_ILOCK; it doesn't take
> the mmap invalidation lock. Does that mean that pagefaults could wander
> in and mess with the layout?
>
> (I think the race that Christoph is referring to here is that any other
> thread can wander in and change the mapping after a ->map_blocks call
> returns, because nfs isn't holding any kind of lock on the inode)
I think this is what happened.
Assuming the current map of the file as follow:
0000-4095: IOMAP_UNWRITTEN - block allocated
4096-8191: IOMAP_HOLE - no block allocated
8192-12287: IOMAP_UNWRITTEN - block allocated
NOTE:
. the bmapi_flags used by xfs_bmapi_read is set to XFS_BMAPI_ENTIRE.
. the nimaps count is set to 1. xfs_bmapi_read can return only 1 extent.
1st call from nfsd to xfs_fs_map_blocks(), specifying offset 0, length 12288 and
write is true.
Extent Returned:
- offset: 0
- length: 4096 bytes (one filesystem block)
- type: IOMAP_UNWRITTEN (existing preallocated extent)
- addr: physical block number of the first unwritten extent.
2nd call from nfsd to xfs_fs_map_blocks(), specifying offset 4096, length 8192
and write is true.
Returned Extent:
- offset: 4096
- length: 4096 bytes (one filesystem block)
- type: IOMAP_UNWRITTEN (freshly allocated unwritten extent)
- addr: physical block number assigned when filling the hole at 4096
3rd call from nfsd to xfs_fs_map_blocks(), specifying offset 8192, length
4096 and write is true.
Returned Extent
- offset: 0
- length: 12288 bytes (three contiguous blocks)
- type: IOMAP_UNWRITTEN
- addr: physical address of the merged unwritten allocation
After the second call, filling the hole merged the three unwritten segments
into a single extent (br_startoff = 0, br_blockcount = 3). When the third call
looks up the range starting at 8192, xfs_bmapi_read() finds that single extent.
Because XFS_BMAPI_ENTIRE is set, xfs_bmapi_trim_map() bypasses trimming and
returns the full extent to xfs_bmbt_to_iomap().
-Dai
>
> 2) Now that NFS apparently can serve up multiple mappings at once,
> should ->map_blocks pass in an array element count so that we can do
> multiple iomaps in a single lock cycle?
>
> 3) Do the reflink and realtime inode checks need to be re-assessed after
> grabbing the ilock since they can change?
>
> --D
>
>>>> However, based on this data I think it's better to change the bmapi_flags
>>>> from XFS_BMAPI_ENTIRE to '0' to address the overlap issue.
>>> We absolutely should be doing that as I said from my first reply.
>>> Still trying to understand what is going on here at a higher level,
>>> though.
>> I'll resubmit the patch series with both fixes combined: the uninitialized
>> imap handling in the error path and the replacement of XFS_BMAPI_ENTIRE
>> with 0.
>>
>> Thanks,
>> -Dai
>>
>>
next prev parent reply other threads:[~2026-05-19 17:34 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-12 17:21 [PATCH 1/1] xfs: fix overlapping extents returned for pNFS LAYOUTGET Dai Ngo
2026-05-12 17:34 ` Darrick J. Wong
2026-05-12 19:21 ` Dai Ngo
2026-05-13 7:01 ` Christoph Hellwig
2026-05-13 15:50 ` Dai Ngo
2026-05-13 17:28 ` Dai Ngo
2026-05-14 0:25 ` Darrick J. Wong
2026-05-14 17:19 ` Dai Ngo
2026-05-14 17:49 ` Darrick J. Wong
2026-05-15 21:39 ` Dave Chinner
2026-05-16 2:14 ` Dai Ngo
2026-05-18 5:09 ` Christoph Hellwig
2026-05-18 19:55 ` Dai Ngo
2026-05-19 6:30 ` Christoph Hellwig
2026-05-19 13:44 ` Dai Ngo
2026-05-19 14:59 ` Darrick J. Wong
2026-05-19 17:34 ` Dai Ngo [this message]
2026-05-20 8:24 ` Christoph Hellwig
2026-05-20 15:09 ` Dai Ngo
2026-05-20 16:48 ` Darrick J. Wong
2026-05-20 17:32 ` Dai Ngo
2026-05-20 22:08 ` Sergey Bashirov
2026-05-15 11:50 ` Christoph Hellwig
2026-05-15 11:49 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0e67fc75-382d-4698-ba93-39464e112e30@oracle.com \
--to=dai.ngo@oracle.com \
--cc=cem@kernel.org \
--cc=dgc@kernel.org \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.