From: John Garry <john.g.garry@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
hch@lst.de, viro@zeniv.linux.org.uk, brauner@kernel.org,
dchinner@redhat.com, jack@suse.cz, chandan.babu@oracle.com,
martin.petersen@oracle.com, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
tytso@mit.edu, jbongio@google.com, ojaswin@linux.ibm.com
Subject: Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
Date: Thu, 15 Feb 2024 09:53:47 +0000 [thread overview]
Message-ID: <c0d1d513-991b-4893-a132-66ee02c0c880@oracle.com> (raw)
In-Reply-To: <Zc1GwE/7QJisKZCX@dread.disaster.area>
>>
>> Yes, I was just looking at adding a mkfs option to format for atomic writes,
>> which would check physical information about the volume and whether it suits
>> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.
>
> FWIW, atomic writes need to be implemented in XFS in a way that
> isn't specific to the rtdev. There is no reason they cannot be
> applied to the data device (via superblock max atomic write size
> field or extent size hints and the align flag) so
> please don't get hung up on rtextsize as the only thing that atomic
> write alignment might apply to.
Sure
>
>>> Yes, mkfs allows the user to override the hardware configsi it
>>> probes, but it also warns when the override is doing something
>>> sub-optimal (like aligning all AG headers to the same disk in a
>>> stripe).
>>>
>>> IOWs, mkfs should be pulling this atomic write info from the
>>> hardware and configuring the filesysetm around that information.
>>> That's the target we should be aiming the kernel implementation at
>>> and optimising for - a filesystem that is correctly configured
>>> according to published hardware capability.
>>
>> Right
>>
>> So, for example, if the atomic writes option is set and the rtextsize set by
>> the user is so much larger than what HW can support in terms of atomic
>> writes, then we should let the user know about this.
>
> Well, this is part of the problem I mention above: you're focussing
> entirely on the rtdev setup and not the general "atomic writes
> require BMBT extent alignment constraints".
I'm really just saying what I was considering based on this series only.
>
> So, maybe, yes, we might want to warn that the rtextsize is much
> bigger than the atomic write size of that device, but now there's
> something else we need to take into account: the rtdev could have a
> different atomic write size comxpapred to the data device.
>
> What now?
>
> IOWs, focussing on the rtdev misses key considerations for making
> the functionality generic, and we most definitely don't want to have
> to rev the on disk format a second time to support atomic writes
> for the data device. Hence we likely need two variables for atomic
> write sizes in the superblock - one for the rtdev, and one for the
> data device. And this then feeds through to Darrick's multi-rtdev
> stuff - each rtdev will need to have an attribute that tracks this
> information.
ok
>>>
>>> What the patchset does is try to extend and infer things from
>>> existing allocation alignment constraints, but then not apply those
>>> constraints to critical extent state operations (pure BMBT
>>> modifications) that atomic writes also need constrained to work
>>> correctly and efficiently.
>>
>> Right. Previously I also did mention that we could explicitly request the
>> atomic write size per-inode, but a drawback is that this would require an
>> on-disk format change.
>
> We're already having to change the on-disk format for this (inode
> flag, superblock feature bit), so we really should be trying to make
> this generic and properly featured so that it can be used for
> anything that requires physical alignment of file data extents, not
> just atomic writes...
ok
...
>> Another motivation for this flag is that we can explicitly enable some
>> software-based atomic write support for an inode when the backing device
>> does not have HW support.
>
> That's orthogonal to the aligment issue. If the BMBT extents are
> always aligned in a way that is compatible with atomic writes, we
> don't need and aomtic writes flag to tell the filesystem it should
> do an atomic write.
Any instruction to do an atomic write should be encoded in the userspace
write operation. Or maybe the file open operation in future - I still
get questions about O_ATOMIC.
> That comes from userspace via the IOCB_ATOMIC
> flag.
>
> It is the IOCB_ATOMIC that triggers the software fallback when the
> hardware doesn't support atomic writes, not an inode flag.
To me, any such fallback seems like something which we should be
explicitly enabling.
> All the
> filesystem has to do is guarantee all extent manipulations are
> correctly aligned and IOCB_ATOMIC can always be executed regardless
> of whether it is hardware or software that does it.
>
>
>> In addition, in this series setting FS_XFLAG_ATOMICWRITES means
>> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
>> similar for other OSes, and for those other OSes it may also mean some other
>> special alignment feature enabled. We want a consistent user experience.
>
> I don't care about other OSes - they don't implement the
> FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
> compatibility for the user API.
Other FSes need to support FS_IOC_FSSETXATTR for atomic writes. Or at
least should support it....
>
> Fundamentally, atomic writes are *not a property of the filesystem
> on-disk format*. They require extent tracking constraints (i.e.
> alignment), and that's the property of the filesystem on-disk format
> that we need to manipulate here.
>
> Users are not going to care if the setup ioctl for atomic writes
> is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They
> already know they have to align their IO properly for atomic writes,
> so it's not like this is something they will be completely
> unfamiliar with.
>
> Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
> = max_atomic_write_size as a user interface to set up the inode for
> atomic writes is pretty concise and easy to use. It also isn't
> specific to atomic writes, and so this fucntionality can be used for
> anything that needs aligned extent manipulation for performant
> functionality.
This is where there seems to be a difference between how you would like
atomic writes to be enabled and how Christoph would, judging by this:
https://lore.kernel.org/linux-fsdevel/20240110091929.GA31003@lst.de/
I think that if the FS and the userspace util can between them figure
out what to do, then that is ok. This is something like what I proposed
previously:
xfs_io -c "atomic-writes 64K" mnt/file
where the userspace util would use for the FS_IOC_FSSETXATTR ioctl:
FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/
fsx.fsx_extsize
There is a question on the purpose of the FS_XFLAG_ATOMIC_WRITES flag,
but, to me, it does seem useful for future feature support.
We could just have FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_EXTSIZE w/
fsx.fsx_extsize, and the kernel auto-enables FS_XFLAG_ALIGN_EXTENTS, but
the other way seems better
>
>>> to behave in a particular way - forced alignment - not that atomic
>>> writes are being used in the filesystem....
>>>
>>> At this point, the filesystem can do all the extent modification
>>> alignment stuff that atomic writes require without caring if the
>>> block device supports atomic writes or even if the
>>> application is using atomic writes.
>>>
>>> This means we can test the BMBT functionality in fstests without
>>> requiring hardware (or emulation) that supports atomic writes - all
>>> we need to do is set the forced align flag, an extent size hint and
>>> go run fsx on it...
>>>
>>
>> The current idea was that the forcealign feature would be required for the
>> second phase for atomic write support - non-rtvol support. Darrick did send
>> that series out separately over the New Year's break.
>
> This is the wrong approach: this needs to be integrated into the
> same patchset so we can review the changes for atomic writes as a
> whole, not as two separate, independent on-disk format changes. The
> on-disk format change that atomic writes need is aligned BMBT extent
> manipulations, regardless of whether we are targeting the rtdev or
> the datadev, and trying to separate them is leading you down
> entirely the wrong path.
>
ok, fine.
BTW, going back to the original discussion on the extent zeroing and
your idea to do this in the iomap dio path, my impression is that we
require changes like this:
- in iomap_dio_bio_iter(), we need to zero out the extent according to
extsize (and not just FS block size)
- xfs_dio_write_end_io() -> xfs_iomap_write_unwritten() also needs to
consider this extent being written, and not assume a FS block
- per-inode locking similar to what is done in
xfs_file_dio_write_unaligned() - I need to check that further, though,
as I am not yet sure on how we decide to use this exclusive lock or not.
Does this sound sane?
Thanks,
John
next prev parent reply other threads:[~2024-02-15 9:54 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
2024-02-02 17:25 ` Darrick J. Wong
2024-02-05 11:29 ` John Garry
2024-02-13 6:55 ` Christoph Hellwig
2024-02-13 8:20 ` John Garry
2024-02-15 11:08 ` John Garry
2024-02-13 18:08 ` Darrick J. Wong
2024-02-05 15:20 ` Pankaj Raghav (Samsung)
2024-02-05 15:41 ` John Garry
2024-01-24 14:26 ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-02-02 17:57 ` Darrick J. Wong
2024-02-05 12:58 ` John Garry
2024-02-13 6:56 ` Christoph Hellwig
2024-02-13 17:08 ` Darrick J. Wong
2024-01-24 14:26 ` [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol John Garry
2024-02-02 17:52 ` Darrick J. Wong
2024-02-03 7:40 ` Ojaswin Mujoo
2024-02-05 12:51 ` John Garry
2024-02-13 17:22 ` Darrick J. Wong
2024-02-14 12:19 ` John Garry
2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
2024-02-02 18:05 ` Darrick J. Wong
2024-02-05 13:10 ` John Garry
2024-02-13 17:37 ` Darrick J. Wong
2024-02-14 12:26 ` John Garry
2024-02-09 7:00 ` Ojaswin Mujoo
2024-02-09 17:30 ` John Garry
2024-02-12 11:48 ` Ojaswin Mujoo
2024-02-12 12:05 ` Ojaswin Mujoo
2024-01-24 14:26 ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
2024-02-02 18:47 ` Darrick J. Wong
2024-02-05 13:36 ` John Garry
2024-02-06 1:15 ` Dave Chinner
2024-02-06 9:53 ` John Garry
2024-02-07 0:06 ` Dave Chinner
2024-02-07 14:13 ` John Garry
2024-02-09 1:40 ` Dave Chinner
2024-02-09 12:47 ` John Garry
2024-02-13 23:41 ` Dave Chinner
2024-02-14 11:06 ` John Garry
2024-02-14 23:03 ` Dave Chinner
2024-02-15 9:53 ` John Garry [this message]
2024-02-13 17:50 ` Darrick J. Wong
2024-02-14 12:13 ` John Garry
2024-01-24 14:26 ` [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set John Garry
2024-02-02 18:06 ` Darrick J. Wong
2024-02-05 10:26 ` John Garry
2024-02-13 17:59 ` Darrick J. Wong
2024-02-14 12:36 ` John Garry
2024-02-21 17:00 ` Darrick J. Wong
2024-02-21 17:38 ` John Garry
2024-02-24 4:18 ` Darrick J. Wong
2024-02-09 7:14 ` [PATCH 0/6] block atomic writes for XFS Ojaswin Mujoo
2024-02-09 9:22 ` John Garry
2024-02-12 12:06 ` Ojaswin Mujoo
2024-02-13 7:22 ` Christoph Hellwig
2024-02-13 17:55 ` Darrick J. Wong
2024-02-14 7:45 ` Christoph Hellwig
2024-02-21 16:56 ` Darrick J. Wong
2024-02-23 6:57 ` Christoph Hellwig
2024-02-13 23:50 ` Dave Chinner
2024-02-14 7:38 ` Christoph Hellwig
2024-02-13 7:45 ` Ritesh Harjani
2024-02-13 8:41 ` John Garry
2024-02-13 9:10 ` Ritesh Harjani
2024-02-13 22:49 ` Dave Chinner
2024-02-14 10:10 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c0d1d513-991b-4893-a132-66ee02c0c880@oracle.com \
--to=john.g.garry@oracle.com \
--cc=brauner@kernel.org \
--cc=chandan.babu@oracle.com \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jbongio@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ojaswin@linux.ibm.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).