From: John Garry <john.g.garry@oracle.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: hch@lst.de, viro@zeniv.linux.org.uk, brauner@kernel.org,
dchinner@redhat.com, jack@suse.cz, chandan.babu@oracle.com,
martin.petersen@oracle.com, linux-kernel@vger.kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
tytso@mit.edu, jbongio@google.com, ojaswin@linux.ibm.com
Subject: Re: [PATCH 1/6] fs: iomap: Atomic write support
Date: Mon, 5 Feb 2024 11:29:57 +0000 [thread overview]
Message-ID: <2f91a71e-413b-47b6-8bc9-a60c86ed6f6b@oracle.com> (raw)
In-Reply-To: <20240202172513.GZ6226@frogsfrogsfrogs>
On 02/02/2024 17:25, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:40PM +0000, John Garry wrote:
>> Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
>> bio is being created and all the rules there need to be followed.
>>
>> It is the task of the FS iomap iter callbacks to ensure that the mapping
>> created adheres to those rules, like size is power-of-2, is at a
>> naturally-aligned offset, etc. However, checking for a single iovec, i.e.
>> iter type is ubuf, is done in __iomap_dio_rw().
>>
>> A write should only produce a single bio, so error when it doesn't.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> fs/iomap/direct-io.c | 21 ++++++++++++++++++++-
>> fs/iomap/trace.h | 3 ++-
>> include/linux/iomap.h | 1 +
>> 3 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index bcd3f8cf5ea4..25736d01b857 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -275,10 +275,12 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>> static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> struct iomap_dio *dio)
>> {
>> + bool atomic_write = iter->flags & IOMAP_ATOMIC;
>> const struct iomap *iomap = &iter->iomap;
>> struct inode *inode = iter->inode;
>> unsigned int fs_block_size = i_blocksize(inode), pad;
>> loff_t length = iomap_length(iter);
>> + const size_t iter_len = iter->len;
>> loff_t pos = iter->pos;
>> blk_opf_t bio_opf;
>> struct bio *bio;
>> @@ -381,6 +383,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> GFP_KERNEL);
>> bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>> bio->bi_ioprio = dio->iocb->ki_ioprio;
>> + if (atomic_write)
>> + bio->bi_opf |= REQ_ATOMIC;
>
> This really ought to be in iomap_dio_bio_opflags. Unless you can't pass
> REQ_ATOMIC to bio_alloc*, in which case there ought to be a comment
> about why.
I think that should be ok
>
> Also, what's the meaning of REQ_OP_READ | REQ_ATOMIC?
REQ_ATOMIC will be ignored for REQ_OP_READ. I'm following the same
policy as something like RWF_SYNC for a read.
However, if FMODE_CAN_ATOMIC_WRITE is unset, then REQ_ATOMIC will be
rejected for both REQ_OP_READ and REQ_OP_WRITE.
> Does that
> actually work? I don't know what that means, and "block: Add REQ_ATOMIC
> flag" says that's not a valid combination. I'll complain about this
> more below.
Please note that I do mention that this flag is only meaningful for
pwritev2(), like RWF_SYNC, here:
https://lore.kernel.org/linux-api/20240124112731.28579-3-john.g.garry@oracle.com/
>
>> +
>> bio->bi_private = dio;
>> bio->bi_end_io = iomap_dio_bio_end_io;
>>
>> @@ -397,6 +402,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>> }
>>
>> n = bio->bi_iter.bi_size;
>> + if (atomic_write && n != iter_len) {
>
> s/iter_len/orig_len/ ?
ok, I can change the name if you prefer
>
>> + /* This bio should have covered the complete length */
>> + ret = -EINVAL;
>> + bio_put(bio);
>> + goto out;
>> + }
>> if (dio->flags & IOMAP_DIO_WRITE) {
>> task_io_account_write(n);
>> } else {
>> @@ -554,12 +565,17 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>> struct blk_plug plug;
>> struct iomap_dio *dio;
>> loff_t ret = 0;
>> + bool is_read = iov_iter_rw(iter) == READ;
>> + bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
>
> Hrmm. So if the caller passes in an IOCB_ATOMIC iocb with a READ iter,
> we'll silently drop IOCB_ATOMIC and do the read anyway? That seems like
> a nonsense combination, but is that ok for some reason?
Please see above
>
>> trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
>>
>> if (!iomi.len)
>> return NULL;
>>
>> + if (atomic_write && !iter_is_ubuf(iter))
>> + return ERR_PTR(-EINVAL);
>
> Does !iter_is_ubuf actually happen?
Sure, if someone uses iovcnt > 1 for pwritev2
Please see __import_iovec(), where only if iovcnt == 1 we create
iter_type == ITER_UBUF, if > 1 then we have iter_type == ITER_IOVEC
> Why don't we support any of the
> other ITER_ types? Is it because hardware doesn't want vectored
> buffers?
It's related how we can determine atomic_write_unit_max for the bdev.
We want to give a definitive max write value which we can guarantee to
always fit in a BIO, but not mandate any extra special iovec
length/alignment rules.
Without any iovec length or alignment rules (apart from direct IO rules
that an iovec needs to be bdev logical block size and length aligned) ,
if a user provides many iovecs, then we may only be able to only fit
bdev LBS of data (typically 512B) in each BIO vector, and thus we need
to give a pessimistically low atomic_write_unit_max value.
If we say that iovcnt max == 1, then we know that we can fit PAGE size
of data in each BIO vector (ignoring first/last vectors), and this will
give a reasonably large atomic_write_unit_max value.
Note that we do now provide this iovcnt max value via statx, but always
return 1 for now. This was agreed with Christoph, please see:
https://lore.kernel.org/linux-nvme/20240117150200.GA30112@lst.de/
>
> I really wish there was more commenting on /why/ we do things here:
>
> if (iocb->ki_flags & IOCB_ATOMIC) {
> /* atomic reads do not make sense */
> if (iov_iter_rw(iter) == READ)
> return ERR_PTR(-EINVAL);
>
> /*
> * block layer doesn't want to handle handle vectors of
> * buffers when performing an atomic write i guess?
> */
> if (!iter_is_ubuf(iter))
> return ERR_PTR(-EINVAL);
>
> iomi.flags |= IOMAP_ATOMIC;
> }
ok, I can make this more clear.
Note: It would be nice if we could check this in
xfs_iomap_write_direct() or a common VFS helper (which
xfs_iomap_write_direct() calls), but iter is not available there.
I could just check iter_is_ubuf() on its own in the vfs rw path, but I
would like to keep the checks as close together as possible.
Thanks,
John
next prev parent reply other threads:[~2024-02-05 11:30 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
2024-02-02 17:25 ` Darrick J. Wong
2024-02-05 11:29 ` John Garry [this message]
2024-02-13 6:55 ` Christoph Hellwig
2024-02-13 8:20 ` John Garry
2024-02-15 11:08 ` John Garry
2024-02-13 18:08 ` Darrick J. Wong
2024-02-05 15:20 ` Pankaj Raghav (Samsung)
2024-02-05 15:41 ` John Garry
2024-01-24 14:26 ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-02-02 17:57 ` Darrick J. Wong
2024-02-05 12:58 ` John Garry
2024-02-13 6:56 ` Christoph Hellwig
2024-02-13 17:08 ` Darrick J. Wong
2024-01-24 14:26 ` [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol John Garry
2024-02-02 17:52 ` Darrick J. Wong
2024-02-03 7:40 ` Ojaswin Mujoo
2024-02-05 12:51 ` John Garry
2024-02-13 17:22 ` Darrick J. Wong
2024-02-14 12:19 ` John Garry
2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
2024-02-02 18:05 ` Darrick J. Wong
2024-02-05 13:10 ` John Garry
2024-02-13 17:37 ` Darrick J. Wong
2024-02-14 12:26 ` John Garry
2024-02-09 7:00 ` Ojaswin Mujoo
2024-02-09 17:30 ` John Garry
2024-02-12 11:48 ` Ojaswin Mujoo
2024-02-12 12:05 ` Ojaswin Mujoo
2024-01-24 14:26 ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
2024-02-02 18:47 ` Darrick J. Wong
2024-02-05 13:36 ` John Garry
2024-02-06 1:15 ` Dave Chinner
2024-02-06 9:53 ` John Garry
2024-02-07 0:06 ` Dave Chinner
2024-02-07 14:13 ` John Garry
2024-02-09 1:40 ` Dave Chinner
2024-02-09 12:47 ` John Garry
2024-02-13 23:41 ` Dave Chinner
2024-02-14 11:06 ` John Garry
2024-02-14 23:03 ` Dave Chinner
2024-02-15 9:53 ` John Garry
2024-02-13 17:50 ` Darrick J. Wong
2024-02-14 12:13 ` John Garry
2024-01-24 14:26 ` [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set John Garry
2024-02-02 18:06 ` Darrick J. Wong
2024-02-05 10:26 ` John Garry
2024-02-13 17:59 ` Darrick J. Wong
2024-02-14 12:36 ` John Garry
2024-02-21 17:00 ` Darrick J. Wong
2024-02-21 17:38 ` John Garry
2024-02-24 4:18 ` Darrick J. Wong
2024-02-09 7:14 ` [PATCH 0/6] block atomic writes for XFS Ojaswin Mujoo
2024-02-09 9:22 ` John Garry
2024-02-12 12:06 ` Ojaswin Mujoo
2024-02-13 7:22 ` Christoph Hellwig
2024-02-13 17:55 ` Darrick J. Wong
2024-02-14 7:45 ` Christoph Hellwig
2024-02-21 16:56 ` Darrick J. Wong
2024-02-23 6:57 ` Christoph Hellwig
2024-02-13 23:50 ` Dave Chinner
2024-02-14 7:38 ` Christoph Hellwig
2024-02-13 7:45 ` Ritesh Harjani
2024-02-13 8:41 ` John Garry
2024-02-13 9:10 ` Ritesh Harjani
2024-02-13 22:49 ` Dave Chinner
2024-02-14 10:10 ` John Garry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2f91a71e-413b-47b6-8bc9-a60c86ed6f6b@oracle.com \
--to=john.g.garry@oracle.com \
--cc=brauner@kernel.org \
--cc=chandan.babu@oracle.com \
--cc=dchinner@redhat.com \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=jbongio@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=ojaswin@linux.ibm.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).