linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: John Garry <john.g.garry@oracle.com>
Cc: axboe@kernel.dk, brauner@kernel.org, viro@zeniv.linux.org.uk,
	jack@suse.cz, chandan.babu@oracle.com, dchinner@redhat.com,
	hch@lst.de, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, hare@suse.de,
	martin.petersen@oracle.com, catherine.hoang@oracle.com,
	kbusch@kernel.org
Subject: Re: [PATCH v5 3/7] fs: iomap: Atomic write support
Date: Fri, 30 Aug 2024 16:56:21 -0700	[thread overview]
Message-ID: <20240830235621.GS6216@frogsfrogsfrogs> (raw)
In-Reply-To: <112ec3a6-48b3-4596-9c20-e23288581ffd@oracle.com>

On Fri, Aug 30, 2024 at 04:48:36PM +0100, John Garry wrote:
> On 22/08/2024 21:30, Darrick J. Wong wrote:
> > > Then, the iomap->type/flag is either IOMAP_UNWRITTEN/IOMAP_F_DIRTY or
> > > IOMAP_MAPPED/IOMAP_F_DIRTY per iter. So the type is not consistent. However
> > > we will set IOMAP_DIO_UNWRITTEN in dio->flags, so call xfs_dio_write_endio()
> > > -> xfs_iomap_write_unwritten() for the complete FSB range.
> > > 
> > > Do you see a problem with this?
> 
> Sorry again for the slow response.
> 
> > > 
> > > Please see this also for some more background:
> > > https://urldefense.com/v3/__https://lore.kernel.org/linux-
> > > xfs/20240726171358.GA27612@lst.de/__;!!ACWV5N9M2RV99hQ! P5jeP96F8wAtRAblbm8NvRo8nlpil03vA26UMMX8qrYa4IzKecAAk7x1l1M45bBshC3Czxn1CkDXypNSAg$
> > Yes -- if you have a mix of written and unwritten blocks for the same
> > chunk of physical space:
> > 
> > 0      7
> > WUWUWUWU
> > 
> > the directio ioend function will start four separate transactions to
> > convert blocks 1, 3, 5, and 7 to written status.  If the system crashes
> > midway through, they will see this afterwards:
> > 
> > WWWWW0W0
> > 
> > IOWs, although the*disk write* was completed successfully, the mapping
> > updates were torn, and the user program sees a torn write.
> > > The most performant/painful way to fix this would be to make the whole
> > ioend completion a logged operation so that we could commit to updating
> > all the unwritten mappings and restart it after a crash.
> 
> could we make it logged for those special cases which we are interested in
> only?

Yes, though this is the long route -- you get to define a new ondisk log
item, build all the incore structures to process them, and then define a
new high level operation that uses the state encoded in that new log
item to run all the ioend completion transactions within that framework.
Also you get to add a new log incompat feature bit for this.

Perhaps we should analyze the cost of writing and QA'ing all that vs.
the amount of time saved in the handling of this corner case using one
of the less exciting options.

> > The least performant of course is to write zeroes at allocation time,
> > like we do for fsdax.
> 
> That idea was already proposed:
> https://lore.kernel.org/linux-xfs/ZcGIPlNCkL6EDx3Z@dread.disaster.area/

Yes, I'm aware.

> > A possible middle ground would be to detect IOMAP_ATOMIC in the
> > ->iomap_begin method, notice that there are mixed mappings under the
> > proposed untorn IO, and pre-convert the unwritten blocks by writing
> > zeroes to disk and updating the mappings
> 
> Won't that have the same issue as using XFS_BMAPI_ZERO, above i.e. zeroing
> during allocation?

Only if you set the forcealign size to > 1fsb and fail to write new
file data in forcealign units, even for non-untorn writes.  If all
writes to the file are aligned to the forcealign size then there's only
one extent conversion to be done, and that cannot be torn.

> > before handing the one single
> > mapping back to iomap_dio_rw to stage the untorn writes bio.  At least
> > you'd only be suffering that penalty for the (probable) corner case of
> > someone creating mixed mappings.
> 
> BTW, one issue I have with the sub-extent(or -alloc unit) zeroing from v4
> series is how the unwritten conversion has changed, like:
> 
> xfs_iomap_write_unwritten()
> {
> 	unsigned int rounding;
> 
> 	/* when converting anything unwritten, we must be spanning an alloc unit,
> so round up/down */
> 	if (rounding > 1) {
> 		offset_fsb = rounddown(rounding);
> 		count_fsb = roundup(rounding);
> 	}
> 
> 	...
> 	do {
> 		xfs_bmapi_write();
> 		...
> 		xfs_trans_commit();
> 	} while ();
> }
> 
> I'm not too happy with it and it seems a bit of a bodge, as I would rather
> we report the complete size written (user data and zeroes); then
> xfs_iomap_write_unwritten() would do proper individual block conversion.
> However, we do something similar for zeroing for sub-FSB writes. I am not
> sure if that is the same thing really, as we only round up to FSB size.
> Opinion?

xfs_iomap_write_unwritten is in the ioend path; that's not what I was
talking about.

I'm talking about a separate change to the xfs_direct_write_iomap_begin
function that would detect the case where the bmapi_read returns an
@imap that doesn't span the whole forcealign region, then repeatedly
calls bmapi_write(BMAPI_ZERO | BMAPI_CONVERT) on any unwritten mappings
within that file range until the original bmapi_read would return a
single written mapping.

--D

> 
> Thanks,
> John
> 
> 
> 

  reply	other threads:[~2024-08-30 23:56 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-17  9:47 [PATCH v5 0/7] block atomic writes for xfs John Garry
2024-08-17  9:47 ` [PATCH v5 1/7] block/fs: Pass an iocb to generic_atomic_write_valid() John Garry
2024-08-20 17:28   ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 2/7] fs: Export generic_atomic_write_valid() John Garry
2024-08-20 17:28   ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 3/7] fs: iomap: Atomic write support John Garry
2024-08-21 16:58   ` Darrick J. Wong
2024-08-22 15:29     ` John Garry
2024-08-22 20:30       ` Darrick J. Wong
2024-08-30 15:48         ` John Garry
2024-08-30 23:56           ` Darrick J. Wong [this message]
2024-09-03 12:43             ` John Garry
2024-08-17  9:47 ` [PATCH v5 4/7] xfs: Support FS_XFLAG_ATOMICWRITES for forcealign John Garry
2024-08-21 17:07   ` Darrick J. Wong
2024-08-22 17:45     ` John Garry
2024-08-22 20:38       ` Darrick J. Wong
2024-08-23  8:39         ` John Garry
2024-08-23 16:03           ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 5/7] xfs: Support atomic write for statx John Garry
2024-08-21 17:09   ` Darrick J. Wong
2024-08-17  9:47 ` [PATCH v5 6/7] xfs: Validate atomic writes John Garry
2024-08-21 17:10   ` Darrick J. Wong
2024-08-17  9:48 ` [PATCH v5 7/7] xfs: Support setting FMODE_CAN_ATOMIC_WRITE John Garry
2024-08-21 17:11   ` Darrick J. Wong
2024-08-22 18:04     ` John Garry
2024-08-22 20:44       ` Darrick J. Wong
2024-08-23 10:41         ` John Garry
2024-08-23 15:52           ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240830235621.GS6216@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=brauner@kernel.org \
    --cc=catherine.hoang@oracle.com \
    --cc=chandan.babu@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=john.g.garry@oracle.com \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).