From: Jan Kara <jack@suse.cz>
To: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@infradead.org>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap
Date: Thu, 30 Nov 2023 17:01:56 +0100 [thread overview]
Message-ID: <20231130160156.xlorirxomdgpko5o@quack3> (raw)
In-Reply-To: <878r6fi6jy.fsf@doe.com>
On Thu 30-11-23 21:20:41, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
>
> > On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >>
> >> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> >> >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> >>
> >> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> >> >
> >> >> >> Christoph Hellwig <hch@infradead.org> writes:
> >> >> >>
> >> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >> >> >>>> care as much about ext2 writeback performance but it should not be that
> >> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >> >> >>>> you give that a try (as a followup "performance improvement" patch).
> >> >>
> >> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
> >> >> would require sequence counter check. We don't have collapse range
> >> >> or COW sort of operations, it is only the truncate which can race,
> >> >> but that should be taken care by folio_lock. And even if the partial
> >> >> truncate happens on a folio, since the logical to physical block mapping
> >> >> never changes, it should not matter if the writeback wrote data to a
> >> >> cached entry, right?
> >> >
> >> > Yes, so this is what I think I've already mentioned. As long as we map just
> >> > the block at the current offset (or a block under currently locked folio),
> >> > we are fine and we don't need any kind of sequence counter. But as soon as
> >> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
> >> > to protect from races with truncate. So something like the sequence counter.
> >> >
> >>
> >> Why do we need to protect from the race with truncate, is my question here.
> >> So, IMO, truncate will truncate the folio cache first before releasing the FS
> >> blocks. Truncation of the folio cache and the writeback path are
> >> protected using folio_lock()
> >> Truncate will clear the dirty flag of the folio before
> >> releasing the folio_lock() right, so writeback will not even proceed for
> >> folios which are not marked dirty (even if we have a cached wpc entry for
> >> which folio is released from folio cache).
> >>
> >> Now coming to the stale cached wpc entry for which truncate is doing a
> >> partial truncation. Say, truncate ended up calling
> >> truncate_inode_partial_folio(). Now for such folio (it remains dirty
> >> after partial truncation) (for which there is a stale cached wpc entry),
> >> when writeback writes to the underlying stale block, there is no harm
> >> with that right?
> >>
> >> Also this will "only" happen for folio which was partially truncated.
> >> So why do we need to have sequence counter for protecting against this
> >> race is my question.
> >
> > That's a very good question and it took me a while to formulate my "this
> > sounds problematic" feeling into a particular example :) We can still have
> > a race like:
> >
> > write_cache_pages()
> > cache extent covering 0..1MB range
> > write page at offset 0k
> > truncate(file, 4k)
> > drops all relevant pages
> > frees fs blocks
> > pwrite(file, 4k, 4k)
> > creates dirty page in the page cache
> > writes page at offset 4k to a stale block
>
> :) Nice!
> Truncate followed by an append write could cause this race with writeback
> happening in parallel. And this might cause data corruption.
>
> Thanks again for clearly explaining the race :)
>
> So I think just having a seq. counter (no other locks required), should
> be ok to prevent this race. Since mostly what we are interested in is
> whether the stale cached block mapping got changed for the folio which
> writeback is going to continue writing to.
>
> i.e. the race only happens when 2 of these operation happen while we
> have a cached block mapping for a folio -
> - a new physical block got allocated for the same logical offset to
> which the previous folio belongs to
> - the folio got re-instatiated in the page cache as dirty with the new
> physical block mapping at the same logical offset of the file.
>
> Now when the writeback continues, it will iterate over the same
> dirty folio in question, lock it, check for ->map_blocks which will
> notice a changed seq counter and do ->get_blocks again and then
> we do submit_bio() to the right physical block.
>
> So, it should be ok, if we simply update the seq counter everytime a
> block is allocated or freed for a given inode... because when we
> check the seq counter, we know the folio is locked and the other
> operation of re-instating a new physical block mapping for a given folio
> can also only happen while under a folio lock.
Yes.
> OR, one other approach to this can be...
>
> IIUC, for a new block mapping for the same folio, the folio can be made to
> get invalidated once in between right? So should this be handled in folio
> cache somehow to know if for a given folio the underlying mapping might
> have changed for which iomap should again query ->map_blocks() ?
> ... can that also help against unnecessary re-validations against cached
> block mappings?
This would be difficult because the page cache handling code does not know
someone has cached block mapping somewhere...
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2023-11-30 16:01 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <cover.1700506526.git.ritesh.list@gmail.com>
2023-11-20 19:05 ` [RFC 1/3] ext2: Fix ki_pos update for DIO buffered-io fallback case Ritesh Harjani (IBM)
2023-11-21 4:39 ` Christoph Hellwig
2023-11-21 5:36 ` Ritesh Harjani
2023-11-22 6:51 ` Christoph Hellwig
2023-11-20 19:05 ` [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap Ritesh Harjani (IBM)
2023-11-20 20:00 ` Matthew Wilcox
2023-11-21 4:44 ` Christoph Hellwig
2023-11-21 5:56 ` Ritesh Harjani
2023-11-21 6:08 ` Christoph Hellwig
2023-11-21 6:15 ` Ritesh Harjani
2023-11-22 12:29 ` Jan Kara
2023-11-22 13:11 ` Christoph Hellwig
2023-11-22 20:26 ` Ritesh Harjani
2023-11-30 3:24 ` Ritesh Harjani
2023-11-30 4:22 ` Matthew Wilcox
2023-11-30 4:37 ` Ritesh Harjani
2023-11-30 4:30 ` Christoph Hellwig
2023-11-30 5:27 ` Ritesh Harjani
2023-11-30 8:22 ` Zhang Yi
2023-11-30 7:45 ` Ritesh Harjani
2023-11-30 10:18 ` Jan Kara
2023-11-30 10:59 ` Ritesh Harjani
2023-11-30 14:08 ` Jan Kara
2023-11-30 15:50 ` Ritesh Harjani
2023-11-30 16:01 ` Jan Kara [this message]
2023-11-30 16:03 ` Matthew Wilcox
2023-12-01 23:09 ` Dave Chinner
2023-12-05 15:22 ` Ritesh Harjani
2023-12-07 8:58 ` Jan Kara
2023-11-22 22:26 ` Dave Chinner
2023-11-23 4:09 ` Darrick J. Wong
2023-11-23 7:09 ` Christoph Hellwig
2023-11-29 5:37 ` Darrick J. Wong
2023-11-29 6:32 ` Christoph Hellwig
2023-11-29 9:19 ` Dave Chinner
2023-11-23 7:02 ` Christoph Hellwig
2023-11-22 20:25 ` Ritesh Harjani
2023-11-20 19:05 ` [RFC 3/3] ext2: Enable large folio support Ritesh Harjani (IBM)
2023-11-20 20:00 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231130160156.xlorirxomdgpko5o@quack3 \
--to=jack@suse.cz \
--cc=hch@infradead.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=ritesh.list@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).