Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Ritesh Harjani <ritesh.list@gmail.com>
Cc: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@infradead.org>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap
Date: Thu, 30 Nov 2023 17:01:56 +0100	[thread overview]
Message-ID: <20231130160156.xlorirxomdgpko5o@quack3> (raw)
In-Reply-To: <878r6fi6jy.fsf@doe.com>

On Thu 30-11-23 21:20:41, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >> 
> >> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> >> >> Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> >> 
> >> >> > Ritesh Harjani (IBM) <ritesh.list@gmail.com> writes:
> >> >> >
> >> >> >> Christoph Hellwig <hch@infradead.org> writes:
> >> >> >>
> >> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >> >> >>>> care as much about ext2 writeback performance but it should not be that
> >> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >> >> >>>> you give that a try (as a followup "performance improvement" patch).
> >> >> 
> >> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
> >> >> would require sequence counter check. We don't have collapse range
> >> >> or COW sort of operations, it is only the truncate which can race,
> >> >> but that should be taken care by folio_lock. And even if the partial
> >> >> truncate happens on a folio, since the logical to physical block mapping
> >> >> never changes, it should not matter if the writeback wrote data to a
> >> >> cached entry, right?
> >> >
> >> > Yes, so this is what I think I've already mentioned. As long as we map just
> >> > the block at the current offset (or a block under currently locked folio),
> >> > we are fine and we don't need any kind of sequence counter. But as soon as
> >> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
> >> > to protect from races with truncate. So something like the sequence counter.
> >> >
> >> 
> >> Why do we need to protect from the race with truncate, is my question here.
> >> So, IMO, truncate will truncate the folio cache first before releasing the FS
> >> blocks. Truncation of the folio cache and the writeback path are
> >> protected using folio_lock()
> >> Truncate will clear the dirty flag of the folio before
> >> releasing the folio_lock() right, so writeback will not even proceed for
> >> folios which are not marked dirty (even if we have a cached wpc entry for
> >> which folio is released from folio cache).
> >> 
> >> Now coming to the stale cached wpc entry for which truncate is doing a
> >> partial truncation. Say, truncate ended up calling
> >> truncate_inode_partial_folio(). Now for such folio (it remains dirty
> >> after partial truncation) (for which there is a stale cached wpc entry),
> >> when writeback writes to the underlying stale block, there is no harm
> >> with that right?
> >> 
> >> Also this will "only" happen for folio which was partially truncated.
> >> So why do we need to have sequence counter for protecting against this
> >> race is my question. 
> >
> > That's a very good question and it took me a while to formulate my "this
> > sounds problematic" feeling into a particular example :) We can still have
> > a race like:
> >
> > write_cache_pages()
> >   cache extent covering 0..1MB range
> >   write page at offset 0k
> > 					truncate(file, 4k)
> > 					  drops all relevant pages
> > 					  frees fs blocks
> > 					pwrite(file, 4k, 4k)
> > 					  creates dirty page in the page cache
> >   writes page at offset 4k to a stale block
> 
> :) Nice! 
> Truncate followed by an append write could cause this race with writeback
> happening in parallel. And this might cause data corruption.
> 
> Thanks again for clearly explaining the race :) 
> 
> So I think just having a seq. counter (no other locks required), should
> be ok to prevent this race. Since mostly what we are interested in is
> whether the stale cached block mapping got changed for the folio which
> writeback is going to continue writing to.
> 
> i.e. the race only happens when 2 of these operation happen while we
> have a cached block mapping for a folio - 
> - a new physical block got allocated for the same logical offset to
> which the previous folio belongs to 
> - the folio got re-instatiated in the page cache as dirty with the new
> physical block mapping at the same logical offset of the file.
> 
> Now when the writeback continues, it will iterate over the same
> dirty folio in question, lock it, check for ->map_blocks which will
> notice a changed seq counter and do ->get_blocks again and then
> we do submit_bio() to the right physical block.
> 
> So, it should be ok, if we simply update the seq counter everytime a
> block is allocated or freed for a given inode... because when we
> check the seq counter, we know the folio is locked and the other
> operation of re-instating a new physical block mapping for a given folio
> can also only happen while under a folio lock. 

Yes.

> OR, one other approach to this can be... 
> 
> IIUC, for a new block mapping for the same folio, the folio can be made to
> get invalidated once in between right? So should this be handled in folio
> cache somehow to know if for a given folio the underlying mapping might
> have changed for which iomap should again query  ->map_blocks() ? 
> ... can that also help against unnecessary re-validations against cached
> block mappings?

This would be difficult because the page cache handling code does not know
someone has cached block mapping somewhere...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2023-11-30 16:01 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1700506526.git.ritesh.list@gmail.com>
2023-11-20 19:05 ` [RFC 1/3] ext2: Fix ki_pos update for DIO buffered-io fallback case Ritesh Harjani (IBM)
2023-11-21  4:39   ` Christoph Hellwig
2023-11-21  5:36     ` Ritesh Harjani
2023-11-22  6:51       ` Christoph Hellwig
2023-11-20 19:05 ` [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap Ritesh Harjani (IBM)
2023-11-20 20:00   ` Matthew Wilcox
2023-11-21  4:44   ` Christoph Hellwig
2023-11-21  5:56     ` Ritesh Harjani
2023-11-21  6:08       ` Christoph Hellwig
2023-11-21  6:15         ` Ritesh Harjani
2023-11-22 12:29   ` Jan Kara
2023-11-22 13:11     ` Christoph Hellwig
2023-11-22 20:26       ` Ritesh Harjani
2023-11-30  3:24         ` Ritesh Harjani
2023-11-30  4:22           ` Matthew Wilcox
2023-11-30  4:37             ` Ritesh Harjani
2023-11-30  4:30           ` Christoph Hellwig
2023-11-30  5:27             ` Ritesh Harjani
2023-11-30  8:22             ` Zhang Yi
2023-11-30  7:45           ` Ritesh Harjani
2023-11-30 10:18             ` Jan Kara
2023-11-30 10:59               ` Ritesh Harjani
2023-11-30 14:08                 ` Jan Kara
2023-11-30 15:50                   ` Ritesh Harjani
2023-11-30 16:01                     ` Jan Kara [this message]
2023-11-30 16:03                     ` Matthew Wilcox
2023-12-01 23:09           ` Dave Chinner
2023-12-05 15:22             ` Ritesh Harjani
2023-12-07  8:58               ` Jan Kara
2023-11-22 22:26       ` Dave Chinner
2023-11-23  4:09         ` Darrick J. Wong
2023-11-23  7:09           ` Christoph Hellwig
2023-11-29  5:37             ` Darrick J. Wong
2023-11-29  6:32               ` Christoph Hellwig
2023-11-29  9:19           ` Dave Chinner
2023-11-23  7:02         ` Christoph Hellwig
2023-11-22 20:25     ` Ritesh Harjani
2023-11-20 19:05 ` [RFC 3/3] ext2: Enable large folio support Ritesh Harjani (IBM)
2023-11-20 20:00   ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231130160156.xlorirxomdgpko5o@quack3 \
    --to=jack@suse.cz \
    --cc=hch@infradead.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).