From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Chi Zhiling <chizhiling@163.com>,
Amir Goldstein <amir73il@gmail.com>,
Dave Chinner <david@fromorbit.com>,
cem@kernel.org, linux-xfs@vger.kernel.org,
linux-kernel@vger.kernel.org, Chi Zhiling <chizhiling@kylinos.cn>,
John Garry <john.g.garry@oracle.com>
Subject: Re: [PATCH] xfs: Remove i_rwsem lock in buffered read
Date: Mon, 13 Jan 2025 08:40:51 -0500 [thread overview]
Message-ID: <Z4UX4zyc8n8lGM16@bfoster> (raw)
In-Reply-To: <20250113024401.GU1306365@frogsfrogsfrogs>
On Sun, Jan 12, 2025 at 06:44:01PM -0800, Darrick J. Wong wrote:
> On Sun, Jan 12, 2025 at 06:05:37PM +0800, Chi Zhiling wrote:
> > On 2025/1/11 01:07, Amir Goldstein wrote:
> > > On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Wed, Jan 08, 2025 at 09:35:47AM -0800, Darrick J. Wong wrote:
> > > > > On Wed, Jan 08, 2025 at 03:43:04PM +0800, Chi Zhiling wrote:
> > > > > > On 2025/1/7 20:13, Amir Goldstein wrote:
> > > > > > > Dave's answer to this question was that there are some legacy applications
> > > > > > > (database applications IIRC) on production systems that do rely on the fact
> > > > > > > that xfs provides this semantics and on the prerequisite that they run on xfs.
> > > > > > >
> > > > > > > However, it was noted that:
> > > > > > > 1. Those application do not require atomicity for any size of IO, they
> > > > > > > typically work in I/O size that is larger than block size (e.g. 16K or 64K)
> > > > > > > and they only require no torn writes for this I/O size
> > > > > > > 2. Large folios and iomap can usually provide this semantics via folio lock,
> > > > > > > but application has currently no way of knowing if the semantics are
> > > > > > > provided or not
> > > > > >
> > > > > > To be honest, it would be best if the folio lock could provide such
> > > > > > semantics, as it would not cause any potential problems for the
> > > > > > application, and we have hope to achieve concurrent writes.
> > > > > >
> > > > > > However, I am not sure if this is easy to implement and will not cause
> > > > > > other problems.
> > > > >
> > > > > Assuming we're not abandoning POSIX "Thread Interactions with Regular
> > > > > File Operations", you can't use the folio lock for coordination, for
> > > > > several reasons:
> > > > >
> > > > > a) Apps can't directly control the size of the folio in the page cache
> > > > >
> > > > > b) The folio size can (theoretically) change underneath the program at
> > > > > any time (reclaim can take your large folio and the next read gets a
> > > > > smaller folio)
> > > > >
> > > > > c) If your write crosses folios, you've just crossed a synchronization
> > > > > boundary and all bets are off, though all the other filesystems behave
> > > > > this way and there seem not to be complaints
> > > > >
> > > > > d) If you try to "guarantee" folio granularity by messing with min/max
> > > > > folio size, you run the risk of ENOMEM if the base pages get fragmented
> > > > >
> > > > > I think that's why Dave suggested range locks as the correct solution to
> > > > > this; though it is a pity that so far nobody has come up with a
> > > > > performant implementation.
> > > >
> > > > Yes, that's a fair summary of the situation.
> > > >
> > > > That said, I just had a left-field idea for a quasi-range lock
> > > > that may allow random writes to run concurrently and atomically
> > > > with reads.
> > > >
> > > > Essentially, we add an unsigned long to the inode, and use it as a
> > > > lock bitmap. That gives up to 64 "lock segments" for the buffered
> > > > write. We may also need a "segment size" variable....
> > > >
> > > > The existing i_rwsem gets taken shared unless it is an extending
> > > > write.
> > > >
> > > > For a non-extending write, we then do an offset->segment translation
> > > > and lock that bit in the bit mask. If it's already locked, we wait
> > > > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
> > > >
> > > > The segments are evenly sized - say a minimum of 64kB each, but when
> > > > EOF is extended or truncated (which is done with the i_rwsem held
> > > > exclusive) the segment size is rescaled. As nothing can hold bit
> > > > locks while the i_rwsem is held exclusive, this will not race with
> > > > anything.
> > > >
> > > > If we are doing an extending write, we take the i_rwsem shared
> > > > first, then check if the extension will rescale the locks. If lock
> > > > rescaling is needed, we have to take the i_rwsem exclusive to do the
> > > > EOF extension. Otherwise, the bit lock that covers EOF will
> > > > serialise file extensions so it can be done under a shared i_rwsem
> > > > safely.
> > > >
> > > > This will allow buffered writes to remain atomic w.r.t. each other,
> > > > and potentially allow buffered reads to wait on writes to the same
> > > > segment and so potentially provide buffered read vs buffered write
> > > > atomicity as well.
> > > >
> > > > If we need more concurrency than an unsigned long worth of bits for
> > > > buffered writes, then maybe we can enlarge the bitmap further.
> > > >
> > > > I suspect this can be extended to direct IO in a similar way to
> > > > buffered reads, and that then opens up the possibility of truncate
> > > > and fallocate() being able to use the bitmap for range exclusion,
> > > > too.
> > > >
> > > > The overhead is likely minimal - setting and clearing bits in a
> > > > bitmap, as opposed to tracking ranges in a tree structure....
> > > >
> > > > Thoughts?
> > >
> > > I think that's a very neat idea, but it will not address the reference
> > > benchmark.
> > > The reference benchmark I started the original report with which is similar
> > > to my understanding to the benchmark that Chi is running simulates the
> > > workload of a database writing with buffered IO.
> > >
> > > That means a very large file and small IO size ~64K.
> > > Leaving the probability of intersecting writes in the same segment quite high.
> > >
> > > Can we do this opportunistically based on available large folios?
> > > If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
> > > if it is not, use IOLOCK_EXCL?
> > >
> > > for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
> > > naturally align to IO size and above?
> > >
> >
> > Great, I think we're getting close to aligning our thoughts.
> >
> > IMO, we shouldn't use a shared lock for write operations that are
> > larger than page size.
> >
> > I believe the current issue is that when acquiring the i_rwsem lock,
> > we have no way of knowing the size of a large folio [1] (as Darrick
> > mentioned earlier), so we can't determine if only one large folio will
> > be written.
> >
> > There's only one certainty: if the IO size fits within one page size,
> > it will definitely fit within one large folio.
> >
> > So for now, we can only use IOLOCK_SHARED if we verify that the IO fits
> > within page size.
>
> For filesystems that /do/ support large folios (xfs), I suppose you
> could have it tell iomap that it only took i_rwsem in shared mode; and
> then the iomap buffered write implementation could proceed if it got a
> folio covering the entire write range, or return some magic code that
> means "take i_rwsem in exclusive mode and try again".
>
Sorry if this is out of left field as I haven't followed the discussion
closely, but I presumed one of the reasons Darrick and Christoph raised
the idea of using the folio batch thing I'm playing around with on zero
range for buffered writes would be to acquire and lock all targeted
folios up front. If so, would that help with what you're trying to
achieve here? (If not, nothing to see here, move along.. ;).
Brian
> Though you're correct that we should always take IOLOCK_EXCL if the
> write size is larger than whatever the max folio size is for that file.
>
> --D
>
> > [1]: Maybe we can find a way to obtain the size of a folio from the page
> > cache, but it might come with some performance costs.
> >
> >
> > Thanks,
> > Chi Zhiling
> >
> >
>
next prev parent reply other threads:[~2025-01-13 13:38 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-26 6:16 [PATCH] xfs: Remove i_rwsem lock in buffered read Chi Zhiling
2024-12-26 21:50 ` Dave Chinner
2024-12-28 7:37 ` Chi Zhiling
2024-12-28 22:17 ` Dave Chinner
2024-12-30 2:42 ` Chi Zhiling
2025-01-07 12:13 ` Amir Goldstein
2025-01-07 17:12 ` Christoph Hellwig
2025-01-08 7:43 ` Chi Zhiling
2025-01-08 11:33 ` Amir Goldstein
2025-01-08 11:45 ` Amir Goldstein
2025-01-08 12:15 ` John Garry
2025-01-09 10:07 ` Amir Goldstein
2025-01-09 12:40 ` John Garry
2025-01-09 8:37 ` Chi Zhiling
2025-01-09 10:25 ` Amir Goldstein
2025-01-09 12:10 ` Chi Zhiling
2025-01-09 12:25 ` John Garry
2025-01-08 17:35 ` Darrick J. Wong
2025-01-09 23:28 ` Dave Chinner
2025-01-10 1:31 ` Chi Zhiling
2025-01-10 17:07 ` Amir Goldstein
2025-01-12 10:05 ` Chi Zhiling
2025-01-13 2:44 ` Darrick J. Wong
2025-01-13 5:59 ` Chi Zhiling
2025-01-13 13:40 ` Brian Foster [this message]
2025-01-13 16:19 ` Darrick J. Wong
2025-01-15 5:55 ` Christoph Hellwig
2025-01-15 21:41 ` Dave Chinner
2025-01-16 4:36 ` Christoph Hellwig
2025-01-17 22:20 ` Dave Chinner
2025-01-16 14:23 ` Brian Foster
2025-01-17 13:27 ` Amir Goldstein
2025-01-17 22:19 ` Dave Chinner
2025-01-18 13:03 ` Amir Goldstein
2025-01-20 5:11 ` Dave Chinner
2025-01-22 6:08 ` Christoph Hellwig
2025-01-22 23:35 ` Dave Chinner
2025-01-17 16:12 ` Chi Zhiling
2025-01-24 7:57 ` Chi Zhiling
2025-01-27 20:49 ` Dave Chinner
2025-01-28 5:15 ` Christoph Hellwig
2025-01-28 21:23 ` David Laight
2025-01-29 0:59 ` Dave Chinner
2025-01-29 5:20 ` Christoph Hellwig
2025-02-10 1:44 ` Chi Zhiling
2025-01-14 0:09 ` Dave Chinner
2025-01-25 8:43 ` Jinliang Zheng
2025-01-25 14:14 ` Amir Goldstein
2025-06-20 14:03 ` Jinliang Zheng
-- strict thread matches above, loose matches on Subject: below --
2019-03-25 0:10 [QUESTION] Long read latencies on mixed rw buffered IO Dave Chinner
2025-06-20 13:46 ` [PATCH] xfs: Remove i_rwsem lock in buffered read Jinliang Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z4UX4zyc8n8lGM16@bfoster \
--to=bfoster@redhat.com \
--cc=amir73il@gmail.com \
--cc=cem@kernel.org \
--cc=chizhiling@163.com \
--cc=chizhiling@kylinos.cn \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=john.g.garry@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).