Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Amir Goldstein <amir73il@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>, Jan Kara <jack@suse.cz>,
	Christoph Hellwig <hch@lst.de>,
	"Darrick J . Wong" <darrick.wong@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload
Date: Wed, 21 Sep 2022 14:20:07 +0300	[thread overview]
Message-ID: <CAOQ4uxj4uCRkqN+L1RYmY9Ey=eu3xp249Y6BYU-JQFjgGrOdQg@mail.gmail.com> (raw)
In-Reply-To: <CAOQ4uxgNBWpdpoXrqwhtGkMCLr3aAdv2=_oEYtafWK1WrP4-hw@mail.gmail.com>

On Tue, Sep 20, 2022 at 6:08 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
[...]

> > As I just discussed on #xfs with Darrick, there are other options
> > we can persue here.
> >
> > The first question we need to ask ourselves is this: what are we
> > protecting against with exclusive buffered write behaviour?
> >
> > The answer is that we know there are custom enterprise database
> > applications out there that assume that 8-16kB buffered writes are
> > atomic. I wish I could say these are legacy applications these days,
> > but they aren't - they are still in production use, and the
> > applications build on those custom database engines are still under
> > active development and use.
> >
> > AFAIK, the 8kB atomic write behaviour is historical and came from
> > applications originally designed for Solaris and hardware that
> > had an 8kB page size. Hence buffered 8kB writes were assumed to be
> > the largest atomic write size that concurrent reads would not see
> > write tearing. These applications are now run on x86-64 boxes with
> > 4kB page size, but they still assume that 8kB writes are atomic and
> > can't tear.
> >
>
> Interesting. I did not know which applications needed that behavior.
> The customer benchmark that started the complaint uses 8k buffered
> IO size (as do the benchmarks that I posted in this thread), so far as
> I am concerned, fixing small buffered IO will solve the problem.
>
> > So, really, these days the atomic write behaviour of XFS is catering
> > for these small random read/write IO applications, not to provide
> > atomic writes for bulk data moving applications writing 2GB of data
> > per write() syscall. Hence we can fairly safely say that we really
> > only need "exclusive" buffered write locking for relatively small
> > multipage IOs, not huge IOs.
> >
> > We can do single page shared buffered writes immediately - we
> > guarantee that while the folio is locked, a buffered read cannot
> > access the data until the folio is unlocked. So that could be the
> > first step to relaxing the exclusive locking requirement for
> > buffered writes.
> >
> > Next we need to consider that we now have large folio support in the
> > page cache, which means we can treat contiguous file ranges larger
> > than a single page a single atomic unit if they are covered by a
> > multi-page folio. As such, if we have a single multi-page folio that
> > spans the entire write() range already in cache, we can run that
> > write atomically under a shared IO lock the same as we can do with
> > single page folios.
> >
> > However, what happens if the folio is smaller than the range we need
> > to write? Well, in that case, we have to abort the shared lock write
> > and upgrade to an exclusive lock before trying again.
> >

Please correct me if I am wrong, but with current upstream, the only
way that multi page folios are created for 4K block / 4K page setup are
during readahead.

We *could* allocate multi page folios on write to an allocated block range
that maps inside a single extent, but there is no code for that today.

It seems that without this code, any write to a region of page cache not
pre-populated using readahead, would get exclusive iolock for 8K buffered
writes until that single page folio cache entry is evicted.

Am I reading the iomap code correctly?

> > Of course, we can only determine if the write can go ahead once we
> > have the folio locked. That means we need a new non-blocking write
> > condition to be handled by the iomap code. We already have several
> > of them because of IOCB_NOWAIT semantics that io_uring requires for
> > buffered writes, so we are already well down the path of needing to
> > support fully non-blocking writes through iomap.
> >
> > Further, the recent concurrent write data corruption that we
> > uncovered requires a new hook in the iomap write path to allow
> > writes to be aborted for remapping because the cached iomap has
> > become stale. This validity check can only be done once the folio
> > has locked - if the cached iomap is stale once we have the page
> > locked, then we have to back out and remap the write range and
> > re-run the write.
> >
> > IOWs, we are going to have to add write retries to the iomap write
> > path for data integrity purposes. These checks must be done only
> > after the folio has been locked, so we really end up getting the
> > "can't do atomic write" retry infrastructure for free with the data
> > corruption fixes...
> >
> > With this in place, it becomes trivial to support atomic writes with
> > shared locking all the way up to PMD sizes (or whatever the maximum
> > multipage folio size the arch supports is) with a minimal amount of
> > extra code.
> >
> > At this point, we have a buffered write path that tries to do shared
> > locking first, and only falls back to exclusive locking if the page
> > cache doesn't contain a folio large enough to soak up the entire
> > write.
> >
> > In future, Darrick suggested we might be able to do a "trygetlock a
> > bunch of folios" operation that locks a range of folios within the
> > current iomap in one go, and then we write into all of them in a
> > batch before unlocking them all. This would give us multi-folio
> > atomic writes with shared locking - this is much more complex, and
> > it's unclear that multi-folio write batching will gain us anything
> > over the single folio check described above...
> >
> > Finally, for anything that is concurrently reading and writing lots
> > of data in chunks larger than PMD sizes, the application should
> > really be using DIO with AIO or io_uring. So falling back to
> > exclusive locking for such large single buffered write IOs doesn't
> > seem like a huge issue right now....
> >
> > Thoughts?
>
> That sounds like a great plan.
> I especially liked the "get it for free" part ;)
> Is there already WIP for the data integrity issue fix?
>

OK. I see your patch set.

> If there is anything I can do to assist, run the benchmark or anything
> please let me know.
>
> In the meanwhile, I will run the benchmark with XFS_IOLOCK_SHARED
> on the write() path.
>

As expected, results without exclusive iolock look very good [*].

Thanks,
Amir.

[*] I ran the following fio workload on e2-standard-8 GCE machine:

[global]
filename=/mnt/xfs/testfile.fio
norandommap
randrepeat=0
size=5G
bs=8K
ioengine=psync
numjobs=8
group_reporting=1
direct=0
fallocate=1
end_fsync=0
runtime=60

[xfs-read]
readwrite=randread

[xfs-write]
readwrite=randwrite

========================= v6.0-rc4 (BAD) =========
Run #1:
   READ: bw=7053KiB/s (7223kB/s)
  WRITE: bw=155MiB/s (163MB/s)

Run #2:
   READ: bw=4672KiB/s (4784kB/s)
  WRITE: bw=355MiB/s (372MB/s)

Run #3:
   READ: bw=5887KiB/s (6028kB/s)
  WRITE: bw=137MiB/s (144MB/s)

========================= v6.0-rc4 (read no iolock like ext4 - GOOD) =========
   READ: bw=742MiB/s (778MB/s)
  WRITE: bw=345MiB/s (361MB/s)

========================= v6.0-rc4 (write shared iolock - BETTER) =========
Run #1:
   READ: bw=762MiB/s (799MB/s)
  WRITE: bw=926MiB/s (971MB/s)

Run #2:
   READ: bw=170MiB/s (178MB/s)
  WRITE: bw=982MiB/s (1029MB/s)

Run #3:
   READ: bw=755MiB/s (792MB/s)
  WRITE: bw=933MiB/s (978MB/s)

next prev parent reply	other threads:[~2022-09-21 11:20 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-04 16:57 [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload Amir Goldstein
2019-04-04 21:17 ` Dave Chinner
2019-04-05 14:02   ` Amir Goldstein
2019-04-07 23:27     ` Dave Chinner
2019-04-08  9:02       ` Amir Goldstein
2019-04-08 14:11         ` Jan Kara
2019-04-08 17:41           ` Amir Goldstein
2019-04-09  8:26             ` Jan Kara
2022-06-17 14:48               ` Amir Goldstein
2022-06-17 15:11                 ` Jan Kara
2022-06-18  8:38                   ` Amir Goldstein
2022-06-20  9:11                     ` Jan Kara
2022-06-21  7:49                       ` Amir Goldstein
2022-06-21  8:59                         ` Jan Kara
2022-06-21 12:53                           ` Amir Goldstein
2022-06-22  3:23                             ` Matthew Wilcox
2022-06-22  9:00                               ` Amir Goldstein
2022-06-22  9:34                                 ` Jan Kara
2022-06-22 16:26                                   ` Amir Goldstein
2022-09-13 14:40                             ` Amir Goldstein
2022-09-14 16:01                               ` Darrick J. Wong
2022-09-14 16:29                                 ` Amir Goldstein
2022-09-14 17:39                                   ` Darrick J. Wong
2022-09-19 23:09                                     ` Dave Chinner
2022-09-20  2:24                                       ` Dave Chinner
2022-09-20  3:08                                         ` Amir Goldstein
2022-09-21 11:20                                           ` Amir Goldstein [this message]
2019-04-08 11:03       ` Jan Kara
2019-04-22 10:55         ` Boaz Harrosh
2019-04-08 10:33   ` Jan Kara
2019-04-08 16:37     ` Davidlohr Bueso
2019-04-11  1:11       ` Dave Chinner
2019-04-16 12:22         ` Dave Chinner
2019-04-18  3:10           ` Dave Chinner
2019-04-18 18:21             ` Davidlohr Bueso
2019-04-20 23:54               ` Dave Chinner
2019-05-03  4:17                 ` Dave Chinner
2019-05-03  5:17                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAOQ4uxj4uCRkqN+L1RYmY9Ey=eu3xp249Y6BYU-JQFjgGrOdQg@mail.gmail.com' \
    --to=amir73il@gmail.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).