Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Chris Mason <clm@meta.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Matthew Wilcox <willy@infradead.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm <linux-mm@kvack.org>,
	Daniel Gomez <da.gomez@samsung.com>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Jens Axboe <axboe@kernel.dk>, Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Date: Sat, 24 Feb 2024 17:57:43 -0500	[thread overview]
Message-ID: <bb2e87d7-a706-4dc8-9c09-9257b69ebd5c@meta.com> (raw)
In-Reply-To: <CAHk-=wintzU7i5NCVAUY_es6_eo8Zpt=mD0PAyhFd0aCu65WfA@mail.gmail.com>

On 2/24/24 2:11 PM, Linus Torvalds wrote:
> On Sat, 24 Feb 2024 at 10:20, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> If somebody really cares about this kind of load, and cannot use
>> O_DIRECT for some reason ("I actually do want caches 99% of the
>> time"), I suspect the solution is to have some slightly gentler way to
>> say "instead of the throttling logic, I want you to start my writeouts
>> much more synchronously".
>>
>> IOW, we could have a writer flag that still uses the page cache, but
>> that instead of that
>>
>>                 balance_dirty_pages_ratelimited(mapping);
> 
> I was *sure* we had had some work in this area, and yup, there's a
> series from 2019 by Konstantin Khlebnikov to implement write-behind.
> 
> Some digging in the lore archives found this
> 
>     https://lore.kernel.org/lkml/156896493723.4334.13340481207144634918.stgit@buzz/
> 
> but I don't remember what then happened to it.  It clearly never went
> anywhere, although I think something _like_ that is quite possibly the
> right thing to do (and I was fairly positive about the patch at the
> time).
> 
> I have this feeling that there's been other attempts of write-behind
> in this area, but that thread was the only one I found from my quick
> search.
> 
> I'm not saying Konstanti's patch is the thing to do, and I suspect we
> might want to actually have some way for people to say at open-time
> that "I want write-behind", but it looks like at least a starting
> point.
> 
> But it is possible that this work never went anywhere exactly because
> this is such a rare case. That kind of "write so much that you want to
> do something special" is often such a special thing that using
> O_DIRECT is generally the trivial solution.

For teams that really more control over dirty pages with existing APIs,
I've suggested using sync_file_range periodically.  It seems to work
pretty well, and they can adjust the sizes and frequency as needed.

Managing clean pages has been a problem with workloads that really care
about p99 allocation latency.  We've had issues where kswapd saturates a
core throwing away all the clean pages from either streaming readers or
writers.

To reproduce on 6.8-rc5, I did buffered IO onto a 6 drive raid0 via MD.
Max possible tput seems to be 8GB/s writes, and the box has 256GB of ram
across two sockets.  For buffered IO onto md0, we're hitting about
1.2GB/s, and have a core saturated by a kworker doing writepages.

From time to time, our random crud that maintains the system will need a
lot of memory and kswapd will saturate a core, but this tends to resolve
itself after 10-20 seconds.  Our ultra sensitive workloads would
complain, but they manage the page cache more explicitly to avoid these
situations.

The raid0 is fast enough that we never hit the synchronous dirty page
limit.  fio is just 100% CPU bound, and when kswapd saturates a core,
it's just freeing clean pages.

With filesystems in use, kswapd and the writepages kworkers are better
behaved, which just makes me think writepages on blockdevices have seen
less optimization, not really a huge surprise.  Filesystems can push the
full 8GB/s tput either buffered or O_DIRECT.

With streaming writes to a small number of large files, total free
memory might get down to 1.5GB on the 256GB machine, with most of the
rest being clean page cache.

If I instead write to millions of 1MB files, free memory refuses to go
below 12GB, and kswapd doesn't misbehave at all.  We're still pushing
7GB/s writes.

Not a lot of conclusions, other than it's not that hard to use clean
page cache to make the system slower than some workloads are willing to
tolerate.

Ignoring widly slow devices, the dirty limits seem to work well enough
on both big and small systems that I haven't needed to investigate
issues there as often.

Going back to Luis's original email, I'd echo Willy's suggestion for
profiles.  Unless we're saturating memory bandwidth, buffered should be
able to get much closer to O_DIRECT, just at a much higher overall cost.

-chris

next prev parent reply	other threads:[~2024-02-24 22:58 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
2024-02-24  4:12 ` Matthew Wilcox
2024-02-24 17:31   ` Linus Torvalds
2024-02-24 18:13     ` Matthew Wilcox
2024-02-24 18:24       ` Linus Torvalds
2024-02-24 18:20     ` Linus Torvalds
2024-02-24 19:11       ` Linus Torvalds
2024-02-24 21:42         ` Theodore Ts'o
2024-02-24 22:57         ` Chris Mason [this message]
2024-02-24 23:40           ` Linus Torvalds
2024-05-10 23:57           ` Luis Chamberlain
2024-02-25  5:18     ` Kent Overstreet
2024-02-25  6:04       ` Kent Overstreet
2024-02-25 13:10       ` Matthew Wilcox
2024-02-25 17:03         ` Linus Torvalds
2024-02-25 21:14           ` Matthew Wilcox
2024-02-25 23:45             ` Linus Torvalds
2024-02-26  1:02               ` Kent Overstreet
2024-02-26  1:32                 ` Linus Torvalds
2024-02-26  1:58                   ` Kent Overstreet
2024-02-26  2:06                     ` Kent Overstreet
2024-02-26  2:34                     ` Linus Torvalds
2024-02-26  2:50                   ` Al Viro
2024-02-26 17:17                     ` Linus Torvalds
2024-02-26 21:07                       ` Matthew Wilcox
2024-02-26 21:17                         ` Kent Overstreet
2024-02-26 21:19                           ` Kent Overstreet
2024-02-26 21:55                             ` Paul E. McKenney
2024-02-26 23:29                               ` Kent Overstreet
2024-02-27  0:05                                 ` Paul E. McKenney
2024-02-27  0:29                                   ` Kent Overstreet
2024-02-27  0:55                                     ` Paul E. McKenney
2024-02-27  1:08                                       ` Kent Overstreet
2024-02-27  5:17                                         ` Paul E. McKenney
2024-02-27  6:21                                           ` Kent Overstreet
2024-02-27 15:32                                             ` Paul E. McKenney
2024-02-27 15:52                                               ` Kent Overstreet
2024-02-27 16:06                                                 ` Paul E. McKenney
2024-02-27 15:54                                               ` Matthew Wilcox
2024-02-27 16:21                                                 ` Paul E. McKenney
2024-02-27 16:34                                                   ` Kent Overstreet
2024-02-27 17:58                                                     ` Paul E. McKenney
2024-02-28 23:55                                                       ` Kent Overstreet
2024-02-29 19:42                                                         ` Paul E. McKenney
2024-02-29 20:51                                                           ` Kent Overstreet
2024-03-05  2:19                                                             ` Paul E. McKenney
2024-02-27  0:43                                 ` Dave Chinner
2024-02-26 22:46                       ` Linus Torvalds
2024-02-26 23:48                         ` Linus Torvalds
2024-02-27  7:21                           ` Kent Overstreet
2024-02-27 15:39                             ` Matthew Wilcox
2024-02-27 15:54                               ` Kent Overstreet
2024-02-27 16:34                             ` Linus Torvalds
2024-02-27 16:47                               ` Kent Overstreet
2024-02-27 17:07                                 ` Linus Torvalds
2024-02-27 17:20                                   ` Kent Overstreet
2024-02-27 18:02                                     ` Linus Torvalds
2024-05-14 11:52                         ` Luis Chamberlain
2024-05-14 16:04                           ` Linus Torvalds
2024-11-15 19:43                           ` Linus Torvalds
2024-11-15 20:42                             ` Matthew Wilcox
2024-11-15 21:52                               ` Linus Torvalds
2024-02-25 21:29           ` Kent Overstreet
2024-02-25 17:32         ` Kent Overstreet
2024-02-24 17:55   ` Luis Chamberlain
2024-02-25  5:24 ` Kent Overstreet
2024-02-26 12:22 ` Dave Chinner
2024-02-27 10:07 ` Kent Overstreet
2024-02-27 14:08   ` Luis Chamberlain
2024-02-27 14:57     ` Kent Overstreet
2024-02-27 22:13   ` Dave Chinner
2024-02-27 22:21     ` Kent Overstreet
2024-02-27 22:42       ` Dave Chinner
2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
2024-02-28 14:01           ` Chris Mason
2024-02-29  0:25           ` Dave Chinner
2024-02-29  0:57             ` Kent Overstreet
2024-03-04  0:46               ` Dave Chinner
2024-02-27 22:46       ` Linus Torvalds
2024-02-27 23:00         ` Linus Torvalds
2024-02-28  2:22         ` Kent Overstreet
2024-02-28  3:00           ` Matthew Wilcox
2024-02-28  4:22             ` Matthew Wilcox
2024-02-28 17:34               ` Kent Overstreet
2024-02-28 18:04                 ` Matthew Wilcox
2024-02-28 18:18         ` Kent Overstreet
2024-02-28 19:09           ` Linus Torvalds
2024-02-28 19:29             ` Kent Overstreet
2024-02-28 20:17               ` Linus Torvalds
2024-02-28 23:21                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb2e87d7-a706-4dc8-9c09-9257b69ebd5c@meta.com \
    --to=clm@meta.com \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=da.gomez@samsung.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=torvalds@linux-foundation.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).