linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@kernel.org>,
	Kent Overstreet <kent.overstreet@linux.dev>,
	Luis Chamberlain <mcgrof@kernel.org>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm <linux-mm@kvack.org>,
	Daniel Gomez <da.gomez@samsung.com>,
	Pankaj Raghav <p.raghav@samsung.com>,
	Jens Axboe <axboe@kernel.dk>, Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@lst.de>, Chris Mason <clm@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
Date: Mon, 26 Feb 2024 21:07:51 +0000	[thread overview]
Message-ID: <Zdz9p_Kn0puI1KEL@casper.infradead.org> (raw)
In-Reply-To: <CAHk-=wiEVcqTU1oQPSjaJvxj5NReg3GzkBO8zpL1tXFG1UVyvg@mail.gmail.com>

On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote:
> Willy - tangential side note: I looked closer at the issue that you
> reported (indirectly) with the small reads during heavy write
> activity.
> 
> Our _reading_ side is very optimized and has none of the write-side
> oddities that I can see, and we just have
> 
>   filemap_read ->
>     filemap_get_pages ->
>         filemap_get_read_batch ->
>           folio_try_get_rcu()
> 
> and there is no page locking or other locking involved (assuming the
> page is cached and marked uptodate etc, of course).
> 
> So afaik, it really is just that *one* atomic access (and the matching
> page ref decrement afterwards).

Yep, that was what the customer reported on their ancient kernel, and
we at least didn't make that worse ...

> We could easily do all of this without getting any ref to the page at
> all if we did the page cache release with RCU (and the user copy with
> "copy_to_user_atomic()").  Honestly, anything else looks like a
> complete disaster. For tiny reads, a temporary buffer sounds ok, but
> really *only* for tiny reads where we could have that buffer on the
> stack.
> 
> Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
> for to that degree?
> 
> In contrast, the RCU-delaying of the page cache might be a good idea
> in general. We've had other situations where that would have been
> nice. The main worry would be low-memory situations, I suspect.
> 
> The "tiny read" optimization smells like a benchmark thing to me. Even
> with the cacheline possibly bouncing, the system call overhead for
> tiny reads (particularly with all the mitigations) should be orders of
> magnitude higher than two atomic accesses.

Ah, good point about the $%^&^*^ mitigations.  This was pre mitigations.
I suspect that this customer would simply disable them; afaik the machine
is an appliance and one interacts with it purely by sending transactions
to it (it's not even an SQL system, much less a "run arbitrary javascript"
kind of system).  But that makes it even more special case, inapplicable
to the majority of workloads and closer to smelling like a benchmark.

I've thought about and rejected RCU delaying of the page cache in the
past.  With the majority of memory in anon memory & file memory, it just
feels too risky to have so much memory waiting to be reused.  We could
also improve gup-fast if we could rely on RCU freeing of anon memory.
Not sure what workloads might benefit from that, though.

It'd be cute if we could restrict free memory to be only reallocatable by
the process that had previously allocated it until an RCU grace period
had passed.  That way you could still snoop, but only on yourself which
wouldn't be all that exciting.  Doubt it'd be worth the effort of setting
up per-mm freelists, although it would allow us to skip zeroing the page
under some circumstances ...


  reply	other threads:[~2024-02-26 21:07 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain
2024-02-24  4:12 ` Matthew Wilcox
2024-02-24 17:31   ` Linus Torvalds
2024-02-24 18:13     ` Matthew Wilcox
2024-02-24 18:24       ` Linus Torvalds
2024-02-24 18:20     ` Linus Torvalds
2024-02-24 19:11       ` Linus Torvalds
2024-02-24 21:42         ` Theodore Ts'o
2024-02-24 22:57         ` Chris Mason
2024-02-24 23:40           ` Linus Torvalds
2024-05-10 23:57           ` Luis Chamberlain
2024-02-25  5:18     ` Kent Overstreet
2024-02-25  6:04       ` Kent Overstreet
2024-02-25 13:10       ` Matthew Wilcox
2024-02-25 17:03         ` Linus Torvalds
2024-02-25 21:14           ` Matthew Wilcox
2024-02-25 23:45             ` Linus Torvalds
2024-02-26  1:02               ` Kent Overstreet
2024-02-26  1:32                 ` Linus Torvalds
2024-02-26  1:58                   ` Kent Overstreet
2024-02-26  2:06                     ` Kent Overstreet
2024-02-26  2:34                     ` Linus Torvalds
2024-02-26  2:50                   ` Al Viro
2024-02-26 17:17                     ` Linus Torvalds
2024-02-26 21:07                       ` Matthew Wilcox [this message]
2024-02-26 21:17                         ` Kent Overstreet
2024-02-26 21:19                           ` Kent Overstreet
2024-02-26 21:55                             ` Paul E. McKenney
2024-02-26 23:29                               ` Kent Overstreet
2024-02-27  0:05                                 ` Paul E. McKenney
2024-02-27  0:29                                   ` Kent Overstreet
2024-02-27  0:55                                     ` Paul E. McKenney
2024-02-27  1:08                                       ` Kent Overstreet
2024-02-27  5:17                                         ` Paul E. McKenney
2024-02-27  6:21                                           ` Kent Overstreet
2024-02-27 15:32                                             ` Paul E. McKenney
2024-02-27 15:52                                               ` Kent Overstreet
2024-02-27 16:06                                                 ` Paul E. McKenney
2024-02-27 15:54                                               ` Matthew Wilcox
2024-02-27 16:21                                                 ` Paul E. McKenney
2024-02-27 16:34                                                   ` Kent Overstreet
2024-02-27 17:58                                                     ` Paul E. McKenney
2024-02-28 23:55                                                       ` Kent Overstreet
2024-02-29 19:42                                                         ` Paul E. McKenney
2024-02-29 20:51                                                           ` Kent Overstreet
2024-03-05  2:19                                                             ` Paul E. McKenney
2024-02-27  0:43                                 ` Dave Chinner
2024-02-26 22:46                       ` Linus Torvalds
2024-02-26 23:48                         ` Linus Torvalds
2024-02-27  7:21                           ` Kent Overstreet
2024-02-27 15:39                             ` Matthew Wilcox
2024-02-27 15:54                               ` Kent Overstreet
2024-02-27 16:34                             ` Linus Torvalds
2024-02-27 16:47                               ` Kent Overstreet
2024-02-27 17:07                                 ` Linus Torvalds
2024-02-27 17:20                                   ` Kent Overstreet
2024-02-27 18:02                                     ` Linus Torvalds
2024-05-14 11:52                         ` Luis Chamberlain
2024-05-14 16:04                           ` Linus Torvalds
2024-11-15 19:43                           ` Linus Torvalds
2024-11-15 20:42                             ` Matthew Wilcox
2024-11-15 21:52                               ` Linus Torvalds
2024-02-25 21:29           ` Kent Overstreet
2024-02-25 17:32         ` Kent Overstreet
2024-02-24 17:55   ` Luis Chamberlain
2024-02-25  5:24 ` Kent Overstreet
2024-02-26 12:22 ` Dave Chinner
2024-02-27 10:07 ` Kent Overstreet
2024-02-27 14:08   ` Luis Chamberlain
2024-02-27 14:57     ` Kent Overstreet
2024-02-27 22:13   ` Dave Chinner
2024-02-27 22:21     ` Kent Overstreet
2024-02-27 22:42       ` Dave Chinner
2024-02-28  7:48         ` [Lsf-pc] " Amir Goldstein
2024-02-28 14:01           ` Chris Mason
2024-02-29  0:25           ` Dave Chinner
2024-02-29  0:57             ` Kent Overstreet
2024-03-04  0:46               ` Dave Chinner
2024-02-27 22:46       ` Linus Torvalds
2024-02-27 23:00         ` Linus Torvalds
2024-02-28  2:22         ` Kent Overstreet
2024-02-28  3:00           ` Matthew Wilcox
2024-02-28  4:22             ` Matthew Wilcox
2024-02-28 17:34               ` Kent Overstreet
2024-02-28 18:04                 ` Matthew Wilcox
2024-02-28 18:18         ` Kent Overstreet
2024-02-28 19:09           ` Linus Torvalds
2024-02-28 19:29             ` Kent Overstreet
2024-02-28 20:17               ` Linus Torvalds
2024-02-28 23:21                 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zdz9p_Kn0puI1KEL@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=axboe@kernel.dk \
    --cc=clm@fb.com \
    --cc=da.gomez@samsung.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@lst.de \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).