* [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO
@ 2024-02-23 23:59 Luis Chamberlain
2024-02-24 4:12 ` Matthew Wilcox
` (3 more replies)
0 siblings, 4 replies; 90+ messages in thread
From: Luis Chamberlain @ 2024-02-23 23:59 UTC (permalink / raw)
To: lsf-pc
Cc: linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe,
Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner,
Matthew Wilcox, Linus Torvalds, mcgrof
Part of the testing we have done with LBS was to do some performance
tests on XFS to ensure things are not regressing. Building linux is a
fine decent test and we did some random cloud instance tests on that and
presented that at Plumbers, but it doesn't really cut it if we want to
push things to the limit though. What are the limits to buffered IO
and how do we test that? Who keeps track of it?
The obvious recurring tension is that for really high performance folks
just recommend to use birect IO. But if you are stress testing changes
to a filesystem and want to push buffered IO to its limits it makes
sense to stick to buffered IO, otherwise how else do we test it?
It is good to know limits to buffered IO too because some workloads
cannot use direct IO. For instance PostgreSQL doesn't have direct IO
support and even as late as the end of last year we learned that adding
direct IO to PostgreSQL would be difficult. Chris Mason has noted also
that direct IO can also force writes during reads (?)... Anyway, testing
the limits of buffered IO limits to ensure you are not creating
regressions when doing some page cache surgery seems like it might be
useful and a sensible thing to do .... The good news is we have not found
regressions with LBS but all the testing seems to beg the question, of what
are the limits of buffered IO anyway, and how does it scale? Do we know, do
we care? Do we keep track of it? How does it compare to direct IO for some
workloads? How big is the delta? How do we best test that? How do we
automate all that? Do we want to automatically test this to avoid regressions?
The obvious issues with some workloads for buffered IO is having a
possible penality if you are not really re-using folios added to the
page cache. Jens Axboe reported a while ago issues with workloads with
random reads over a data set 10x the size of RAM and also proposed
RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more
like direct IO with kernel pages and a memcpy(), and it requires
further serialization to be implemented that we already do for
direct IO for writes. There at least seems to be agreement that if we're
going to provide an enhancement or alternative that we should strive to not
make the same mistakes we've done with direct IO. The rationale for some
workloads to use buffered IO is it helps reduce some tail latencies, so
that's something to live up to.
On that same thread Christoph also mentioned the possibility of a direct
IO variant which can leverage the cache. Is that something we want to
move forward with?
Chris Mason also listed a few other desirables if we do:
- Allowing concurrent writes (xfs DIO does this now)
- Optionally doing zero copy if alignment is good (btrfs DIO does this now)
- Optionally tossing pages at the end (requires a separate syscall now)
- Supporting aio via io_uring
I recently ran a different type of simple test, focused on sequantial writes
to fill capacity, with write workload essentially matching your RAM, so
having parity with your RAM. Technically in the case of max size that I
tested the writes were just *slightly* over the RAM, that's a minor
technicality given I did other tests with similar sizes which showed similar
results... This test should be possible to reproduce then if you have more
than enough RAM to spare. In this case the system uses 1 TiB RAM, using
pmem to avoid drive variance / GC / and other drive shenanigans.
So pmem grub setup:
memmap=500G!4G memmap=3G!504G
As noted earlier, surely, DIO / DAX is best for pmem (and I actually get
a difference between using just DIO and DAX, but that digresses), but
when one is wishing to test buffered IO on purpose it makes sense to do
this. Yes, we can test tmpfs too... but I believe that topic will be
brought up at LSFMM separately. The delta with DIO and buffered IO on
XFS is astronomical:
~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
Vs
~ 7,000 MiB/s with buffered IO
Let's recap, the system had 1 TiB RAM, about half was dedicated for
pmem: it used two pmem paritions one 500 GiB for XFS and another 3 GiB
for an XFS external journal. The system has DDR5 which has a cap at
35.76 GiB/s. The proprietary intel tool to test RAM crashes on me... and
leaves a kernel splat that seems to indicate the tool is doing something
we don't allow anymore. So I couldn't get an aggregate RAM baseline on
just RAM. The kernel used was 6.6.0-rc5 + our LBS changes on top.
The buffered io fio incantation:
fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring
-direct=0
--group_reporting=1 --alloc-size=1048576 --filesize=1GiB
--readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1
--directory=/mnt
This dedicates 1 GiB to scale up to nproc (128) threads, it will then
use each thread to write 1 GiB of data at a time 10 times. So 10 files
of 1 GiB per thread. So this easily fills until we're out of space. Each
threads ends up writing about ~300 - 400 MiB per each of its 10 files as
we run out of space fast.
-direct=1 for direct IO and -direct=0 for buffered IO.
Perhaps this is a silly topic to bring up, I don't know, but it seemed
to me just strikinly low the max throughput wall on writes given the
amount of free memory on the system.
Reducing the data written to say just 128 GiB makes the throughput climb
to about ~ 13 GiB/s for buffered IO... but doubling that to 256 GiB we
get only 9276 MiB/s. So even if we have twice the RAM available we hit
a throughput wall which is *still* astronomically different than direct
IO with this setup.
Should we talk about these things at LSFMM? Are folks interested?
[0] https://lwn.net/Articles/806980/
[1] https://patchwork.kernel.org/project/linux-fsdevel/cover/20191211152943.2933-1-axboe@kernel.dk/
Luis
^ permalink raw reply [flat|nested] 90+ messages in thread* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain @ 2024-02-24 4:12 ` Matthew Wilcox 2024-02-24 17:31 ` Linus Torvalds 2024-02-24 17:55 ` Luis Chamberlain 2024-02-25 5:24 ` Kent Overstreet ` (2 subsequent siblings) 3 siblings, 2 replies; 90+ messages in thread From: Matthew Wilcox @ 2024-02-24 4:12 UTC (permalink / raw) To: Luis Chamberlain Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Linus Torvalds On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > Part of the testing we have done with LBS was to do some performance > tests on XFS to ensure things are not regressing. Building linux is a > fine decent test and we did some random cloud instance tests on that and > presented that at Plumbers, but it doesn't really cut it if we want to > push things to the limit though. What are the limits to buffered IO > and how do we test that? Who keeps track of it? TLDR: Why does the pagecache suck? > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > Vs > ~ 7,000 MiB/s with buffered IO Profile? My guess is that you're bottlenecked on the xa_lock between memory reclaim removing folios from the page cache and the various threads adding folios to the page cache. If each thread has its own file, that would help. If the threads do their own reclaim that would help the page cache ... but then they'd contend on the node's lru lock instead, so just trading one pain for another. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 4:12 ` Matthew Wilcox @ 2024-02-24 17:31 ` Linus Torvalds 2024-02-24 18:13 ` Matthew Wilcox ` (2 more replies) 2024-02-24 17:55 ` Luis Chamberlain 1 sibling, 3 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-24 17:31 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > What are the limits to buffered IO > > and how do we test that? Who keeps track of it? > > TLDR: Why does the pagecache suck? What? No. Our page cache is so good that the question is literally "what are the limits of it", and "how we would measure them". That's not a sign of suckage. When you have to have completely unrealistic loads that nobody would actually care about in reality just to get a number for the limit, it's not a sign of problems. Or rather, the "problem" is the person looking at a stupid load, and going "we need to improve this because I can write a benchmark for this". Here's a clue: a hardware discussion forum I visit was arguing about memory latencies, and talking about how their measured overhead of DRAM latency was literally 85% on the CPU side, not the DRAM side. Guess what? It's because the CPU in question had quite a bit of L3, and it was spread out, and the CPU doesn't even start the memory access before it has checked caches. And here's a big honking clue: only a complete nincompoop and mentally deficient rodent would look at that and say "caches suck". > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > > Vs > > ~ 7,000 MiB/s with buffered IO > > Profile? My guess is that you're bottlenecked on the xa_lock between > memory reclaim removing folios from the page cache and the various > threads adding folios to the page cache. I doubt it's the locking. In fact, for writeout in particular it's probably not even the page cache at all. For writeout, we have a very traditional problem: we care about a million times more about latency than we care about throughput, because nobody ever actually cares all that much about performance of huge writes. Ask yourself when you have last *really* sat there waiting for writes, unless it's some dog-slow USB device that writes at 100kB/s? The main situation where people care about cached write performance (ignoring silly benchmarks) tends to be when you create files, and the directory entry ordering means that the bottleneck is a number of small writes and their *ordering* and their latency. And then the issue is basically never the page cache, but the filesystem ordering of the metadata writes against each other and against the page writeout. Why? Because on all but a *miniscule* percentage of loads, all the actual data writes are quite gracefully taken by the page cache completely asynchronously, and nobody ever cares about the writeout latencies. Now, the benchmark that Luis highlighted is a completely different class of historical problems that has been around forever, namely the "fill up lots of memory with dirty data". And there - because the problem is easy to trigger but nobody tends to care deeply about throughput because they care much much *MUCH* more about latency, we have a rather stupid big hammer approach. It's called "vm_dirty_bytes". Well, that's the knob (not the only one). The actual logic around it is then quite the moreass of turning that into the dirty_throttle_control, and the per-bdi dirty limits that try to take the throughput of the backing device into account etc etc. And then all those heuristics are used to actually LITERALLY PAUSE the writer. We literally have this code: __set_current_state(TASK_KILLABLE); bdi->last_bdp_sleep = jiffies; io_schedule_timeout(pause); in balance_dirty_pages(), which is all about saying "I'm putting you to sleep, because I judge you to have dirtied so much memory that you're making things worse for others". And a lot of *that* is then because we haven't wanted everybody to rush in and start their own synchronous writeback, but instead watn all writeback to be done by somebody else. So now we move from mm/page-writeback.c to fs/fs-writeback.c, and all the work-queues to do dirty writeout. Notice how the io_schedule_timeout() above doesn't even get woken up by IO completing. Nope. The "you have written too much" logic literally pauses the writer, and doesn't even want to wake it up when there is no more dirty data. So the "you went over the dirty limits It's a penalty box, and all of this comes from "you are doing something that is abnormal and that disturbs other people, so you get an unconditional penalty". Yes, the timeout is then obviously tied to how much of a problem the dirtying is (based on that whole "how fast is the device") but it's purely a heuristic. And (one) important part here is "nobody sane does that". So benchmarking this is a bit crazy. The code is literally meant for bad actors, and what you are benchmarking is the kernel telling you "don't do that then". And absolutely *NONE* of this all has anything to do with the page cache. NADA. And yes, there's literally thousands of lines of code all explicitly designed for this "slow down writers" and make it be at least somewhat graceful and gradual. That's pretty much all mm/page-writeback.c does (yes, that file *also* does have the "start/end folio writeback" functions, but they are only a small part of it, even if that's obviously the origin of the file - the writeback throttling logic has just grown a lot more). Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 17:31 ` Linus Torvalds @ 2024-02-24 18:13 ` Matthew Wilcox 2024-02-24 18:24 ` Linus Torvalds 2024-02-24 18:20 ` Linus Torvalds 2024-02-25 5:18 ` Kent Overstreet 2 siblings, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-02-24 18:13 UTC (permalink / raw) To: Linus Torvalds Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote: > On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > > What are the limits to buffered IO > > > and how do we test that? Who keeps track of it? > > > > TLDR: Why does the pagecache suck? > > What? No. > > Our page cache is so good that the question is literally "what are the > limits of it", and "how we would measure them". > > That's not a sign of suckage. For a lot of things our pagecache is amazing. And in other ways it absolutely sucks. The trick will be fixing those less-used glass jaws without damaging the common cases. > When you have to have completely unrealistic loads that nobody would > actually care about in reality just to get a number for the limit, > it's not a sign of problems. No, but sometimes the unrealistic loads are, alas, a good proxy for problems that customers hit. For example, I have one where the customer does an overnight backup with a shitty backup program that doesn't use O_DIRECT and ends up evicting the actual working set from the page cache. They start work the next morning with terrible performance because everything they care about has been swapped out. The "fix" is to limit the pagecache to one NUMA node. I suspect if this customer could be persuaded to run a more recent kernel that this problem has been solved, so I'm not sure there's a call to action from this particular case. Anyway, if there's a way to fix an unrealistic load that doesn't affect realistic loads, sometimes we fix a customer problem too. > Guess what? It's because the CPU in question had quite a bit of L3, > and it was spread out, and the CPU doesn't even start the memory > access before it has checked caches. > > And here's a big honking clue: only a complete nincompoop and mentally > deficient rodent would look at that and say "caches suck". although the problem might be that the CPU has a terrible cross-chiplet interconnect ;-) > > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > > > Vs > > > ~ 7,000 MiB/s with buffered IO > > > > Profile? My guess is that you're bottlenecked on the xa_lock between > > memory reclaim removing folios from the page cache and the various > > threads adding folios to the page cache. > > I doubt it's the locking. It might not be! But there are read-only workloads that do bottleneck on the xa_lock. > For writeout, we have a very traditional problem: we care about a > million times more about latency than we care about throughput, > because nobody ever actually cares all that much about performance of > huge writes. > > Ask yourself when you have last *really* sat there waiting for writes, > unless it's some dog-slow USB device that writes at 100kB/s? You picked a bad day to send this email ;-) $ sudo dd if=Downloads/debian-testing-amd64-netinst.iso of=/dev/sda [sudo] password for willy: 1366016+0 records in 1366016+0 records out 699400192 bytes (699 MB, 667 MiB) copied, 296.219 s, 2.4 MB/s ok, that was a cheap-arse USB stick, but then I had to wait for the installer to write 800GB of random data to /dev/nvme0n1p3 as it set up the crypto drive. The Debian installer didn't time that for me, but it was enough time to vacuum the couch. > Now, the benchmark that Luis highlighted is a completely different > class of historical problems that has been around forever, namely the > "fill up lots of memory with dirty data". > > And there - because the problem is easy to trigger but nobody tends to > care deeply about throughput because they care much much *MUCH* more > about latency, we have a rather stupid big hammer approach. > > It's called "vm_dirty_bytes". > > Well, that's the knob (not the only one). The actual logic around it > is then quite the moreass of turning that into the > dirty_throttle_control, and the per-bdi dirty limits that try to take > the throughput of the backing device into account etc etc. > > And then all those heuristics are used to actually LITERALLY PAUSE the > writer. We literally have this code: > > __set_current_state(TASK_KILLABLE); > bdi->last_bdp_sleep = jiffies; > io_schedule_timeout(pause); > > in balance_dirty_pages(), which is all about saying "I'm putting you > to sleep, because I judge you to have dirtied so much memory that > you're making things worse for others". > > And a lot of *that* is then because we haven't wanted everybody to > rush in and start their own synchronous writeback, but instead watn > all writeback to be done by somebody else. So now we move from > mm/page-writeback.c to fs/fs-writeback.c, and all the work-queues to > do dirty writeout. > > Notice how the io_schedule_timeout() above doesn't even get woken up > by IO completing. Nope. The "you have written too much" logic > literally pauses the writer, and doesn't even want to wake it up when > there is no more dirty data. > > So the "you went over the dirty limits It's a penalty box, and all of > this comes from "you are doing something that is abnormal and that > disturbs other people, so you get an unconditional penalty". Yes, the > timeout is then obviously tied to how much of a problem the dirtying > is (based on that whole "how fast is the device") but it's purely a > heuristic. > > And (one) important part here is "nobody sane does that". So > benchmarking this is a bit crazy. The code is literally meant for bad > actors, and what you are benchmarking is the kernel telling you "don't > do that then". This was a really good writeup, thanks. I think this might need some tuning (if it is what's going on). When the underlying device can do 86GB/s and we're only getting 7GB/s, we could afford to let this writer do a bit more. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 18:13 ` Matthew Wilcox @ 2024-02-24 18:24 ` Linus Torvalds 0 siblings, 0 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-24 18:24 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, 24 Feb 2024 at 10:13, Matthew Wilcox <willy@infradead.org> wrote: > > > Ask yourself when you have last *really* sat there waiting for writes, > > unless it's some dog-slow USB device that writes at 100kB/s? > > You picked a bad day to send this email ;-) > > $ sudo dd if=Downloads/debian-testing-amd64-netinst.iso of=/dev/sda What? No. I literally picked the example you then did. And you in fact were being clueless, and the kernel damn well should tell you that you are being an ass, and should slow your writes down so that you don't hurt others. Which is what it did. You should have damn well used oflag=direct, and if you didn't do that, then the onus is all on YOU. The kernel really isn't there to fix your stupid errors for you. You do something stupid, the kernel will give you rope and happily say "do you want more?" Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 17:31 ` Linus Torvalds 2024-02-24 18:13 ` Matthew Wilcox @ 2024-02-24 18:20 ` Linus Torvalds 2024-02-24 19:11 ` Linus Torvalds 2024-02-25 5:18 ` Kent Overstreet 2 siblings, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-02-24 18:20 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, 24 Feb 2024 at 09:31, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And (one) important part here is "nobody sane does that". So > benchmarking this is a bit crazy. The code is literally meant for bad > actors, and what you are benchmarking is the kernel telling you "don't > do that then". Side note: one reason why the big hammer approach of "don't do that" has worked so well is that the few loads that *do* want to do this and have a valid reason to write large amounts of data in one go are generally trivially translated to O_DIRECT. For example, if you actually do things like write disk images etc, O_DIRECT is lovely and easy - even trivial - to use. You don't even have to write code for it, you can (and people do) just use 'dd' with 'oflag=direct'. So even trivial shell scripting has access to the "don't do that then" flag. In other words, I really think that Luis' benchmark triggers that kernel "you are doing something actively wrong and stupid" logic. It's not the kernel trying to optimize writeback. It's the kernel trying to protect others from stupid loads. Now, I'm also not saying that you should benchmark this with our "vm_dirty_bytes" logic disabled. That may indeed help performance on that benchmark, but you'll just hit other problem spots instead. Once you fill up lots of memory, other problems become really big and nasty, so you would then need *other* fixes for those issues. If somebody really cares about this kind of load, and cannot use O_DIRECT for some reason ("I actually do want caches 99% of the time"), I suspect the solution is to have some slightly gentler way to say "instead of the throttling logic, I want you to start my writeouts much more synchronously". IOW, we could have a writer flag that still uses the page cache, but that instead of that balance_dirty_pages_ratelimited(mapping); in generic_perform_write(), it would actually synchronously *start* the write, that might work a whole lot better for any load that still wants to do big streaming writes, but wants to also keep the page cache component. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 18:20 ` Linus Torvalds @ 2024-02-24 19:11 ` Linus Torvalds 2024-02-24 21:42 ` Theodore Ts'o 2024-02-24 22:57 ` Chris Mason 0 siblings, 2 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-24 19:11 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, 24 Feb 2024 at 10:20, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > If somebody really cares about this kind of load, and cannot use > O_DIRECT for some reason ("I actually do want caches 99% of the > time"), I suspect the solution is to have some slightly gentler way to > say "instead of the throttling logic, I want you to start my writeouts > much more synchronously". > > IOW, we could have a writer flag that still uses the page cache, but > that instead of that > > balance_dirty_pages_ratelimited(mapping); I was *sure* we had had some work in this area, and yup, there's a series from 2019 by Konstantin Khlebnikov to implement write-behind. Some digging in the lore archives found this https://lore.kernel.org/lkml/156896493723.4334.13340481207144634918.stgit@buzz/ but I don't remember what then happened to it. It clearly never went anywhere, although I think something _like_ that is quite possibly the right thing to do (and I was fairly positive about the patch at the time). I have this feeling that there's been other attempts of write-behind in this area, but that thread was the only one I found from my quick search. I'm not saying Konstanti's patch is the thing to do, and I suspect we might want to actually have some way for people to say at open-time that "I want write-behind", but it looks like at least a starting point. But it is possible that this work never went anywhere exactly because this is such a rare case. That kind of "write so much that you want to do something special" is often such a special thing that using O_DIRECT is generally the trivial solution. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 19:11 ` Linus Torvalds @ 2024-02-24 21:42 ` Theodore Ts'o 2024-02-24 22:57 ` Chris Mason 1 sibling, 0 replies; 90+ messages in thread From: Theodore Ts'o @ 2024-02-24 21:42 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, Feb 24, 2024 at 11:11:28AM -0800, Linus Torvalds wrote: > But it is possible that this work never went anywhere exactly because > this is such a rare case. That kind of "write so much that you want to > do something special" is often such a special thing that using > O_DIRECT is generally the trivial solution. Well, actually there's a relatively common workload where we do this exact same thing --- and that's when we run mkfs.ext[234] / mke2fs. We issue a huge number of buffered writes (at least, if the device doesn't support a zeroing discard operation) to zero out the inode table. We rely on the mm subsystem putting mke2fs "into the penalty box", or else some process (usually mke2fs) will get OOM-killed. I don't consider it a "penalty" --- in fact, when write throttling doesn't work, I've complained that it's an mm bug. (Sometimes this has broken when the mke2fs process runs out of physical memory, and sometimes it has broken when the mke2fs runs into the memory cgroup limit; it's one of those things that's seems to break every 3-5 years.) But still, it's something which *must* work, because it's really not reasonable for userspace to know what is a reasonable rate to self-throttling buffered writes --- it's something the kernel should do for the userspace process. Because this is something that has broken more than once, we have two workarounds in mke2fs; one is that we can call fsync(2) every N block group's worth of inode tables, which is kind of a hack, and the other is that we can use Direct I/O. But using DIO has a worse user experience (well, unless the alternative is mke2fs getting OOM-killed; admittedly that's worse) than just using buffered I/O, since we generally don't need to synchronously wait for the write requests to complete. Neither is enabled by default, because in my view, this is something the mm should just get right, darn it. In any case, I definitely don't consider write throttled to be a performance "problem" --- it's actually a far worse problem when the throttling doesn't happen, because it generally means someone is getting OOM-killed. - Ted ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 19:11 ` Linus Torvalds 2024-02-24 21:42 ` Theodore Ts'o @ 2024-02-24 22:57 ` Chris Mason 2024-02-24 23:40 ` Linus Torvalds 2024-05-10 23:57 ` Luis Chamberlain 1 sibling, 2 replies; 90+ messages in thread From: Chris Mason @ 2024-02-24 22:57 UTC (permalink / raw) To: Linus Torvalds, Matthew Wilcox Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On 2/24/24 2:11 PM, Linus Torvalds wrote: > On Sat, 24 Feb 2024 at 10:20, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> If somebody really cares about this kind of load, and cannot use >> O_DIRECT for some reason ("I actually do want caches 99% of the >> time"), I suspect the solution is to have some slightly gentler way to >> say "instead of the throttling logic, I want you to start my writeouts >> much more synchronously". >> >> IOW, we could have a writer flag that still uses the page cache, but >> that instead of that >> >> balance_dirty_pages_ratelimited(mapping); > > I was *sure* we had had some work in this area, and yup, there's a > series from 2019 by Konstantin Khlebnikov to implement write-behind. > > Some digging in the lore archives found this > > https://lore.kernel.org/lkml/156896493723.4334.13340481207144634918.stgit@buzz/ > > but I don't remember what then happened to it. It clearly never went > anywhere, although I think something _like_ that is quite possibly the > right thing to do (and I was fairly positive about the patch at the > time). > > I have this feeling that there's been other attempts of write-behind > in this area, but that thread was the only one I found from my quick > search. > > I'm not saying Konstanti's patch is the thing to do, and I suspect we > might want to actually have some way for people to say at open-time > that "I want write-behind", but it looks like at least a starting > point. > > But it is possible that this work never went anywhere exactly because > this is such a rare case. That kind of "write so much that you want to > do something special" is often such a special thing that using > O_DIRECT is generally the trivial solution. For teams that really more control over dirty pages with existing APIs, I've suggested using sync_file_range periodically. It seems to work pretty well, and they can adjust the sizes and frequency as needed. Managing clean pages has been a problem with workloads that really care about p99 allocation latency. We've had issues where kswapd saturates a core throwing away all the clean pages from either streaming readers or writers. To reproduce on 6.8-rc5, I did buffered IO onto a 6 drive raid0 via MD. Max possible tput seems to be 8GB/s writes, and the box has 256GB of ram across two sockets. For buffered IO onto md0, we're hitting about 1.2GB/s, and have a core saturated by a kworker doing writepages. From time to time, our random crud that maintains the system will need a lot of memory and kswapd will saturate a core, but this tends to resolve itself after 10-20 seconds. Our ultra sensitive workloads would complain, but they manage the page cache more explicitly to avoid these situations. The raid0 is fast enough that we never hit the synchronous dirty page limit. fio is just 100% CPU bound, and when kswapd saturates a core, it's just freeing clean pages. With filesystems in use, kswapd and the writepages kworkers are better behaved, which just makes me think writepages on blockdevices have seen less optimization, not really a huge surprise. Filesystems can push the full 8GB/s tput either buffered or O_DIRECT. With streaming writes to a small number of large files, total free memory might get down to 1.5GB on the 256GB machine, with most of the rest being clean page cache. If I instead write to millions of 1MB files, free memory refuses to go below 12GB, and kswapd doesn't misbehave at all. We're still pushing 7GB/s writes. Not a lot of conclusions, other than it's not that hard to use clean page cache to make the system slower than some workloads are willing to tolerate. Ignoring widly slow devices, the dirty limits seem to work well enough on both big and small systems that I haven't needed to investigate issues there as often. Going back to Luis's original email, I'd echo Willy's suggestion for profiles. Unless we're saturating memory bandwidth, buffered should be able to get much closer to O_DIRECT, just at a much higher overall cost. -chris ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 22:57 ` Chris Mason @ 2024-02-24 23:40 ` Linus Torvalds 2024-05-10 23:57 ` Luis Chamberlain 1 sibling, 0 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-24 23:40 UTC (permalink / raw) To: Chris Mason Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, 24 Feb 2024 at 14:58, Chris Mason <clm@meta.com> wrote: > > For teams that really more control over dirty pages with existing APIs, > I've suggested using sync_file_range periodically. It seems to work > pretty well, and they can adjust the sizes and frequency as needed. Yes. I've written code like that myself. That said, that is also fairly close to what the write-behind patches I pointed at did. One issue (and maybe that was what killed that write-behind patch) is that there are *other* benchmarks that are actually slightly more realistic that do things like "untar a tar-file, do something with it, and them 'rm -rf' it all again". And *those* benchmarks behave best when the IO is never ever actually done at all. And unlike the "write a terabyte with random IO", those benchmarks actually approximate a few somewhat real loads (I'm not claiming they are good, but the "create files, do something, then remove them" pattern at least _exists_ in real life). For things like block device write for a 'mkfs' run, the whole "this file may be deleted soon, so let's not even start the write in the first place" behavior doesn't exist, of course. Starting writeback much more aggressively for those is probably not a bad idea. > From time to time, our random crud that maintains the system will need a > lot of memory and kswapd will saturate a core, but this tends to resolve > itself after 10-20 seconds. Our ultra sensitive workloads would > complain, but they manage the page cache more explicitly to avoid these > situations. You can see these things with slow USB devices with much more obvious results. Including long spikes of total inactivity if some system piece ends up doing a "sync" for some reason. It happens. It's very annoying. My gut feel is that it happens a lot less these days than it used to, but I suspect that's at least partly because I don't see the slow USB devices very much any more. > Ignoring widly slow devices, the dirty limits seem to work well enough > on both big and small systems that I haven't needed to investigate > issues there as often. One particular problem point used to be backing devices with wildly different IO throughput, because I think the speed heuristics don't necessarily always work all that well at least initially. And things like that may partly explain your "filesystems work better than block devices". It doesn't necessarily have to be about filesystems vs block devices per se, and be instead about things like "on a filesystem, the bdi throughput numbers have had time to stabilize". In contrast, a benchmark that uses soem other random device that doesn't look like a regular disk (whether it's really slow like a bad USB device, or really fast like pmem), you might see more issues. And I wouldn't be in the least surprised if that is part of the situation Luis sees. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 22:57 ` Chris Mason 2024-02-24 23:40 ` Linus Torvalds @ 2024-05-10 23:57 ` Luis Chamberlain 1 sibling, 0 replies; 90+ messages in thread From: Luis Chamberlain @ 2024-05-10 23:57 UTC (permalink / raw) To: Chris Mason, Dave Chinner, David Bueso, Kent Overstreet, Paul E. McKenney Cc: Linus Torvalds, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, Feb 24, 2024 at 05:57:43PM -0500, Chris Mason wrote: > Going back to Luis's original email, I'd echo Willy's suggestion for > profiles. Unless we're saturating memory bandwidth, buffered should be > able to get much closer to O_DIRECT, just at a much higher overall cost. I finally had some time to look beyond just "what locks" could be the main culprit, David Bueso helped me review this, thanks! Based on all the discussions on this insanely long thread, I do believe the issue was the single threaded write-behind cache flushing back Chinner noted. Lifting the /proc/sys/vm/dirty_ratio from 20 to 90 keeps the profile perky and nice with most top penalties seen just in userspace as seen in the first paste exhibit a) below but as soon as we start throttling we hit the profile on past exhibit b) below. a) without the throttling: Samples: 1M of event 'cycles:P', Event count (approx.): 1061541571785 Children Self Command Shared Object Symbol + 17.05% 16.85% fio fio [.] get_io_u ◆ + 3.04% 0.01% fio [kernel.vmlinux] [k] entry_SYSCALL_64 ▒ + 3.03% 0.02% fio [kernel.vmlinux] [k] do_syscall_64 ▒ + 1.39% 0.04% fio [kernel.vmlinux] [k] __do_sys_io_uring_enter ▒ + 1.33% 0.00% fio libc.so.6 [.] __GI___libc_open ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] __x64_sys_openat ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] do_sys_openat2 ▒ + 1.33% 0.00% fio [unknown] [k] 0x312d6e65742f2f6d ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] do_filp_open ▒ + 1.33% 0.00% fio [kernel.vmlinux] [k] path_openat ▒ + 1.29% 0.00% fio [kernel.vmlinux] [k] down_write ▒ + 1.29% 0.00% fio [kernel.vmlinux] [k] rwsem_down_write_slowpath ▒ + 1.26% 1.25% fio [kernel.vmlinux] [k] osq_lock ▒ + 1.14% 0.00% fio fio [.] 0x000055bbb94449fa ▒ + 1.14% 1.14% fio fio [.] 0x000000000002a9f5 ▒ + 0.98% 0.00% fio [unknown] [k] 0x000055bbd6310520 ▒ + 0.93% 0.00% fio fio [.] 0x000055bbb94b197b ▒ + 0.89% 0.00% perf libc.so.6 [.] __GI___libc_write ▒ + 0.89% 0.00% perf [kernel.vmlinux] [k] entry_SYSCALL_64 ▒ + 0.88% 0.00% perf [kernel.vmlinux] [k] do_syscall_64 ▒ + 0.86% 0.00% perf [kernel.vmlinux] [k] ksys_write ▒ + 0.85% 0.01% perf [kernel.vmlinux] [k] vfs_write ▒ + 0.83% 0.00% perf [ext4] [k] ext4_buffered_write_iter ▒ + 0.81% 0.01% perf [kernel.vmlinux] [k] generic_perform_write ▒ + 0.77% 0.02% fio [kernel.vmlinux] [k] io_submit_sqes ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] ret_from_fork_asm ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] ret_from_fork ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] kthread ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] worker_thread ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] process_one_work ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] wb_workfn ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] wb_writeback ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] __writeback_inodes_wb ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] writeback_sb_inodes ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] __writeback_single_inode ▒ + 0.76% 0.00% kworker/u513:26 [kernel.vmlinux] [k] do_writepages ▒ + 0.76% 0.00% kworker/u513:26 [xfs] [k] xfs_vm_writepages ▒ + 0.75% 0.00% kworker/u513:26 [kernel.vmlinux] [k] submit_bio_noacct_nocheck ▒ + 0.75% 0.00% kworker/u513:26 [kernel.vmlinux] [k] iomap_submit_ioend So we see *more* penalty because of perf's own buffered IO writes of the perf data than any writeback from from XFS. a) when we hit throttling: Samples: 1M of event 'cycles:P', Event count (approx.): 816903693659 Children Self Command Shared Object Symbol + 14.24% 14.06% fio fio [.] get_io_u ◆ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] ret_from_fork_asm ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] ret_from_fork ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] kthread ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] worker_thread ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] process_one_work ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] wb_workfn ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] wb_writeback ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] __writeback_inodes_wb ▒ + 4.88% 0.00% kworker/u513:3- [kernel.vmlinux] [k] writeback_sb_inodes ▒ + 4.87% 0.00% kworker/u513:3- [kernel.vmlinux] [k] __writeback_single_inode ▒ + 4.87% 0.00% kworker/u513:3- [kernel.vmlinux] [k] do_writepages ▒ + 4.87% 0.00% kworker/u513:3- [xfs] [k] xfs_vm_writepages ▒ + 4.82% 0.00% kworker/u513:3- [kernel.vmlinux] [k] iomap_submit_ioend ▒ + 4.82% 0.00% kworker/u513:3- [kernel.vmlinux] [k] submit_bio_noacct_nocheck ▒ + 4.82% 0.00% kworker/u513:3- [kernel.vmlinux] [k] __submit_bio ▒ + 4.82% 0.04% kworker/u513:3- [nd_pmem] [k] pmem_submit_bio ▒ + 4.78% 0.05% kworker/u513:3- [nd_pmem] [k] pmem_do_write Although my focus was on measuring the limits of the page cache, this thread also had a *slew* of ideas on how to improve that status quo, pathological or not. We have to accept some workloads are clearly pathological, but that's the point in coming up with limits and testing the page cache. But since there were a slew of unexpected ideas spread out this entire thread about general improvements, even for general use cases, I've collected all of them and put them as notes for for review for this topic at LSFMM. Thanks all for the feedback! Luis ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 17:31 ` Linus Torvalds 2024-02-24 18:13 ` Matthew Wilcox 2024-02-24 18:20 ` Linus Torvalds @ 2024-02-25 5:18 ` Kent Overstreet 2024-02-25 6:04 ` Kent Overstreet 2024-02-25 13:10 ` Matthew Wilcox 2 siblings, 2 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-25 5:18 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote: > On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > > What are the limits to buffered IO > > > and how do we test that? Who keeps track of it? > > > > TLDR: Why does the pagecache suck? > > What? No. > > Our page cache is so good that the question is literally "what are the > limits of it", and "how we would measure them". > > That's not a sign of suckage. > > When you have to have completely unrealistic loads that nobody would > actually care about in reality just to get a number for the limit, > it's not a sign of problems. > > Or rather, the "problem" is the person looking at a stupid load, and > going "we need to improve this because I can write a benchmark for > this". > > Here's a clue: a hardware discussion forum I visit was arguing about > memory latencies, and talking about how their measured overhead of > DRAM latency was literally 85% on the CPU side, not the DRAM side. > > Guess what? It's because the CPU in question had quite a bit of L3, > and it was spread out, and the CPU doesn't even start the memory > access before it has checked caches. > > And here's a big honking clue: only a complete nincompoop and mentally > deficient rodent would look at that and say "caches suck". > > > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > > > Vs > > > ~ 7,000 MiB/s with buffered IO > > > > Profile? My guess is that you're bottlenecked on the xa_lock between > > memory reclaim removing folios from the page cache and the various > > threads adding folios to the page cache. > > I doubt it's the locking. > > In fact, for writeout in particular it's probably not even the page > cache at all. > > For writeout, we have a very traditional problem: we care about a > million times more about latency than we care about throughput, > because nobody ever actually cares all that much about performance of > huge writes. Before large folios, we had people very much bottlenecked by 4k page overhead on sequential IO; my customer/sponsor was one of them. Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the profiles and looked at the filemap.c code it wasn't hard to see why; we'd walk a radix tree, do an atomic op (get the page), then do a 4k usercopy... hence the work I did to break up generic_file_buffered_read() and vectorize it, which was a huge improvement. It's definitely less of a factor when post large folios and when we're talking about workloads that don't fit in cache, but I always wanted to do a generic version of the vectorized write path that brfs and bcachefs have. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 5:18 ` Kent Overstreet @ 2024-02-25 6:04 ` Kent Overstreet 2024-02-25 13:10 ` Matthew Wilcox 1 sibling, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-25 6:04 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote: > On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote: > Before large folios, we had people very much bottlenecked by 4k page > overhead on sequential IO; my customer/sponsor was one of them. > > Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the > profiles and looked at the filemap.c code it wasn't hard to see why; > we'd walk a radix tree, do an atomic op (get the page), then do a 4k > usercopy... hence the work I did to break up > generic_file_buffered_read() and vectorize it, which was a huge > improvement. > > It's definitely less of a factor when post large folios and when we're > talking about workloads that don't fit in cache, but I always wanted to > do a generic version of the vectorized write path that brfs and bcachefs > have. to expound further, our buffered io performance really is crap vs. direct in lots of real world scenarios, and what was going on in generic_file_buffered_read() was just one instance of a larger theme - walking data structures, taking locks/atomics/barriers, then doing work on the page/folio with cacheline bounces, in a loop - lots of places where batching/vectorizing would help a lot but it tends to be insufficient. i had patches that went further than the generic_file_buffered_read() rework to vectorize add_to_page_cache_lru(), and that was another significant improvement. the pagecache lru operations were another hot spot... willy and I at one point were spitballing getting rid of the linked list for a dequeue, more for getting rid of the list_head in struct page/folio and replacing it with a single size_t index, but it'd open up more vectorizing possibilities i give willy crap about the .readahead interface... the way we make the filesystem code walk the xarray to get the folios instead of just passing it a vector is stupid folio_batch is stupid, it shouldn't be fixed size. there's no reason for that to be a small fixed size array on the stack, the slub fastpath has no atomic ops and doesn't disable preemption or interrupts - it's _fast_. just use a darray and vectorize the whole operation but that wouldn't be the big gains, bigger would be hunting down all the places that aren't vectorized and should be. i haven't reviewed the recent .writepages work christoph et all are doing, if that's properly vectorized now that'll help ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 5:18 ` Kent Overstreet 2024-02-25 6:04 ` Kent Overstreet @ 2024-02-25 13:10 ` Matthew Wilcox 2024-02-25 17:03 ` Linus Torvalds 2024-02-25 17:32 ` Kent Overstreet 1 sibling, 2 replies; 90+ messages in thread From: Matthew Wilcox @ 2024-02-25 13:10 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote: > Before large folios, we had people very much bottlenecked by 4k page > overhead on sequential IO; my customer/sponsor was one of them. > > Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the > profiles and looked at the filemap.c code it wasn't hard to see why; > we'd walk a radix tree, do an atomic op (get the page), then do a 4k > usercopy... hence the work I did to break up > generic_file_buffered_read() and vectorize it, which was a huge > improvement. There's also the small random 64 byte read case that we haven't optimised for yet. That also bottlenecks on the page refcount atomic op. The proposed solution to that was double-copy; look up the page without bumping its refcount, copy to a buffer, look up the page again to be sure it's still there, copy from the buffer to userspace. Except that can go wrong under really unlikely circumstances. Look up the page, page gets freed, page gets reallocated to slab, we copy sensitive data from it, page gets freed again, page gets reallocated to the same spot in the file (!), lookup says "yup the same page is there". We'd need a seqcount or something to be sure the page hasn't moved. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 13:10 ` Matthew Wilcox @ 2024-02-25 17:03 ` Linus Torvalds 2024-02-25 21:14 ` Matthew Wilcox 2024-02-25 21:29 ` Kent Overstreet 2024-02-25 17:32 ` Kent Overstreet 1 sibling, 2 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-25 17:03 UTC (permalink / raw) To: Matthew Wilcox Cc: Kent Overstreet, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@infradead.org> wrote: > > There's also the small random 64 byte read case that we haven't optimised > for yet. That also bottlenecks on the page refcount atomic op. > > The proposed solution to that was double-copy; look up the page without > bumping its refcount, copy to a buffer, look up the page again to be > sure it's still there, copy from the buffer to userspace. Please stop the cray-cray. Yes, cache dirtying is expensive. But you don't actually have cacheline ping-pong, because you don't have lots of different CPU's hammering the same page cache page in any normal circumstances. So the really expensive stuff just doesn't exist. I think you've been staring at profiles too much. In instruction-level profiles, the atomic ops stand out a lot. But that's at least partly artificial - they are a serialization point on x86, so things get accounted to them. So they tend to be the collection point for everything around them in an OoO CPU. Yes, atomics are bad. But double buffering is worse, and only looks good if you have some artificial benchmark that does some single-byte hot-cache read in a loop. In fact, I get the strong feeling that the complaints come from people who have looked at bad microbenchmarks a bit too much. People who have artificially removed the *real* costs by putting their data on a ramdisk, and then run a microbenchmark on this artificial setup. So you have a make-believe benchmark on a make-believe platform, and you may have started out with the best of intentions ("what are the limits"), but at some point you took a wrong turn, and turned that "what are the limits of performance" and turned that into an instruction-level profile and tried to mis-optimize the limits, instead of realizing that that is NOT THE POINT of a "what are the limits" question. The point of doing limit analysis is not to optimize the limit. It's to see how close you are to that limit in real loads. And I pretty much guarantee that you aren't close to those limits on any real loads. Before filesystem people start doing crazy things like double buffering to do RCU reading of the page cache, you need to look yourself in the mirror. Fior example, the fact that Kent complains about the page cache and talks about large folios is completely ludicrous. I've seen the benchmarks of real loads. Kent - you're not close to any limits, you are often a factor of two to five off other filesystems. We're not talking "a few percent", and we're not talking "the atomics are hurting". So people: wake up and smell the coffee. Don't optimize based off profiles of micro-benchmarks on made up platforms. That's for seeing where the limits are. And YOU ARE NOT EVEN CLOSE. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 17:03 ` Linus Torvalds @ 2024-02-25 21:14 ` Matthew Wilcox 2024-02-25 23:45 ` Linus Torvalds 2024-02-25 21:29 ` Kent Overstreet 1 sibling, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-02-25 21:14 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote: > I think you've been staring at profiles too much. In instruction-level > profiles, the atomic ops stand out a lot. But that's at least partly > artificial - they are a serialization point on x86, so things get > accounted to them. So they tend to be the collection point for > everything around them in an OoO CPU. > > Yes, atomics are bad. But double buffering is worse, and only looks > good if you have some artificial benchmark that does some single-byte > hot-cache read in a loop. Not artificial; this was a real customer with a real workload. I don't know how much about it I can discuss publically, but my memory of it was a system writing a log with 64 byte entries, millions of entries per second. Occasionally the system would have to go back and look at an entry in the last few seconds worth of data (so it would still be in the page cache). This customer was quite savvy, so they actually implemented and tested the lookup-copy-lookup-again algorithm in their custom kernel, and saw a speedup from it. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 21:14 ` Matthew Wilcox @ 2024-02-25 23:45 ` Linus Torvalds 2024-02-26 1:02 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-02-25 23:45 UTC (permalink / raw) To: Matthew Wilcox Cc: Kent Overstreet, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, 25 Feb 2024 at 13:14, Matthew Wilcox <willy@infradead.org> wrote: > > Not artificial; this was a real customer with a real workload. I don't > know how much about it I can discuss publically, but my memory of it was a > system writing a log with 64 byte entries, millions of entries per second. > Occasionally the system would have to go back and look at an entry in the > last few seconds worth of data (so it would still be in the page cache). Honestly, that should never hit any kind of contention on the page cache. Unless they did something else odd, that load should be entirely serialized by the POSIX "atomic write" requirements and the "inode_lock(inode)" that writes take. So it would end up literally being just one cache miss - and if you do things across CPU's and have cachelines moving around, that inode lock would be the bigger offender in that it is the one that would see any contention. Now, *that* is locking that I despise, much more than the page cache lock. It serializes unrelated writes to different areas, and the direct-IO people instead said "we don't care about POSIX" and did concurrent writes without it. That said, I do wonder if we could take advantage of the fact that we have the inode lock, and just make page eviction take that lock too (possibly in shared form). At that point, you really could just say "no need to increment the reference count, because we can do writes knowing that the mapping pages are stable". Not pretty, but we could possibly at least take advantage of the horrid other ugliness of the inode locking and POSIX rules that nobody really wants. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 23:45 ` Linus Torvalds @ 2024-02-26 1:02 ` Kent Overstreet 2024-02-26 1:32 ` Linus Torvalds 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-26 1:02 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 03:45:47PM -0800, Linus Torvalds wrote: > On Sun, 25 Feb 2024 at 13:14, Matthew Wilcox <willy@infradead.org> wrote: > > > > Not artificial; this was a real customer with a real workload. I don't > > know how much about it I can discuss publically, but my memory of it was a > > system writing a log with 64 byte entries, millions of entries per second. > > Occasionally the system would have to go back and look at an entry in the > > last few seconds worth of data (so it would still be in the page cache). > > Honestly, that should never hit any kind of contention on the page cache. > > Unless they did something else odd, that load should be entirely > serialized by the POSIX "atomic write" requirements and the > "inode_lock(inode)" that writes take. > > So it would end up literally being just one cache miss - and if you do > things across CPU's and have cachelines moving around, that inode lock > would be the bigger offender in that it is the one that would see any > contention. > > Now, *that* is locking that I despise, much more than the page cache > lock. It serializes unrelated writes to different areas, and the > direct-IO people instead said "we don't care about POSIX" and did > concurrent writes without it. We could satisfy the posix atomic writes rule by just having a properly vectorized buffered write path, no need for the inode lock - it really should just be extending writes that have to hit the inode lock, same as O_DIRECT. (whenever people bring up range locks, I keep trying to tell them - we already have that in the form of the folio lock, if you'd just use it properly...) ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 1:02 ` Kent Overstreet @ 2024-02-26 1:32 ` Linus Torvalds 2024-02-26 1:58 ` Kent Overstreet 2024-02-26 2:50 ` Al Viro 0 siblings, 2 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-26 1:32 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, 25 Feb 2024 at 17:03, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > We could satisfy the posix atomic writes rule by just having a properly > vectorized buffered write path, no need for the inode lock - it really > should just be extending writes that have to hit the inode lock, same as > O_DIRECT. > > (whenever people bring up range locks, I keep trying to tell them - we > already have that in the form of the folio lock, if you'd just use it > properly...) Sadly, that is *technically* not proper. IOW, I actually agree with you that the folio lock is sufficient, and several filesystems do too. BUT. Technically, the POSIX requirements are that the atomicity of writes are "all or nothing" being visible, and so ext4, for example, will have the whole write operation inside the inode_lock. Note that this is not some kind of locking requirement coming from some ext4 locking rule: it only does this for the buffered path, the DIO path happily says "if I'm not extending the size of the inode, I'll just take the lock for shared access". So this is literally only about buffered writes, and the atomicity within a folio is already dealt with with the folio lock (ext4 uses "generic_perform_write()" to do the actual writing part, which does *not* care about that lock). IOW, the whole inode lock is pointless for any real uses, but exists because _technically_ POSIX does not allow reads to see partial writes (ie if you see *one* page change, you have to see all of them change, so per-folio locking is not sufficient - you technically need to lock the whole operations against readers too, not just writers). Of course, in real life absolutely nobody cares, and you can see partial writes in many other ways, so this is really a completely "because of paper standards" serialization. Several other filesystems do *not* do that serialization, and as mentioned, the DIO paths also just said "hey, this isn't a normal write, so we'll ignore POSIX because technically we can". But to take a different example, ext2 just calls generic_file_write_iter() *without* taking the inode lock, and does locking one page at a time. As far as I know, nobody ever really complained. (It's not just ext2. It's all the old filesystems: anything that uses generic_file_write_iter() without doing inode_lock/unlock around it, which is actually most of them). And I find that silly locking rule a bit offensive, because it literally serializes accesses that shouldn't be serialized. IOW, it *should* be a "you do stupid things, you get what you deserve". Instead, it's a "everybody pays the price for giving stupid things a pointless semantic guarantee". Afaik, it has a purely historical reason for it - all the original UNIX read/write calls used a per-inode lock, so they were all serialized whether you liked it or not. Oh well. It's a pet peeve of mine. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 1:32 ` Linus Torvalds @ 2024-02-26 1:58 ` Kent Overstreet 2024-02-26 2:06 ` Kent Overstreet 2024-02-26 2:34 ` Linus Torvalds 2024-02-26 2:50 ` Al Viro 1 sibling, 2 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-26 1:58 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 05:32:14PM -0800, Linus Torvalds wrote: > On Sun, 25 Feb 2024 at 17:03, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > We could satisfy the posix atomic writes rule by just having a properly > > vectorized buffered write path, no need for the inode lock - it really > > should just be extending writes that have to hit the inode lock, same as > > O_DIRECT. > > > > (whenever people bring up range locks, I keep trying to tell them - we > > already have that in the form of the folio lock, if you'd just use it > > properly...) > > Sadly, that is *technically* not proper. > > IOW, I actually agree with you that the folio lock is sufficient, and > several filesystems do too. > > BUT. > > Technically, the POSIX requirements are that the atomicity of writes > are "all or nothing" being visible, and so ext4, for example, will > have the whole write operation inside the inode_lock. ... > (It's not just ext2. It's all the old filesystems: anything that uses > generic_file_write_iter() without doing inode_lock/unlock around it, > which is actually most of them). According to my reading just now, ext4 and btrfs (as well as bcachefs) also don't take the inode lock in the read path - xfs is the only one that does. Perhaps we should just lift it to the VFS and make it controllable as a mount/open option, as nice of a property as it is in theory I can't see myself wanting to make everyone pay for it if ext4 and btrfs aren't doing it and no one's screaming. I think write vs. write consistency is the more interesting case; the question there is does falling back to the inode lock when we can't lock all the folios simultaneously work. Consider thread A doing a 1 MB write, and it ends up in the path where it locks the inode and it's allowed to write one folio at a time. Then you have thread B doing some form of overlapping write, but without the inode lock, and with all the folios locked simultaneously. I think everything works; we need the end result to be consistent with some total ordering of all the writes, IOW, thread B's write (if fully within thread A's write) should be fully overwritten or not at all, and that clearly is the case. But there may be situations involving more than two threads where things get weirder. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 1:58 ` Kent Overstreet @ 2024-02-26 2:06 ` Kent Overstreet 2024-02-26 2:34 ` Linus Torvalds 1 sibling, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-26 2:06 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Suren Baghdasaryan On Sun, Feb 25, 2024 at 08:58:28PM -0500, Kent Overstreet wrote: > I think everything works; we need the end result to be consistent with > some total ordering of all the writes, IOW, thread B's write (if fully > within thread A's write) should be fully overwritten or not at all, and > that clearly is the case. But there may be situations involving more > than two threads where things get weirder. Not sure if there's anything interesting or deep here, but this situations where sometimes we've got a global lock and sometimes we've got little locks do seem to come up - it reminded me of Suren's work on the mmap sem recently, and I think I've seen it in other places. Feels like it should at least have a name... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 1:58 ` Kent Overstreet 2024-02-26 2:06 ` Kent Overstreet @ 2024-02-26 2:34 ` Linus Torvalds 1 sibling, 0 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-26 2:34 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, 25 Feb 2024 at 17:58, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > According to my reading just now, ext4 and btrfs (as well as bcachefs) > also don't take the inode lock in the read path - xfs is the only one > that does. Yeah, I should have remembered that detail - DaveC has pointed it out at some point how other filesystems don't actually honor the whole "all or nothing visible to read". And I was actually wrong about the common cases like ext2 - they use generic_file_write_iter(), which does take that inode lock, and I was confused with generic_perform_write() (which does not). It was always the read side that didn't care, as you point out. It's been some time since I looked at that. But as mentioned, nobody has actually ever shown any real interest in caring about the lack of POSIX technicality. > I think write vs. write consistency is the more interesting case; the > question there is does falling back to the inode lock when we can't lock > all the folios simultaneously work. I really don't think the write-write consistency is all that interesting either, and it really does hurt. If you're some toy database that would love to use buffered writes on just a DB file, that "no concurrent writes" can hurt a lot. So then people say "use DIO", but that has its own issues... There is one obvious special case, and I think it's the primary one why we end up having that inode_lock: O_APPEND or any other write extending the size of the file. THAT one obviously has to work right, and that's the case when multiple writers actually do want to get write-write consistency, and where it makes total sense to serialize them all. That's the one case that even DIO cares about. In the other cases, it's hard to argue that "one or the other wins the whole range" is seriously hugely better than "one or the other wins at some granularity". What the hell are you doing overlapping write ranges for if you have a "one or the other" mentality? Of course, maybe in practice it would be fine to do the "lock all the folios, with the fallback being the inode lock" - and we could even start with "all" being a pretty small number (perhaps starting with "one" ;^). Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 1:32 ` Linus Torvalds 2024-02-26 1:58 ` Kent Overstreet @ 2024-02-26 2:50 ` Al Viro 2024-02-26 17:17 ` Linus Torvalds 1 sibling, 1 reply; 90+ messages in thread From: Al Viro @ 2024-02-26 2:50 UTC (permalink / raw) To: Linus Torvalds Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 05:32:14PM -0800, Linus Torvalds wrote: > But to take a different example, ext2 just calls > generic_file_write_iter() *without* taking the inode lock, and does > locking one page at a time. As far as I know, nobody ever really > complained. Uh? __generic_file_write_iter() doesn't take inode lock, but generic_file_write_iter() does. O_APPEND handling aside, there's also file_remove_privs() in there and it does need that lock. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 2:50 ` Al Viro @ 2024-02-26 17:17 ` Linus Torvalds 2024-02-26 21:07 ` Matthew Wilcox 2024-02-26 22:46 ` Linus Torvalds 0 siblings, 2 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-26 17:17 UTC (permalink / raw) To: Al Viro Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, 25 Feb 2024 at 18:49, Al Viro <viro@kernel.org> wrote: > > Uh? __generic_file_write_iter() doesn't take inode lock, but > generic_file_write_iter() does. Yes, as I replied to Kent later, I mis-remembered the details (together with a too-quick grep). It's the reading side that just doesn't care, and which causes us to not actually give the POSIX guarantees anyway. > O_APPEND handling aside, there's > also file_remove_privs() in there and it does need that lock. Fair enough, but that's such a rare special case that it really doesn't merit the lock on every write. What at least the ext4 the DIO code does is something like "look at the write position and the inode size, and decide optimistically whether to take the lock shared or not". Then it re-checks it after potentially taking it for a read (in case the inode size has changed), and might decide to go for a write lock after all.. And I think we could fairly trivially do something similar on the regular write side. Yes, it needs to check SUID/SGID bits in addition to the inode size change, I guess (I don't think the ext4 dio code does, but my quick grepping might once again be incomplete). Anyway, DaveC is obviously also right that for the "actually need to do writeback" case, our writeback is currently intentionally throttled, which is why the benchmark by Luis shows that "almost two orders of magnitude" slowdown with buffered writeback. That's probably mainly an effect of having a backing store with no delays, but still.. However, the reason I dislike our write-side locking is that it actually serializes even the entirely cached case. Now, writing concurrently to the same inode is obviously strange, but it's a common thing for databases. And while all the *serious* ones use DIO, I think the buffered case really should do better. Willy - tangential side note: I looked closer at the issue that you reported (indirectly) with the small reads during heavy write activity. Our _reading_ side is very optimized and has none of the write-side oddities that I can see, and we just have filemap_read -> filemap_get_pages -> filemap_get_read_batch -> folio_try_get_rcu() and there is no page locking or other locking involved (assuming the page is cached and marked uptodate etc, of course). So afaik, it really is just that *one* atomic access (and the matching page ref decrement afterwards). We could easily do all of this without getting any ref to the page at all if we did the page cache release with RCU (and the user copy with "copy_to_user_atomic()"). Honestly, anything else looks like a complete disaster. For tiny reads, a temporary buffer sounds ok, but really *only* for tiny reads where we could have that buffer on the stack. Are tiny reads (handwaving: 100 bytes or less) really worth optimizing for to that degree? In contrast, the RCU-delaying of the page cache might be a good idea in general. We've had other situations where that would have been nice. The main worry would be low-memory situations, I suspect. The "tiny read" optimization smells like a benchmark thing to me. Even with the cacheline possibly bouncing, the system call overhead for tiny reads (particularly with all the mitigations) should be orders of magnitude higher than two atomic accesses. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 17:17 ` Linus Torvalds @ 2024-02-26 21:07 ` Matthew Wilcox 2024-02-26 21:17 ` Kent Overstreet 2024-02-26 22:46 ` Linus Torvalds 1 sibling, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-02-26 21:07 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Kent Overstreet, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > Willy - tangential side note: I looked closer at the issue that you > reported (indirectly) with the small reads during heavy write > activity. > > Our _reading_ side is very optimized and has none of the write-side > oddities that I can see, and we just have > > filemap_read -> > filemap_get_pages -> > filemap_get_read_batch -> > folio_try_get_rcu() > > and there is no page locking or other locking involved (assuming the > page is cached and marked uptodate etc, of course). > > So afaik, it really is just that *one* atomic access (and the matching > page ref decrement afterwards). Yep, that was what the customer reported on their ancient kernel, and we at least didn't make that worse ... > We could easily do all of this without getting any ref to the page at > all if we did the page cache release with RCU (and the user copy with > "copy_to_user_atomic()"). Honestly, anything else looks like a > complete disaster. For tiny reads, a temporary buffer sounds ok, but > really *only* for tiny reads where we could have that buffer on the > stack. > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > for to that degree? > > In contrast, the RCU-delaying of the page cache might be a good idea > in general. We've had other situations where that would have been > nice. The main worry would be low-memory situations, I suspect. > > The "tiny read" optimization smells like a benchmark thing to me. Even > with the cacheline possibly bouncing, the system call overhead for > tiny reads (particularly with all the mitigations) should be orders of > magnitude higher than two atomic accesses. Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. I suspect that this customer would simply disable them; afaik the machine is an appliance and one interacts with it purely by sending transactions to it (it's not even an SQL system, much less a "run arbitrary javascript" kind of system). But that makes it even more special case, inapplicable to the majority of workloads and closer to smelling like a benchmark. I've thought about and rejected RCU delaying of the page cache in the past. With the majority of memory in anon memory & file memory, it just feels too risky to have so much memory waiting to be reused. We could also improve gup-fast if we could rely on RCU freeing of anon memory. Not sure what workloads might benefit from that, though. It'd be cute if we could restrict free memory to be only reallocatable by the process that had previously allocated it until an RCU grace period had passed. That way you could still snoop, but only on yourself which wouldn't be all that exciting. Doubt it'd be worth the effort of setting up per-mm freelists, although it would allow us to skip zeroing the page under some circumstances ... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 21:07 ` Matthew Wilcox @ 2024-02-26 21:17 ` Kent Overstreet 2024-02-26 21:19 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-26 21:17 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote: > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > > Willy - tangential side note: I looked closer at the issue that you > > reported (indirectly) with the small reads during heavy write > > activity. > > > > Our _reading_ side is very optimized and has none of the write-side > > oddities that I can see, and we just have > > > > filemap_read -> > > filemap_get_pages -> > > filemap_get_read_batch -> > > folio_try_get_rcu() > > > > and there is no page locking or other locking involved (assuming the > > page is cached and marked uptodate etc, of course). > > > > So afaik, it really is just that *one* atomic access (and the matching > > page ref decrement afterwards). > > Yep, that was what the customer reported on their ancient kernel, and > we at least didn't make that worse ... > > > We could easily do all of this without getting any ref to the page at > > all if we did the page cache release with RCU (and the user copy with > > "copy_to_user_atomic()"). Honestly, anything else looks like a > > complete disaster. For tiny reads, a temporary buffer sounds ok, but > > really *only* for tiny reads where we could have that buffer on the > > stack. > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > > for to that degree? > > > > In contrast, the RCU-delaying of the page cache might be a good idea > > in general. We've had other situations where that would have been > > nice. The main worry would be low-memory situations, I suspect. > > > > The "tiny read" optimization smells like a benchmark thing to me. Even > > with the cacheline possibly bouncing, the system call overhead for > > tiny reads (particularly with all the mitigations) should be orders of > > magnitude higher than two atomic accesses. > > Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. > I suspect that this customer would simply disable them; afaik the machine > is an appliance and one interacts with it purely by sending transactions > to it (it's not even an SQL system, much less a "run arbitrary javascript" > kind of system). But that makes it even more special case, inapplicable > to the majority of workloads and closer to smelling like a benchmark. > > I've thought about and rejected RCU delaying of the page cache in the > past. With the majority of memory in anon memory & file memory, it just > feels too risky to have so much memory waiting to be reused. We could > also improve gup-fast if we could rely on RCU freeing of anon memory. > Not sure what workloads might benefit from that, though. RCU allocating and freeing of memory can already be fairly significant depending on workload, and I'd expect that to grow - we really just need a way for reclaim to kick RCU when needed (and probably add a percpu counter for "amount of memory stranded until the next RCU grace period"). ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 21:17 ` Kent Overstreet @ 2024-02-26 21:19 ` Kent Overstreet 2024-02-26 21:55 ` Paul E. McKenney 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-26 21:19 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Paul E. McKenney +cc Paul On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote: > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote: > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > > > Willy - tangential side note: I looked closer at the issue that you > > > reported (indirectly) with the small reads during heavy write > > > activity. > > > > > > Our _reading_ side is very optimized and has none of the write-side > > > oddities that I can see, and we just have > > > > > > filemap_read -> > > > filemap_get_pages -> > > > filemap_get_read_batch -> > > > folio_try_get_rcu() > > > > > > and there is no page locking or other locking involved (assuming the > > > page is cached and marked uptodate etc, of course). > > > > > > So afaik, it really is just that *one* atomic access (and the matching > > > page ref decrement afterwards). > > > > Yep, that was what the customer reported on their ancient kernel, and > > we at least didn't make that worse ... > > > > > We could easily do all of this without getting any ref to the page at > > > all if we did the page cache release with RCU (and the user copy with > > > "copy_to_user_atomic()"). Honestly, anything else looks like a > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but > > > really *only* for tiny reads where we could have that buffer on the > > > stack. > > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > > > for to that degree? > > > > > > In contrast, the RCU-delaying of the page cache might be a good idea > > > in general. We've had other situations where that would have been > > > nice. The main worry would be low-memory situations, I suspect. > > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even > > > with the cacheline possibly bouncing, the system call overhead for > > > tiny reads (particularly with all the mitigations) should be orders of > > > magnitude higher than two atomic accesses. > > > > Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. > > I suspect that this customer would simply disable them; afaik the machine > > is an appliance and one interacts with it purely by sending transactions > > to it (it's not even an SQL system, much less a "run arbitrary javascript" > > kind of system). But that makes it even more special case, inapplicable > > to the majority of workloads and closer to smelling like a benchmark. > > > > I've thought about and rejected RCU delaying of the page cache in the > > past. With the majority of memory in anon memory & file memory, it just > > feels too risky to have so much memory waiting to be reused. We could > > also improve gup-fast if we could rely on RCU freeing of anon memory. > > Not sure what workloads might benefit from that, though. > > RCU allocating and freeing of memory can already be fairly significant > depending on workload, and I'd expect that to grow - we really just need > a way for reclaim to kick RCU when needed (and probably add a percpu > counter for "amount of memory stranded until the next RCU grace > period"). ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 21:19 ` Kent Overstreet @ 2024-02-26 21:55 ` Paul E. McKenney 2024-02-26 23:29 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Paul E. McKenney @ 2024-02-26 21:55 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote: > +cc Paul > > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote: > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote: > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > > > > Willy - tangential side note: I looked closer at the issue that you > > > > reported (indirectly) with the small reads during heavy write > > > > activity. > > > > > > > > Our _reading_ side is very optimized and has none of the write-side > > > > oddities that I can see, and we just have > > > > > > > > filemap_read -> > > > > filemap_get_pages -> > > > > filemap_get_read_batch -> > > > > folio_try_get_rcu() > > > > > > > > and there is no page locking or other locking involved (assuming the > > > > page is cached and marked uptodate etc, of course). > > > > > > > > So afaik, it really is just that *one* atomic access (and the matching > > > > page ref decrement afterwards). > > > > > > Yep, that was what the customer reported on their ancient kernel, and > > > we at least didn't make that worse ... > > > > > > > We could easily do all of this without getting any ref to the page at > > > > all if we did the page cache release with RCU (and the user copy with > > > > "copy_to_user_atomic()"). Honestly, anything else looks like a > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but > > > > really *only* for tiny reads where we could have that buffer on the > > > > stack. > > > > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > > > > for to that degree? > > > > > > > > In contrast, the RCU-delaying of the page cache might be a good idea > > > > in general. We've had other situations where that would have been > > > > nice. The main worry would be low-memory situations, I suspect. > > > > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even > > > > with the cacheline possibly bouncing, the system call overhead for > > > > tiny reads (particularly with all the mitigations) should be orders of > > > > magnitude higher than two atomic accesses. > > > > > > Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. > > > I suspect that this customer would simply disable them; afaik the machine > > > is an appliance and one interacts with it purely by sending transactions > > > to it (it's not even an SQL system, much less a "run arbitrary javascript" > > > kind of system). But that makes it even more special case, inapplicable > > > to the majority of workloads and closer to smelling like a benchmark. > > > > > > I've thought about and rejected RCU delaying of the page cache in the > > > past. With the majority of memory in anon memory & file memory, it just > > > feels too risky to have so much memory waiting to be reused. We could > > > also improve gup-fast if we could rely on RCU freeing of anon memory. > > > Not sure what workloads might benefit from that, though. > > > > RCU allocating and freeing of memory can already be fairly significant > > depending on workload, and I'd expect that to grow - we really just need > > a way for reclaim to kick RCU when needed (and probably add a percpu > > counter for "amount of memory stranded until the next RCU grace > > period"). There are some APIs for that, though the are sharp-edged and mainly intended for rcutorture, and there are some hooks for a CI Kconfig option called RCU_STRICT_GRACE_PERIOD that could be organized into something useful. Of course, if there is a long-running RCU reader, there is nothing RCU can do. By definition, it must wait on all pre-existing readers, no exceptions. But my guess is that you instead are thinking of memory-exhaustion emergencies where you would like RCU to burn more CPU than usual to reduce grace-period latency, there are definitely things that can be done. I am sure that there are more questions that I should ask, but the one that comes immediately to mind is "Is this API call an occasional thing, or does RCU need to tolerate many CPUs hammering it frequently?" Either answer is fine, I just need to know. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 21:55 ` Paul E. McKenney @ 2024-02-26 23:29 ` Kent Overstreet 2024-02-27 0:05 ` Paul E. McKenney 2024-02-27 0:43 ` Dave Chinner 0 siblings, 2 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-26 23:29 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote: > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote: > > +cc Paul > > > > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote: > > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote: > > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > > > > > Willy - tangential side note: I looked closer at the issue that you > > > > > reported (indirectly) with the small reads during heavy write > > > > > activity. > > > > > > > > > > Our _reading_ side is very optimized and has none of the write-side > > > > > oddities that I can see, and we just have > > > > > > > > > > filemap_read -> > > > > > filemap_get_pages -> > > > > > filemap_get_read_batch -> > > > > > folio_try_get_rcu() > > > > > > > > > > and there is no page locking or other locking involved (assuming the > > > > > page is cached and marked uptodate etc, of course). > > > > > > > > > > So afaik, it really is just that *one* atomic access (and the matching > > > > > page ref decrement afterwards). > > > > > > > > Yep, that was what the customer reported on their ancient kernel, and > > > > we at least didn't make that worse ... > > > > > > > > > We could easily do all of this without getting any ref to the page at > > > > > all if we did the page cache release with RCU (and the user copy with > > > > > "copy_to_user_atomic()"). Honestly, anything else looks like a > > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but > > > > > really *only* for tiny reads where we could have that buffer on the > > > > > stack. > > > > > > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > > > > > for to that degree? > > > > > > > > > > In contrast, the RCU-delaying of the page cache might be a good idea > > > > > in general. We've had other situations where that would have been > > > > > nice. The main worry would be low-memory situations, I suspect. > > > > > > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even > > > > > with the cacheline possibly bouncing, the system call overhead for > > > > > tiny reads (particularly with all the mitigations) should be orders of > > > > > magnitude higher than two atomic accesses. > > > > > > > > Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. > > > > I suspect that this customer would simply disable them; afaik the machine > > > > is an appliance and one interacts with it purely by sending transactions > > > > to it (it's not even an SQL system, much less a "run arbitrary javascript" > > > > kind of system). But that makes it even more special case, inapplicable > > > > to the majority of workloads and closer to smelling like a benchmark. > > > > > > > > I've thought about and rejected RCU delaying of the page cache in the > > > > past. With the majority of memory in anon memory & file memory, it just > > > > feels too risky to have so much memory waiting to be reused. We could > > > > also improve gup-fast if we could rely on RCU freeing of anon memory. > > > > Not sure what workloads might benefit from that, though. > > > > > > RCU allocating and freeing of memory can already be fairly significant > > > depending on workload, and I'd expect that to grow - we really just need > > > a way for reclaim to kick RCU when needed (and probably add a percpu > > > counter for "amount of memory stranded until the next RCU grace > > > period"). > > There are some APIs for that, though the are sharp-edged and mainly > intended for rcutorture, and there are some hooks for a CI Kconfig > option called RCU_STRICT_GRACE_PERIOD that could be organized into > something useful. > > Of course, if there is a long-running RCU reader, there is nothing > RCU can do. By definition, it must wait on all pre-existing readers, > no exceptions. > > But my guess is that you instead are thinking of memory-exhaustion > emergencies where you would like RCU to burn more CPU than usual to > reduce grace-period latency, there are definitely things that can be done. > > I am sure that there are more questions that I should ask, but the one > that comes immediately to mind is "Is this API call an occasional thing, > or does RCU need to tolerate many CPUs hammering it frequently?" > Either answer is fine, I just need to know. ;-) Well, we won't want it getting hammered on continuously - we should be able to tune reclaim so that doesn't happen. I think getting numbers on the amount of memory stranded waiting for RCU is probably first order of business - minor tweak to kfree_rcu() et all for that; there's APIs they can query to maintain that counter. then, we can add a heuristic threshhold somewhere, something like if (rcu_stranded * multiplier > reclaimable_memory) kick_rcu() ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 23:29 ` Kent Overstreet @ 2024-02-27 0:05 ` Paul E. McKenney 2024-02-27 0:29 ` Kent Overstreet 2024-02-27 0:43 ` Dave Chinner 1 sibling, 1 reply; 90+ messages in thread From: Paul E. McKenney @ 2024-02-27 0:05 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote: > > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote: > > > +cc Paul > > > > > > On Mon, Feb 26, 2024 at 04:17:19PM -0500, Kent Overstreet wrote: > > > > On Mon, Feb 26, 2024 at 09:07:51PM +0000, Matthew Wilcox wrote: > > > > > On Mon, Feb 26, 2024 at 09:17:33AM -0800, Linus Torvalds wrote: > > > > > > Willy - tangential side note: I looked closer at the issue that you > > > > > > reported (indirectly) with the small reads during heavy write > > > > > > activity. > > > > > > > > > > > > Our _reading_ side is very optimized and has none of the write-side > > > > > > oddities that I can see, and we just have > > > > > > > > > > > > filemap_read -> > > > > > > filemap_get_pages -> > > > > > > filemap_get_read_batch -> > > > > > > folio_try_get_rcu() > > > > > > > > > > > > and there is no page locking or other locking involved (assuming the > > > > > > page is cached and marked uptodate etc, of course). > > > > > > > > > > > > So afaik, it really is just that *one* atomic access (and the matching > > > > > > page ref decrement afterwards). > > > > > > > > > > Yep, that was what the customer reported on their ancient kernel, and > > > > > we at least didn't make that worse ... > > > > > > > > > > > We could easily do all of this without getting any ref to the page at > > > > > > all if we did the page cache release with RCU (and the user copy with > > > > > > "copy_to_user_atomic()"). Honestly, anything else looks like a > > > > > > complete disaster. For tiny reads, a temporary buffer sounds ok, but > > > > > > really *only* for tiny reads where we could have that buffer on the > > > > > > stack. > > > > > > > > > > > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > > > > > > for to that degree? > > > > > > > > > > > > In contrast, the RCU-delaying of the page cache might be a good idea > > > > > > in general. We've had other situations where that would have been > > > > > > nice. The main worry would be low-memory situations, I suspect. > > > > > > > > > > > > The "tiny read" optimization smells like a benchmark thing to me. Even > > > > > > with the cacheline possibly bouncing, the system call overhead for > > > > > > tiny reads (particularly with all the mitigations) should be orders of > > > > > > magnitude higher than two atomic accesses. > > > > > > > > > > Ah, good point about the $%^&^*^ mitigations. This was pre mitigations. > > > > > I suspect that this customer would simply disable them; afaik the machine > > > > > is an appliance and one interacts with it purely by sending transactions > > > > > to it (it's not even an SQL system, much less a "run arbitrary javascript" > > > > > kind of system). But that makes it even more special case, inapplicable > > > > > to the majority of workloads and closer to smelling like a benchmark. > > > > > > > > > > I've thought about and rejected RCU delaying of the page cache in the > > > > > past. With the majority of memory in anon memory & file memory, it just > > > > > feels too risky to have so much memory waiting to be reused. We could > > > > > also improve gup-fast if we could rely on RCU freeing of anon memory. > > > > > Not sure what workloads might benefit from that, though. > > > > > > > > RCU allocating and freeing of memory can already be fairly significant > > > > depending on workload, and I'd expect that to grow - we really just need > > > > a way for reclaim to kick RCU when needed (and probably add a percpu > > > > counter for "amount of memory stranded until the next RCU grace > > > > period"). > > > > There are some APIs for that, though the are sharp-edged and mainly > > intended for rcutorture, and there are some hooks for a CI Kconfig > > option called RCU_STRICT_GRACE_PERIOD that could be organized into > > something useful. > > > > Of course, if there is a long-running RCU reader, there is nothing > > RCU can do. By definition, it must wait on all pre-existing readers, > > no exceptions. > > > > But my guess is that you instead are thinking of memory-exhaustion > > emergencies where you would like RCU to burn more CPU than usual to > > reduce grace-period latency, there are definitely things that can be done. > > > > I am sure that there are more questions that I should ask, but the one > > that comes immediately to mind is "Is this API call an occasional thing, > > or does RCU need to tolerate many CPUs hammering it frequently?" > > Either answer is fine, I just need to know. ;-) > > Well, we won't want it getting hammered on continuously - we should be > able to tune reclaim so that doesn't happen. > > I think getting numbers on the amount of memory stranded waiting for RCU > is probably first order of business - minor tweak to kfree_rcu() et all > for that; there's APIs they can query to maintain that counter. We can easily tell you the number of blocks of memory waiting to be freed. But RCU does not know their size. Yes, we could ferret this on each call to kmem_free_rcu(), but that might not be great for performance. We could traverse the lists at runtime, but such traversal must be done with interrupts disabled, which is also not great. > then, we can add a heuristic threshhold somewhere, something like > > if (rcu_stranded * multiplier > reclaimable_memory) > kick_rcu() If it is a heuristic anyway, it sounds best to base the heuristic on the number of objects rather than their aggregate size. Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 0:05 ` Paul E. McKenney @ 2024-02-27 0:29 ` Kent Overstreet 2024-02-27 0:55 ` Paul E. McKenney 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 0:29 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote: > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > > Well, we won't want it getting hammered on continuously - we should be > > able to tune reclaim so that doesn't happen. > > > > I think getting numbers on the amount of memory stranded waiting for RCU > > is probably first order of business - minor tweak to kfree_rcu() et all > > for that; there's APIs they can query to maintain that counter. > > We can easily tell you the number of blocks of memory waiting to be freed. > But RCU does not know their size. Yes, we could ferret this on each > call to kmem_free_rcu(), but that might not be great for performance. > We could traverse the lists at runtime, but such traversal must be done > with interrupts disabled, which is also not great. > > > then, we can add a heuristic threshhold somewhere, something like > > > > if (rcu_stranded * multiplier > reclaimable_memory) > > kick_rcu() > > If it is a heuristic anyway, it sounds best to base the heuristic on > the number of objects rather than their aggregate size. I don't think that'll really work given that object size can very from < 100 bytes all the way up to 2MB hugepages. The shrinker API works that way and I positively hate it; it's really helpful for introspection and debugability later to give good human understandable units to this stuff. And __ksize() is pretty cheap, and I think there might be room in struct slab to stick the object size there instead of getting it from the slab cache - and folio_size() is cheaper still. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 0:29 ` Kent Overstreet @ 2024-02-27 0:55 ` Paul E. McKenney 2024-02-27 1:08 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Paul E. McKenney @ 2024-02-27 0:55 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote: > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote: > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > > > Well, we won't want it getting hammered on continuously - we should be > > > able to tune reclaim so that doesn't happen. > > > > > > I think getting numbers on the amount of memory stranded waiting for RCU > > > is probably first order of business - minor tweak to kfree_rcu() et all > > > for that; there's APIs they can query to maintain that counter. > > > > We can easily tell you the number of blocks of memory waiting to be freed. > > But RCU does not know their size. Yes, we could ferret this on each > > call to kmem_free_rcu(), but that might not be great for performance. > > We could traverse the lists at runtime, but such traversal must be done > > with interrupts disabled, which is also not great. > > > > > then, we can add a heuristic threshhold somewhere, something like > > > > > > if (rcu_stranded * multiplier > reclaimable_memory) > > > kick_rcu() > > > > If it is a heuristic anyway, it sounds best to base the heuristic on > > the number of objects rather than their aggregate size. > > I don't think that'll really work given that object size can very from < > 100 bytes all the way up to 2MB hugepages. The shrinker API works that > way and I positively hate it; it's really helpful for introspection and > debugability later to give good human understandable units to this > stuff. You might well be right, but let's please try it before adding overhead to kfree_rcu() and friends. I bet it will prove to be good and sufficient. > And __ksize() is pretty cheap, and I think there might be room in struct > slab to stick the object size there instead of getting it from the slab > cache - and folio_size() is cheaper still. On __ksize(): * This should only be used internally to query the true size of allocations. * It is not meant to be a way to discover the usable size of an allocation * after the fact. Instead, use kmalloc_size_roundup(). Except that kmalloc_size_roundup() doesn't look like it is meant for this use case. On __ksize() being used only internally, I would not be at all averse to kfree_rcu() and friends moving to mm. The idea is for kfree_rcu() to invoke __ksize() when given slab memory and folio_size() when given vmalloc() memory? Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 0:55 ` Paul E. McKenney @ 2024-02-27 1:08 ` Kent Overstreet 2024-02-27 5:17 ` Paul E. McKenney 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 1:08 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote: > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote: > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote: > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > > > > Well, we won't want it getting hammered on continuously - we should be > > > > able to tune reclaim so that doesn't happen. > > > > > > > > I think getting numbers on the amount of memory stranded waiting for RCU > > > > is probably first order of business - minor tweak to kfree_rcu() et all > > > > for that; there's APIs they can query to maintain that counter. > > > > > > We can easily tell you the number of blocks of memory waiting to be freed. > > > But RCU does not know their size. Yes, we could ferret this on each > > > call to kmem_free_rcu(), but that might not be great for performance. > > > We could traverse the lists at runtime, but such traversal must be done > > > with interrupts disabled, which is also not great. > > > > > > > then, we can add a heuristic threshhold somewhere, something like > > > > > > > > if (rcu_stranded * multiplier > reclaimable_memory) > > > > kick_rcu() > > > > > > If it is a heuristic anyway, it sounds best to base the heuristic on > > > the number of objects rather than their aggregate size. > > > > I don't think that'll really work given that object size can very from < > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that > > way and I positively hate it; it's really helpful for introspection and > > debugability later to give good human understandable units to this > > stuff. > > You might well be right, but let's please try it before adding overhead to > kfree_rcu() and friends. I bet it will prove to be good and sufficient. > > > And __ksize() is pretty cheap, and I think there might be room in struct > > slab to stick the object size there instead of getting it from the slab > > cache - and folio_size() is cheaper still. > > On __ksize(): > > * This should only be used internally to query the true size of allocations. > * It is not meant to be a way to discover the usable size of an allocation > * after the fact. Instead, use kmalloc_size_roundup(). > > Except that kmalloc_size_roundup() doesn't look like it is meant for > this use case. On __ksize() being used only internally, I would not be > at all averse to kfree_rcu() and friends moving to mm. __ksize() is the right helper to use for this; ksize() is "how much usable memory", __ksize() is "how much does this occupy". > The idea is for kfree_rcu() to invoke __ksize() when given slab memory > and folio_size() when given vmalloc() memory? __ksize() for slab memory, but folio_size() would be for page allocations - actually, I think compound_order() is more appropriate here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we don't have yet but it looks like we're going to need. I'm scanning through vmalloc.c and I don't think we have a helper yet to query the allocation size - I can write one tomorrow, giving my brain a rest today :) ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 1:08 ` Kent Overstreet @ 2024-02-27 5:17 ` Paul E. McKenney 2024-02-27 6:21 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Paul E. McKenney @ 2024-02-27 5:17 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote: > On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote: > > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote: > > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote: > > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > > > > > Well, we won't want it getting hammered on continuously - we should be > > > > > able to tune reclaim so that doesn't happen. > > > > > > > > > > I think getting numbers on the amount of memory stranded waiting for RCU > > > > > is probably first order of business - minor tweak to kfree_rcu() et all > > > > > for that; there's APIs they can query to maintain that counter. > > > > > > > > We can easily tell you the number of blocks of memory waiting to be freed. > > > > But RCU does not know their size. Yes, we could ferret this on each > > > > call to kmem_free_rcu(), but that might not be great for performance. > > > > We could traverse the lists at runtime, but such traversal must be done > > > > with interrupts disabled, which is also not great. > > > > > > > > > then, we can add a heuristic threshhold somewhere, something like > > > > > > > > > > if (rcu_stranded * multiplier > reclaimable_memory) > > > > > kick_rcu() > > > > > > > > If it is a heuristic anyway, it sounds best to base the heuristic on > > > > the number of objects rather than their aggregate size. > > > > > > I don't think that'll really work given that object size can very from < > > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that > > > way and I positively hate it; it's really helpful for introspection and > > > debugability later to give good human understandable units to this > > > stuff. > > > > You might well be right, but let's please try it before adding overhead to > > kfree_rcu() and friends. I bet it will prove to be good and sufficient. > > > > > And __ksize() is pretty cheap, and I think there might be room in struct > > > slab to stick the object size there instead of getting it from the slab > > > cache - and folio_size() is cheaper still. > > > > On __ksize(): > > > > * This should only be used internally to query the true size of allocations. > > * It is not meant to be a way to discover the usable size of an allocation > > * after the fact. Instead, use kmalloc_size_roundup(). > > > > Except that kmalloc_size_roundup() doesn't look like it is meant for > > this use case. On __ksize() being used only internally, I would not be > > at all averse to kfree_rcu() and friends moving to mm. > > __ksize() is the right helper to use for this; ksize() is "how much > usable memory", __ksize() is "how much does this occupy". > > > The idea is for kfree_rcu() to invoke __ksize() when given slab memory > > and folio_size() when given vmalloc() memory? > > __ksize() for slab memory, but folio_size() would be for page > allocations - actually, I think compound_order() is more appropriate > here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we > don't have yet but it looks like we're going to need. > > I'm scanning through vmalloc.c and I don't think we have a helper yet to > query the allocation size - I can write one tomorrow, giving my brain a > rest today :) Again, let's give the straight count of blocks a try first. I do see that you feel that the added overhead is negligible, but zero added overhead is even better. Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 5:17 ` Paul E. McKenney @ 2024-02-27 6:21 ` Kent Overstreet 2024-02-27 15:32 ` Paul E. McKenney 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 6:21 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 09:17:41PM -0800, Paul E. McKenney wrote: > On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote: > > On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote: > > > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote: > > > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote: > > > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > > > > > > Well, we won't want it getting hammered on continuously - we should be > > > > > > able to tune reclaim so that doesn't happen. > > > > > > > > > > > > I think getting numbers on the amount of memory stranded waiting for RCU > > > > > > is probably first order of business - minor tweak to kfree_rcu() et all > > > > > > for that; there's APIs they can query to maintain that counter. > > > > > > > > > > We can easily tell you the number of blocks of memory waiting to be freed. > > > > > But RCU does not know their size. Yes, we could ferret this on each > > > > > call to kmem_free_rcu(), but that might not be great for performance. > > > > > We could traverse the lists at runtime, but such traversal must be done > > > > > with interrupts disabled, which is also not great. > > > > > > > > > > > then, we can add a heuristic threshhold somewhere, something like > > > > > > > > > > > > if (rcu_stranded * multiplier > reclaimable_memory) > > > > > > kick_rcu() > > > > > > > > > > If it is a heuristic anyway, it sounds best to base the heuristic on > > > > > the number of objects rather than their aggregate size. > > > > > > > > I don't think that'll really work given that object size can very from < > > > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that > > > > way and I positively hate it; it's really helpful for introspection and > > > > debugability later to give good human understandable units to this > > > > stuff. > > > > > > You might well be right, but let's please try it before adding overhead to > > > kfree_rcu() and friends. I bet it will prove to be good and sufficient. > > > > > > > And __ksize() is pretty cheap, and I think there might be room in struct > > > > slab to stick the object size there instead of getting it from the slab > > > > cache - and folio_size() is cheaper still. > > > > > > On __ksize(): > > > > > > * This should only be used internally to query the true size of allocations. > > > * It is not meant to be a way to discover the usable size of an allocation > > > * after the fact. Instead, use kmalloc_size_roundup(). > > > > > > Except that kmalloc_size_roundup() doesn't look like it is meant for > > > this use case. On __ksize() being used only internally, I would not be > > > at all averse to kfree_rcu() and friends moving to mm. > > > > __ksize() is the right helper to use for this; ksize() is "how much > > usable memory", __ksize() is "how much does this occupy". > > > > > The idea is for kfree_rcu() to invoke __ksize() when given slab memory > > > and folio_size() when given vmalloc() memory? > > > > __ksize() for slab memory, but folio_size() would be for page > > allocations - actually, I think compound_order() is more appropriate > > here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we > > don't have yet but it looks like we're going to need. > > > > I'm scanning through vmalloc.c and I don't think we have a helper yet to > > query the allocation size - I can write one tomorrow, giving my brain a > > rest today :) > > Again, let's give the straight count of blocks a try first. I do see > that you feel that the added overhead is negligible, but zero added > overhead is even better. How are you going to write a heuristic that works correctly both when the system is cycling through nothing but 2M hugepages, and nothing but 128 byte whatevers? ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 6:21 ` Kent Overstreet @ 2024-02-27 15:32 ` Paul E. McKenney 2024-02-27 15:52 ` Kent Overstreet 2024-02-27 15:54 ` Matthew Wilcox 0 siblings, 2 replies; 90+ messages in thread From: Paul E. McKenney @ 2024-02-27 15:32 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 01:21:05AM -0500, Kent Overstreet wrote: > On Mon, Feb 26, 2024 at 09:17:41PM -0800, Paul E. McKenney wrote: > > On Mon, Feb 26, 2024 at 08:08:17PM -0500, Kent Overstreet wrote: > > > On Mon, Feb 26, 2024 at 04:55:29PM -0800, Paul E. McKenney wrote: > > > > On Mon, Feb 26, 2024 at 07:29:04PM -0500, Kent Overstreet wrote: > > > > > On Mon, Feb 26, 2024 at 04:05:37PM -0800, Paul E. McKenney wrote: > > > > > > On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > > > > > > > Well, we won't want it getting hammered on continuously - we should be > > > > > > > able to tune reclaim so that doesn't happen. > > > > > > > > > > > > > > I think getting numbers on the amount of memory stranded waiting for RCU > > > > > > > is probably first order of business - minor tweak to kfree_rcu() et all > > > > > > > for that; there's APIs they can query to maintain that counter. > > > > > > > > > > > > We can easily tell you the number of blocks of memory waiting to be freed. > > > > > > But RCU does not know their size. Yes, we could ferret this on each > > > > > > call to kmem_free_rcu(), but that might not be great for performance. > > > > > > We could traverse the lists at runtime, but such traversal must be done > > > > > > with interrupts disabled, which is also not great. > > > > > > > > > > > > > then, we can add a heuristic threshhold somewhere, something like > > > > > > > > > > > > > > if (rcu_stranded * multiplier > reclaimable_memory) > > > > > > > kick_rcu() > > > > > > > > > > > > If it is a heuristic anyway, it sounds best to base the heuristic on > > > > > > the number of objects rather than their aggregate size. > > > > > > > > > > I don't think that'll really work given that object size can very from < > > > > > 100 bytes all the way up to 2MB hugepages. The shrinker API works that > > > > > way and I positively hate it; it's really helpful for introspection and > > > > > debugability later to give good human understandable units to this > > > > > stuff. > > > > > > > > You might well be right, but let's please try it before adding overhead to > > > > kfree_rcu() and friends. I bet it will prove to be good and sufficient. > > > > > > > > > And __ksize() is pretty cheap, and I think there might be room in struct > > > > > slab to stick the object size there instead of getting it from the slab > > > > > cache - and folio_size() is cheaper still. > > > > > > > > On __ksize(): > > > > > > > > * This should only be used internally to query the true size of allocations. > > > > * It is not meant to be a way to discover the usable size of an allocation > > > > * after the fact. Instead, use kmalloc_size_roundup(). > > > > > > > > Except that kmalloc_size_roundup() doesn't look like it is meant for > > > > this use case. On __ksize() being used only internally, I would not be > > > > at all averse to kfree_rcu() and friends moving to mm. > > > > > > __ksize() is the right helper to use for this; ksize() is "how much > > > usable memory", __ksize() is "how much does this occupy". > > > > > > > The idea is for kfree_rcu() to invoke __ksize() when given slab memory > > > > and folio_size() when given vmalloc() memory? > > > > > > __ksize() for slab memory, but folio_size() would be for page > > > allocations - actually, I think compound_order() is more appropriate > > > here, but that's willy's area. IOW, for free_pages_rcu(), which AFAIK we > > > don't have yet but it looks like we're going to need. > > > > > > I'm scanning through vmalloc.c and I don't think we have a helper yet to > > > query the allocation size - I can write one tomorrow, giving my brain a > > > rest today :) > > > > Again, let's give the straight count of blocks a try first. I do see > > that you feel that the added overhead is negligible, but zero added > > overhead is even better. > > How are you going to write a heuristic that works correctly both when > the system is cycling through nothing but 2M hugepages, and nothing but > 128 byte whatevers? I could simply use the same general approach that I use within RCU itself, which currently has absolutely no idea how much memory (if any) that each callback will free. Especially given that some callbacks free groups of memory blocks, while other free nothing. ;-) Alternatively, we could gather statistics on the amount of memory freed by each callback and use that as an estimate. But we should instead step back and ask exactly what we are trying to accomplish here, which just might be what Dave Chinner was getting at. At a ridiculously high level, reclaim is looking for memory to free. Some read-only memory can often be dropped immediately on the grounds that its data can be read back in if needed. Other memory can only be dropped after being written out, which involves a delay. There are of course many other complications, but this will do for a start. So, where does RCU fit in? RCU fits in between the two. With memory awaiting RCU, there is no need to write anything out, but there is a delay. As such, memory waiting for an RCU grace period is similar to memory that is to be reclaimed after its I/O completes. One complication, and a complication that we are considering exploiting, is that, unlike reclaimable memory waiting for I/O, we could often (but not always) have some control over how quickly RCU's grace periods complete. And we already do this programmatically by using the choice between sychronize_rcu() and synchronize_rcu_expedited(). The question is whether we should expedite normal RCU grace periods during reclaim, and if so, under what conditions. You identified one potential condition, namely the amount of memory waiting to be reclaimed. One complication with this approach is that RCU has no idea how much memory each callback represents, and for call_rcu(), there is no way for it to find out. For kfree_rcu(), there are ways, but as you know, I am questioning whether those ways are reasonable from a performance perspective. But even if they are, we would be accepting more error from the memory waiting via call_rcu() than we would be accepting if we just counted blocks instead of bytes for kfree_rcu(). Let me reiterate that: The estimation error that you are objecting to for kfree_rcu() is completely and utterly unavoidable for call_rcu(). RCU callback functions do whatever their authors want, and we won't be analyzing their code to estimate bytes freed without some serious advances in code-analysis technology. Hence my perhaps otherwise inexplicable insistence on starting with block counts rather than byte counts. Another complication surrounding estimating memory to be freed is that this memory can be in any of the following states: 1. Not yet associated with a specific grace period. Its CPU (or its rcuog kthread, as the case may be) must interact with the RCU core and assign a grace period. There are a few costly RCU_STRICT_GRACE_PERIOD tricks that can help here, usually involving IPIing a bunch of CPUs or awakening a bunch of rcuog kthreads. 2. Associated with a grace period that has not yet started. This grace period must of course be started, which usually cannot happen until the previous grace period completes. Which leads us to... 3. Associated with the current grace period. This is where the rcutorture forcing of quiescent states comes in. 4. Waiting to be invoked. This happens from either RCU_SOFTIRQ context or from rcuoc kthread context, and is of course impeded by any of the aforementioned measures to speed things up. Perhaps we could crank up the priority of the relevant ksoftirq or rcuog/rcuoc kthreads, though this likely has some serious side effects. Besides, as far as I know, we don't mess with other process priorities for the benefit of reclaim, so why would we start with RCU? Of these, #3 and #4 are normally the slowest, with #3 often being the slowest under light call_rcu()/kfree_rcu() load (or in the presence of slow RCU readers) and #4 often being the slowest under callback-flooding conditions. Now reclaim might cause callback flooding, but I don't recall seeing this. At least not to the extent as userspace-induced callback flooding, for example, during "rm -rf" of a big filesystem tree with lots of small files on a fast device. So we likely focus on #3. So we can speed RCU up, and we could use some statistics on quantity of whatever waiting for RCU. But is that the right approach? Keep in mind that reclaim via writeback involves I/O delays. Now these delays are often much shorter than the used to be, courtesy of SSDs. But queueing is still a thing, as are limitations on write bandwidth, so those delays are still consequential. My experience is that reclaim spans seconds rather than milliseconds, let alone microseconds, and RCU grace periods are in the tens to hundreds of milliseconds. Or am I yet again showing my age? So perhaps we should instead make RCU take "ongoing reclaim" into account when prodding reluctant grace periods. For example, RCU normally scans for idle CPUs every 3 milliseconds. Maybe it should speed that up to once per millisecond when reclaim is ongoing. And likelwise for RCU's other grace-period-prodding heuristics. In addition, RCU uses the number of callbacks queued on a particular CPU to instigate grace-period prodding. Maybe these heuristics should also depend on whether or not a reclaim is in progress. Is there an easy, fast, and reliable way for RCU to determine whether or not a reclaim is ongoing? It might be both easier and more effective for RCU to simply unconditionally react to the existence of a relaim than for the reclaim process to try to figure out when to prod RCU. Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 15:32 ` Paul E. McKenney @ 2024-02-27 15:52 ` Kent Overstreet 2024-02-27 16:06 ` Paul E. McKenney 2024-02-27 15:54 ` Matthew Wilcox 1 sibling, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 15:52 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > I could simply use the same general approach that I use within RCU > itself, which currently has absolutely no idea how much memory (if any) > that each callback will free. Especially given that some callbacks > free groups of memory blocks, while other free nothing. ;-) > > Alternatively, we could gather statistics on the amount of memory freed > by each callback and use that as an estimate. > > But we should instead step back and ask exactly what we are trying to > accomplish here, which just might be what Dave Chinner was getting at. > > At a ridiculously high level, reclaim is looking for memory to free. > Some read-only memory can often be dropped immediately on the grounds > that its data can be read back in if needed. Other memory can only be > dropped after being written out, which involves a delay. There are of > course many other complications, but this will do for a start. > > So, where does RCU fit in? > > RCU fits in between the two. With memory awaiting RCU, there is no need > to write anything out, but there is a delay. As such, memory waiting > for an RCU grace period is similar to memory that is to be reclaimed > after its I/O completes. > > One complication, and a complication that we are considering exploiting, > is that, unlike reclaimable memory waiting for I/O, we could often > (but not always) have some control over how quickly RCU's grace periods > complete. And we already do this programmatically by using the choice > between sychronize_rcu() and synchronize_rcu_expedited(). The question > is whether we should expedite normal RCU grace periods during reclaim, > and if so, under what conditions. > > You identified one potential condition, namely the amount of memory > waiting to be reclaimed. One complication with this approach is that RCU > has no idea how much memory each callback represents, and for call_rcu(), > there is no way for it to find out. For kfree_rcu(), there are ways, > but as you know, I am questioning whether those ways are reasonable from > a performance perspective. But even if they are, we would be accepting > more error from the memory waiting via call_rcu() than we would be > accepting if we just counted blocks instead of bytes for kfree_rcu(). You're _way_ overcomplicating this. The relevant thing to consider is the relative cost of __ksize() and kfree_rcu(). __ksize() is already pretty cheap, and with slab gone and space available in struct slab we can get it down to a single load. > Let me reiterate that: The estimation error that you are objecting to > for kfree_rcu() is completely and utterly unavoidable for call_rcu(). hardly, callsites manually freeing memory manually after an RCU grace period can do the accounting manually - if they're hot enough to matter, most aren.t and with memory allocation profiling coming, which also tracks # of allocations, we'll also have an easy way to spot those. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 15:52 ` Kent Overstreet @ 2024-02-27 16:06 ` Paul E. McKenney 0 siblings, 0 replies; 90+ messages in thread From: Paul E. McKenney @ 2024-02-27 16:06 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 10:52:51AM -0500, Kent Overstreet wrote: > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > > I could simply use the same general approach that I use within RCU > > itself, which currently has absolutely no idea how much memory (if any) > > that each callback will free. Especially given that some callbacks > > free groups of memory blocks, while other free nothing. ;-) > > > > Alternatively, we could gather statistics on the amount of memory freed > > by each callback and use that as an estimate. > > > > But we should instead step back and ask exactly what we are trying to > > accomplish here, which just might be what Dave Chinner was getting at. > > > > At a ridiculously high level, reclaim is looking for memory to free. > > Some read-only memory can often be dropped immediately on the grounds > > that its data can be read back in if needed. Other memory can only be > > dropped after being written out, which involves a delay. There are of > > course many other complications, but this will do for a start. > > > > So, where does RCU fit in? > > > > RCU fits in between the two. With memory awaiting RCU, there is no need > > to write anything out, but there is a delay. As such, memory waiting > > for an RCU grace period is similar to memory that is to be reclaimed > > after its I/O completes. > > > > One complication, and a complication that we are considering exploiting, > > is that, unlike reclaimable memory waiting for I/O, we could often > > (but not always) have some control over how quickly RCU's grace periods > > complete. And we already do this programmatically by using the choice > > between sychronize_rcu() and synchronize_rcu_expedited(). The question > > is whether we should expedite normal RCU grace periods during reclaim, > > and if so, under what conditions. > > > > You identified one potential condition, namely the amount of memory > > waiting to be reclaimed. One complication with this approach is that RCU > > has no idea how much memory each callback represents, and for call_rcu(), > > there is no way for it to find out. For kfree_rcu(), there are ways, > > but as you know, I am questioning whether those ways are reasonable from > > a performance perspective. But even if they are, we would be accepting > > more error from the memory waiting via call_rcu() than we would be > > accepting if we just counted blocks instead of bytes for kfree_rcu(). > > You're _way_ overcomplicating this. Sorry, but no. Please read the remainder of my prior email carefully. Thanx, Paul > The relevant thing to consider is the relative cost of __ksize() and > kfree_rcu(). __ksize() is already pretty cheap, and with slab gone and > space available in struct slab we can get it down to a single load. > > > Let me reiterate that: The estimation error that you are objecting to > > for kfree_rcu() is completely and utterly unavoidable for call_rcu(). > > hardly, callsites manually freeing memory manually after an RCU grace > period can do the accounting manually - if they're hot enough to matter, > most aren.t > > and with memory allocation profiling coming, which also tracks # of > allocations, we'll also have an easy way to spot those. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 15:32 ` Paul E. McKenney 2024-02-27 15:52 ` Kent Overstreet @ 2024-02-27 15:54 ` Matthew Wilcox 2024-02-27 16:21 ` Paul E. McKenney 1 sibling, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-02-27 15:54 UTC (permalink / raw) To: Paul E. McKenney Cc: Kent Overstreet, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > At a ridiculously high level, reclaim is looking for memory to free. > Some read-only memory can often be dropped immediately on the grounds > that its data can be read back in if needed. Other memory can only be > dropped after being written out, which involves a delay. There are of > course many other complications, but this will do for a start. Hi Paul, I appreciate the necessity of describing what's going on at a very high level, but there's a wrinkle that I'm not sure you're aware of which may substantially change your argument. For anonymous memory, we do indeed wait until reclaim to start writing it to swap. That may or may not be the right approach given how anonymous memory is used (and could be the topic of an interesting discussion at LSFMM). For file-backed memory, we do not write back memory in reclaim. If it has got to the point of calling ->writepage in vmscan, things have gone horribly wrong to the point where calling ->writepage will make things worse. This is why we're currently removing ->writepage from every filesystem (only ->writepages will remain). Instead, the page cache is written back much earlier, once we get to balance_dirty_pages(). That lets us write pages in filesystem-friendly ways instead of in MM LRU order. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 15:54 ` Matthew Wilcox @ 2024-02-27 16:21 ` Paul E. McKenney 2024-02-27 16:34 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Paul E. McKenney @ 2024-02-27 16:21 UTC (permalink / raw) To: Matthew Wilcox Cc: Kent Overstreet, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote: > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > > At a ridiculously high level, reclaim is looking for memory to free. > > Some read-only memory can often be dropped immediately on the grounds > > that its data can be read back in if needed. Other memory can only be > > dropped after being written out, which involves a delay. There are of > > course many other complications, but this will do for a start. > > Hi Paul, > > I appreciate the necessity of describing what's going on at a very high > level, but there's a wrinkle that I'm not sure you're aware of which > may substantially change your argument. > > For anonymous memory, we do indeed wait until reclaim to start writing it > to swap. That may or may not be the right approach given how anonymous > memory is used (and could be the topic of an interesting discussion > at LSFMM). > > For file-backed memory, we do not write back memory in reclaim. If it > has got to the point of calling ->writepage in vmscan, things have gone > horribly wrong to the point where calling ->writepage will make things > worse. This is why we're currently removing ->writepage from every > filesystem (only ->writepages will remain). Instead, the page cache > is written back much earlier, once we get to balance_dirty_pages(). > That lets us write pages in filesystem-friendly ways instead of in MM > LRU order. Thank you for the additional details. But please allow me to further summarize the point of my prior email that seems to be getting lost: 1. RCU already does significant work prodding grace periods. 2. There is no reasonable way to provide estimates of the memory sent to RCU via call_rcu(), and in many cases the bulk of the waiting memory will be call_rcu() memory. Therefore, if we cannot come up with a heuristic that does not need to know the bytes of memory waiting, we are stuck anyway. So perhaps the proper heuristic for RCU speeding things up is simply "Hey RCU, we are in reclaim!". Or do you have hard data showing that this is insufficient? Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 16:21 ` Paul E. McKenney @ 2024-02-27 16:34 ` Kent Overstreet 2024-02-27 17:58 ` Paul E. McKenney 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 16:34 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote: > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote: > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > > > At a ridiculously high level, reclaim is looking for memory to free. > > > Some read-only memory can often be dropped immediately on the grounds > > > that its data can be read back in if needed. Other memory can only be > > > dropped after being written out, which involves a delay. There are of > > > course many other complications, but this will do for a start. > > > > Hi Paul, > > > > I appreciate the necessity of describing what's going on at a very high > > level, but there's a wrinkle that I'm not sure you're aware of which > > may substantially change your argument. > > > > For anonymous memory, we do indeed wait until reclaim to start writing it > > to swap. That may or may not be the right approach given how anonymous > > memory is used (and could be the topic of an interesting discussion > > at LSFMM). > > > > For file-backed memory, we do not write back memory in reclaim. If it > > has got to the point of calling ->writepage in vmscan, things have gone > > horribly wrong to the point where calling ->writepage will make things > > worse. This is why we're currently removing ->writepage from every > > filesystem (only ->writepages will remain). Instead, the page cache > > is written back much earlier, once we get to balance_dirty_pages(). > > That lets us write pages in filesystem-friendly ways instead of in MM > > LRU order. > > Thank you for the additional details. > > But please allow me to further summarize the point of my prior email > that seems to be getting lost: > > 1. RCU already does significant work prodding grace periods. > > 2. There is no reasonable way to provide estimates of the > memory sent to RCU via call_rcu(), and in many cases > the bulk of the waiting memory will be call_rcu() memory. > > Therefore, if we cannot come up with a heuristic that does not need to > know the bytes of memory waiting, we are stuck anyway. That is a completely asinine argument. > So perhaps the proper heuristic for RCU speeding things up is simply > "Hey RCU, we are in reclaim!". Because that's the wrong heuristic. There are important workloads for which we're _always_ in reclaim, but as long as RCU grace periods are happening at some steady rate, the amount of memory stranded will be bounded and there's no reason to expedite grace periods. If we start RCU freeing all pagecache folios we're going to be cycling memory through RCU freeing at the rate of gigabytes per second, tens of gigabytes per second on high end systems. Do you put hard limits on how long we can go before an RCU grace period that will limit the amount of memory stranded to something acceptable? Yes or no? ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 16:34 ` Kent Overstreet @ 2024-02-27 17:58 ` Paul E. McKenney 2024-02-28 23:55 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Paul E. McKenney @ 2024-02-27 17:58 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote: > On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote: > > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote: > > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > > > > At a ridiculously high level, reclaim is looking for memory to free. > > > > Some read-only memory can often be dropped immediately on the grounds > > > > that its data can be read back in if needed. Other memory can only be > > > > dropped after being written out, which involves a delay. There are of > > > > course many other complications, but this will do for a start. > > > > > > Hi Paul, > > > > > > I appreciate the necessity of describing what's going on at a very high > > > level, but there's a wrinkle that I'm not sure you're aware of which > > > may substantially change your argument. > > > > > > For anonymous memory, we do indeed wait until reclaim to start writing it > > > to swap. That may or may not be the right approach given how anonymous > > > memory is used (and could be the topic of an interesting discussion > > > at LSFMM). > > > > > > For file-backed memory, we do not write back memory in reclaim. If it > > > has got to the point of calling ->writepage in vmscan, things have gone > > > horribly wrong to the point where calling ->writepage will make things > > > worse. This is why we're currently removing ->writepage from every > > > filesystem (only ->writepages will remain). Instead, the page cache > > > is written back much earlier, once we get to balance_dirty_pages(). > > > That lets us write pages in filesystem-friendly ways instead of in MM > > > LRU order. > > > > Thank you for the additional details. > > > > But please allow me to further summarize the point of my prior email > > that seems to be getting lost: > > > > 1. RCU already does significant work prodding grace periods. > > > > 2. There is no reasonable way to provide estimates of the > > memory sent to RCU via call_rcu(), and in many cases > > the bulk of the waiting memory will be call_rcu() memory. > > > > Therefore, if we cannot come up with a heuristic that does not need to > > know the bytes of memory waiting, we are stuck anyway. > > That is a completely asinine argument. Huh. Anything else you need to get off your chest? On the off-chance it is unclear, I do disagree with your assessment. > > So perhaps the proper heuristic for RCU speeding things up is simply > > "Hey RCU, we are in reclaim!". > > Because that's the wrong heuristic. There are important workloads for > which we're _always_ in reclaim, but as long as RCU grace periods are > happening at some steady rate, the amount of memory stranded will be > bounded and there's no reason to expedite grace periods. RCU is in fact designed to handle heavy load, and contains a number of mechanisms to operate more efficiently at higher load than at lower load. It also contains mechanisms to expedite grace periods under heavy load. Some of which I already described in earlier emails on this thread. > If we start RCU freeing all pagecache folios we're going to be cycling > memory through RCU freeing at the rate of gigabytes per second, tens of > gigabytes per second on high end systems. The load on RCU would be measured in terms of requests (kfree_rcu() and friends) per unit time per CPU, not in terms of gigabytes per unit time. Of course, the amount of memory per unit time might be an issue for whatever allocators you are using, and as Matthew has often pointed out, the deferred reclamation incurs additional cache misses. And rcutorture really does do forward-progress testing on vanilla RCU that features tight in-kernel loops doing call_rcu() without bounds on memory in flight, and RCU routinely survives this in a VM that is given only 512MB of memory. In fact, any failure to survive this is considered a bug to be fixed. So I suspect RCU is quite capable of handling this load. But how many additional kfree_rcu() calls are you anticipating per unit time per CPU? For example, could we simply measure the rate at which pagecache folios are currently being freed under heavy load? Given that information, we could just try it out. > Do you put hard limits on how long we can go before an RCU grace period > that will limit the amount of memory stranded to something acceptable? > Yes or no? C'mon, Kent, you can do better than this. Given what you have described, I believe that RCU would have no problem with it. However, additional information would be welcome. As always, I could well be missing something. And yes, kernel developers can break RCU (with infinite loops in RCU readers being a popular approach) and systems administrators can break RCU, for example, by confining processing of offloading callbacks to a single CPU on a 128-CPU system. Don't do that. Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 17:58 ` Paul E. McKenney @ 2024-02-28 23:55 ` Kent Overstreet 2024-02-29 19:42 ` Paul E. McKenney 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-28 23:55 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 09:58:48AM -0800, Paul E. McKenney wrote: > On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote: > > On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote: > > > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote: > > > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > > > > > At a ridiculously high level, reclaim is looking for memory to free. > > > > > Some read-only memory can often be dropped immediately on the grounds > > > > > that its data can be read back in if needed. Other memory can only be > > > > > dropped after being written out, which involves a delay. There are of > > > > > course many other complications, but this will do for a start. > > > > > > > > Hi Paul, > > > > > > > > I appreciate the necessity of describing what's going on at a very high > > > > level, but there's a wrinkle that I'm not sure you're aware of which > > > > may substantially change your argument. > > > > > > > > For anonymous memory, we do indeed wait until reclaim to start writing it > > > > to swap. That may or may not be the right approach given how anonymous > > > > memory is used (and could be the topic of an interesting discussion > > > > at LSFMM). > > > > > > > > For file-backed memory, we do not write back memory in reclaim. If it > > > > has got to the point of calling ->writepage in vmscan, things have gone > > > > horribly wrong to the point where calling ->writepage will make things > > > > worse. This is why we're currently removing ->writepage from every > > > > filesystem (only ->writepages will remain). Instead, the page cache > > > > is written back much earlier, once we get to balance_dirty_pages(). > > > > That lets us write pages in filesystem-friendly ways instead of in MM > > > > LRU order. > > > > > > Thank you for the additional details. > > > > > > But please allow me to further summarize the point of my prior email > > > that seems to be getting lost: > > > > > > 1. RCU already does significant work prodding grace periods. > > > > > > 2. There is no reasonable way to provide estimates of the > > > memory sent to RCU via call_rcu(), and in many cases > > > the bulk of the waiting memory will be call_rcu() memory. > > > > > > Therefore, if we cannot come up with a heuristic that does not need to > > > know the bytes of memory waiting, we are stuck anyway. > > > > That is a completely asinine argument. > > Huh. Anything else you need to get off your chest? > > On the off-chance it is unclear, I do disagree with your assessment. > > > > So perhaps the proper heuristic for RCU speeding things up is simply > > > "Hey RCU, we are in reclaim!". > > > > Because that's the wrong heuristic. There are important workloads for > > which we're _always_ in reclaim, but as long as RCU grace periods are > > happening at some steady rate, the amount of memory stranded will be > > bounded and there's no reason to expedite grace periods. > > RCU is in fact designed to handle heavy load, and contains a number of > mechanisms to operate more efficiently at higher load than at lower load. > It also contains mechanisms to expedite grace periods under heavy load. > Some of which I already described in earlier emails on this thread. yeah, the synchronize_rcu_expedited() souns like exactly what we need here, when memory reclaim notices too much memory is stranded > > If we start RCU freeing all pagecache folios we're going to be cycling > > memory through RCU freeing at the rate of gigabytes per second, tens of > > gigabytes per second on high end systems. > > The load on RCU would be measured in terms of requests (kfree_rcu() and > friends) per unit time per CPU, not in terms of gigabytes per unit time. > Of course, the amount of memory per unit time might be an issue for > whatever allocators you are using, and as Matthew has often pointed out, > the deferred reclamation incurs additional cache misses. So what I'm saying is that in the absensce of something noticing excessive memory being stranded and asking for an expedited grace period, the only bounds on the amount of memory being stranded will be how often RCU grace periods expire in the absence of anyone asking for them - that was my question to you. I'm not at all knowledgable on RCU internals but I gather it varies with things like dynticks and whether or not userspace is calling into the kernel? "gigabytes per second" - if userspace is doing big sequential streaming reads that don't fit into cache, we'll be evicting pagecache as quickly as we can read it in, so we should only be limited by your SSD bandwidth. > And rcutorture really does do forward-progress testing on vanilla RCU > that features tight in-kernel loops doing call_rcu() without bounds on > memory in flight, and RCU routinely survives this in a VM that is given > only 512MB of memory. In fact, any failure to survive this is considered > a bug to be fixed. Are you saying there's already feedback between memory reclaim and RCU? > So I suspect RCU is quite capable of handling this load. But how many > additional kfree_rcu() calls are you anticipating per unit time per CPU? > For example, could we simply measure the rate at which pagecache folios > are currently being freed under heavy load? Given that information, > we could just try it out. It's not the load I'm concerned about, or the number of call_rcu() calls, I have no doubt that RCU will cope with that just fine. But I do think that we need an additional feedback mechanism here. When we're doing big streaming sequential buffered IO, and lots of memory is cycled in and out of the pagecache, we have a couple things we want to avoid: The main thing is that we don't want the amount of memory stranded waiting for RCU to grow unbounded and shove everything out of the caches; if you're currently just concerned about _deadlock_ that is likely insufficient here, we're also concerned about maintaining good steady performance under load We don't want memory reclaim to be trying harder and harder when the correct thing to do is a synchronize_rcu_expedited(). We _also_ don't want to be hammering on RCU asking for expedited grace periods unnecessarily when the number of pending callbacks is high, but they're all for unrelated stuff - expedited RCU grace periods aren't free either!. Does that help clarify things? ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 23:55 ` Kent Overstreet @ 2024-02-29 19:42 ` Paul E. McKenney 2024-02-29 20:51 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Paul E. McKenney @ 2024-02-29 19:42 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Wed, Feb 28, 2024 at 06:55:12PM -0500, Kent Overstreet wrote: > On Tue, Feb 27, 2024 at 09:58:48AM -0800, Paul E. McKenney wrote: > > On Tue, Feb 27, 2024 at 11:34:06AM -0500, Kent Overstreet wrote: > > > On Tue, Feb 27, 2024 at 08:21:29AM -0800, Paul E. McKenney wrote: > > > > On Tue, Feb 27, 2024 at 03:54:23PM +0000, Matthew Wilcox wrote: > > > > > On Tue, Feb 27, 2024 at 07:32:32AM -0800, Paul E. McKenney wrote: > > > > > > At a ridiculously high level, reclaim is looking for memory to free. > > > > > > Some read-only memory can often be dropped immediately on the grounds > > > > > > that its data can be read back in if needed. Other memory can only be > > > > > > dropped after being written out, which involves a delay. There are of > > > > > > course many other complications, but this will do for a start. > > > > > > > > > > Hi Paul, > > > > > > > > > > I appreciate the necessity of describing what's going on at a very high > > > > > level, but there's a wrinkle that I'm not sure you're aware of which > > > > > may substantially change your argument. > > > > > > > > > > For anonymous memory, we do indeed wait until reclaim to start writing it > > > > > to swap. That may or may not be the right approach given how anonymous > > > > > memory is used (and could be the topic of an interesting discussion > > > > > at LSFMM). > > > > > > > > > > For file-backed memory, we do not write back memory in reclaim. If it > > > > > has got to the point of calling ->writepage in vmscan, things have gone > > > > > horribly wrong to the point where calling ->writepage will make things > > > > > worse. This is why we're currently removing ->writepage from every > > > > > filesystem (only ->writepages will remain). Instead, the page cache > > > > > is written back much earlier, once we get to balance_dirty_pages(). > > > > > That lets us write pages in filesystem-friendly ways instead of in MM > > > > > LRU order. > > > > > > > > Thank you for the additional details. > > > > > > > > But please allow me to further summarize the point of my prior email > > > > that seems to be getting lost: > > > > > > > > 1. RCU already does significant work prodding grace periods. > > > > > > > > 2. There is no reasonable way to provide estimates of the > > > > memory sent to RCU via call_rcu(), and in many cases > > > > the bulk of the waiting memory will be call_rcu() memory. > > > > > > > > Therefore, if we cannot come up with a heuristic that does not need to > > > > know the bytes of memory waiting, we are stuck anyway. > > > > > > That is a completely asinine argument. > > > > Huh. Anything else you need to get off your chest? > > > > On the off-chance it is unclear, I do disagree with your assessment. > > > > > > So perhaps the proper heuristic for RCU speeding things up is simply > > > > "Hey RCU, we are in reclaim!". > > > > > > Because that's the wrong heuristic. There are important workloads for > > > which we're _always_ in reclaim, but as long as RCU grace periods are > > > happening at some steady rate, the amount of memory stranded will be > > > bounded and there's no reason to expedite grace periods. > > > > RCU is in fact designed to handle heavy load, and contains a number of > > mechanisms to operate more efficiently at higher load than at lower load. > > It also contains mechanisms to expedite grace periods under heavy load. > > Some of which I already described in earlier emails on this thread. > > yeah, the synchronize_rcu_expedited() souns like exactly what we need > here, when memory reclaim notices too much memory is stranded I was talking about things that the normal grace periods do go speed things up when needed, but it might well be the case that this use case needs to use synchronize_rcu_expedited(). But this function does have some serious side effects, including lots of IPIs and (in preemptible kernels) priority boosting of any task that has ever been preempted within its current RCU read-side critical section. Plus synchronize_rcu_expedited() cannot help if the problem is overly long readers. But one might imagine a solution that used kfree_rcu() and rcu_barrier() to detect any RCU slowness, which would trigger use of synchronize_rcu_expedited(). Except that this is harder than it might immediately appear. So first, let's find out what the situation really requires. > > > If we start RCU freeing all pagecache folios we're going to be cycling > > > memory through RCU freeing at the rate of gigabytes per second, tens of > > > gigabytes per second on high end systems. > > > > The load on RCU would be measured in terms of requests (kfree_rcu() and > > friends) per unit time per CPU, not in terms of gigabytes per unit time. > > Of course, the amount of memory per unit time might be an issue for > > whatever allocators you are using, and as Matthew has often pointed out, > > the deferred reclamation incurs additional cache misses. > > So what I'm saying is that in the absensce of something noticing > excessive memory being stranded and asking for an expedited grace > period, the only bounds on the amount of memory being stranded will be > how often RCU grace periods expire in the absence of anyone asking for > them - that was my question to you. I'm not at all knowledgable on RCU > internals but I gather it varies with things like dynticks and whether > or not userspace is calling into the kernel? It does vary based on quite a few things. Taking as an example your dynticks point, a nohz_full CPU does not restart the tick upon entering the kernel. Which is what you want for short system calls because enabling and disabling the tick can be expensive. But this causes RCU trouble if a nohz_full CPU executes a long-running system call, so RCU will force the tick on should such a CPU refused to respond to RCU quickly enough. But the typical idle-system normal grace-period duration is a couple of tens of milliseconds. Except of course that if there are one-second-long readers, you will see one-second-long grace periods. > "gigabytes per second" - if userspace is doing big sequential streaming > reads that don't fit into cache, we'll be evicting pagecache as quickly > as we can read it in, so we should only be limited by your SSD > bandwidth. Understood. But why not simply replace the current freeing with kfree_rcu() or similar and see what really happens? More further down... > > And rcutorture really does do forward-progress testing on vanilla RCU > > that features tight in-kernel loops doing call_rcu() without bounds on > > memory in flight, and RCU routinely survives this in a VM that is given > > only 512MB of memory. In fact, any failure to survive this is considered > > a bug to be fixed. > > Are you saying there's already feedback between memory reclaim and RCU? For the callback flooding in the torture test, yes. There is a register_oom_notifier() function, and if that function is called during the callback flooding, that is a test failure. There used to be a shrinker in the RCU implementation itself, but this proved unhelpful and was therefore removed. As in RCU worked fine without it given the workloads of that time. Who knows? Maybe it needs to be added back in. > > So I suspect RCU is quite capable of handling this load. But how many > > additional kfree_rcu() calls are you anticipating per unit time per CPU? > > For example, could we simply measure the rate at which pagecache folios > > are currently being freed under heavy load? Given that information, > > we could just try it out. > > It's not the load I'm concerned about, or the number of call_rcu() > calls, I have no doubt that RCU will cope with that just fine. > > But I do think that we need an additional feedback mechanism here. When > we're doing big streaming sequential buffered IO, and lots of memory is > cycled in and out of the pagecache, we have a couple things we want to > avoid: > > The main thing is that we don't want the amount of memory stranded > waiting for RCU to grow unbounded and shove everything out of the > caches; if you're currently just concerned about _deadlock_ that is > likely insufficient here, we're also concerned about maintaining good > steady performance under load > > We don't want memory reclaim to be trying harder and harder when the > correct thing to do is a synchronize_rcu_expedited(). > > We _also_ don't want to be hammering on RCU asking for expedited grace > periods unnecessarily when the number of pending callbacks is high, but > they're all for unrelated stuff - expedited RCU grace periods aren't > free either!. > > Does that help clarify things? Yes, but... The key point is that no one knows enough about this problem to dive at all deeply into solution space. From what you have described thus far, the solution might involve one or more of the following options: o We do nothing, because RCU just handles it. (Hey, I can dream, can't I?) o We do a few modifications to RCU's force-quiescent state code. This needs to be revisited for lazy preemption anyway, so this is a good time to take a look. o We do a few modifications to RCU's force-quiescent state code, but with the addition of communication of some sort from reclaim. This appears to be your current focus. o It might be that reclaim needs to use SRCU instead of RCU. This would insulate reclaim from other subsystems' overly long readers, but at the expense of smp_mb() overhead when entering and exiting a reader. Also, unlike RCU, SRCU has no reasonable way of encouraging delayed readers (for example, delayed due to preemption). o It might be that reclaim needs its own specialized flavor of RCU, as did tracing (Tasks RCU) and BPF (Tasks Trace RCU). Plus there is Steve Rostedt's additional flavor for tracing, which is now Tasks RCU Rude. I doubt that any of these three are what you want, but let's get some hard data so that we can find out. o It might be that reclaim needs some synchronization mechanism other than RCU. We really need some hard data on how this thing affects RCU. Again, could we do the kfree()->kfree_rcu() change (and similar change, depending on what memory allocator is used) and the run some tests to see how RCU reacts? If RCU does not react all that well, then we will also need information on what the RCU readers look like, for example, must they block, must they be preemptible, and about how long do they run for? Best if they never block, need not be preemptible, and run for very short periods of time, but they are what they are. Perhaps Matthew Wilcox has some thoughts on this. Given that information, we can make at least a semi-informed decision as to which (if any) of the six possibilities listed above is most appropriate. Does that help to communicate my position on this topic? Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-29 19:42 ` Paul E. McKenney @ 2024-02-29 20:51 ` Kent Overstreet 2024-03-05 2:19 ` Paul E. McKenney 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-29 20:51 UTC (permalink / raw) To: Paul E. McKenney Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Thu, Feb 29, 2024 at 11:42:53AM -0800, Paul E. McKenney wrote: > On Wed, Feb 28, 2024 at 06:55:12PM -0500, Kent Overstreet wrote: > > So what I'm saying is that in the absensce of something noticing > > excessive memory being stranded and asking for an expedited grace > > period, the only bounds on the amount of memory being stranded will be > > how often RCU grace periods expire in the absence of anyone asking for > > them - that was my question to you. I'm not at all knowledgable on RCU > > internals but I gather it varies with things like dynticks and whether > > or not userspace is calling into the kernel? > > It does vary based on quite a few things. Taking as an example your > dynticks point, a nohz_full CPU does not restart the tick upon entering > the kernel. Which is what you want for short system calls because > enabling and disabling the tick can be expensive. But this causes > RCU trouble if a nohz_full CPU executes a long-running system call, > so RCU will force the tick on should such a CPU refused to respond to > RCU quickly enough. Makes sense - so synchronize_rcu_expedited() has to check that and issue an IPI? > But the typical idle-system normal grace-period duration is a couple of > tens of milliseconds. Except of course that if there are one-second-long > readers, you will see one-second-long grace periods. > > > "gigabytes per second" - if userspace is doing big sequential streaming > > reads that don't fit into cache, we'll be evicting pagecache as quickly > > as we can read it in, so we should only be limited by your SSD > > bandwidth. > > Understood. > > But why not simply replace the current freeing with kfree_rcu() or > similar and see what really happens? More further down... Well, mainly because I've been doing filesystem and storage work for 15 years and these kinds of problems are my bread and butter; in filesystem land (especially bcachefs, but really any modern journalling filesystem), there are a _lot_ of background reclaim tasks not unlike this. So I'm speaking from experience here; the very first thing we're going to want when we do this for page cache pages and start analyzing system behaviour and performance is some basic numbers as to what's going on, and here it's practically no trouble at all to do it right. > > > And rcutorture really does do forward-progress testing on vanilla RCU > > > that features tight in-kernel loops doing call_rcu() without bounds on > > > memory in flight, and RCU routinely survives this in a VM that is given > > > only 512MB of memory. In fact, any failure to survive this is considered > > > a bug to be fixed. > > > > Are you saying there's already feedback between memory reclaim and RCU? > > For the callback flooding in the torture test, yes. There is a > register_oom_notifier() function, and if that function is called during > the callback flooding, that is a test failure. Based on number of outstanding callbacks? > There used to be a shrinker in the RCU implementation itself, but this > proved unhelpful and was therefore removed. As in RCU worked fine without > it given the workloads of that time. Who knows? Maybe it needs to be > added back in. Why was it unhelpful? Were you having problems with the approach, or just didn't seem to be useful? > The key point is that no one knows enough about this problem to dive at > all deeply into solution space. From what you have described thus far, > the solution might involve one or more of the following options: > > o We do nothing, because RCU just handles it. (Hey, I can dream, > can't I?) > > o We do a few modifications to RCU's force-quiescent state code. > This needs to be revisited for lazy preemption anyway, so this > is a good time to take a look. > > o We do a few modifications to RCU's force-quiescent state code, > but with the addition of communication of some sort from reclaim. > This appears to be your current focus. Not exactly (but I'd like to hear what you have in mind); I'm envisioning this driven by reclaim itself. > o It might be that reclaim needs to use SRCU instead of RCU. > This would insulate reclaim from other subsystems' overly long > readers, but at the expense of smp_mb() overhead when entering > and exiting a reader. Also, unlike RCU, SRCU has no reasonable > way of encouraging delayed readers (for example, delayed due > to preemption). Yeah, I don't think SRCU is quite what we want here. > o It might be that reclaim needs its own specialized flavor of RCU, > as did tracing (Tasks RCU) and BPF (Tasks Trace RCU). Plus there > is Steve Rostedt's additional flavor for tracing, which is now > Tasks RCU Rude. I doubt that any of these three are what you > want, but let's get some hard data so that we can find out. > > o It might be that reclaim needs some synchronization mechanism > other than RCU. > > We really need some hard data on how this thing affects RCU. Again, > could we do the kfree()->kfree_rcu() change (and similar change, > depending on what memory allocator is used) and the run some tests to > see how RCU reacts? Well, like I mentioned previously - the "hard data" you say we need is exactly what I was saying we need to collect as a first order of business - we need to know the amount of memory stranded by RCU, and we need to be able to watch that number as we run through different buffered IO workloads. Sure, we could do this as a special case for just pagecache page freeing; but it looks like zero extra work to just do it right and do it for kfree_rcu() in general; and I've already talked about how minimal the perforance cost is. And more broadly, we have real gaps when it comes to OOM debugging and being able to report on who's responsible for what memory and how much, and that's something I want to close. I've got a shrinker debugging patch that I really need to get back to and shove in, and really in that area I need to go further and add byte counters to shrinker reporting. IOW, this isn't just an RCU topic, this is a memory reclaim topic, and lots of us filesystem developers have... things to say on the subject of memory reclaim. Being able to get proper information so that we can debug stuff has been a major point of contention; I hope you can try to appreciate where we're coming from. > If RCU does not react all that well, then we will also need information on > what the RCU readers look like, for example, must they block, must they > be preemptible, and about how long do they run for? Best if they never > block, need not be preemptible, and run for very short periods of time, > but they are what they are. Perhaps Matthew Wilcox has some thoughts > on this. We'll be using this in filemap_read() (and I've worked on that code more than most); we're already using rcu_read_lock() there for the xarray lookup, which is partly why RCU (the regular sort) is a natural fit for protecting the data itself. This will move the copy_to_user_nofault() under rcu_read_lock(), so we're not blocking or faulting or preempting or anything like that. Just doing a (potentially big) copy to userspace. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-29 20:51 ` Kent Overstreet @ 2024-03-05 2:19 ` Paul E. McKenney 0 siblings, 0 replies; 90+ messages in thread From: Paul E. McKenney @ 2024-03-05 2:19 UTC (permalink / raw) To: Kent Overstreet Cc: Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Thu, Feb 29, 2024 at 03:51:08PM -0500, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 11:42:53AM -0800, Paul E. McKenney wrote: > > On Wed, Feb 28, 2024 at 06:55:12PM -0500, Kent Overstreet wrote: > > > So what I'm saying is that in the absensce of something noticing > > > excessive memory being stranded and asking for an expedited grace > > > period, the only bounds on the amount of memory being stranded will be > > > how often RCU grace periods expire in the absence of anyone asking for > > > them - that was my question to you. I'm not at all knowledgable on RCU > > > internals but I gather it varies with things like dynticks and whether > > > or not userspace is calling into the kernel? > > > > It does vary based on quite a few things. Taking as an example your > > dynticks point, a nohz_full CPU does not restart the tick upon entering > > the kernel. Which is what you want for short system calls because > > enabling and disabling the tick can be expensive. But this causes > > RCU trouble if a nohz_full CPU executes a long-running system call, > > so RCU will force the tick on should such a CPU refused to respond to > > RCU quickly enough. > > Makes sense - so synchronize_rcu_expedited() has to check that and issue > an IPI? No and yes, in that synchronize_rcu_expedited() sends an IPI to every CPU that is not either deep in the idle loop, in nohz_full userspace, or offline. A nohz_full CPU that is in a long system call will thus get an IPI, which means that synchronize_rcu_expedited() need not rely on the scheduling-clock interrupt. In contrast, synchronize_rcu() (non-expedited) might well need to re-enable the tick on this CPU in this case. > > But the typical idle-system normal grace-period duration is a couple of > > tens of milliseconds. Except of course that if there are one-second-long > > readers, you will see one-second-long grace periods. > > > > > "gigabytes per second" - if userspace is doing big sequential streaming > > > reads that don't fit into cache, we'll be evicting pagecache as quickly > > > as we can read it in, so we should only be limited by your SSD > > > bandwidth. > > > > Understood. > > > > But why not simply replace the current freeing with kfree_rcu() or > > similar and see what really happens? More further down... > > Well, mainly because I've been doing filesystem and storage work for 15 > years and these kinds of problems are my bread and butter; in filesystem > land (especially bcachefs, but really any modern journalling > filesystem), there are a _lot_ of background reclaim tasks not unlike > this. > > So I'm speaking from experience here; the very first thing we're going > to want when we do this for page cache pages and start analyzing system > behaviour and performance is some basic numbers as to what's going on, > and here it's practically no trouble at all to do it right. I am not questioning the depth of your filesystem and storage experience. However, the definition of "right" can vary widely. In this case, "right" might mean that we absolutely must do yet another special-purpose RCU. I hope not, but if this is in fact the case, I need to learn that before spending huge amounts of time attempting to tweak vanilla RCU into doing something that it is incapable of doing. > > > > And rcutorture really does do forward-progress testing on vanilla RCU > > > > that features tight in-kernel loops doing call_rcu() without bounds on > > > > memory in flight, and RCU routinely survives this in a VM that is given > > > > only 512MB of memory. In fact, any failure to survive this is considered > > > > a bug to be fixed. > > > > > > Are you saying there's already feedback between memory reclaim and RCU? > > > > For the callback flooding in the torture test, yes. There is a > > register_oom_notifier() function, and if that function is called during > > the callback flooding, that is a test failure. > > Based on number of outstanding callbacks? I am not sure what you are asking. The OOM killer doesn't care about the number of callbacks. The callback flooding for vanilla RCU's torture testing does not limit the number of callbacks. Instead, it continues allocating memory and passing it to call_rcu() until either the OOM notification arrives (which is a bug in RCU) or until a reasonable number of callbacks have completed their grace periods. It does cheat in that the RCU callback function hands the memory directly to rcutorture instead of taking a trip through the memory allocator. It does this to focus the torturing on RCU instead of letting RCU hide behind the memory allocator. > > There used to be a shrinker in the RCU implementation itself, but this > > proved unhelpful and was therefore removed. As in RCU worked fine without > > it given the workloads of that time. Who knows? Maybe it needs to be > > added back in. > > Why was it unhelpful? Were you having problems with the approach, or > just didn't seem to be useful? If I recall correctly, the RCU shrinker was counterproductive because the grace periods completed quickly enough on their own, so there was no point in burning the additional CPU. But times change, so maybe it is time to put a shrinker back in. > > The key point is that no one knows enough about this problem to dive at > > all deeply into solution space. From what you have described thus far, > > the solution might involve one or more of the following options: > > > > o We do nothing, because RCU just handles it. (Hey, I can dream, > > can't I?) > > > > o We do a few modifications to RCU's force-quiescent state code. > > This needs to be revisited for lazy preemption anyway, so this > > is a good time to take a look. > > > > o We do a few modifications to RCU's force-quiescent state code, > > but with the addition of communication of some sort from reclaim. > > This appears to be your current focus. > > Not exactly (but I'd like to hear what you have in mind); I'm > envisioning this driven by reclaim itself. I am wide open. So much so that I consider your thought of driving RCU from reclaim as a special case of communication of some sort from reclaim. Again, I would like to see actual data. This is not the simple part of RCU. In contrast, some years back when I immediately agreed to add polling interfaces to SRCU, it was a simple addition to RCU that I had long known would eventually be needed. > > o It might be that reclaim needs to use SRCU instead of RCU. > > This would insulate reclaim from other subsystems' overly long > > readers, but at the expense of smp_mb() overhead when entering > > and exiting a reader. Also, unlike RCU, SRCU has no reasonable > > way of encouraging delayed readers (for example, delayed due > > to preemption). > > Yeah, I don't think SRCU is quite what we want here. Given that you have actually used SRCU, I trust your judgment more than I would trust most other people's. But long and painful experience has caused me to trust the judgment of real hardware running real software quite a bit more than I trust that of any person, my own judgment included. > > o It might be that reclaim needs its own specialized flavor of RCU, > > as did tracing (Tasks RCU) and BPF (Tasks Trace RCU). Plus there > > is Steve Rostedt's additional flavor for tracing, which is now > > Tasks RCU Rude. I doubt that any of these three are what you > > want, but let's get some hard data so that we can find out. > > > > o It might be that reclaim needs some synchronization mechanism > > other than RCU. You are free to ignore these last two for the time being, but they might well prove necessary. So I cannot ignore them. > > We really need some hard data on how this thing affects RCU. Again, > > could we do the kfree()->kfree_rcu() change (and similar change, > > depending on what memory allocator is used) and the run some tests to > > see how RCU reacts? > > Well, like I mentioned previously - the "hard data" you say we need is > exactly what I was saying we need to collect as a first order of > business - we need to know the amount of memory stranded by RCU, and we > need to be able to watch that number as we run through different > buffered IO workloads. Good! Please do it! Change those kfree() calls to kfree_rcu() and let's see whether or not the system OOMs. > Sure, we could do this as a special case for just pagecache page > freeing; but it looks like zero extra work to just do it right and do it > for kfree_rcu() in general; and I've already talked about how minimal > the perforance cost is. Kent, kfree_rcu() is itself a special case. There is call_rcu(), synchronize_rcu(), synchronize_rcu_expedited(), as well as all of the polling interfaces, none of which have any reasonable way to know the corresponding size of data. And a lot of the RCU traffic typically goes through call_rcu() and synchronize_rcu(). > And more broadly, we have real gaps when it comes to OOM debugging and > being able to report on who's responsible for what memory and how much, > and that's something I want to close. I've got a shrinker debugging > patch that I really need to get back to and shove in, and really in that > area I need to go further and add byte counters to shrinker reporting. > > IOW, this isn't just an RCU topic, this is a memory reclaim topic, and > lots of us filesystem developers have... things to say on the subject of > memory reclaim. Being able to get proper information so that we can > debug stuff has been a major point of contention; I hope you can > try to appreciate where we're coming from. I am trying, Kent. I really am. I am absolutely not succeeding. Without the kfree()->kfree_rcu() testing, how do you even know that vanilla RCU is in fact the right solution? The answer is that you absolutely do not know one way or the other. Please keep in mind that we have had several cases where vanilla RCU was not the right answer, which is why SRCU, Tasks RCU, Tasks Trace RCU, and Tasks Rude RCU exist. Plus there has been code that uses workqueues as a type of RCU. And if you are going to do something like free half of memory in one shot and be unable to re-use that memory until after a normal RCU grace period elapses, we probably need something quite special. Maybe things never get that extreme, but you have been emphasizing tens of gigabytes, and that would likely require something special on even a 64GB system. Or maybe you would only do tens of gigabytes on a system with terabytes of memory, in which case I am not quite so worried. Which brings us to how do you even know that RCU needs to change at all? Again, the answer is that you absolutely do not know one way or the other. > > If RCU does not react all that well, then we will also need information on > > what the RCU readers look like, for example, must they block, must they > > be preemptible, and about how long do they run for? Best if they never > > block, need not be preemptible, and run for very short periods of time, > > but they are what they are. Perhaps Matthew Wilcox has some thoughts > > on this. > > We'll be using this in filemap_read() (and I've worked on that code more > than most); we're already using rcu_read_lock() there for the xarray > lookup, which is partly why RCU (the regular sort) is a natural fit for > protecting the data itself. > > This will move the copy_to_user_nofault() under rcu_read_lock(), so > we're not blocking or faulting or preempting or anything like that. Just > doing a (potentially big) copy to userspace. Thank you for the heads up. Are these also things that you believe that RCU cannot handle without modifications? If so, it would be good to test these as well. Thanx, Paul ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 23:29 ` Kent Overstreet 2024-02-27 0:05 ` Paul E. McKenney @ 2024-02-27 0:43 ` Dave Chinner 1 sibling, 0 replies; 90+ messages in thread From: Dave Chinner @ 2024-02-27 0:43 UTC (permalink / raw) To: Kent Overstreet Cc: Paul E. McKenney, Matthew Wilcox, Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 06:29:43PM -0500, Kent Overstreet wrote: > On Mon, Feb 26, 2024 at 01:55:10PM -0800, Paul E. McKenney wrote: > > On Mon, Feb 26, 2024 at 04:19:14PM -0500, Kent Overstreet wrote: > > > > RCU allocating and freeing of memory can already be fairly significant > > > > depending on workload, and I'd expect that to grow - we really just need > > > > a way for reclaim to kick RCU when needed (and probably add a percpu > > > > counter for "amount of memory stranded until the next RCU grace > > > > period"). > > > > There are some APIs for that, though the are sharp-edged and mainly > > intended for rcutorture, and there are some hooks for a CI Kconfig > > option called RCU_STRICT_GRACE_PERIOD that could be organized into > > something useful. > > > > Of course, if there is a long-running RCU reader, there is nothing > > RCU can do. By definition, it must wait on all pre-existing readers, > > no exceptions. > > > > But my guess is that you instead are thinking of memory-exhaustion > > emergencies where you would like RCU to burn more CPU than usual to > > reduce grace-period latency, there are definitely things that can be done. > > > > I am sure that there are more questions that I should ask, but the one > > that comes immediately to mind is "Is this API call an occasional thing, > > or does RCU need to tolerate many CPUs hammering it frequently?" > > Either answer is fine, I just need to know. ;-) > > Well, we won't want it getting hammered on continuously - we should be > able to tune reclaim so that doesn't happen. If we are under sustained memory pressure, there will be a relatively steady state of "stranded memory" - every rcu grace period will be stranding and freeing roughly the same amount of memory because that reclaim progress across all caches won't change significantly from grace period to grace period. I really haven't seen "stranded memory" from reclaimable slab caches (like inodes and dentries) ever causing issues with allocation or causing OOM kills. Hence I'm not sure that there is any real need for expediting the freeing of RCU memory in the general case - it's probably only when we get near OOM (i.e. reclaim priority is approaching 0) that expediting rcu_free()d memory may make any difference to allocation success... > I think getting numbers on the amount of memory stranded waiting for RCU > is probably first order of business - minor tweak to kfree_rcu() et all > for that; there's APIs they can query to maintain that counter. Yes, please. Get some numbers that show there is an actual problem here that needs solving. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 17:17 ` Linus Torvalds 2024-02-26 21:07 ` Matthew Wilcox @ 2024-02-26 22:46 ` Linus Torvalds 2024-02-26 23:48 ` Linus Torvalds 2024-05-14 11:52 ` Luis Chamberlain 1 sibling, 2 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-26 22:46 UTC (permalink / raw) To: Al Viro Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner [-- Attachment #1: Type: text/plain, Size: 1147 bytes --] On Mon, 26 Feb 2024 at 09:17, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Are tiny reads (handwaving: 100 bytes or less) really worth optimizing > for to that degree? So this ended up just bothering me, and dammit, we already *have* that stack buffer, except it's that 'folio batch' buffer. Which is already 128 bytes of stack space (on 64-bit). So, just hypothetically, let's say that *before* we start using that folio batch buffer for folios, we use it as a dummy buffer for small reads. So we'd make that 'fbatch' thing be a union with a temporary byte buffer. That hypothetical patch might look something like this TOTALLY UNTESTED CRAP. Anybody interested in seeing if something like this might actually work? I do want to emphasize the "something like this". This pile of random thoughts ends up compiling for me, and I _tried_ to think of all the cases, but there might be obvious thinkos, and there might be things I just didn't think about at all. I really haven't tested this AT ALL. I'm much too scared. But I don't actually hate how the code looks nearly as much as I *thought* I'd hate it. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 5033 bytes --] mm/filemap.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 107 insertions(+), 10 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 750e779c23db..5f66fd1bc82d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2543,6 +2543,85 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio) return (pos1 >> shift == pos2 >> shift); } +/* + * I can't be bothered to care about HIGHMEM for the fast read case + */ +#ifdef CONFIG_HIGHMEM +#define filemap_fast_read(mapping, pos, buffer, size) 0 +#else + +/* + * Called under RCU with size limited to the file size and one + */ +static unsigned long filemap_folio_copy_rcu(struct address_space *mapping, loff_t pos, char *buffer, size_t size) +{ + XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT); + struct folio *folio; + size_t offset; + + xas_reset(&xas); + folio = xas_load(&xas); + if (xas_retry(&xas, folio)) + return 0; + + if (!folio || xa_is_value(folio)) + return 0; + + if (!folio_test_uptodate(folio)) + return 0; + + /* No fast-case if we are supposed to start readahead */ + if (folio_test_readahead(folio)) + return 0; + /* .. or mark it accessed */ + if (!folio_test_referenced(folio)) + return 0; + + /* Do the data copy */ + offset = pos & (folio_size(folio) - 1); + memcpy(buffer, folio_address(folio) + offset, size); + + /* We should probably do some silly memory barrier here */ + if (unlikely(folio != xas_reload(&xas))) + return 0; + + return size; +} + +/* + * Iff we can complete the read completely in one atomic go under RCU, + * do so here. Otherwise return 0 (no partial reads, please - this is + * purely for the trivial fast case). + */ +static unsigned long filemap_fast_read(struct address_space *mapping, loff_t pos, char *buffer, size_t size) +{ + struct inode *inode; + loff_t file_size; + unsigned long pgoff; + + /* Don't even try for page-crossers */ + pgoff = pos & ~PAGE_MASK; + if (pgoff + size > PAGE_SIZE) + return 0; + + /* Limit it to the file size */ + inode = mapping->host; + file_size = i_size_read(inode); + if (unlikely(pos >= file_size)) + return 0; + file_size -= pos; + if (file_size < size) + size = file_size; + + /* Let's see if we can just do the read under RCU */ + rcu_read_lock(); + size = filemap_folio_copy_rcu(mapping, pos, buffer, size); + rcu_read_unlock(); + + return size; +} +#endif /* !HIGHMEM */ + /** * filemap_read - Read data from the page cache. * @iocb: The iocb to read. @@ -2563,7 +2642,10 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, struct file_ra_state *ra = &filp->f_ra; struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; - struct folio_batch fbatch; + union { + struct folio_batch fbatch; + __DECLARE_FLEX_ARRAY(char, buffer); + } area; int i, error = 0; bool writably_mapped; loff_t isize, end_offset; @@ -2575,7 +2657,22 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, return 0; iov_iter_truncate(iter, inode->i_sb->s_maxbytes); - folio_batch_init(&fbatch); + + if (iov_iter_count(iter) < sizeof(area)) { + unsigned long count = iov_iter_count(iter); + + count = filemap_fast_read(mapping, iocb->ki_pos, area.buffer, count); + if (count) { + size_t copied = copy_to_iter(area.buffer, count, iter); + if (unlikely(!copied)) + return already_read ? already_read : -EFAULT; + ra->prev_pos = iocb->ki_pos += copied; + file_accessed(filp); + return copied + already_read; + } + } + + folio_batch_init(&area.fbatch); do { cond_resched(); @@ -2591,7 +2688,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, if (unlikely(iocb->ki_pos >= i_size_read(inode))) break; - error = filemap_get_pages(iocb, iter->count, &fbatch, false); + error = filemap_get_pages(iocb, iter->count, &area.fbatch, false); if (error < 0) break; @@ -2628,11 +2725,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, * mark it as accessed the first time. */ if (!pos_same_folio(iocb->ki_pos, last_pos - 1, - fbatch.folios[0])) - folio_mark_accessed(fbatch.folios[0]); + area.fbatch.folios[0])) + folio_mark_accessed(area.fbatch.folios[0]); - for (i = 0; i < folio_batch_count(&fbatch); i++) { - struct folio *folio = fbatch.folios[i]; + for (i = 0; i < folio_batch_count(&area.fbatch); i++) { + struct folio *folio = area.fbatch.folios[i]; size_t fsize = folio_size(folio); size_t offset = iocb->ki_pos & (fsize - 1); size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos, @@ -2663,9 +2760,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, } } put_folios: - for (i = 0; i < folio_batch_count(&fbatch); i++) - folio_put(fbatch.folios[i]); - folio_batch_init(&fbatch); + for (i = 0; i < folio_batch_count(&area.fbatch); i++) + folio_put(area.fbatch.folios[i]); + folio_batch_init(&area.fbatch); } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); file_accessed(filp); ^ permalink raw reply related [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 22:46 ` Linus Torvalds @ 2024-02-26 23:48 ` Linus Torvalds 2024-02-27 7:21 ` Kent Overstreet 2024-05-14 11:52 ` Luis Chamberlain 1 sibling, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-02-26 23:48 UTC (permalink / raw) To: Al Viro Cc: Kent Overstreet, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, 26 Feb 2024 at 14:46, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I really haven't tested this AT ALL. I'm much too scared. "Courage is not the absence of fear, but acting in spite of it" - Paddington Bear / Michal Scott It seems to actually boot here. That said, from a quick test with lots of threads all hammering on the same page - I'm still not entirely convinced it makes a difference. Sure, the kernel profile changes, but filemap_get_read_batch() wasn't very high up in the profile to begin with. I didn't do any actual performance testing, I just did a 64-byte pread at offset 0 in a loop in 64 threads on my 32c/64t machine. The cache ping-pong would be a lot more noticeable on some multi-socket machine, of course, but I do get the feeling that this is all optimizing for such an edge-case of an edge-case that it's all a bit questionable. But that patch does largely seem to work. Famous last words. It really needs a lot more sanity checking, and that comment about probably needing a memory barrier is still valid. And even then there's the question about replacing the same folio in the same spot in the xarray. I'm not convinced it is worth worrying about in any reality we care about, but it's _technically_ all a bit wrong. So I'm throwing that patch over the fence to somebody that cares. I _do_ now claim it at least kind of works. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 23:48 ` Linus Torvalds @ 2024-02-27 7:21 ` Kent Overstreet 2024-02-27 15:39 ` Matthew Wilcox 2024-02-27 16:34 ` Linus Torvalds 0 siblings, 2 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 7:21 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 03:48:35PM -0800, Linus Torvalds wrote: > On Mon, 26 Feb 2024 at 14:46, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I really haven't tested this AT ALL. I'm much too scared. > > "Courage is not the absence of fear, but acting in spite of it" > - Paddington Bear / Michal Scott > > It seems to actually boot here. > > That said, from a quick test with lots of threads all hammering on the > same page - I'm still not entirely convinced it makes a difference. > Sure, the kernel profile changes, but filemap_get_read_batch() wasn't > very high up in the profile to begin with. > > I didn't do any actual performance testing, I just did a 64-byte pread > at offset 0 in a loop in 64 threads on my 32c/64t machine. Only rough testing, but this is looking like around a 25% performance increase doing 4k random reads on a 1G file with fio, 8 jobs, on my Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're only spending half our cpu time in the kernel, so the buffered read path is actually getting 40% or 50% faster. So I'd say that's substantial. RCU freeing of pagecache pages would be even better - I think that'd let us completely get rid of the barrier & xarray recheck, and we wouldn't have to do it as a silly special case. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 7:21 ` Kent Overstreet @ 2024-02-27 15:39 ` Matthew Wilcox 2024-02-27 15:54 ` Kent Overstreet 2024-02-27 16:34 ` Linus Torvalds 1 sibling, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-02-27 15:39 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 02:21:59AM -0500, Kent Overstreet wrote: > On Mon, Feb 26, 2024 at 03:48:35PM -0800, Linus Torvalds wrote: > > On Mon, 26 Feb 2024 at 14:46, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > I really haven't tested this AT ALL. I'm much too scared. > > > > "Courage is not the absence of fear, but acting in spite of it" > > - Paddington Bear / Michal Scott > > > > It seems to actually boot here. > > > > That said, from a quick test with lots of threads all hammering on the > > same page - I'm still not entirely convinced it makes a difference. > > Sure, the kernel profile changes, but filemap_get_read_batch() wasn't > > very high up in the profile to begin with. > > > > I didn't do any actual performance testing, I just did a 64-byte pread > > at offset 0 in a loop in 64 threads on my 32c/64t machine. > > Only rough testing, but this is looking like around a 25% performance > increase doing 4k random reads on a 1G file with fio, 8 jobs, on my > Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're > only spending half our cpu time in the kernel, so the buffered read path > is actually getting 40% or 50% faster. Linus' patch only kicks in for 128 bytes or smaller. So what are you measuring? ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 15:39 ` Matthew Wilcox @ 2024-02-27 15:54 ` Kent Overstreet 0 siblings, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 15:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Al Viro, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 03:39:35PM +0000, Matthew Wilcox wrote: > On Tue, Feb 27, 2024 at 02:21:59AM -0500, Kent Overstreet wrote: > > On Mon, Feb 26, 2024 at 03:48:35PM -0800, Linus Torvalds wrote: > > > On Mon, 26 Feb 2024 at 14:46, Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > I really haven't tested this AT ALL. I'm much too scared. > > > > > > "Courage is not the absence of fear, but acting in spite of it" > > > - Paddington Bear / Michal Scott > > > > > > It seems to actually boot here. > > > > > > That said, from a quick test with lots of threads all hammering on the > > > same page - I'm still not entirely convinced it makes a difference. > > > Sure, the kernel profile changes, but filemap_get_read_batch() wasn't > > > very high up in the profile to begin with. > > > > > > I didn't do any actual performance testing, I just did a 64-byte pread > > > at offset 0 in a loop in 64 threads on my 32c/64t machine. > > > > Only rough testing, but this is looking like around a 25% performance > > increase doing 4k random reads on a 1G file with fio, 8 jobs, on my > > Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're > > only spending half our cpu time in the kernel, so the buffered read path > > is actually getting 40% or 50% faster. > > Linus' patch only kicks in for 128 bytes or smaller. So what are you > measuring? 64 byte reads ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 7:21 ` Kent Overstreet 2024-02-27 15:39 ` Matthew Wilcox @ 2024-02-27 16:34 ` Linus Torvalds 2024-02-27 16:47 ` Kent Overstreet 1 sibling, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-02-27 16:34 UTC (permalink / raw) To: Kent Overstreet Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, 26 Feb 2024 at 23:22, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > Only rough testing, but this is looking like around a 25% performance > increase doing 4k random reads on a 1G file with fio, 8 jobs, on my > Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're > only spending half our cpu time in the kernel, so the buffered read path > is actually getting 40% or 50% faster. > > So I'd say that's substantial. No, you're doing something wrong. The new fastread logic only triggers for reads <= 128 bytes, so you must have done some other major change (like built a kernel without the mitigations, and compared it to one with mitigations - that would easily be 25% depending on hardware). Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 16:34 ` Linus Torvalds @ 2024-02-27 16:47 ` Kent Overstreet 2024-02-27 17:07 ` Linus Torvalds 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 16:47 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 08:34:39AM -0800, Linus Torvalds wrote: > On Mon, 26 Feb 2024 at 23:22, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > Only rough testing, but this is looking like around a 25% performance > > increase doing 4k random reads on a 1G file with fio, 8 jobs, on my > > Ryzen 5950x - 16.7M -> 21.4M iops, very roughly. fio's a pig and we're > > only spending half our cpu time in the kernel, so the buffered read path > > is actually getting 40% or 50% faster. > > > > So I'd say that's substantial. > > No, you're doing something wrong. The new fastread logic only > triggers for reads <= 128 bytes, so you must have done some other > major change (like built a kernel without the mitigations, and > compared it to one with mitigations - that would easily be 25% > depending on hardware). Like I replied to willy - the "4k" was a typo from force of habit, I was doing 64 byte random reads. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 16:47 ` Kent Overstreet @ 2024-02-27 17:07 ` Linus Torvalds 2024-02-27 17:20 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-02-27 17:07 UTC (permalink / raw) To: Kent Overstreet Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, 27 Feb 2024 at 08:47, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > Like I replied to willy - the "4k" was a typo from force of habit, I was > doing 64 byte random reads. Ok, can you send the exact fio command? I'll do some testing here too. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 17:07 ` Linus Torvalds @ 2024-02-27 17:20 ` Kent Overstreet 2024-02-27 18:02 ` Linus Torvalds 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 17:20 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 09:07:01AM -0800, Linus Torvalds wrote: > On Tue, 27 Feb 2024 at 08:47, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > Like I replied to willy - the "4k" was a typo from force of habit, I was > > doing 64 byte random reads. > > Ok, can you send the exact fio command? I'll do some testing here too. > > Linus fio \ --group_reporting \ --gtod_reduce=1 \ --norandommap \ --runtime=60s \ --exitall_on_error=1 \ --filename=/root/foo \ --name=read \ --rw=randread \ --loops=100 \ --numjobs=8 \ --filesize=1G \ --bs=64b \ --io_size=10G ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 17:20 ` Kent Overstreet @ 2024-02-27 18:02 ` Linus Torvalds 0 siblings, 0 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-27 18:02 UTC (permalink / raw) To: Kent Overstreet Cc: Al Viro, Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, 27 Feb 2024 at 09:20, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > fio \ > --group_reporting \ [..] Ok, I have to say that my kernel profile looks very nice with that patch. It all looks almost perfect, and the main big spike in the profile is the inevitable "clac/stac" pair around the user copy because my old threadripper just ends up not doing that very well. Of course, I disable all the mitigations for my profiling runs, because not doing that is just too depressing and any real kernel issues that we can do something about get hidden by the noise introduced by the mitigations. So my profiles are entirely artificial in that sense (I tell myself that it's what future CPU's will look like and that it's ok. Then I cry myself to sleep). I'm not going to push that patch of mine - I still suspect that doing this all under RCU with page faults optimistically disabled would be a much nicer model and work for bigger areas - but I don't hate the patch either, and if somebody else wants to push it, I'm ok with it. I can't actually think of any real load that does a lot of small reads. Willy's odd customer not-withstanding. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-26 22:46 ` Linus Torvalds 2024-02-26 23:48 ` Linus Torvalds @ 2024-05-14 11:52 ` Luis Chamberlain 2024-05-14 16:04 ` Linus Torvalds 2024-11-15 19:43 ` Linus Torvalds 1 sibling, 2 replies; 90+ messages in thread From: Luis Chamberlain @ 2024-05-14 11:52 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Kent Overstreet, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Mon, Feb 26, 2024 at 02:46:56PM -0800, Linus Torvalds wrote: > I really haven't tested this AT ALL. I'm much too scared. But I don't > actually hate how the code looks nearly as much as I *thought* I'd > hate it. Thanks for this, obviously those interested in this will have to test this and fix the below issues. I've tested for regressions just against xfs on 4k reflink profile and detected only two failures, generic/095 fails with a failure rate of about 1/2 or so: * generic/095 * generic/741 For details refer to the test result archive [0]. To reproduce with kdevops: make defconfig-xfs_reflink_4k KDEVOPS_HOSTS_PREFIX=hacks B4_MESSAGE_ID=20240514084221.3664475-1-mcgrof@kernel.org make -j $(nproc) make bringup make fstests make linux make fstests-baseline TESTS=generic/095 COUNT=10 [0] https://github.com/linux-kdevops/kdevops-results-archive/commit/27186b2782af4af559524539a2f6058d12361e10 Luis ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-05-14 11:52 ` Luis Chamberlain @ 2024-05-14 16:04 ` Linus Torvalds 2024-11-15 19:43 ` Linus Torvalds 1 sibling, 0 replies; 90+ messages in thread From: Linus Torvalds @ 2024-05-14 16:04 UTC (permalink / raw) To: Luis Chamberlain Cc: Al Viro, Kent Overstreet, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, 14 May 2024 at 04:52, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Feb 26, 2024 at 02:46:56PM -0800, Linus Torvalds wrote: > > I really haven't tested this AT ALL. I'm much too scared. But I don't > > actually hate how the code looks nearly as much as I *thought* I'd > > hate it. > > Thanks for this, obviously those interested in this will have to test > this and fix the below issues. I've tested for regressions just against > xfs on 4k reflink profile and detected only two failures, generic/095 > fails with a failure rate of about 1/2 or so: > > * generic/095 > * generic/741 Funky. I do *not* see how those can fail due to the change, but that's the point of testing. Somebody who knows those two tests better, and figures out what the difference is would have to get involved. One different thing that my fast-read case does is that it does *NOT* do the crazy dcache coherency thing that the "full" case does, ie the writably_mapped = mapping_writably_mapped(mapping); ... /* * If users can be writing to this folio using arbitrary * virtual addresses, take care of potential aliasing * before reading the folio on the kernel side. */ if (writably_mapped) flush_dcache_folio(folio); but that shouldn't matter on any sane architecture. Sadly, even arm64 counts as "insane" here, because it does the I$ sync using flush_dcache_folio(). I can't tell what architecture the testing was done on, but I assume it was x86, and I assume the above detail is _not_ the cause. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-05-14 11:52 ` Luis Chamberlain 2024-05-14 16:04 ` Linus Torvalds @ 2024-11-15 19:43 ` Linus Torvalds 2024-11-15 20:42 ` Matthew Wilcox 1 sibling, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-11-15 19:43 UTC (permalink / raw) To: Luis Chamberlain Cc: Al Viro, Matthew Wilcox, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner [-- Attachment #1: Type: text/plain, Size: 1719 bytes --] I'm coming back to this ancient discussion, because I've actually been running that odd small-reads patch on my machine since February, because it's been in my random pile of "local test patches". I had to rebase the patch because it conflicted with a recent fix, and as part of rebasing it I looked at it again. And I did find that the "do small reads entirely under RCU" didn't test for one condition that filemap_get_read_batch() checked for: the xa_is_sibling() check. So... On Tue, 14 May 2024 at 04:52, Luis Chamberlain <mcgrof@kernel.org> wrote: > > On Mon, Feb 26, 2024 at 02:46:56PM -0800, Linus Torvalds wrote: > > I really haven't tested this AT ALL. I'm much too scared. But I don't > > actually hate how the code looks nearly as much as I *thought* I'd > > hate it. > > Thanks for this, obviously those interested in this will have to test > this and fix the below issues. I've tested for regressions just against > xfs on 4k reflink profile and detected only two failures, generic/095 > fails with a failure rate of about 1/2 or so: > > * generic/095 > * generic/741 Would you mind seeing if that ancient patch with the xa_is_sibling() check added, and rebased to be on top of current -git maybe works for you now? I still haven't "really" tested this in any real way, except for running all my normal desktop loads on it. Which isn't really saying much. I compile kernels, and read email. That's pretty much all I do. So no fstests or anything like that. But it *has* worked in that capacity for 8+ months now. So it's not entirely broken, but the fact that it failed fstests for you clearly means that *something* was subtly but horribly broken in the original version. Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 5550 bytes --] From 2dafd15ceb00a41b9f6b37c1d84662039ba2d140 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Mon, 26 Feb 2024 15:18:44 -0800 Subject: [PATCH] mm/filemap: do small reads in RCU-mode read without refcounts Hackety hack hack concept from report by Willy. Mommy, I'm scared. Link: https://lore.kernel.org/all/Zduto30LUEqIHg4h@casper.infradead.org/ Not-yet-signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- mm/filemap.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 107 insertions(+), 10 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 56fa431c52af..91dd09c43af5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2594,6 +2594,85 @@ static inline bool pos_same_folio(loff_t pos1, loff_t pos2, struct folio *folio) return (pos1 >> shift == pos2 >> shift); } +/* + * I can't be bothered to care about HIGHMEM for the fast read case + */ +#ifdef CONFIG_HIGHMEM +#define filemap_fast_read(mapping, pos, buffer, size) 0 +#else + +/* + * Called under RCU with size limited to the file size and one page + */ +static inline unsigned long filemap_folio_copy_rcu(struct address_space *mapping, loff_t pos, char *buffer, size_t size) +{ + XA_STATE(xas, &mapping->i_pages, pos >> PAGE_SHIFT); + struct folio *folio; + size_t offset; + + xas_reset(&xas); + folio = xas_load(&xas); + if (xas_retry(&xas, folio)) + return 0; + + if (!folio || xa_is_value(folio) || xa_is_sibling(folio)) + return 0; + + if (!folio_test_uptodate(folio)) + return 0; + + /* No fast-case if we are supposed to start readahead */ + if (folio_test_readahead(folio)) + return 0; + /* .. or mark it accessed */ + if (!folio_test_referenced(folio)) + return 0; + + /* Do the data copy */ + offset = pos & (folio_size(folio) - 1); + memcpy(buffer, folio_address(folio) + offset, size); + + /* We should probably do some silly memory barrier here */ + if (unlikely(folio != xas_reload(&xas))) + return 0; + + return size; +} + +/* + * Iff we can complete the read completely in one atomic go under RCU, + * do so here. Otherwise return 0 (no partial reads, please - this is + * purely for the trivial fast case). + */ +static inline unsigned long filemap_fast_read(struct address_space *mapping, loff_t pos, char *buffer, size_t size) +{ + struct inode *inode; + loff_t file_size; + unsigned long pgoff; + + /* Don't even try for page-crossers */ + pgoff = pos & ~PAGE_MASK; + if (pgoff + size > PAGE_SIZE) + return 0; + + /* Limit it to the file size */ + inode = mapping->host; + file_size = i_size_read(inode); + if (unlikely(pos >= file_size)) + return 0; + file_size -= pos; + if (file_size < size) + size = file_size; + + /* Let's see if we can just do the read under RCU */ + rcu_read_lock(); + size = filemap_folio_copy_rcu(mapping, pos, buffer, size); + rcu_read_unlock(); + + return size; +} +#endif /* !HIGHMEM */ + /** * filemap_read - Read data from the page cache. * @iocb: The iocb to read. @@ -2614,7 +2693,10 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, struct file_ra_state *ra = &filp->f_ra; struct address_space *mapping = filp->f_mapping; struct inode *inode = mapping->host; - struct folio_batch fbatch; + union { + struct folio_batch fbatch; + __DECLARE_FLEX_ARRAY(char, buffer); + } area; int i, error = 0; bool writably_mapped; loff_t isize, end_offset; @@ -2626,7 +2708,22 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, return 0; iov_iter_truncate(iter, inode->i_sb->s_maxbytes - iocb->ki_pos); - folio_batch_init(&fbatch); + + if (iov_iter_count(iter) <= sizeof(area)) { + unsigned long count = iov_iter_count(iter); + + count = filemap_fast_read(mapping, iocb->ki_pos, area.buffer, count); + if (count) { + size_t copied = copy_to_iter(area.buffer, count, iter); + if (unlikely(!copied)) + return already_read ? already_read : -EFAULT; + ra->prev_pos = iocb->ki_pos += copied; + file_accessed(filp); + return copied + already_read; + } + } + + folio_batch_init(&area.fbatch); do { cond_resched(); @@ -2642,7 +2739,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, if (unlikely(iocb->ki_pos >= i_size_read(inode))) break; - error = filemap_get_pages(iocb, iter->count, &fbatch, false); + error = filemap_get_pages(iocb, iter->count, &area.fbatch, false); if (error < 0) break; @@ -2670,11 +2767,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, * mark it as accessed the first time. */ if (!pos_same_folio(iocb->ki_pos, last_pos - 1, - fbatch.folios[0])) - folio_mark_accessed(fbatch.folios[0]); + area.fbatch.folios[0])) + folio_mark_accessed(area.fbatch.folios[0]); - for (i = 0; i < folio_batch_count(&fbatch); i++) { - struct folio *folio = fbatch.folios[i]; + for (i = 0; i < folio_batch_count(&area.fbatch); i++) { + struct folio *folio = area.fbatch.folios[i]; size_t fsize = folio_size(folio); size_t offset = iocb->ki_pos & (fsize - 1); size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos, @@ -2705,9 +2802,9 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, } } put_folios: - for (i = 0; i < folio_batch_count(&fbatch); i++) - folio_put(fbatch.folios[i]); - folio_batch_init(&fbatch); + for (i = 0; i < folio_batch_count(&area.fbatch); i++) + folio_put(area.fbatch.folios[i]); + folio_batch_init(&area.fbatch); } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error); file_accessed(filp); ^ permalink raw reply related [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-11-15 19:43 ` Linus Torvalds @ 2024-11-15 20:42 ` Matthew Wilcox 2024-11-15 21:52 ` Linus Torvalds 0 siblings, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-11-15 20:42 UTC (permalink / raw) To: Linus Torvalds Cc: Luis Chamberlain, Al Viro, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Fri, Nov 15, 2024 at 11:43:10AM -0800, Linus Torvalds wrote: > And I did find that the "do small reads entirely under RCU" didn't > test for one condition that filemap_get_read_batch() checked for: the > xa_is_sibling() check. I don't think you need it: xas_load: entry = xas_descend(xas, node); xas_descend: while (xa_is_sibling(entry)) { entry = xa_entry(xas->xa, node, offset); so I don't think xa_is_sibling() can ever be true here. xas_next() can return a sibling entry, so it does need the extra check. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-11-15 20:42 ` Matthew Wilcox @ 2024-11-15 21:52 ` Linus Torvalds 0 siblings, 0 replies; 90+ messages in thread From: Linus Torvalds @ 2024-11-15 21:52 UTC (permalink / raw) To: Matthew Wilcox Cc: Luis Chamberlain, Al Viro, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Fri, 15 Nov 2024 at 12:42, Matthew Wilcox <willy@infradead.org> wrote: > > I don't think you need it: D'oh! I think I was even aware of that originally, but now that I rebased it I just did a mindless "what's the difference between the fast-path and the slow path" thing. The other case I noticed - but ignored - is this part of the slow case: /* * i_size must be checked after we know the pages are Uptodate. * * Checking i_size after the check allows us to calculate * the correct value for "nr", which means the zero-filled * part of the page is not copied back to userspace (unless * another truncate extends the file - this is desired though). */ isize = i_size_read(inode); if (unlikely(iocb->ki_pos >= isize)) goto put_folios; ... /* * If users can be writing to this folio using arbitrary * virtual addresses, take care of potential aliasing * before reading the folio on the kernel side. */ if (writably_mapped) flush_dcache_folio(folio); but I thought that the truncate case shouldn't matter - because a truncate can still happen *during* the copy - and the second case is not relevant on cache-coherent architectures. In fact, now I looked at generic/095 some more, and as far as I can see, that fio script doesn't do any small reads at all! Which makes me even more confused about how it was affected by this change? Funky funky. But maybe that file size check. Because clearly there *is* some difference here, even if it has worked for me and it all looks ObviouslyCorrect(tm). Oh well. Maybe in another 9 months I'll figure it out. I'll keep the patch in my pile of random patches. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 17:03 ` Linus Torvalds 2024-02-25 21:14 ` Matthew Wilcox @ 2024-02-25 21:29 ` Kent Overstreet 1 sibling, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-25 21:29 UTC (permalink / raw) To: Linus Torvalds Cc: Matthew Wilcox, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 09:03:32AM -0800, Linus Torvalds wrote: > On Sun, 25 Feb 2024 at 05:10, Matthew Wilcox <willy@infradead.org> wrote: > > > > There's also the small random 64 byte read case that we haven't optimised > > for yet. That also bottlenecks on the page refcount atomic op. > > > > The proposed solution to that was double-copy; look up the page without > > bumping its refcount, copy to a buffer, look up the page again to be > > sure it's still there, copy from the buffer to userspace. > > Please stop the cray-cray. > > Yes, cache dirtying is expensive. But you don't actually have > cacheline ping-pong, because you don't have lots of different CPU's > hammering the same page cache page in any normal circumstances. So the > really expensive stuff just doesn't exist. Not ping pong, you're just blowing the cachelines you want out of l1 with the big usercopy, hardware caches not being fully associative. > I think you've been staring at profiles too much. In instruction-level > profiles, the atomic ops stand out a lot. But that's at least partly > artificial - they are a serialization point on x86, so things get > accounted to them. So they tend to be the collection point for > everything around them in an OoO CPU. Yes, which leads to a fun game of whack a mole when you eliminate one atomic op and then everything just ends up piling up behind a different atomic op - but for the buffered read path, the folio get/put are the only atomic ops. > Fior example, the fact that Kent complains about the page cache and > talks about large folios is completely ludicrous. I've seen the > benchmarks of real loads. Kent - you're not close to any limits, you > are often a factor of two to five off other filesystems. We're not > talking "a few percent", and we're not talking "the atomics are > hurting". Yes, there's a bunch of places where bcachefs is still slow; it'll get there :) If you've got those benchmarks handyy and they're ones I haven't seen, I'd love to take look; the one that always jumps out at people is small O_DIRECT reads, and that hasn't been a priority because O_DIRECT doesn't matter to most people nearly as much as they think it does. There's a bunch of stuff still to work through; another that comes to mind is that we need a free inodes btree to eliminate scanning in inode create, and that was half a day of work - except it also needs sharding (i.e. leaf nodes can't span certain boundaries), and for that I need variable sized btree nodes so we aren't burning stupid amounts of memory - and that's something we need anyways, number of btrees growing like it is. Another fun one that I just discovered while I was hanging out at Darrick's - journal was stalling on high iodepth workloads; device write buffer fills up, write latency goes up, suddenly the journal can't write quickly enough when it's only submitting one write at a time. So there's a fix for 6.9 queued up that lets the journal keep multiple writes in flight. That one was worth mentioning because another fix would've been to add a way to signal backpressure to /above/ the filesystem, so that we don't hit such big queuing delays within the filesystem; right now user writes don't hit backpressure until submit_bio() blocks because the request queue is full. I've been seeing other performance corner cases where it looks like such a mechanism would be helpful. I except I've got a solid year or two ahead of me of mastly just working through performance bugs - standing up a lot of automated perf testing adn whatnot. But, one thing at a time... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-25 13:10 ` Matthew Wilcox 2024-02-25 17:03 ` Linus Torvalds @ 2024-02-25 17:32 ` Kent Overstreet 1 sibling, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-25 17:32 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner On Sun, Feb 25, 2024 at 01:10:39PM +0000, Matthew Wilcox wrote: > On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote: > > Before large folios, we had people very much bottlenecked by 4k page > > overhead on sequential IO; my customer/sponsor was one of them. > > > > Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the > > profiles and looked at the filemap.c code it wasn't hard to see why; > > we'd walk a radix tree, do an atomic op (get the page), then do a 4k > > usercopy... hence the work I did to break up > > generic_file_buffered_read() and vectorize it, which was a huge > > improvement. > > There's also the small random 64 byte read case that we haven't optimised > for yet. That also bottlenecks on the page refcount atomic op. > > The proposed solution to that was double-copy; look up the page without > bumping its refcount, copy to a buffer, look up the page again to be > sure it's still there, copy from the buffer to userspace. > > Except that can go wrong under really unlikely circumstances. Look up the > page, page gets freed, page gets reallocated to slab, we copy sensitive > data from it, page gets freed again, page gets reallocated to the same > spot in the file (!), lookup says "yup the same page is there". > We'd need a seqcount or something to be sure the page hasn't moved. yes, generation numbers are the standard solution to ABA... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-24 4:12 ` Matthew Wilcox 2024-02-24 17:31 ` Linus Torvalds @ 2024-02-24 17:55 ` Luis Chamberlain 1 sibling, 0 replies; 90+ messages in thread From: Luis Chamberlain @ 2024-02-24 17:55 UTC (permalink / raw) To: Matthew Wilcox Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Linus Torvalds On Sat, Feb 24, 2024 at 04:12:31AM +0000, Matthew Wilcox wrote: > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > > Vs > > ~ 7,000 MiB/s with buffered IO > > Profile? My guess is that you're bottlenecked on the xa_lock between > memory reclaim removing folios from the page cache and the various > threads adding folios to the page cache. If it was lock contention I was hoping to use perf lock record on fio, then perf lock report -F acquired,contended,avg_wait,wait_total If the contention was on locking xa_lock, it would creep up here, no? Name acquired contended avg wait total wait cgroup_rstat_lock 90132 90132 26.41 us 2.38 s event_mutex 32538 32538 1.40 ms 45.61 s 23476 23476 123.48 us 2.90 s 20803 20803 47.58 us 989.73 ms 11921 11921 31.19 us 371.82 ms 9389 9389 102.65 us 963.80 ms 7763 7763 21.86 us 169.69 ms 1736 1736 15.49 us 26.89 ms 743 743 308.30 us 229.07 ms 667 667 269.69 us 179.88 ms 522 522 36.64 us 19.13 ms 335 335 19.38 us 6.49 ms 328 328 157.10 us 51.53 ms 296 296 278.22 us 82.35 ms 288 288 214.82 us 61.87 ms 282 282 314.38 us 88.65 ms 275 275 128.98 us 35.47 ms 269 269 141.99 us 38.19 ms 264 264 277.73 us 73.32 ms 260 260 160.02 us 41.61 ms event_mutex 251 251 242.03 us 60.75 ms 248 248 12.47 us 3.09 ms 246 246 328.33 us 80.77 ms 245 245 189.83 us 46.51 ms 245 245 275.17 us 67.42 ms 235 235 152.49 us 35.84 ms 235 235 38.55 us 9.06 ms 228 228 137.27 us 31.30 ms 224 224 94.65 us 21.20 ms 221 221 198.13 us 43.79 ms 220 220 411.64 us 90.56 ms 214 214 291.08 us 62.29 ms 209 209 132.94 us 27.79 ms 207 207 364.20 us 75.39 ms 204 204 346.68 us 70.72 ms 194 194 169.77 us 32.94 ms 181 181 137.87 us 24.95 ms 181 181 154.78 us 28.01 ms 172 172 145.11 us 24.96 ms 169 169 124.30 us 21.01 ms 168 168 378.92 us 63.66 ms 161 161 91.64 us 14.75 ms 161 161 264.51 us 42.59 ms 153 153 85.53 us 13.09 ms 150 150 383.28 us 57.49 ms 148 148 91.24 us 13.50 ms I'll have to nose dive some more.. but for the life of me I can't see the expected xa_lock contention. Luis ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain 2024-02-24 4:12 ` Matthew Wilcox @ 2024-02-25 5:24 ` Kent Overstreet 2024-02-26 12:22 ` Dave Chinner 2024-02-27 10:07 ` Kent Overstreet 3 siblings, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-25 5:24 UTC (permalink / raw) To: Luis Chamberlain Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > Part of the testing we have done with LBS was to do some performance > tests on XFS to ensure things are not regressing. Building linux is a > fine decent test and we did some random cloud instance tests on that and > presented that at Plumbers, but it doesn't really cut it if we want to > push things to the limit though. What are the limits to buffered IO > and how do we test that? Who keeps track of it? > > The obvious recurring tension is that for really high performance folks > just recommend to use birect IO. But if you are stress testing changes > to a filesystem and want to push buffered IO to its limits it makes > sense to stick to buffered IO, otherwise how else do we test it? > > It is good to know limits to buffered IO too because some workloads > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > support and even as late as the end of last year we learned that adding > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > that direct IO can also force writes during reads (?)... Anyway, testing > the limits of buffered IO limits to ensure you are not creating > regressions when doing some page cache surgery seems like it might be > useful and a sensible thing to do .... The good news is we have not found > regressions with LBS but all the testing seems to beg the question, of what > are the limits of buffered IO anyway, and how does it scale? Do we know, do > we care? Do we keep track of it? How does it compare to direct IO for some > workloads? How big is the delta? How do we best test that? How do we > automate all that? Do we want to automatically test this to avoid regressions? > > The obvious issues with some workloads for buffered IO is having a > possible penality if you are not really re-using folios added to the > page cache. Jens Axboe reported a while ago issues with workloads with > random reads over a data set 10x the size of RAM and also proposed > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > like direct IO with kernel pages and a memcpy(), and it requires > further serialization to be implemented that we already do for > direct IO for writes. There at least seems to be agreement that if we're > going to provide an enhancement or alternative that we should strive to not > make the same mistakes we've done with direct IO. The rationale for some > workloads to use buffered IO is it helps reduce some tail latencies, so > that's something to live up to. > > On that same thread Christoph also mentioned the possibility of a direct > IO variant which can leverage the cache. Is that something we want to > move forward with? The thing to consider here would be an improved O_SYNC. There's a fair amount of tree walking and thread to thread cacheline bouncing that would be avoided by just calling .write_folios() and kicking bios off from .write_iter(). OTOH - the way it's done now is probably the best possible way of splitting up the work between multiple threads, so I'd expect this approach to get less throughput than current O_SYNC. Luis, are you profiling these workloads? I haven't looked at high throughput profiles of the buffered IO path in years, and that's a good place to start. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain 2024-02-24 4:12 ` Matthew Wilcox 2024-02-25 5:24 ` Kent Overstreet @ 2024-02-26 12:22 ` Dave Chinner 2024-02-27 10:07 ` Kent Overstreet 3 siblings, 0 replies; 90+ messages in thread From: Dave Chinner @ 2024-02-26 12:22 UTC (permalink / raw) To: Luis Chamberlain Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > I recently ran a different type of simple test, focused on sequantial writes > to fill capacity, with write workload essentially matching your RAM, so > having parity with your RAM. Technically in the case of max size that I > tested the writes were just *slightly* over the RAM, that's a minor > technicality given I did other tests with similar sizes which showed similar > results... This test should be possible to reproduce then if you have more > than enough RAM to spare. In this case the system uses 1 TiB RAM, using > pmem to avoid drive variance / GC / and other drive shenanigans. > > So pmem grub setup: > > memmap=500G!4G memmap=3G!504G > > As noted earlier, surely, DIO / DAX is best for pmem (and I actually get > a difference between using just DIO and DAX, but that digresses), but > when one is wishing to test buffered IO on purpose it makes sense to do > this. Yes, we can test tmpfs too... but I believe that topic will be > brought up at LSFMM separately. The delta with DIO and buffered IO on > XFS is astronomical: > > ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64 > Vs > ~ 7,000 MiB/s with buffered IO You're not testing apples to apples. Buffered writes to the same superblock serialise on IO submission, not write() calls, so it doesn't matter how much concurrency you have in write() syscalls. That is, streaming buffered write throughput is entirely limited by the number of IOs that the bdi flusher thread can submit. For ext4, XFS and btrfs, delayed allocation means that this writeback thread is also doing extent allocation for all IO, and hence the single writeback thread for buffered writes is the performance limiting factor for them. It doesn't matter how fast you can copy in to the kernel, it can only drain as fast as it can submit IO. As soon as this writeback thread is CPU bound, incoming buffered write()s will be throttle back to the rate at which memory can be cleaned by the writeback thread. Direct IO doesn't have this limitation - it's an orange in comparison because IO is always submitted by the task that does the write() syscall. Hence it inherently scales out to the limit of the underlying hardware and it is not limited by the throughput of a single CPU like page cache writeback is. If you wonder why people are saying "issue sync_file_range() periodically" to improved buffered write throughput, it's because it moves the async writeback submission for that inode out of the single background writeback thread and into task context where IO submission can be trivially parallelised. Just like direct IO.... IOWs, the issue you are demonstrating is the inherent limitations in single threaded write-behind cache flushing, and the solution to that specific bottleneck is to enable concurrent writeback submission from the same file and/or superblock via various available manual mechanisms. An automatic way of doing this for large streaming writes is switch from write-behind to near-write-through, such that the majority of write IO is submitted asynchronously from the write() syscall. Think of how readahead from read() context pulls in data that is likely to be needed soon - sequential writes should trigger similar behaviour where we do async write-behind of the previous write()s in the context of the current write. Track a sequential write window like we do readahead, and trigger async writeback for such streaming writes from the write() context... That doesn't solve the huge tarball problem where we create millions of small files in a couple of seconds, then have to wait for single threaded writeback to drain them to the storage at 50,000 files/s. We can create files and get the data into the cache far faster and with way more concurrency than the page cache can push the data back to the storage itself. IOWs, the problems with page cache write throughput really have nothing to do with write() scalability, folios or filesystem block sizes. The fundamental problem is single-threaded writeback IO submission and that throttling incoming writes to whatever speed it runs at when CPU bound.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain ` (2 preceding siblings ...) 2024-02-26 12:22 ` Dave Chinner @ 2024-02-27 10:07 ` Kent Overstreet 2024-02-27 14:08 ` Luis Chamberlain 2024-02-27 22:13 ` Dave Chinner 3 siblings, 2 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 10:07 UTC (permalink / raw) To: Luis Chamberlain Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > Part of the testing we have done with LBS was to do some performance > tests on XFS to ensure things are not regressing. Building linux is a > fine decent test and we did some random cloud instance tests on that and > presented that at Plumbers, but it doesn't really cut it if we want to > push things to the limit though. What are the limits to buffered IO > and how do we test that? Who keeps track of it? > > The obvious recurring tension is that for really high performance folks > just recommend to use birect IO. But if you are stress testing changes > to a filesystem and want to push buffered IO to its limits it makes > sense to stick to buffered IO, otherwise how else do we test it? > > It is good to know limits to buffered IO too because some workloads > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > support and even as late as the end of last year we learned that adding > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > that direct IO can also force writes during reads (?)... Anyway, testing > the limits of buffered IO limits to ensure you are not creating > regressions when doing some page cache surgery seems like it might be > useful and a sensible thing to do .... The good news is we have not found > regressions with LBS but all the testing seems to beg the question, of what > are the limits of buffered IO anyway, and how does it scale? Do we know, do > we care? Do we keep track of it? How does it compare to direct IO for some > workloads? How big is the delta? How do we best test that? How do we > automate all that? Do we want to automatically test this to avoid regressions? > > The obvious issues with some workloads for buffered IO is having a > possible penality if you are not really re-using folios added to the > page cache. Jens Axboe reported a while ago issues with workloads with > random reads over a data set 10x the size of RAM and also proposed > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > like direct IO with kernel pages and a memcpy(), and it requires > further serialization to be implemented that we already do for > direct IO for writes. There at least seems to be agreement that if we're > going to provide an enhancement or alternative that we should strive to not > make the same mistakes we've done with direct IO. The rationale for some > workloads to use buffered IO is it helps reduce some tail latencies, so > that's something to live up to. > > On that same thread Christoph also mentioned the possibility of a direct > IO variant which can leverage the cache. Is that something we want to > move forward with? > > Chris Mason also listed a few other desirables if we do: > > - Allowing concurrent writes (xfs DIO does this now) AFAIK every filesystem allows concurrent direct writes, not just xfs, it's _buffered_ writes that we care about here. I just pushed a patch to my CI for buffered writes without taking the inode lock - for bcachefs. It'll be straightforward, but a decent amount of work, to lift this to the VFS, if people are interested in collaborating. https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking The approach is: for non extending, non appending writes, see if we can pin the entire range of the pagecache we're writing to; fall back to taking the inode lock if we can't. If we do a short write because of a page fault (despite previously faulting in the userspace buffer), there is no way to completely prevent torn writes an atomicity breakage; we could at least try a trylock on the inode lock, I didn't do that here. For lifting this to the VFS, this needs - My darray code, which I'll be moving to include/linux/ in the 6.9 merge window - My pagecache add lock - we need this for sychronization with hole punching and truncate when we don't have the inode lock. - My vectorized buffered write path lifted to filemap.c, which means we need some sort of vectorized replacement for .write_begin and .write_end ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 10:07 ` Kent Overstreet @ 2024-02-27 14:08 ` Luis Chamberlain 2024-02-27 14:57 ` Kent Overstreet 2024-02-27 22:13 ` Dave Chinner 1 sibling, 1 reply; 90+ messages in thread From: Luis Chamberlain @ 2024-02-27 14:08 UTC (permalink / raw) To: Kent Overstreet Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > Part of the testing we have done with LBS was to do some performance > > tests on XFS to ensure things are not regressing. Building linux is a > > fine decent test and we did some random cloud instance tests on that and > > presented that at Plumbers, but it doesn't really cut it if we want to > > push things to the limit though. What are the limits to buffered IO > > and how do we test that? Who keeps track of it? > > > > The obvious recurring tension is that for really high performance folks > > just recommend to use birect IO. But if you are stress testing changes > > to a filesystem and want to push buffered IO to its limits it makes > > sense to stick to buffered IO, otherwise how else do we test it? > > > > It is good to know limits to buffered IO too because some workloads > > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > > support and even as late as the end of last year we learned that adding > > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > > that direct IO can also force writes during reads (?)... Anyway, testing > > the limits of buffered IO limits to ensure you are not creating > > regressions when doing some page cache surgery seems like it might be > > useful and a sensible thing to do .... The good news is we have not found > > regressions with LBS but all the testing seems to beg the question, of what > > are the limits of buffered IO anyway, and how does it scale? Do we know, do > > we care? Do we keep track of it? How does it compare to direct IO for some > > workloads? How big is the delta? How do we best test that? How do we > > automate all that? Do we want to automatically test this to avoid regressions? > > > > The obvious issues with some workloads for buffered IO is having a > > possible penality if you are not really re-using folios added to the > > page cache. Jens Axboe reported a while ago issues with workloads with > > random reads over a data set 10x the size of RAM and also proposed > > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > > like direct IO with kernel pages and a memcpy(), and it requires > > further serialization to be implemented that we already do for > > direct IO for writes. There at least seems to be agreement that if we're > > going to provide an enhancement or alternative that we should strive to not > > make the same mistakes we've done with direct IO. The rationale for some > > workloads to use buffered IO is it helps reduce some tail latencies, so > > that's something to live up to. > > > > On that same thread Christoph also mentioned the possibility of a direct > > IO variant which can leverage the cache. Is that something we want to > > move forward with? > > > > Chris Mason also listed a few other desirables if we do: > > > > - Allowing concurrent writes (xfs DIO does this now) > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > it's _buffered_ writes that we care about here. The context above was a possible direct IO variant, that's why direct IO was mentioned and that XFS at least had support. > I just pushed a patch to my CI for buffered writes without taking the > inode lock - for bcachefs. It'll be straightforward, but a decent amount > of work, to lift this to the VFS, if people are interested in > collaborating. > > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking Neat, this is sort of what I wanted to get a sense for, if this sort of topic was worth discussing at LSFMM. > The approach is: for non extending, non appending writes, see if we can > pin the entire range of the pagecache we're writing to; fall back to > taking the inode lock if we can't. Perhaps a silly thought... but initial reaction is, would it make sense for the page cache to make this easier for us, so we have this be easier? It is not clear to me but my first reaction to seeing some of these deltas was what if we had something like the space split up, as we do with XFS agcounts, and so each group deals with its own ranges. I considered this before profiling, and as with Matthew I figured it might be lock contenton. It very likely is not for my test case, and as Linus and Dave has clarified we are both penalized and also have a singlthreaded writeback. If we had a group split we'd have locks per group and perhaps a writeback a dedicated thread per group. Luis ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 14:08 ` Luis Chamberlain @ 2024-02-27 14:57 ` Kent Overstreet 0 siblings, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 14:57 UTC (permalink / raw) To: Luis Chamberlain Cc: lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Dave Chinner, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Tue, Feb 27, 2024 at 06:08:57AM -0800, Luis Chamberlain wrote: > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > > Part of the testing we have done with LBS was to do some performance > > > tests on XFS to ensure things are not regressing. Building linux is a > > > fine decent test and we did some random cloud instance tests on that and > > > presented that at Plumbers, but it doesn't really cut it if we want to > > > push things to the limit though. What are the limits to buffered IO > > > and how do we test that? Who keeps track of it? > > > > > > The obvious recurring tension is that for really high performance folks > > > just recommend to use birect IO. But if you are stress testing changes > > > to a filesystem and want to push buffered IO to its limits it makes > > > sense to stick to buffered IO, otherwise how else do we test it? > > > > > > It is good to know limits to buffered IO too because some workloads > > > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > > > support and even as late as the end of last year we learned that adding > > > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > > > that direct IO can also force writes during reads (?)... Anyway, testing > > > the limits of buffered IO limits to ensure you are not creating > > > regressions when doing some page cache surgery seems like it might be > > > useful and a sensible thing to do .... The good news is we have not found > > > regressions with LBS but all the testing seems to beg the question, of what > > > are the limits of buffered IO anyway, and how does it scale? Do we know, do > > > we care? Do we keep track of it? How does it compare to direct IO for some > > > workloads? How big is the delta? How do we best test that? How do we > > > automate all that? Do we want to automatically test this to avoid regressions? > > > > > > The obvious issues with some workloads for buffered IO is having a > > > possible penality if you are not really re-using folios added to the > > > page cache. Jens Axboe reported a while ago issues with workloads with > > > random reads over a data set 10x the size of RAM and also proposed > > > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > > > like direct IO with kernel pages and a memcpy(), and it requires > > > further serialization to be implemented that we already do for > > > direct IO for writes. There at least seems to be agreement that if we're > > > going to provide an enhancement or alternative that we should strive to not > > > make the same mistakes we've done with direct IO. The rationale for some > > > workloads to use buffered IO is it helps reduce some tail latencies, so > > > that's something to live up to. > > > > > > On that same thread Christoph also mentioned the possibility of a direct > > > IO variant which can leverage the cache. Is that something we want to > > > move forward with? > > > > > > Chris Mason also listed a few other desirables if we do: > > > > > > - Allowing concurrent writes (xfs DIO does this now) > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > > it's _buffered_ writes that we care about here. > > The context above was a possible direct IO variant, that's why direct IO > was mentioned and that XFS at least had support. > > > I just pushed a patch to my CI for buffered writes without taking the > > inode lock - for bcachefs. It'll be straightforward, but a decent amount > > of work, to lift this to the VFS, if people are interested in > > collaborating. > > > > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking > > Neat, this is sort of what I wanted to get a sense for, if this sort of > topic was worth discussing at LSFMM. > > > The approach is: for non extending, non appending writes, see if we can > > pin the entire range of the pagecache we're writing to; fall back to > > taking the inode lock if we can't. > > Perhaps a silly thought... but initial reaction is, would it make sense > for the page cache to make this easier for us, so we have this be > easier? It is not clear to me but my first reaction to seeing some of > these deltas was what if we had something like the space split up, as we > do with XFS agcounts, and so each group deals with its own ranges. I > considered this before profiling, and as with Matthew I figured it might > be lock contenton. It very likely is not for my test case, and as Linus > and Dave has clarified we are both penalized and also have a > singlthreaded writeback. If we had a group split we'd have locks per > group and perhaps a writeback a dedicated thread per group. Wtf are you talking about? ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 10:07 ` Kent Overstreet 2024-02-27 14:08 ` Luis Chamberlain @ 2024-02-27 22:13 ` Dave Chinner 2024-02-27 22:21 ` Kent Overstreet 1 sibling, 1 reply; 90+ messages in thread From: Dave Chinner @ 2024-02-27 22:13 UTC (permalink / raw) To: Kent Overstreet Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote: > > Part of the testing we have done with LBS was to do some performance > > tests on XFS to ensure things are not regressing. Building linux is a > > fine decent test and we did some random cloud instance tests on that and > > presented that at Plumbers, but it doesn't really cut it if we want to > > push things to the limit though. What are the limits to buffered IO > > and how do we test that? Who keeps track of it? > > > > The obvious recurring tension is that for really high performance folks > > just recommend to use birect IO. But if you are stress testing changes > > to a filesystem and want to push buffered IO to its limits it makes > > sense to stick to buffered IO, otherwise how else do we test it? > > > > It is good to know limits to buffered IO too because some workloads > > cannot use direct IO. For instance PostgreSQL doesn't have direct IO > > support and even as late as the end of last year we learned that adding > > direct IO to PostgreSQL would be difficult. Chris Mason has noted also > > that direct IO can also force writes during reads (?)... Anyway, testing > > the limits of buffered IO limits to ensure you are not creating > > regressions when doing some page cache surgery seems like it might be > > useful and a sensible thing to do .... The good news is we have not found > > regressions with LBS but all the testing seems to beg the question, of what > > are the limits of buffered IO anyway, and how does it scale? Do we know, do > > we care? Do we keep track of it? How does it compare to direct IO for some > > workloads? How big is the delta? How do we best test that? How do we > > automate all that? Do we want to automatically test this to avoid regressions? > > > > The obvious issues with some workloads for buffered IO is having a > > possible penality if you are not really re-using folios added to the > > page cache. Jens Axboe reported a while ago issues with workloads with > > random reads over a data set 10x the size of RAM and also proposed > > RWF_UNCACHED as a way to help [0]. As Chinner put it, this seemed more > > like direct IO with kernel pages and a memcpy(), and it requires > > further serialization to be implemented that we already do for > > direct IO for writes. There at least seems to be agreement that if we're > > going to provide an enhancement or alternative that we should strive to not > > make the same mistakes we've done with direct IO. The rationale for some > > workloads to use buffered IO is it helps reduce some tail latencies, so > > that's something to live up to. > > > > On that same thread Christoph also mentioned the possibility of a direct > > IO variant which can leverage the cache. Is that something we want to > > move forward with? > > > > Chris Mason also listed a few other desirables if we do: > > > > - Allowing concurrent writes (xfs DIO does this now) > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > it's _buffered_ writes that we care about here. We could do concurrent buffered writes in XFS - we would just use the same locking strategy as direct IO and fall back on folio locks for copy-in exclusion like ext4 does. The real question is how much of userspace will that break, because of implicit assumptions that the kernel has always serialised buffered writes? > I just pushed a patch to my CI for buffered writes without taking the > inode lock - for bcachefs. It'll be straightforward, but a decent amount > of work, to lift this to the VFS, if people are interested in > collaborating. Yeah, XFS would just revert to shared inode locking - we still need the inode lock for things like truncate/fallocate exlusion. > https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-buffered-write-locking > > The approach is: for non extending, non appending writes, see if we can > pin the entire range of the pagecache we're writing to; fall back to > taking the inode lock if we can't. XFS just falls back to exclusive locking if the file needs extending. > If we do a short write because of a page fault (despite previously > faulting in the userspace buffer), there is no way to completely prevent > torn writes an atomicity breakage; we could at least try a trylock on > the inode lock, I didn't do that here. As soon as we go for concurrent writes, we give up on any concept of atomicity of buffered writes (esp. w.r.t reads), so this really doesn't matter at all. > For lifting this to the VFS, this needs > - My darray code, which I'll be moving to include/linux/ in the 6.9 > merge window > - My pagecache add lock - we need this for sychronization with hole > punching and truncate when we don't have the inode lock. > - My vectorized buffered write path lifted to filemap.c, which means we > need some sort of vectorized replacement for .write_begin and > .write_end I don't think we need any of that - I think you're over complicating it. As long as the filesystems has a mechanism that works for concurrent DIO writes, they can just reuse that for concurrent buffered writes.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 22:13 ` Dave Chinner @ 2024-02-27 22:21 ` Kent Overstreet 2024-02-27 22:42 ` Dave Chinner 2024-02-27 22:46 ` Linus Torvalds 0 siblings, 2 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-27 22:21 UTC (permalink / raw) To: Dave Chinner Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote: > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > > it's _buffered_ writes that we care about here. > > We could do concurrent buffered writes in XFS - we would just use > the same locking strategy as direct IO and fall back on folio locks > for copy-in exclusion like ext4 does. ext4 code doesn't do that. it takes the inode lock in exclusive mode, just like everyone else. > The real question is how much of userspace will that break, because > of implicit assumptions that the kernel has always serialised > buffered writes? What would break? > > If we do a short write because of a page fault (despite previously > > faulting in the userspace buffer), there is no way to completely prevent > > torn writes an atomicity breakage; we could at least try a trylock on > > the inode lock, I didn't do that here. > > As soon as we go for concurrent writes, we give up on any concept of > atomicity of buffered writes (esp. w.r.t reads), so this really > doesn't matter at all. We've already given up buffered write vs. read atomicity, have for a long time - buffered read path takes no locks. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 22:21 ` Kent Overstreet @ 2024-02-27 22:42 ` Dave Chinner 2024-02-28 7:48 ` [Lsf-pc] " Amir Goldstein 2024-02-27 22:46 ` Linus Torvalds 1 sibling, 1 reply; 90+ messages in thread From: Dave Chinner @ 2024-02-27 22:42 UTC (permalink / raw) To: Kent Overstreet Cc: Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox, Linus Torvalds On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote: > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote: > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > > > it's _buffered_ writes that we care about here. > > > > We could do concurrent buffered writes in XFS - we would just use > > the same locking strategy as direct IO and fall back on folio locks > > for copy-in exclusion like ext4 does. > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > just like everyone else. Uhuh. ext4 does allow concurrent DIO writes. It's just much more constrained than XFS. See ext4_dio_write_checks(). > > The real question is how much of userspace will that break, because > > of implicit assumptions that the kernel has always serialised > > buffered writes? > > What would break? Good question. If you don't know the answer, then you've got the same problem as I have. i.e. we don't know if concurrent applications that use buffered IO extensively (eg. postgres) assume data coherency because of the implicit serialisation occurring during buffered IO writes? > > > If we do a short write because of a page fault (despite previously > > > faulting in the userspace buffer), there is no way to completely prevent > > > torn writes an atomicity breakage; we could at least try a trylock on > > > the inode lock, I didn't do that here. > > > > As soon as we go for concurrent writes, we give up on any concept of > > atomicity of buffered writes (esp. w.r.t reads), so this really > > doesn't matter at all. > > We've already given up buffered write vs. read atomicity, have for a > long time - buffered read path takes no locks. We still have explicit buffered read() vs buffered write() atomicity in XFS via buffered reads taking the inode lock shared (see xfs_file_buffered_read()) because that's what POSIX says we should have. Essentially, we need to explicitly give POSIX the big finger and state that there are no atomicity guarantees given for write() calls of any size, nor are there any guarantees for data coherency for any overlapping concurrent buffered IO operations. Those are things we haven't completely given up yet w.r.t. buffered IO, and enabling concurrent buffered writes will expose to users. So we need to have explicit policies for this and document them clearly in all the places that application developers might look for behavioural hints. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 22:42 ` Dave Chinner @ 2024-02-28 7:48 ` Amir Goldstein 2024-02-28 14:01 ` Chris Mason 2024-02-29 0:25 ` Dave Chinner 0 siblings, 2 replies; 90+ messages in thread From: Amir Goldstein @ 2024-02-28 7:48 UTC (permalink / raw) To: Dave Chinner Cc: Kent Overstreet, Pankaj Raghav, Jens Axboe, Chris Mason, Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain, Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds, Christoph Hellwig, Josef Bacik, Jan Kara On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc <lsf-pc@lists.linux-foundation.org> wrote: > > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote: > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote: > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > > > > it's _buffered_ writes that we care about here. > > > > > > We could do concurrent buffered writes in XFS - we would just use > > > the same locking strategy as direct IO and fall back on folio locks > > > for copy-in exclusion like ext4 does. > > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > > just like everyone else. > > Uhuh. ext4 does allow concurrent DIO writes. It's just much more > constrained than XFS. See ext4_dio_write_checks(). > > > > The real question is how much of userspace will that break, because > > > of implicit assumptions that the kernel has always serialised > > > buffered writes? > > > > What would break? > > Good question. If you don't know the answer, then you've got the > same problem as I have. i.e. we don't know if concurrent > applications that use buffered IO extensively (eg. postgres) assume > data coherency because of the implicit serialisation occurring > during buffered IO writes? > > > > > If we do a short write because of a page fault (despite previously > > > > faulting in the userspace buffer), there is no way to completely prevent > > > > torn writes an atomicity breakage; we could at least try a trylock on > > > > the inode lock, I didn't do that here. > > > > > > As soon as we go for concurrent writes, we give up on any concept of > > > atomicity of buffered writes (esp. w.r.t reads), so this really > > > doesn't matter at all. > > > > We've already given up buffered write vs. read atomicity, have for a > > long time - buffered read path takes no locks. > > We still have explicit buffered read() vs buffered write() atomicity > in XFS via buffered reads taking the inode lock shared (see > xfs_file_buffered_read()) because that's what POSIX says we should > have. > > Essentially, we need to explicitly give POSIX the big finger and > state that there are no atomicity guarantees given for write() calls > of any size, nor are there any guarantees for data coherency for > any overlapping concurrent buffered IO operations. > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave as the other fs ever since Jan has added the invalidate_lock and I believe that Meta kernel has done that way before. > Those are things we haven't completely given up yet w.r.t. buffered > IO, and enabling concurrent buffered writes will expose to users. > So we need to have explicit policies for this and document them > clearly in all the places that application developers might look > for behavioural hints. That's doable - I can try to do that. What is your take regarding opt-in/opt-out of legacy behavior? At the time, I have proposed POSIX_FADV_TORN_RW API [1] to opt-out of the legacy POSIX behavior, but I guess that an xfs mount option would make more sense for consistent and clear semantics across the fs - it is easier if all buffered IO to inode behaved the same way. Thanks, Amir. [1] https://lore.kernel.org/linux-xfs/CAOQ4uxguwnx4AxXqp_zjg39ZUaTGJEM2wNUPnNdtiqV2Q9woqA@mail.gmail.com/ ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 7:48 ` [Lsf-pc] " Amir Goldstein @ 2024-02-28 14:01 ` Chris Mason 2024-02-29 0:25 ` Dave Chinner 1 sibling, 0 replies; 90+ messages in thread From: Chris Mason @ 2024-02-28 14:01 UTC (permalink / raw) To: Amir Goldstein, Dave Chinner Cc: Kent Overstreet, Pankaj Raghav, Jens Axboe, Chris Mason, Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain, Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds, Christoph Hellwig, Josef Bacik, Jan Kara On 2/28/24 2:48 AM, Amir Goldstein wrote: > On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc >> Essentially, we need to explicitly give POSIX the big finger and >> state that there are no atomicity guarantees given for write() calls >> of any size, nor are there any guarantees for data coherency for >> any overlapping concurrent buffered IO operations. >> > > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave > as the other fs ever since Jan has added the invalidate_lock and I believe > that Meta kernel has done that way before. Hmmm, you might be thinking of my patch to prevent kswapd from getting stuck on XFS inode reclaim, but I don't think we've ever messed with write concurrency. I'm comfortable with the concurrency change in general, but it's not somewhere I'd be excited about differing from upstream. Total tangent, but we only carry two XFS patches right now that aren't upstream. I dropped the inode reclaim patch; the problem stopped showing up in our profiles, and the impacted workloads changed to rocksdb for other reasons. We flip XFS discards back to synchronous. Async disards without any kind of metering saturate drives when we do bulk deletes, leading to latency spikes on reads and writes. There's probably a class of flash that can handle this, but we don't have it. Unfortunately I also disable large folios on XFS. They are corrupting xarrays on our v5.19 kernel, with large folios from multiple files interleaved together in the same file. We'll try again with them on v6.4 or maybe v6.8, but the repro needs thousands of machines making NFS noises just to trigger one failure, and I won't be able to debug it until I can make a more reasonable reproduction. -chris ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 7:48 ` [Lsf-pc] " Amir Goldstein 2024-02-28 14:01 ` Chris Mason @ 2024-02-29 0:25 ` Dave Chinner 2024-02-29 0:57 ` Kent Overstreet 1 sibling, 1 reply; 90+ messages in thread From: Dave Chinner @ 2024-02-29 0:25 UTC (permalink / raw) To: Amir Goldstein Cc: Kent Overstreet, Pankaj Raghav, Jens Axboe, Chris Mason, Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain, Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds, Christoph Hellwig, Josef Bacik, Jan Kara On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote: > On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc > <lsf-pc@lists.linux-foundation.org> wrote: > > > > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote: > > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote: > > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > > > > > it's _buffered_ writes that we care about here. > > > > > > > > We could do concurrent buffered writes in XFS - we would just use > > > > the same locking strategy as direct IO and fall back on folio locks > > > > for copy-in exclusion like ext4 does. > > > > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > > > just like everyone else. > > > > Uhuh. ext4 does allow concurrent DIO writes. It's just much more > > constrained than XFS. See ext4_dio_write_checks(). > > > > > > The real question is how much of userspace will that break, because > > > > of implicit assumptions that the kernel has always serialised > > > > buffered writes? > > > > > > What would break? > > > > Good question. If you don't know the answer, then you've got the > > same problem as I have. i.e. we don't know if concurrent > > applications that use buffered IO extensively (eg. postgres) assume > > data coherency because of the implicit serialisation occurring > > during buffered IO writes? > > > > > > > If we do a short write because of a page fault (despite previously > > > > > faulting in the userspace buffer), there is no way to completely prevent > > > > > torn writes an atomicity breakage; we could at least try a trylock on > > > > > the inode lock, I didn't do that here. > > > > > > > > As soon as we go for concurrent writes, we give up on any concept of > > > > atomicity of buffered writes (esp. w.r.t reads), so this really > > > > doesn't matter at all. > > > > > > We've already given up buffered write vs. read atomicity, have for a > > > long time - buffered read path takes no locks. > > > > We still have explicit buffered read() vs buffered write() atomicity > > in XFS via buffered reads taking the inode lock shared (see > > xfs_file_buffered_read()) because that's what POSIX says we should > > have. > > > > Essentially, we need to explicitly give POSIX the big finger and > > state that there are no atomicity guarantees given for write() calls > > of any size, nor are there any guarantees for data coherency for > > any overlapping concurrent buffered IO operations. > > > > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave > as the other fs ever since Jan has added the invalidate_lock and I believe > that Meta kernel has done that way before. > > > Those are things we haven't completely given up yet w.r.t. buffered > > IO, and enabling concurrent buffered writes will expose to users. > > So we need to have explicit policies for this and document them > > clearly in all the places that application developers might look > > for behavioural hints. > > That's doable - I can try to do that. > What is your take regarding opt-in/opt-out of legacy behavior? Screw the legacy code, don't even make it an option. No-one should be relying on large buffered writes being atomic anymore, and with high order folios in the page cache most small buffered writes are going to be atomic w.r.t. both reads and writes anyway. > At the time, I have proposed POSIX_FADV_TORN_RW API [1] > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount > option would make more sense for consistent and clear semantics across > the fs - it is easier if all buffered IO to inode behaved the same way. No mount options, just change the behaviour. Applications already have to avoid concurrent overlapping buffered reads and writes if they care about data integrity and coherency, so making buffered writes concurrent doesn't change anything. Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-29 0:25 ` Dave Chinner @ 2024-02-29 0:57 ` Kent Overstreet 2024-03-04 0:46 ` Dave Chinner 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-29 0:57 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Pankaj Raghav, Jens Axboe, Chris Mason, Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain, Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds, Christoph Hellwig, Josef Bacik, Jan Kara On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote: > On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote: > > On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc > > <lsf-pc@lists.linux-foundation.org> wrote: > > > > > > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote: > > > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote: > > > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > > > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > > > > > > it's _buffered_ writes that we care about here. > > > > > > > > > > We could do concurrent buffered writes in XFS - we would just use > > > > > the same locking strategy as direct IO and fall back on folio locks > > > > > for copy-in exclusion like ext4 does. > > > > > > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > > > > just like everyone else. > > > > > > Uhuh. ext4 does allow concurrent DIO writes. It's just much more > > > constrained than XFS. See ext4_dio_write_checks(). > > > > > > > > The real question is how much of userspace will that break, because > > > > > of implicit assumptions that the kernel has always serialised > > > > > buffered writes? > > > > > > > > What would break? > > > > > > Good question. If you don't know the answer, then you've got the > > > same problem as I have. i.e. we don't know if concurrent > > > applications that use buffered IO extensively (eg. postgres) assume > > > data coherency because of the implicit serialisation occurring > > > during buffered IO writes? > > > > > > > > > If we do a short write because of a page fault (despite previously > > > > > > faulting in the userspace buffer), there is no way to completely prevent > > > > > > torn writes an atomicity breakage; we could at least try a trylock on > > > > > > the inode lock, I didn't do that here. > > > > > > > > > > As soon as we go for concurrent writes, we give up on any concept of > > > > > atomicity of buffered writes (esp. w.r.t reads), so this really > > > > > doesn't matter at all. > > > > > > > > We've already given up buffered write vs. read atomicity, have for a > > > > long time - buffered read path takes no locks. > > > > > > We still have explicit buffered read() vs buffered write() atomicity > > > in XFS via buffered reads taking the inode lock shared (see > > > xfs_file_buffered_read()) because that's what POSIX says we should > > > have. > > > > > > Essentially, we need to explicitly give POSIX the big finger and > > > state that there are no atomicity guarantees given for write() calls > > > of any size, nor are there any guarantees for data coherency for > > > any overlapping concurrent buffered IO operations. > > > > > > > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave > > as the other fs ever since Jan has added the invalidate_lock and I believe > > that Meta kernel has done that way before. > > > > > Those are things we haven't completely given up yet w.r.t. buffered > > > IO, and enabling concurrent buffered writes will expose to users. > > > So we need to have explicit policies for this and document them > > > clearly in all the places that application developers might look > > > for behavioural hints. > > > > That's doable - I can try to do that. > > What is your take regarding opt-in/opt-out of legacy behavior? > > Screw the legacy code, don't even make it an option. No-one should > be relying on large buffered writes being atomic anymore, and with > high order folios in the page cache most small buffered writes are > going to be atomic w.r.t. both reads and writes anyway. That's a new take... > > > At the time, I have proposed POSIX_FADV_TORN_RW API [1] > > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount > > option would make more sense for consistent and clear semantics across > > the fs - it is easier if all buffered IO to inode behaved the same way. > > No mount options, just change the behaviour. Applications already > have to avoid concurrent overlapping buffered reads and writes if > they care about data integrity and coherency, so making buffered > writes concurrent doesn't change anything. Honestly - no. Userspace would really like to see some sort of definition for this kind of behaviour, and if we just change things underneath them without telling anyone, _that's a dick move_. POSIX_FADV_TORN_RW is a terrible name, though. And fadvise() is the wrong API for this because it applies to ranges, this should be an open flag or an fcntl. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-29 0:57 ` Kent Overstreet @ 2024-03-04 0:46 ` Dave Chinner 0 siblings, 0 replies; 90+ messages in thread From: Dave Chinner @ 2024-03-04 0:46 UTC (permalink / raw) To: Kent Overstreet Cc: Amir Goldstein, Pankaj Raghav, Jens Axboe, Chris Mason, Matthew Wilcox, Daniel Gomez, linux-mm, Luis Chamberlain, Johannes Weiner, linux-fsdevel, lsf-pc, Linus Torvalds, Christoph Hellwig, Josef Bacik, Jan Kara On Wed, Feb 28, 2024 at 07:57:38PM -0500, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote: > > > That's doable - I can try to do that. > > > What is your take regarding opt-in/opt-out of legacy behavior? > > > > Screw the legacy code, don't even make it an option. No-one should > > be relying on large buffered writes being atomic anymore, and with > > high order folios in the page cache most small buffered writes are > > going to be atomic w.r.t. both reads and writes anyway. > > That's a new take... > > > > > > At the time, I have proposed POSIX_FADV_TORN_RW API [1] > > > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount > > > option would make more sense for consistent and clear semantics across > > > the fs - it is easier if all buffered IO to inode behaved the same way. > > > > No mount options, just change the behaviour. Applications already > > have to avoid concurrent overlapping buffered reads and writes if > > they care about data integrity and coherency, so making buffered > > writes concurrent doesn't change anything. > > Honestly - no. > > Userspace would really like to see some sort of definition for this kind > of behaviour, and if we just change things underneath them without > telling anyone, _that's a dick move_. I don't think you understand the full picture here, Kent. > POSIX_FADV_TORN_RW is a terrible name, though. The described behaviour for this advice is the standard behaviour for ext4, btrfs and most linux filesystems other than XFS. It has been for a -long- time. The only filesystem that gives anything resembling POSIX atomic write behaviour is XFS. No other filesystem in Linux actually provides the POSIX "buffered reads won't see partial data from buffered writes in progress" behaviour that XFS does via the IOLOCK behaviour it uses. So when I say "screw the legacy apps" I'm talking about the ancient enterprise applications that still behave as if this POSIX behaviour is reliable on modern linux systems. It simply isn't, and these apps are *already implicitly broken* on most Linux filesystems and they need fixing. > And fadvise() is the wrong API for this because it applies to ranges, > this should be an open flag or an fcntl. Not only is it the wrong API, it's also the wrong approach to take. We have a new API coming through for atomic writes: RWF_ATOMIC. If an applications needs an actual atomic IO guarantee, they will soon be able to be explicit in their requirements and they will not end up in the situation where the filesystem they use might determine if there is an implicit atomic write behaviour provided. Indeed, we don't actually say that XFS will always guarantee POSIX atomic buffered IO semantics - we've just never decided that the time is right to change the behaviour. In making such a change to XFS, normal buffered writes will get mostly the same behaviour as they do now because we now use high order folios in the page cache and serialisation will be done against high-order ranges rather than individual pages. And applications that actually need atomic IO semantics can use RWF_ATOMIC and in that case we can do explicitly serialised buffered writes that lock out concurrent buffered reads as we do right now. IOWs, there is no better time to convert XFS behaviour to match all the other Linux filesystems than right now. Applications that need atomic IO guarantees can use RWF_ATOMIC, and everyone else can get the performance benefits that come from no longer trying to make buffered IO implicitly "atomic". -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 22:21 ` Kent Overstreet 2024-02-27 22:42 ` Dave Chinner @ 2024-02-27 22:46 ` Linus Torvalds 2024-02-27 23:00 ` Linus Torvalds ` (2 more replies) 1 sibling, 3 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-27 22:46 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > just like everyone else. Not for dio, it doesn't. > > The real question is how much of userspace will that break, because > > of implicit assumptions that the kernel has always serialised > > buffered writes? > > What would break? Well, at least in theory you could have concurrent overlapping writes of folio crossing records, and currently you do get the guarantee that one or the other record is written, but relying just on page locking would mean that you might get a mix of them at page boundaries. I'm not sure that such a model would make any sense, but if you *intend* to break if somebody doesn't do write-to-write exclusion, that's certainly possible. The fact that we haven't given the atomicity guarantees wrt reads does imply that nobody can do this kinds of crazy thing, but it's an implication, not a guarantee. I really don't think such an odd load is sensible (except for the special case of O_APPEND records, which definitely is sensible), and it is certainly solvable. For example, a purely "local lock" model would be to just lock all pages in order as you write them, and not unlock the previous page until you've locked the next one. That is a really simple model that doesn't require any range locking or anything like that because it simply relies on all writes always being done strictly in file position order. But you'd have to be very careful with deadlocks anyway in case there are other cases of multi-page locks. And even without deadlocks, you might end up having just a lot more lock contention (nested locks can have *much* worse contention than sequential ones) There are other models with multi-level locking, but I think we'd like to try to keep things simple if we change something core like this. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 22:46 ` Linus Torvalds @ 2024-02-27 23:00 ` Linus Torvalds 2024-02-28 2:22 ` Kent Overstreet 2024-02-28 18:18 ` Kent Overstreet 2 siblings, 0 replies; 90+ messages in thread From: Linus Torvalds @ 2024-02-27 23:00 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Tue, 27 Feb 2024 at 14:46, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > For example, a purely "local lock" model would be to just lock all > pages in order as you write them, and not unlock the previous page > until you've locked the next one. Thinking more about this, maybe that doesn't really guarantee much. The final state would be that even with concurrent overlapping writes, one writer did its overlapping write fully and you'd never have mixed ABAB kind of results, but you could still have concurrent readers see the two writes progressing concurrently. Of course, since readers aren't serialized as-is, I'm not sure if "readers can see intermediate state" is anything new or relevant anyway. Maybe the worry isn't worth it. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 22:46 ` Linus Torvalds 2024-02-27 23:00 ` Linus Torvalds @ 2024-02-28 2:22 ` Kent Overstreet 2024-02-28 3:00 ` Matthew Wilcox 2024-02-28 18:18 ` Kent Overstreet 2 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-28 2:22 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Tue, Feb 27, 2024 at 02:46:11PM -0800, Linus Torvalds wrote: > On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > > just like everyone else. > > Not for dio, it doesn't. > > > > The real question is how much of userspace will that break, because > > > of implicit assumptions that the kernel has always serialised > > > buffered writes? > > > > What would break? > > Well, at least in theory you could have concurrent overlapping writes > of folio crossing records, and currently you do get the guarantee that > one or the other record is written, but relying just on page locking > would mean that you might get a mix of them at page boundaries. > > I'm not sure that such a model would make any sense, but if you > *intend* to break if somebody doesn't do write-to-write exclusion, > that's certainly possible. > > The fact that we haven't given the atomicity guarantees wrt reads does > imply that nobody can do this kinds of crazy thing, but it's an > implication, not a guarantee. > > I really don't think such an odd load is sensible (except for the > special case of O_APPEND records, which definitely is sensible), and > it is certainly solvable. > > For example, a purely "local lock" model would be to just lock all > pages in order as you write them, and not unlock the previous page > until you've locked the next one. The code I'm testing locks _all_ the folios we're writing to simultaneously, and if they can't all be pinned and locked just falls back to the inode lock. Which does raise the question of if we've ever attempted to define a lock ordering on folios. I suspect not, since folio lock doesn't even seem to have lockdep support. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 2:22 ` Kent Overstreet @ 2024-02-28 3:00 ` Matthew Wilcox 2024-02-28 4:22 ` Matthew Wilcox 0 siblings, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-02-28 3:00 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner On Tue, Feb 27, 2024 at 09:22:26PM -0500, Kent Overstreet wrote: > Which does raise the question of if we've ever attempted to define a > lock ordering on folios. I suspect not, since folio lock doesn't even > seem to have lockdep support. We even wrote it down! /* * To avoid deadlocks between range_cyclic writeback and callers * that hold pages in PageWriteback to aggregate I/O until * the writeback iteration finishes, we do not loop back to the * start of the file. Doing so causes a page lock/page * writeback access order inversion - we should only ever lock * multiple pages in ascending page->index order, and looping * back to the start of the file violates that rule and causes * deadlocks. */ (I'll take the AR to put this somewhere better like in the folio_lock() kernel-doc) ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 3:00 ` Matthew Wilcox @ 2024-02-28 4:22 ` Matthew Wilcox 2024-02-28 17:34 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Matthew Wilcox @ 2024-02-28 4:22 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner On Wed, Feb 28, 2024 at 03:00:36AM +0000, Matthew Wilcox wrote: > On Tue, Feb 27, 2024 at 09:22:26PM -0500, Kent Overstreet wrote: > > Which does raise the question of if we've ever attempted to define a > > lock ordering on folios. I suspect not, since folio lock doesn't even > > seem to have lockdep support. > > We even wrote it down! > > /* > * To avoid deadlocks between range_cyclic writeback and callers > * that hold pages in PageWriteback to aggregate I/O until > * the writeback iteration finishes, we do not loop back to the > * start of the file. Doing so causes a page lock/page > * writeback access order inversion - we should only ever lock > * multiple pages in ascending page->index order, and looping > * back to the start of the file violates that rule and causes > * deadlocks. > */ > > (I'll take the AR to put this somewhere better like in the folio_lock() > kernel-doc) Um. I already did. * Context: May sleep. If you need to acquire the locks of two or * more folios, they must be in order of ascending index, if they are * in the same address_space. If they are in different address_spaces, * acquire the lock of the folio which belongs to the address_space which * has the lowest address in memory first. Where should I have put this information that you would have found it, if not in the kernel-doc for folio_lock()? ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 4:22 ` Matthew Wilcox @ 2024-02-28 17:34 ` Kent Overstreet 2024-02-28 18:04 ` Matthew Wilcox 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-28 17:34 UTC (permalink / raw) To: Matthew Wilcox Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner On Wed, Feb 28, 2024 at 04:22:32AM +0000, Matthew Wilcox wrote: > On Wed, Feb 28, 2024 at 03:00:36AM +0000, Matthew Wilcox wrote: > > On Tue, Feb 27, 2024 at 09:22:26PM -0500, Kent Overstreet wrote: > > > Which does raise the question of if we've ever attempted to define a > > > lock ordering on folios. I suspect not, since folio lock doesn't even > > > seem to have lockdep support. > > > > We even wrote it down! > > > > /* > > * To avoid deadlocks between range_cyclic writeback and callers > > * that hold pages in PageWriteback to aggregate I/O until > > * the writeback iteration finishes, we do not loop back to the > > * start of the file. Doing so causes a page lock/page > > * writeback access order inversion - we should only ever lock > > * multiple pages in ascending page->index order, and looping > > * back to the start of the file violates that rule and causes > > * deadlocks. > > */ > > > > (I'll take the AR to put this somewhere better like in the folio_lock() > > kernel-doc) > > Um. I already did. > > * Context: May sleep. If you need to acquire the locks of two or > * more folios, they must be in order of ascending index, if they are > * in the same address_space. If they are in different address_spaces, > * acquire the lock of the folio which belongs to the address_space which > * has the lowest address in memory first. > > Where should I have put this information that you would have found it, > if not in the kernel-doc for folio_lock()? I should have seen that :) But even better would be if we could get lockdep to support folio locks, then we could define a lockdep comparison function. Of course, there's no place to stick a lockdep map, but I think technically lockdep could do everything it needs to do without state in the lock itself, if only that code didn't make my eyes bleed whenever I have to dig into it... ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 17:34 ` Kent Overstreet @ 2024-02-28 18:04 ` Matthew Wilcox 0 siblings, 0 replies; 90+ messages in thread From: Matthew Wilcox @ 2024-02-28 18:04 UTC (permalink / raw) To: Kent Overstreet Cc: Linus Torvalds, Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner On Wed, Feb 28, 2024 at 12:34:41PM -0500, Kent Overstreet wrote: > But even better would be if we could get lockdep to support folio locks, > then we could define a lockdep comparison function. Of course, there's > no place to stick a lockdep map, but I think technically lockdep could > do everything it needs to do without state in the lock itself, if only > that code didn't make my eyes bleed whenever I have to dig into it... I don't think lockdep can support folio locks. They're taken in process context and (for ->read_folio) released in interrupt context. Also we can return to userspace while holding folio locks (I believe ...) DEPT claims to be able to handle them, but I haven't looked in detail. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-27 22:46 ` Linus Torvalds 2024-02-27 23:00 ` Linus Torvalds 2024-02-28 2:22 ` Kent Overstreet @ 2024-02-28 18:18 ` Kent Overstreet 2024-02-28 19:09 ` Linus Torvalds 2 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-28 18:18 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Tue, Feb 27, 2024 at 02:46:11PM -0800, Linus Torvalds wrote: > On Tue, 27 Feb 2024 at 14:21, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > > just like everyone else. > > Not for dio, it doesn't. > > > > The real question is how much of userspace will that break, because > > > of implicit assumptions that the kernel has always serialised > > > buffered writes? > > > > What would break? > > Well, at least in theory you could have concurrent overlapping writes > of folio crossing records, and currently you do get the guarantee that > one or the other record is written, but relying just on page locking > would mean that you might get a mix of them at page boundaries. I think we can keep that guarantee. The tricky case was -EFAULT from copy_from_user_nofault(), where we have to bail out, drop locks, re-fault in the user buffer - and redo the rest of the write, this time holding the inode lock. We can't guarantee that partial writes don't happen, but what we can do is restart the write from the beginning, so the partial write gets overwritten with a full atomic write. This way after writes complete we'll never have weird torn writes left around. ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 18:18 ` Kent Overstreet @ 2024-02-28 19:09 ` Linus Torvalds 2024-02-28 19:29 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-02-28 19:09 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Wed, 28 Feb 2024 at 10:18, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > I think we can keep that guarantee. > > The tricky case was -EFAULT from copy_from_user_nofault(), where we have > to bail out, drop locks, re-fault in the user buffer - and redo the rest > of the write, this time holding the inode lock. > > We can't guarantee that partial writes don't happen, but what we can do > is restart the write from the beginning, so the partial write gets > overwritten with a full atomic write. I think that's a solution that is actually much worse than the thing it is trying to solve. Now a concurrent reader can actually see the data change twice or more. Either because there's another writer that came in in between, or because of threaded modifications to the source buffer in the first writer. So your solution actually makes for noticeably *worse* atomicity guarantees, not better. Not the solution. Not at all. I do think the solution is to just take the inode lock exclusive (when we have to modify the inode size or the suid/sgid) or shared (to prevent concurrent i_size modifications), and leave it at that. And we should probably do a mount flag (for defaults) and an open-time flag (for specific uses) to let people opt in to this behavior. Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 19:09 ` Linus Torvalds @ 2024-02-28 19:29 ` Kent Overstreet 2024-02-28 20:17 ` Linus Torvalds 0 siblings, 1 reply; 90+ messages in thread From: Kent Overstreet @ 2024-02-28 19:29 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Wed, Feb 28, 2024 at 11:09:53AM -0800, Linus Torvalds wrote: > On Wed, 28 Feb 2024 at 10:18, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > I think we can keep that guarantee. > > > > The tricky case was -EFAULT from copy_from_user_nofault(), where we have > > to bail out, drop locks, re-fault in the user buffer - and redo the rest > > of the write, this time holding the inode lock. > > > > We can't guarantee that partial writes don't happen, but what we can do > > is restart the write from the beginning, so the partial write gets > > overwritten with a full atomic write. > > I think that's a solution that is actually much worse than the thing > it is trying to solve. > > Now a concurrent reader can actually see the data change twice or > more. Either because there's another writer that came in in between, > or because of threaded modifications to the source buffer in the first > writer. Yeah, that's a wrinkle. But that requires a three way race to observe; a race between a reader and multiple writers, and the reader doing multiple reads. The more concerning sitution to me would be if breaking write atomicity means that we end up with data in the file that doesn't correspond to an total ordering of writes; e.g. part of write a, then write b, then the rest of write a overlaying part of write b. Maybe that can't happen as long as writes are always happening in ascending folio order? ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 19:29 ` Kent Overstreet @ 2024-02-28 20:17 ` Linus Torvalds 2024-02-28 23:21 ` Kent Overstreet 0 siblings, 1 reply; 90+ messages in thread From: Linus Torvalds @ 2024-02-28 20:17 UTC (permalink / raw) To: Kent Overstreet Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Wed, 28 Feb 2024 at 11:29, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > The more concerning sitution to me would be if breaking write atomicity > means that we end up with data in the file that doesn't correspond to an > total ordering of writes; e.g. part of write a, then write b, then the > rest of write a overlaying part of write b. > > Maybe that can't happen as long as writes are always happening in > ascending folio order? So that was what my suggestion about just overlapping one lock at a time was about - we always do writes in ascending order, and contiguously (even if the data source obviously isn't necessarily some contiguous thing). And I think that's actually fundamental and not something we're likely to get out of. If you do non-contiguous writes, you'll always treat them as separate things. Then just the "lock the next folio before unlocking the previous one" would already give some relevant guarantees, in that at least you wouldn't get overlapping writes where the write data would be mixed up. So you'd get *some* ordering, and while I wouldn't call it "total ordering" (and I think with readers not taking locks you can't get that anyway because readers will *always* see partial writes), I think it's much better than some re-write model. However, the "lock consecutive pages as you go along" does still have the issue of "you have to be able to take a page fault in the middle". And right now that actually depends on us always dropping the page lock in between iterations. This issue is solvable - if you get a partial read while you hold a page lock, you can always just see "are the pages I hold locks on not up-to-date?" And decide to do the read yourself (and mark them up-to-date). We don't do that right now because it's simpler not to, but it's not conceptually a huge problem. It *is* a practical problem, though. For example, in generic_perform_write(), we've left page locking on writes to the filesystems (ie it's done by "->write_begin/->write_end"), so I think in reality that "don't release the lock on folio N until after you've taken the lock on folio N+1" isn't actually wonderful. It has some really nasty practical issues. And yes, those practical issues are because of historical mistakes (some of them very much by yours truly). Having one single "page lock" was probably one of those historical mistakes. If we use a different bit for serializing page writers, the above problem would go away as an issue. ANYWAY. At least with the current setup we literally depend on that "one page at a time" behavior right now, and even XFS - which takes the inode lock shared for reading - does *not* do it for reading a page during a page fault for all these reasons. XFS uses iomap_write_iter() instead of generic_perform_write(), but the solution there is exactly the same, and the issue is fairly fundamental (unless you want to always read in pages that you are going to overwrite first). This is also one of the (many) reasons I think the POSIX atomicity model is complete garbage. I think the whole "reads are atomic with respect to writes" is a feel-good bedtime story. It's simplistic, and it's just not *real*, because it's basically not compatible with mmap. So the whole POSIX atomicity story comes from a historical implementation background and ignores mmap. Fine, people can live in that kind of "read and write without DIO is special" fairy tale and think that paper standards are more important than sane semantics. But I just am not a fan of that. So I am personally perfectly happy to say "POSIX atomicity is a stupid paper standard that has no relevance for reality". The read side *cannot* be atomic wrt the write side. But at the same time, I obviously then care a _lot_ about actual existing loads. I may not worry about some POSIX atomicity guarantees, but I *do* worry about real loads. And I don't think real loads actually care about concurrent overlapping writes at all, but the "I don't think" may be another wishful feel-good bedtime story that isn't based on reality ;) Linus ^ permalink raw reply [flat|nested] 90+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO 2024-02-28 20:17 ` Linus Torvalds @ 2024-02-28 23:21 ` Kent Overstreet 0 siblings, 0 replies; 90+ messages in thread From: Kent Overstreet @ 2024-02-28 23:21 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Chinner, Luis Chamberlain, lsf-pc, linux-fsdevel, linux-mm, Daniel Gomez, Pankaj Raghav, Jens Axboe, Christoph Hellwig, Chris Mason, Johannes Weiner, Matthew Wilcox On Wed, Feb 28, 2024 at 12:17:50PM -0800, Linus Torvalds wrote: > On Wed, 28 Feb 2024 at 11:29, Kent Overstreet <kent.overstreet@linux.dev> wrote: > > > > The more concerning sitution to me would be if breaking write atomicity > > means that we end up with data in the file that doesn't correspond to an > > total ordering of writes; e.g. part of write a, then write b, then the > > rest of write a overlaying part of write b. > > > > Maybe that can't happen as long as writes are always happening in > > ascending folio order? > > So that was what my suggestion about just overlapping one lock at a > time was about - we always do writes in ascending order, and > contiguously (even if the data source obviously isn't necessarily some > contiguous thing). > > And I think that's actually fundamental and not something we're likely > to get out of. If you do non-contiguous writes, you'll always treat > them as separate things. > > Then just the "lock the next folio before unlocking the previous one" > would already give some relevant guarantees, in that at least you > wouldn't get overlapping writes where the write data would be mixed > up. > > So you'd get *some* ordering, and while I wouldn't call it "total > ordering" (and I think with readers not taking locks you can't get > that anyway because readers will *always* see partial writes), I think > it's much better than some re-write model. Well, the re-write model is for the case where we have to bail out, drop locks and re-fault, and that's no different here. But that's a side issue. The "just take the next lock before dropping the previous lock" approach - I agree that works just as well. I'm still trying to come up with a satisfying proof as to why - but one writer can't pass another writer, it's basically that. > > However, the "lock consecutive pages as you go along" does still have > the issue of "you have to be able to take a page fault in the middle". > > And right now that actually depends on us always dropping the page > lock in between iterations. > > This issue is solvable - if you get a partial read while you hold a > page lock, you can always just see "are the pages I hold locks on not > up-to-date?" And decide to do the read yourself (and mark them > up-to-date). We don't do that right now because it's simpler not to, > but it's not conceptually a huge problem. Hang on, I think you're talking about two different issues. The "oops, copying from the _user_ buffer faulted" is the more annoying one here. If the destination folio wasn't uptodate, we just read that inline, we don't drop locks for that (at least my code doesn't; I'd have to have a look at filemap.c and iomap for those versions). There is another funny corner case where the destination folio wasn't uptodate, and we decided we didn't need to read it in because we were going to be fully overwriting it, but then we _didn't_ fully overwrite it - then we have to revert that part of the write. But if that happened it was because the useer copy didn't copy as much as we expected, i.e. it would've needed to take a page fault, so we have to bail out, drop locks and refault anyways. As an aside, the "copy_to_(from|user) may recurse into arbitrary code and take arbitrary locks" is some scary stuff, and I very much doubt it's handled correctly everywhere. If I ever do the ioctl v2 proposal I wrote up awhile back, I think part of that should be doubble buffering approximately at the syscall layer so that we're not doing that from within rando driver/fs code holding rando locks; the 1% of cases that care about performance can do something special, but most shouldn't. > It *is* a practical problem, though. > > For example, in generic_perform_write(), we've left page locking on > writes to the filesystems (ie it's done by > "->write_begin/->write_end"), so I think in reality that "don't > release the lock on folio N until after you've taken the lock on folio > N+1" isn't actually wonderful. It has some really nasty practical > issues. yeah, write_begin and write_end are... odd, to say the least. > And yes, those practical issues are because of historical mistakes > (some of them very much by yours truly). Having one single "page lock" > was probably one of those historical mistakes. If we use a different > bit for serializing page writers, the above problem would go away as > an issue. well, it's a difficult situation, a _lot_ of different code ends up wanting to communicate state machine style through page flags. but yeah, we're not clear on whether folio lock protects - the data within the folio? - is it used for signalling read completion? (sometimes!) - is it used for guarding against against truncate/invalidate? There's multiple different truncate/invalidate paths, for doing different things based on folio dirty/writeback and perhaps locked - does it protect other filesystem state hanging off folio->private? you might thing so, but I came to the conclusion that that was a bad idea > ANYWAY. > > At least with the current setup we literally depend on that "one page > at a time" behavior right now, and even XFS - which takes the inode > lock shared for reading - does *not* do it for reading a page during a > page fault for all these reasons. > > XFS uses iomap_write_iter() instead of generic_perform_write(), but > the solution there is exactly the same, and the issue is fairly > fundamental (unless you want to always read in pages that you are > going to overwrite first). > > This is also one of the (many) reasons I think the POSIX atomicity > model is complete garbage. I think the whole "reads are atomic with > respect to writes" is a feel-good bedtime story. It's simplistic, and > it's just not *real*, because it's basically not compatible with mmap. oh, I wouldn't go that far. I think the real problem is just undocumented behaviour that differs across filesystems, and we've got too many filesystems doing their own thing where their smart stuff should've been lifted up to the vfs (and yes, I am a guilty party here too). And another part of what makes this stuff hard is that it's always such a goddamn hassle to extend userspace interfaces, even just to add a couple flags. People have a real fear of that stuff, and when things are proposed they end up getting bikeshed to death, and then we _overdesign_ stuff so that we don't have to repeat that process... no. Make it easy to add new userspace interfaces, do it in small bite sized chunks, one thing at a time, learn from our mistakes, depracate and add new revisions for the stuff that turned out to be crap. ...What we really want here is just some flags so that userspace can specify what locking model they want. We're clearly capable of implementing multiple locking models, and the real nightmare scenario is some userspace program depending on stupidly subtle atomicity guarantees that only one filesystem implements AND THERE'S NOTHING TO GREP FOR IN THE CODE TO FIND IT. ^ permalink raw reply [flat|nested] 90+ messages in thread
end of thread, other threads:[~2024-11-15 21:52 UTC | newest] Thread overview: 90+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-23 23:59 [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO Luis Chamberlain 2024-02-24 4:12 ` Matthew Wilcox 2024-02-24 17:31 ` Linus Torvalds 2024-02-24 18:13 ` Matthew Wilcox 2024-02-24 18:24 ` Linus Torvalds 2024-02-24 18:20 ` Linus Torvalds 2024-02-24 19:11 ` Linus Torvalds 2024-02-24 21:42 ` Theodore Ts'o 2024-02-24 22:57 ` Chris Mason 2024-02-24 23:40 ` Linus Torvalds 2024-05-10 23:57 ` Luis Chamberlain 2024-02-25 5:18 ` Kent Overstreet 2024-02-25 6:04 ` Kent Overstreet 2024-02-25 13:10 ` Matthew Wilcox 2024-02-25 17:03 ` Linus Torvalds 2024-02-25 21:14 ` Matthew Wilcox 2024-02-25 23:45 ` Linus Torvalds 2024-02-26 1:02 ` Kent Overstreet 2024-02-26 1:32 ` Linus Torvalds 2024-02-26 1:58 ` Kent Overstreet 2024-02-26 2:06 ` Kent Overstreet 2024-02-26 2:34 ` Linus Torvalds 2024-02-26 2:50 ` Al Viro 2024-02-26 17:17 ` Linus Torvalds 2024-02-26 21:07 ` Matthew Wilcox 2024-02-26 21:17 ` Kent Overstreet 2024-02-26 21:19 ` Kent Overstreet 2024-02-26 21:55 ` Paul E. McKenney 2024-02-26 23:29 ` Kent Overstreet 2024-02-27 0:05 ` Paul E. McKenney 2024-02-27 0:29 ` Kent Overstreet 2024-02-27 0:55 ` Paul E. McKenney 2024-02-27 1:08 ` Kent Overstreet 2024-02-27 5:17 ` Paul E. McKenney 2024-02-27 6:21 ` Kent Overstreet 2024-02-27 15:32 ` Paul E. McKenney 2024-02-27 15:52 ` Kent Overstreet 2024-02-27 16:06 ` Paul E. McKenney 2024-02-27 15:54 ` Matthew Wilcox 2024-02-27 16:21 ` Paul E. McKenney 2024-02-27 16:34 ` Kent Overstreet 2024-02-27 17:58 ` Paul E. McKenney 2024-02-28 23:55 ` Kent Overstreet 2024-02-29 19:42 ` Paul E. McKenney 2024-02-29 20:51 ` Kent Overstreet 2024-03-05 2:19 ` Paul E. McKenney 2024-02-27 0:43 ` Dave Chinner 2024-02-26 22:46 ` Linus Torvalds 2024-02-26 23:48 ` Linus Torvalds 2024-02-27 7:21 ` Kent Overstreet 2024-02-27 15:39 ` Matthew Wilcox 2024-02-27 15:54 ` Kent Overstreet 2024-02-27 16:34 ` Linus Torvalds 2024-02-27 16:47 ` Kent Overstreet 2024-02-27 17:07 ` Linus Torvalds 2024-02-27 17:20 ` Kent Overstreet 2024-02-27 18:02 ` Linus Torvalds 2024-05-14 11:52 ` Luis Chamberlain 2024-05-14 16:04 ` Linus Torvalds 2024-11-15 19:43 ` Linus Torvalds 2024-11-15 20:42 ` Matthew Wilcox 2024-11-15 21:52 ` Linus Torvalds 2024-02-25 21:29 ` Kent Overstreet 2024-02-25 17:32 ` Kent Overstreet 2024-02-24 17:55 ` Luis Chamberlain 2024-02-25 5:24 ` Kent Overstreet 2024-02-26 12:22 ` Dave Chinner 2024-02-27 10:07 ` Kent Overstreet 2024-02-27 14:08 ` Luis Chamberlain 2024-02-27 14:57 ` Kent Overstreet 2024-02-27 22:13 ` Dave Chinner 2024-02-27 22:21 ` Kent Overstreet 2024-02-27 22:42 ` Dave Chinner 2024-02-28 7:48 ` [Lsf-pc] " Amir Goldstein 2024-02-28 14:01 ` Chris Mason 2024-02-29 0:25 ` Dave Chinner 2024-02-29 0:57 ` Kent Overstreet 2024-03-04 0:46 ` Dave Chinner 2024-02-27 22:46 ` Linus Torvalds 2024-02-27 23:00 ` Linus Torvalds 2024-02-28 2:22 ` Kent Overstreet 2024-02-28 3:00 ` Matthew Wilcox 2024-02-28 4:22 ` Matthew Wilcox 2024-02-28 17:34 ` Kent Overstreet 2024-02-28 18:04 ` Matthew Wilcox 2024-02-28 18:18 ` Kent Overstreet 2024-02-28 19:09 ` Linus Torvalds 2024-02-28 19:29 ` Kent Overstreet 2024-02-28 20:17 ` Linus Torvalds 2024-02-28 23:21 ` Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).