* [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback [not found] <CGME20250129103448epcas5p1f7d71506e4443429a0b0002eb842e749@epcas5p1.samsung.com> @ 2025-01-29 10:26 ` Kundan Kumar 2025-01-29 15:42 ` Christoph Hellwig 2025-01-29 22:51 ` Dave Chinner 0 siblings, 2 replies; 22+ messages in thread From: Kundan Kumar @ 2025-01-29 10:26 UTC (permalink / raw) To: lsf-pc Cc: linux-fsdevel, anuj20.g, mcgrof, joshi.k, david, axboe, clm, hch, willy, gost.dev, Kundan Kumar Greetings everyone, Anuj and I would like to present our proposal for implementing parallel writeback at LSF/MM. This approach could address the writeback performance issue discussed by Luis last year [1]. Currently, the pagecache writeback is executed in a single-threaded manner. Inodes are added to the dirty list, and the delayed writeback is scheduled. The single-threaded writeback then iterates through the dirty list and performs the writeback operations. We have developed a prototype implementation of parallel writeback, where we have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time, and b_more_io lists have also been modified to be per-CPU. When an inode needs to be added to the b_dirty list, we select the next CPU (in a round-robin fashion) and schedule the per-CPU writeback work on the selected CPU. With the per-CPU threads handling the writeback, we have observed a significant increase in IOPS. Here is a test and comparison between the older writeback and the newer parallel writeback, using XFS filesystem: https://github.com/kundanthebest/parallel_writeback/blob/main/README.md In our tests, we have found that with this implementation the buffered I/O IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe (Optane) and PMEM. There are a few items that we would like to discuss: During the implementation, we noticed several ways in which the writeback IOs are throttled: A) balance_dirty_pages B) writeback_chunk_size C) wb_over_bg_thresh Additionally, there are delayed writeback executions in the form of dirty_writeback_centisecs and dirty_expire_centisecs. With the introduction of per-CPU writeback, we need to determine the appropriate way to adjust the throttling and delayed writeback settings. Lock contention: We have observed that the writeback list_lock and the inode->i_lock are the locks that are most likely to cause contention when we make the thread per-CPU. Our next task will be to convert the writeback list_lock to a per-CPU lock. Another potential improvement could be to assign a specific CPU to an inode. We are preparing the RFC to depict the changes and planning to submit the same soon. [1] https://lore.kernel.org/linux-mm/Zd5lORiOCUsARPWq@dread.disaster.area/T/ -- 2.25.1 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-01-29 10:26 ` [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback Kundan Kumar @ 2025-01-29 15:42 ` Christoph Hellwig 2025-01-31 9:57 ` Kundan Kumar 2025-01-29 22:51 ` Dave Chinner 1 sibling, 1 reply; 22+ messages in thread From: Christoph Hellwig @ 2025-01-29 15:42 UTC (permalink / raw) To: Kundan Kumar Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, david, axboe, clm, hch, willy, gost.dev On Wed, Jan 29, 2025 at 03:56:27PM +0530, Kundan Kumar wrote: > and b_more_io lists have also been modified to be per-CPU. When an inode needs > to be added to the b_dirty list, we select the next CPU (in a round-robin > fashion) and schedule the per-CPU writeback work on the selected CPU. I don't think per-cpu is the right shard here. You want to write related data together. A fіrst approximation might be inodes. FYI, a really good "benchmark" is if you can use this parallel writeback code to replace the btrfs workqueue threads spawned to handle checksumming and compression. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-01-29 15:42 ` Christoph Hellwig @ 2025-01-31 9:57 ` Kundan Kumar 0 siblings, 0 replies; 22+ messages in thread From: Kundan Kumar @ 2025-01-31 9:57 UTC (permalink / raw) To: Christoph Hellwig Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, david, axboe, clm, willy, gost.dev [-- Attachment #1: Type: text/plain, Size: 566 bytes --] On 29/01/25 04:42PM, Christoph Hellwig wrote: >I don't think per-cpu is the right shard here. You want to write >related data together. A fіrst approximation might be inodes. > We have plan to affine inode to a particular writeback thread, like on first encounter of inode we select one writeback context, and use the same context when inode tries to issue IO again. The number of writeback contexts can be changed from per-cpu and can be decided using other factors as in this discussion. https://lore.kernel.org/all/20250131093209.6luwm4ny5kj34jqc@green245/ [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-01-29 10:26 ` [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback Kundan Kumar 2025-01-29 15:42 ` Christoph Hellwig @ 2025-01-29 22:51 ` Dave Chinner 2025-01-31 9:32 ` Kundan Kumar 1 sibling, 1 reply; 22+ messages in thread From: Dave Chinner @ 2025-01-29 22:51 UTC (permalink / raw) To: Kundan Kumar Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, hch, willy, gost.dev On Wed, Jan 29, 2025 at 03:56:27PM +0530, Kundan Kumar wrote: > Greetings everyone, > > Anuj and I would like to present our proposal for implementing parallel > writeback at LSF/MM. This approach could address the writeback performance > issue discussed by Luis last year [1]. > > Currently, the pagecache writeback is executed in a single-threaded manner. > Inodes are added to the dirty list, and the delayed writeback is scheduled. > The single-threaded writeback then iterates through the dirty list and performs > the writeback operations. > > We have developed a prototype implementation of parallel writeback, where we > have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time, > and b_more_io lists have also been modified to be per-CPU. When an inode needs > to be added to the b_dirty list, we select the next CPU (in a round-robin > fashion) and schedule the per-CPU writeback work on the selected CPU. My own experiments and experience tell me that per-cpu is not the right architecture for expanding writeback concurrency. The actual writeback parallelism that can be acheived is determined by the underlying filesystem architecture and geometry, not the CPU layout of the machine. I think some background info on writeback concurrency is also necessary before we go any further. If we go back many years to before we had the current write throttling architecture, we go back to an age where balance_dirty_pages() throttled by submitting foreground IO itself. It was throttled by the block layer submission queue running out of slots rather than any specific performance measurement. Under memory pressure, this lead highly concurrent dispatch of dirty pages, but very poor selection of dirty inodes/pages to dispatch. The result was that this foreground submission would often dispatch single page IOs, and issue tens of them concurrently across all user processes that were trying to write data. IOWs, throttling caused "IO breakdown" where the IO pattern ended up being small random IO exactly when we need to perform bulk cleaning of dirty pages. We needed bulk page cleaning throughput, not random dirty page shootdown. When we moved to "writeback is only done by the single BDI flusher thread" architecture, we significantly improved system behaviour under heavy load because writeback now had a tight focus issuing IO from a single inode at a time as efficiently as possible. i.e. when we had large sequential dirty file regions, the writeback IO was all large sequential IO, and that didn't get chopped up by another hundred processes issuing IO from random other inodes at the same time. IOWs, having too much parallelism in writeback for the underlying storage and/or filesystem can be far more harmful to system performance under load than having too little parallelism to drive the filesystem/hardware to it's maximum performance. This is an underlying constraint that everyone needs to understand before discussing writeback concurrency changes: excessive writeback concurrency can result in worse outcomes than having no writeback concurrency under sustained heavy production loads.... > With the per-CPU threads handling the writeback, we have observed a significant > increase in IOPS. Here is a test and comparison between the older writeback and > the newer parallel writeback, using XFS filesystem: > https://github.com/kundanthebest/parallel_writeback/blob/main/README.md What kernel is that from? > In our tests, we have found that with this implementation the buffered I/O > IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe > (Optane) and PMEM. Buffered IO often goes faster than direct for these workloads because the buffered IO is aggregated in memory to much larger, less random sequential IOs at writeback time. Writing the data in fewer, larger, better ordered IOs is almost always faster (at both the filesystem and storage hardware layers) than a pure 4kB random IO submission pattern. i.e. with enough RAM, this random write workload using buffered IO is pretty much guaranteed to outperform direct IO regardless of the underlying writeback concurrency. > There are a few items that we would like to discuss: > > During the implementation, we noticed several ways in which the writeback IOs > are throttled: > A) balance_dirty_pages > B) writeback_chunk_size > C) wb_over_bg_thresh > Additionally, there are delayed writeback executions in the form of > dirty_writeback_centisecs and dirty_expire_centisecs. > > With the introduction of per-CPU writeback, we need to determine the > appropriate way to adjust the throttling and delayed writeback settings. Yup, that's kinda my point - residency of the dirty data set in the page cache influences throughput in a workload like this more than writeback parallelism does. > Lock contention: > We have observed that the writeback list_lock and the inode->i_lock are the > locks that are most likely to cause contention when we make the thread per-CPU. You clearly aren't driving the system hard enough with small IOs. :) Once you load up the system with lots of small files (e.g. with fsmark to create millions of 4kB files concurrently) and use flush-on-close writeback semantics to drive writeback IO submission concurrency, the profiles are typically dominated by folio allocation and freeing in the page cache and XFS journal free space management (i.e. transaction reservation management around timestamp updates, block allocation and unwritten extent conversion.) That is, writeback parallelism in XFS is determined by the underlying allocation parallelism of the filesystem and journal size/throughput because it uses delayed allocation (same goes for ext4 and btrfs). In the case of XFS, allocation parallelism is determined by the filesytem geometry - the number of allocation groups created at mkfs time. So if you have 4 allocation groups (AGs) in XFS (typical default for small filesystems), then you can only be doing 4 extent allocation operations at once. Hence it doesn't matter how many concurrent writeback contexts the system has, the filesystem will limit concurrency in many workloads to just 4. i.e. the filesystem geometry determines writeback concurrency, not the number of CPUs in the system. Not every writeback will require allocation in XFS (specualtive delalloc beyond EOF greatly reduces this overhead) but when you have hundreds of thousands of small files that each require allocation (e.g. cloning or untarring a large source tree), then writeback has a 1:1 ratio between per-inode writeback and extent allocation. This is the case where filesystem writeback concurrency really matters... > Our next task will be to convert the writeback list_lock to a per-CPU lock. > Another potential improvement could be to assign a specific CPU to an inode. As I mentioned above, per-cpu writeback doesn't seem like the right architecture to me. IMO, we need writeback to be optimised for is asynchronous IO dispatch through each filesystems; our writeback IOPS problems in XFS largely stem from the per-IO cpu overhead of block allocation in the filesystems (i.e. delayed allocation). Because of the way we do allocation locality in XFS, we always try to allocate data extents in the same allocation group as the inode is located. Hence if we have a writeback context per AG, and we always know what AG an inode is located in, we always know which writeback context is should belong to. Hence what a filesystem like XFS needs is the ability to say to the writeback infrastructure "I need N writeback contexts" and then have the ability to control which writeback context an inode is attached to. A writeback context would be an entirely self contained object with dirty lists, locks to protect the lists, a (set of) worker threads that does the writeback dispatch, etc. Then the VFS will naturally write back all the inodes in a given AG indepedently to the inodes in any other AG, and there will be almost no serialisation on allocation or IO submission between writeback contexts. And there won't be any excessive writeback concurrency at the filesystem layer so contention problems during writeback go away, too. IOWs, for XFS we really don't want CPU aligned writeback lists, we want filesystem geometry aligned writeback contexts... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-01-29 22:51 ` Dave Chinner @ 2025-01-31 9:32 ` Kundan Kumar 2025-01-31 17:06 ` Luis Chamberlain 2025-02-04 2:50 ` Dave Chinner 0 siblings, 2 replies; 22+ messages in thread From: Kundan Kumar @ 2025-01-31 9:32 UTC (permalink / raw) To: Dave Chinner Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, hch, willy, gost.dev [-- Attachment #1: Type: text/plain, Size: 2121 bytes --] >IOWs, having too much parallelism in writeback for the underlying >storage and/or filesystem can be far more harmful to system >performance under load than having too little parallelism to drive >the filesystem/hardware to it's maximum performance. With increasing speed of devices we would like to improve the performance of buffered IO as well. This will help the applications(DB, AI/ML) using buffered I/O. If more parallelism is causing side effect, we can reduce it using some factor like: 1) writeback context per NUMA node. 2) Fixed number of writeback contexts, say min(10, numcpu). 3) NUMCPU/N number of writeback contexts. 4) Writeback context based on FS geometry like per AG for XFS, as per your suggestion. > >What kernel is that from? > 6.12.0-rc4+ commit d165768847839f8d1ae5f8081ecc018a190d50e8 >i.e. with enough RAM, this random write workload using buffered IO >is pretty much guaranteed to outperform direct IO regardless of the >underlying writeback concurrency. We tested making sure RAM is available for both buffered and direct IO. On a system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device. fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \ --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \ --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting We can see the results which show direct IO exceed buffered IO by big margin. BW (MiB/s) buffered dontcache %improvement direct %improvement randwrite (bs=4k) 3393 5397 59.06% 9315 174.53% >IMO, we need writeback to be optimised for is asynchronous IO >dispatch through each filesystems; our writeback IOPS problems in >XFS largely stem from the per-IO cpu overhead of block allocation in >the filesystems (i.e. delayed allocation). This is a good idea, but it means we will not be able to paralellize within an AG. I will spend some time to build a POC with per AG writeback context, and compare it with per-cpu writeback performance and extent fragmentation. Other filesystems using delayed allocation will also need a similar scheme. [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-01-31 9:32 ` Kundan Kumar @ 2025-01-31 17:06 ` Luis Chamberlain 2025-02-04 2:50 ` Dave Chinner 1 sibling, 0 replies; 22+ messages in thread From: Luis Chamberlain @ 2025-01-31 17:06 UTC (permalink / raw) To: Kundan Kumar Cc: Dave Chinner, lsf-pc, linux-fsdevel, anuj20.g, joshi.k, axboe, clm, hch, willy, gost.dev, Javier González On Fri, Jan 31, 2025 at 03:02:09PM +0530, Kundan Kumar wrote: > > IOWs, having too much parallelism in writeback for the underlying > > storage and/or filesystem can be far more harmful to system > > performance under load than having too little parallelism to drive > > the filesystem/hardware to it's maximum performance. > > With increasing speed of devices we would like to improve the performance of > buffered IO as well. This will help the applications(DB, AI/ML) using buffered > I/O. If more parallelism is causing side effect, we can reduce it using some > factor like: > 1) writeback context per NUMA node. > 2) Fixed number of writeback contexts, say min(10, numcpu). > 3) NUMCPU/N number of writeback contexts. Based on Dave's feedback, it would seem not using 4) can in the worst case make things worse in certain heavy workloads. So an opt-in rather than default would probably be best for 1-3. > 4) Writeback context based on FS geometry like per AG for XFS, as per your > suggestion. To this later point 4): This is not the first time having the ability to gather filesystem topology somehow comes up for more interesting enhancements, FDP being the other one. I don't think we have a generic way to gather this information today, and so do we want a way to at least allow internal users to query for something like this? Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-01-31 9:32 ` Kundan Kumar 2025-01-31 17:06 ` Luis Chamberlain @ 2025-02-04 2:50 ` Dave Chinner 2025-02-04 5:06 ` Christoph Hellwig 1 sibling, 1 reply; 22+ messages in thread From: Dave Chinner @ 2025-02-04 2:50 UTC (permalink / raw) To: Kundan Kumar Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, hch, willy, gost.dev On Fri, Jan 31, 2025 at 03:02:09PM +0530, Kundan Kumar wrote: > > IOWs, having too much parallelism in writeback for the underlying > > storage and/or filesystem can be far more harmful to system > > performance under load than having too little parallelism to drive > > the filesystem/hardware to it's maximum performance. > > With increasing speed of devices we would like to improve the performance of > buffered IO as well. This will help the applications(DB, AI/ML) using buffered > I/O. If more parallelism is causing side effect, we can reduce it using some > factor like: > 1) writeback context per NUMA node. I doubt that will create enough concurrency for a typical small server or desktop machine that only has a single NUMA node but has a couple of fast nvme SSDs in it. > 2) Fixed number of writeback contexts, say min(10, numcpu). > 3) NUMCPU/N number of writeback contexts. These don't take into account the concurrency available from the underlying filesystem or storage. That's the point I was making - CPU count has -zero- relationship to the concurrency the filesystem and/or storage provide the system. It is fundamentally incorrect to base decisions about IO concurrency on the number of CPU cores in the system. Same goes for using NUMA nodes, a fixed minimum, etc. > 4) Writeback context based on FS geometry like per AG for XFS, as per your > suggestion. Leave the concurrency and queue affinity selection up to the filesystem. If they don't specify anything, then we remain with a single queue and writeback thread. > > i.e. with enough RAM, this random write workload using buffered IO > > is pretty much guaranteed to outperform direct IO regardless of the > > underlying writeback concurrency. > > We tested making sure RAM is available for both buffered and direct IO. On a > system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device. > > fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \ > --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \ > --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting > > We can see the results which show direct IO exceed buffered IO by big margin. > > BW (MiB/s) buffered dontcache %improvement direct %improvement > randwrite (bs=4k) 3393 5397 59.06% 9315 174.53% DIO is faster on PMEM because it only copies the data once (i.e. user to PMEM) whilst buffered IO copies the data twice (user to page cache to PMEM). The IO has exactly the same CPU overhead regardless of in-memory page cache aggregation for buffered IO because IO cpu usage is dominated by the data copy overhead. Throughput benchmarks on PMEM are -always- CPU and memory bandwidth bound doing synchronous data copying. PMEM benchamrks may respond to CPU count based optimisations, but that should not be taken as "infrastructure needs to be CPU-count optimised. This is why it is fundamentally wrong to use PMEM for IO infrastructure optimisation: the behaviour it displays is completely different to pretty much all other types of IO device we support. PMEM is a niche, and it performs very poorly compared to a handful of NVMe SSDs being driven by the same CPUs... > > IMO, we need writeback to be optimised for is asynchronous IO > > dispatch through each filesystems; our writeback IOPS problems in > > XFS largely stem from the per-IO cpu overhead of block allocation in > > the filesystems (i.e. delayed allocation). > > This is a good idea, but it means we will not be able to paralellize within > an AG. The XFS allocation concurrency model scales out across AGs, not within AGs. If we need writeback concurrency within an AG (e.g. for pure overwrite concurrency) then we want async dispatch worker threads per writeback queue, not multiple writeback queues per AG.... > I will spend some time to build a POC with per AG writeback context, and > compare it with per-cpu writeback performance and extent fragmentation. Other > filesystems using delayed allocation will also need a similar scheme. The two are not exclusive like you seem to think they are. If you want to optimise your filesystem geometry for CPU count based concurrency, then on XFS all you need to do is this: # mkfs.xfs -d concurrency=nr_cpus /dev/pmem0 mkfs will then create a geometry that has sufficient concurrency for the number of CPUs in the host system. Hence we end up with -all- in-memory and on-disk filesystem structures on that PMEM device being CPU count optimised, not just writeback. Writeback is just a small part of a much larger system, and we already optimise filesystem and IO concurrency via filesystem geometry. Writeback needs to work within those same controls we already provide users with. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-04 2:50 ` Dave Chinner @ 2025-02-04 5:06 ` Christoph Hellwig 2025-02-04 7:08 ` Dave Chinner 2025-02-10 17:28 ` [Lsf-pc] " Jan Kara 0 siblings, 2 replies; 22+ messages in thread From: Christoph Hellwig @ 2025-02-04 5:06 UTC (permalink / raw) To: Dave Chinner Cc: Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, hch, willy, gost.dev On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote: > I doubt that will create enough concurrency for a typical small > server or desktop machine that only has a single NUMA node but has a > couple of fast nvme SSDs in it. > > > 2) Fixed number of writeback contexts, say min(10, numcpu). > > 3) NUMCPU/N number of writeback contexts. > > These don't take into account the concurrency available from > the underlying filesystem or storage. > > That's the point I was making - CPU count has -zero- relationship to > the concurrency the filesystem and/or storage provide the system. It > is fundamentally incorrect to base decisions about IO concurrency on > the number of CPU cores in the system. Yes. But as mention in my initial reply there is a use case for more WB threads than fs writeback contexts, which is when the writeback threads do CPU intensive work like compression. Being able to do that from normal writeback threads vs forking out out to fs level threads would really simply the btrfs code a lot. Not really interesting for XFS right now of course. Or in other words: fs / device geometry really should be the main driver, but if a file systems supports compression (or really expensive data checksums) being able to scale up the numbes of threads per context might still make sense. But that's really the advanced part, we'll need to get the fs geometry aligned to work first. > > > XFS largely stem from the per-IO cpu overhead of block allocation in > > > the filesystems (i.e. delayed allocation). > > > > This is a good idea, but it means we will not be able to paralellize within > > an AG. > > The XFS allocation concurrency model scales out across AGs, not > within AGs. If we need writeback concurrency within an AG (e.g. for > pure overwrite concurrency) then we want async dispatch worker > threads per writeback queue, not multiple writeback queues per > AG.... Unless the computational work mentioned above is involved there would be something really wrong if we're saturating a CPU per AG. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-04 5:06 ` Christoph Hellwig @ 2025-02-04 7:08 ` Dave Chinner 2025-02-10 17:28 ` [Lsf-pc] " Jan Kara 1 sibling, 0 replies; 22+ messages in thread From: Dave Chinner @ 2025-02-04 7:08 UTC (permalink / raw) To: Christoph Hellwig Cc: Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev On Tue, Feb 04, 2025 at 06:06:42AM +0100, Christoph Hellwig wrote: > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote: > > I doubt that will create enough concurrency for a typical small > > server or desktop machine that only has a single NUMA node but has a > > couple of fast nvme SSDs in it. > > > > > 2) Fixed number of writeback contexts, say min(10, numcpu). > > > 3) NUMCPU/N number of writeback contexts. > > > > These don't take into account the concurrency available from > > the underlying filesystem or storage. > > > > That's the point I was making - CPU count has -zero- relationship to > > the concurrency the filesystem and/or storage provide the system. It > > is fundamentally incorrect to base decisions about IO concurrency on > > the number of CPU cores in the system. > > Yes. But as mention in my initial reply there is a use case for more > WB threads than fs writeback contexts, which is when the writeback I understand that - there's more that one reason we may want multiple IO dispatch threads per writeback context. > threads do CPU intensive work like compression. Being able to do that > from normal writeback threads vs forking out out to fs level threads > would really simply the btrfs code a lot. Not really interesting for > XFS right now of course. > > Or in other words: fs / device geometry really should be the main > driver, but if a file systems supports compression (or really expensive > data checksums) being able to scale up the numbes of threads per > context might still make sense. But that's really the advanced part, > we'll need to get the fs geometry aligned to work first. > > > > > XFS largely stem from the per-IO cpu overhead of block allocation in > > > > the filesystems (i.e. delayed allocation). > > > > > > This is a good idea, but it means we will not be able to paralellize within > > > an AG. > > > > The XFS allocation concurrency model scales out across AGs, not > > within AGs. If we need writeback concurrency within an AG (e.g. for > > pure overwrite concurrency) then we want async dispatch worker > > threads per writeback queue, not multiple writeback queues per > > AG.... > > Unless the computational work mentioned above is involved there would be > something really wrong if we're saturating a CPU per AG. ... and the common case I'm thinking of here is stalling writeback on the AGF lock. i.e. another non-writeback task holding the AGF lock - inode cluster allocation, directory block allocation, freeing space in that AG, etc - will stall any writeback that requires allocation in that AG if we don't have multiple dispatch threads per writeback context available. That, IMO, is future optimisation; just moving to a writeback context per AG would solve a lot of the current "single thread CPU bound" writeback throughput limitations we have right now, especially when it comes to writing lots of small files across many directories in a very short period of time (i.e. opening tarballs, rsync, recursive dir copies, etc). -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-04 5:06 ` Christoph Hellwig 2025-02-04 7:08 ` Dave Chinner @ 2025-02-10 17:28 ` Jan Kara 2025-02-11 1:13 ` Dave Chinner 1 sibling, 1 reply; 22+ messages in thread From: Jan Kara @ 2025-02-10 17:28 UTC (permalink / raw) To: Christoph Hellwig Cc: Dave Chinner, Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev On Tue 04-02-25 06:06:42, Christoph Hellwig wrote: > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote: > > I doubt that will create enough concurrency for a typical small > > server or desktop machine that only has a single NUMA node but has a > > couple of fast nvme SSDs in it. > > > > > 2) Fixed number of writeback contexts, say min(10, numcpu). > > > 3) NUMCPU/N number of writeback contexts. > > > > These don't take into account the concurrency available from > > the underlying filesystem or storage. > > > > That's the point I was making - CPU count has -zero- relationship to > > the concurrency the filesystem and/or storage provide the system. It > > is fundamentally incorrect to base decisions about IO concurrency on > > the number of CPU cores in the system. > > Yes. But as mention in my initial reply there is a use case for more > WB threads than fs writeback contexts, which is when the writeback > threads do CPU intensive work like compression. Being able to do that > from normal writeback threads vs forking out out to fs level threads > would really simply the btrfs code a lot. Not really interesting for > XFS right now of course. > > Or in other words: fs / device geometry really should be the main > driver, but if a file systems supports compression (or really expensive > data checksums) being able to scale up the numbes of threads per > context might still make sense. But that's really the advanced part, > we'll need to get the fs geometry aligned to work first. As I'm reading the thread it sounds to me the writeback subsystem should provide an API for the filesystem to configure number of writeback contexts which would be kind of similar to what we currently do for cgroup aware writeback? Currently we create writeback context per cgroup so now additionally we'll have some property like "inode writeback locality" that will also influence what inode->i_wb gets set to and hence where mark_inode_dirty() files inodes etc. Then additionally you'd like to have possibility to have more processes operate on one struct wb_writeback (again configurable by filesystem?) to handle cases of cpu intensive writeback submission more efficiently. Sounds about right? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-10 17:28 ` [Lsf-pc] " Jan Kara @ 2025-02-11 1:13 ` Dave Chinner 2025-02-11 13:43 ` Jan Kara ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Dave Chinner @ 2025-02-11 1:13 UTC (permalink / raw) To: Jan Kara Cc: Christoph Hellwig, Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote: > On Tue 04-02-25 06:06:42, Christoph Hellwig wrote: > > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote: > > > I doubt that will create enough concurrency for a typical small > > > server or desktop machine that only has a single NUMA node but has a > > > couple of fast nvme SSDs in it. > > > > > > > 2) Fixed number of writeback contexts, say min(10, numcpu). > > > > 3) NUMCPU/N number of writeback contexts. > > > > > > These don't take into account the concurrency available from > > > the underlying filesystem or storage. > > > > > > That's the point I was making - CPU count has -zero- relationship to > > > the concurrency the filesystem and/or storage provide the system. It > > > is fundamentally incorrect to base decisions about IO concurrency on > > > the number of CPU cores in the system. > > > > Yes. But as mention in my initial reply there is a use case for more > > WB threads than fs writeback contexts, which is when the writeback > > threads do CPU intensive work like compression. Being able to do that > > from normal writeback threads vs forking out out to fs level threads > > would really simply the btrfs code a lot. Not really interesting for > > XFS right now of course. > > > > Or in other words: fs / device geometry really should be the main > > driver, but if a file systems supports compression (or really expensive > > data checksums) being able to scale up the numbes of threads per > > context might still make sense. But that's really the advanced part, > > we'll need to get the fs geometry aligned to work first. > > As I'm reading the thread it sounds to me the writeback subsystem should > provide an API for the filesystem to configure number of writeback > contexts which would be kind of similar to what we currently do for cgroup > aware writeback? Yes, that's pretty much what I've been trying to say. > Currently we create writeback context per cgroup so now > additionally we'll have some property like "inode writeback locality" that > will also influence what inode->i_wb gets set to and hence where > mark_inode_dirty() files inodes etc. Well, that's currently selected by __inode_attach_wb() based on whether there is a memcg attached to the folio/task being dirtied or not. If there isn't a cgroup based writeback task, then it uses the bdi->wb as the wb context. In my mind, what you are describing above sounds like we would be heading down the same road list_lru started down back in 2012 to support NUMA scalability for LRU based memory reclaim. i.e. we originally had a single global LRU list for important caches. This didn't scale up, so I introduced the list_lru construct to abstract the physical layout of the LRU from the objects being stored on it and the reclaim infrastructure walking it. That gave us per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The fundamental concept was that we abstract away the sharding of the object tracking into per-physical-node structures via generic infrastructure (i.e. list_lru). Then memcgs needed memory reclaim, and so they were added as extra lists with a different indexing mechanism to the list-lru contexts. These weren't per-node lists because there could be thousands of them. Hence it was just a single "global" list per memcg, and so it didn't scale on large machines. This wasn't seen as a problem initially, but a few years later applications using memcgs wanted to scale properly on large NUMA systems. So now we have each memcg tracking the physical per-node memory usage for reclaim purposes (i.e. combinatorial explosion of memcg vs per-node lists). Hence suggesting "physically sharded lists for global objects, single per-cgroup lists for cgroup-owned objects" sounds like exactly the same problem space progression is about to play out with writeback contexts. i.e. we shared the global writeback context into a set of physically sharded lists for scalability and perofrmance reasons, but leave cgroups with the old single threaded list constructs. Then someone says "my cgroup based workload doesn't perform the same as a global workload" and we're off to solve the problem list_lru solves again. So.... Should we be looking towards using a subset of the existing list_lru functionality for writeback contexts here? i.e. create a list_lru object with N-way scalability, allow the fs to provide an inode-number-to-list mapping function, and use the list_lru interfaces to abstract away everything physical and cgroup related for tracking dirty inodes? Then selecting inodes for writeback becomes a list_lru_walk() variant depending on what needs to be written back (e.g. physical node, memcg, both, everything that is dirty everywhere, etc). > Then additionally you'd like to have possibility to have more processes > operate on one struct wb_writeback (again configurable by filesystem?) to > handle cases of cpu intensive writeback submission more efficiently. Yeah, that is a separate problem... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-11 1:13 ` Dave Chinner @ 2025-02-11 13:43 ` Jan Kara 2025-02-20 14:19 ` Kundan Kumar 2025-03-12 17:47 ` Luis Chamberlain 2 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2025-02-11 13:43 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Christoph Hellwig, Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev On Tue 11-02-25 12:13:18, Dave Chinner wrote: > On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote: > > On Tue 04-02-25 06:06:42, Christoph Hellwig wrote: > > > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote: > > > > I doubt that will create enough concurrency for a typical small > > > > server or desktop machine that only has a single NUMA node but has a > > > > couple of fast nvme SSDs in it. > > > > > > > > > 2) Fixed number of writeback contexts, say min(10, numcpu). > > > > > 3) NUMCPU/N number of writeback contexts. > > > > > > > > These don't take into account the concurrency available from > > > > the underlying filesystem or storage. > > > > > > > > That's the point I was making - CPU count has -zero- relationship to > > > > the concurrency the filesystem and/or storage provide the system. It > > > > is fundamentally incorrect to base decisions about IO concurrency on > > > > the number of CPU cores in the system. > > > > > > Yes. But as mention in my initial reply there is a use case for more > > > WB threads than fs writeback contexts, which is when the writeback > > > threads do CPU intensive work like compression. Being able to do that > > > from normal writeback threads vs forking out out to fs level threads > > > would really simply the btrfs code a lot. Not really interesting for > > > XFS right now of course. > > > > > > Or in other words: fs / device geometry really should be the main > > > driver, but if a file systems supports compression (or really expensive > > > data checksums) being able to scale up the numbes of threads per > > > context might still make sense. But that's really the advanced part, > > > we'll need to get the fs geometry aligned to work first. > > > > As I'm reading the thread it sounds to me the writeback subsystem should > > provide an API for the filesystem to configure number of writeback > > contexts which would be kind of similar to what we currently do for cgroup > > aware writeback? > > Yes, that's pretty much what I've been trying to say. > > > Currently we create writeback context per cgroup so now > > additionally we'll have some property like "inode writeback locality" that > > will also influence what inode->i_wb gets set to and hence where > > mark_inode_dirty() files inodes etc. > > Well, that's currently selected by __inode_attach_wb() based on > whether there is a memcg attached to the folio/task being dirtied or > not. If there isn't a cgroup based writeback task, then it uses the > bdi->wb as the wb context. > > In my mind, what you are describing above sounds like we would be > heading down the same road list_lru started down back in 2012 to > support NUMA scalability for LRU based memory reclaim. > > i.e. we originally had a single global LRU list for important > caches. This didn't scale up, so I introduced the list_lru construct > to abstract the physical layout of the LRU from the objects being > stored on it and the reclaim infrastructure walking it. That gave us > per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The > fundamental concept was that we abstract away the sharding of the > object tracking into per-physical-node structures via generic > infrastructure (i.e. list_lru). > > Then memcgs needed memory reclaim, and so they were added as extra > lists with a different indexing mechanism to the list-lru contexts. > These weren't per-node lists because there could be thousands of > them. Hence it was just a single "global" list per memcg, and so it > didn't scale on large machines. > > This wasn't seen as a problem initially, but a few years later > applications using memcgs wanted to scale properly on large NUMA > systems. So now we have each memcg tracking the physical per-node > memory usage for reclaim purposes (i.e. combinatorial explosion of > memcg vs per-node lists). > > Hence suggesting "physically sharded lists for global objects, > single per-cgroup lists for cgroup-owned objects" sounds like > exactly the same problem space progression is about to play out with > writeback contexts. > > i.e. we shared the global writeback context into a set of physically > sharded lists for scalability and perofrmance reasons, but leave > cgroups with the old single threaded list constructs. Then someone > says "my cgroup based workload doesn't perform the same as a global > workload" and we're off to solve the problem list_lru solves again. > > So.... > > Should we be looking towards using a subset of the existing list_lru > functionality for writeback contexts here? i.e. create a list_lru > object with N-way scalability, allow the fs to provide an > inode-number-to-list mapping function, and use the list_lru > interfaces to abstract away everything physical and cgroup related > for tracking dirty inodes? Interesting idea. Indeed, the similarity with problems list_lru is solving is significant. I like the idea. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-11 1:13 ` Dave Chinner 2025-02-11 13:43 ` Jan Kara @ 2025-02-20 14:19 ` Kundan Kumar 2025-03-13 20:22 ` Jan Kara 2025-03-12 17:47 ` Luis Chamberlain 2 siblings, 1 reply; 22+ messages in thread From: Kundan Kumar @ 2025-02-20 14:19 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Christoph Hellwig, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev [-- Attachment #1: Type: text/plain, Size: 1976 bytes --] >Well, that's currently selected by __inode_attach_wb() based on >whether there is a memcg attached to the folio/task being dirtied or >not. If there isn't a cgroup based writeback task, then it uses the >bdi->wb as the wb context. We have created a proof of concept for per-AG context-based writeback, as described in [1]. The AG is mapped to a writeback context (wb_ctx). Using the filesystem handler, __mark_inode_dirty() selects writeback context corresponding to the inode. We attempted to handle memcg and bdi based writeback in a similar manner. This approach aims to maintain the original writeback semantics while providing parallelism. This helps in pushing more data early to the device, trying to ease the write pressure faster. [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/ >Then selecting inodes for writeback becomes a list_lru_walk() >variant depending on what needs to be written back (e.g. physical >node, memcg, both, everything that is dirty everywhere, etc). We considered using list_lru to track inodes within a writeback context. This can be implemented as: struct bdi_writeback { struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list struct list_lru b_io_dirty_inodes_lru; ... ... }; By doing this, we would obtain a sharded list of inodes per NUMA node. However, we would also need per-NUMA writeback contexts. Otherwise, even if inodes are NUMA-sharded, a single writeback context would stil process them sequentially, limiting parallelism. But there’s a concern: NUMA-based writeback contexts are not aligned with filesystem geometry, which could negatively impact delayed allocation and writeback efficiency, as you pointed out in your previous reply [2]. Would it be better to let the filesystem dictate the number of writeback threads, rather than enforcing a per-NUMA model? Do you see it differently? [2] https://lore.kernel.org/all/Z5qw_1BOqiFum5Dn@dread.disaster.area/ [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-20 14:19 ` Kundan Kumar @ 2025-03-13 20:22 ` Jan Kara 2025-03-18 3:42 ` Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Jan Kara @ 2025-03-13 20:22 UTC (permalink / raw) To: Kundan Kumar Cc: Dave Chinner, Jan Kara, Christoph Hellwig, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev On Thu 20-02-25 19:49:22, Kundan Kumar wrote: > > Well, that's currently selected by __inode_attach_wb() based on > > whether there is a memcg attached to the folio/task being dirtied or > > not. If there isn't a cgroup based writeback task, then it uses the > > bdi->wb as the wb context. > > We have created a proof of concept for per-AG context-based writeback, as > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using > the filesystem handler, __mark_inode_dirty() selects writeback context > corresponding to the inode. > > We attempted to handle memcg and bdi based writeback in a similar manner. > This approach aims to maintain the original writeback semantics while > providing parallelism. This helps in pushing more data early to the > device, trying to ease the write pressure faster. > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/ Yeah, I've seen the patches. Sorry for not getting to you earlier. > > Then selecting inodes for writeback becomes a list_lru_walk() > > variant depending on what needs to be written back (e.g. physical > > node, memcg, both, everything that is dirty everywhere, etc). > > We considered using list_lru to track inodes within a writeback context. > This can be implemented as: > struct bdi_writeback { > struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list > struct list_lru b_io_dirty_inodes_lru; > ... > ... > }; > By doing this, we would obtain a sharded list of inodes per NUMA node. I think you've misunderstood Dave's suggestion here. list_lru was given as an example of a structure for inspiration. We cannot take it directly as is for writeback purposes because we don't want to be sharding based on NUMA nodes but rather based on some other (likely FS driven) criteria. > However, we would also need per-NUMA writeback contexts. Otherwise, > even if inodes are NUMA-sharded, a single writeback context would stil > process them sequentially, limiting parallelism. But there’s a concern: > NUMA-based writeback contexts are not aligned with filesystem geometry, > which could negatively impact delayed allocation and writeback efficiency, > as you pointed out in your previous reply [2]. > > Would it be better to let the filesystem dictate the number of writeback > threads, rather than enforcing a per-NUMA model? I was thinking about how to best parallelize the writeback and I think there are two quite different demands for which we probably want two different levels of parallelism. One case is the situation when the filesystem for example has multiple underlying devices (like btrfs or bcachefs) or for other reasons writeback to different parts is fairly independent (like for different XFS AGs). Here we want parallelism at rather high level I think including separate dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like separate bdis (struct backing_dev_info) but I think it would be technically and also conceptually somewhat easier to do the multiplexing by factoring out: struct bdi_writeback wb; /* the root writeback info for this bdi */ struct list_head wb_list; /* list of all wbs */ #ifdef CONFIG_CGROUP_WRITEBACK struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ #endif wait_queue_head_t wb_waitq; into a new structure (looking for a good name - bdi_writeback_context???) that can get multiplied (filesystem can create its own bdi on mount and configure there number of bdi_writeback_contexts it wants). We also need to add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() which will return appropriate bdi_writeback_context (or perhaps just it's index?) for an inode. This will be used by the filesystem to direct writeback code where the inode should go. This is kind of what Kundan did in the last revision of his patches but I hope this approach should somewhat limit the changes necessary to writeback infrastructure - the patch 2 in his series is really unreviewably large... Then another case is a situation where either the amount of CPU work is rather high for IO submission (cases like Christoph mentioned where filesystem needs to do checksumming on submission or similar) or simply the device is rather fast for a single submission thread and the FS doesn't have a sensible way to partition inodes (e.g. for ext4 there's no meaningful way of partitioning inodes into independent groups - ext4 allocation groups are small and inodes often span multiple groups and the sets of groups used by different inodes randomly overlap). In this case I think we want single dirty throttling instance, single writeback throughput estimation, single set of dirty inode lists etc. The level where the parallelism needs to happen is fairly low - I'd say duplicate: struct delayed_work dwork; /* work item used for writeback */ in struct bdi_writeback. Again, the number of dworks should be configurable when creating bdi for the filesystem. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-03-13 20:22 ` Jan Kara @ 2025-03-18 3:42 ` Dave Chinner [not found] ` <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com> 2025-03-18 6:41 ` Kundan Kumar 2025-03-18 11:37 ` Anuj Gupta 2 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2025-03-18 3:42 UTC (permalink / raw) To: Jan Kara Cc: Kundan Kumar, Christoph Hellwig, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev On Thu, Mar 13, 2025 at 09:22:00PM +0100, Jan Kara wrote: > On Thu 20-02-25 19:49:22, Kundan Kumar wrote: > > > Well, that's currently selected by __inode_attach_wb() based on > > > whether there is a memcg attached to the folio/task being dirtied or > > > not. If there isn't a cgroup based writeback task, then it uses the > > > bdi->wb as the wb context. > > > > We have created a proof of concept for per-AG context-based writeback, as > > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using > > the filesystem handler, __mark_inode_dirty() selects writeback context > > corresponding to the inode. > > > > We attempted to handle memcg and bdi based writeback in a similar manner. > > This approach aims to maintain the original writeback semantics while > > providing parallelism. This helps in pushing more data early to the > > device, trying to ease the write pressure faster. > > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/ > > Yeah, I've seen the patches. Sorry for not getting to you earlier. > > > > Then selecting inodes for writeback becomes a list_lru_walk() > > > variant depending on what needs to be written back (e.g. physical > > > node, memcg, both, everything that is dirty everywhere, etc). > > > > We considered using list_lru to track inodes within a writeback context. > > This can be implemented as: > > struct bdi_writeback { > > struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list > > struct list_lru b_io_dirty_inodes_lru; > > ... > > ... > > }; > > By doing this, we would obtain a sharded list of inodes per NUMA node. > > I think you've misunderstood Dave's suggestion here. list_lru was given as > an example of a structure for inspiration. We cannot take it directly as is > for writeback purposes because we don't want to be sharding based on NUMA > nodes but rather based on some other (likely FS driven) criteria. Well, you might say that, but..... ... I was actually thinking of taking the list_lru and abstracting it's N-way parallelism away from the numa infrastructure. The NUMA awareness of the list_lru is largely in external APIs. Th eonly implicit NUMA awareness is in the list_lru_add() function where it converts the object being added to the list to a node ID based on where it is physically located in memory. The only other thing that is NUMA specific is that the list is set up with N-way concurrency when N = the number of NUMA nodes in the machine. So, really, it is just thin veneer of NUMA wrapped around the inherent concurrency built into the structure. IOWs, when we create a list_lru for a numa aware shrinker, we simply use the number of nodes as the N-way parallelism for the list, and the existing per-node infrastructure simply feeds the right numa node ID as the "list index" for it to function as is. In the case of writeback parallelism, we could create a list_lru with the number of AGs as the N-way parallism for the list, and then have the concurrent BDI writeback context (1 per AG) simply provide the AG number as the "list index" for writeback.... > > However, we would also need per-NUMA writeback contexts. Otherwise, > > even if inodes are NUMA-sharded, a single writeback context would stil > > process them sequentially, limiting parallelism. But there’s a concern: > > NUMA-based writeback contexts are not aligned with filesystem geometry, > > which could negatively impact delayed allocation and writeback efficiency, > > as you pointed out in your previous reply [2]. > > > > Would it be better to let the filesystem dictate the number of writeback > > threads, rather than enforcing a per-NUMA model? > > I was thinking about how to best parallelize the writeback and I think > there are two quite different demands for which we probably want two > different levels of parallelism. > > One case is the situation when the filesystem for example has multiple > underlying devices (like btrfs or bcachefs) or for other reasons writeback > to different parts is fairly independent (like for different XFS AGs). Here > we want parallelism at rather high level I think including separate > dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like > separate bdis (struct backing_dev_info) but I think it would be technically > and also conceptually somewhat easier to do the multiplexing by factoring > out: > > struct bdi_writeback wb; /* the root writeback info for this bdi */ > struct list_head wb_list; /* list of all wbs */ > #ifdef CONFIG_CGROUP_WRITEBACK > struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > #endif > wait_queue_head_t wb_waitq; > > into a new structure (looking for a good name - bdi_writeback_context???) > that can get multiplied (filesystem can create its own bdi on mount and > configure there number of bdi_writeback_contexts it wants). We also need to > add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() > which will return appropriate bdi_writeback_context (or perhaps just it's > index?) for an inode. This will be used by the filesystem to direct > writeback code where the inode should go. This is kind of what Kundan did > in the last revision of his patches but I hope this approach should > somewhat limit the changes necessary to writeback infrastructure - the > patch 2 in his series is really unreviewably large... Yes, this is the equivalent of the SHRINKER_NUMA_AWARE flag that triggers the shrinker infrastructure to track work in a NUMA aware way when the subsystem shrinker callbacks are using the NUMA aware list_lru to back it.... > Then another case is a situation where either the amount of CPU work is > rather high for IO submission (cases like Christoph mentioned where > filesystem needs to do checksumming on submission or similar) or simply the > device is rather fast for a single submission thread and the FS doesn't > have a sensible way to partition inodes (e.g. for ext4 there's no > meaningful way of partitioning inodes into independent groups - ext4 > allocation groups are small and inodes often span multiple groups and the > sets of groups used by different inodes randomly overlap). In this case I > think we want single dirty throttling instance, single writeback throughput > estimation, single set of dirty inode lists etc. The level where the > parallelism needs to happen is fairly low - I'd say duplicate: > > struct delayed_work dwork; /* work item used for writeback */ > > in struct bdi_writeback. Again, the number of dworks should be configurable > when creating bdi for the filesystem. Right - this is what I described early on in the discussion as a need for both scaling out across bdi contexts as well as scaling up within a single bdi context. That is, even within the context of a BDI writeback context per AG in XFS, we can hav eneed for multiple IO submissions to be in flight at once. e.g. one IO submission gets blocked doing allocation, but there are a heap of pure overwrites queued up behind it. We can submit the pure overwrites without getting blocked on the allocation, but if we don't have IO submission concurrency within a BDI, we cannot do that at all. IOWs, these aren't exclusive optimisations to writeback; there are situations were only one of the two mechanisms can solve the concurrency problem. i.e. we need both high level and low level concurrency mechanisms in filesystems like XFS.... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
[parent not found: <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com>]
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback [not found] ` <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com> @ 2025-03-19 8:07 ` Anuj Gupta 0 siblings, 0 replies; 22+ messages in thread From: Anuj Gupta @ 2025-03-19 8:07 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, Kundan Kumar, Christoph Hellwig, lsf-pc, linux-fsdevel, mcgrof, joshi.k, axboe, clm, willy, gost.dev [-- Attachment #1: Type: text/plain, Size: 3590 bytes --] On Tue, Mar 18, 2025 at 02:42:18PM +1100, Dave Chinner wrote: > On Thu, Mar 13, 2025 at 09:22:00PM +0100, Jan Kara wrote: > > On Thu 20-02-25 19:49:22, Kundan Kumar wrote: > > > > Well, that's currently selected by __inode_attach_wb() based on > > > > whether there is a memcg attached to the folio/task being dirtied or > > > > not. If there isn't a cgroup based writeback task, then it uses the > > > > bdi->wb as the wb context. > > > > > > We have created a proof of concept for per-AG context-based writeback, as > > > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using > > > the filesystem handler, __mark_inode_dirty() selects writeback context > > > corresponding to the inode. > > > > > > We attempted to handle memcg and bdi based writeback in a similar manner. > > > This approach aims to maintain the original writeback semantics while > > > providing parallelism. This helps in pushing more data early to the > > > device, trying to ease the write pressure faster. > > > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/ > > > > Yeah, I've seen the patches. Sorry for not getting to you earlier. > > > > > > Then selecting inodes for writeback becomes a list_lru_walk() > > > > variant depending on what needs to be written back (e.g. physical > > > > node, memcg, both, everything that is dirty everywhere, etc). > > > > > > We considered using list_lru to track inodes within a writeback context. > > > This can be implemented as: > > > struct bdi_writeback { > > > struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list > > > struct list_lru b_io_dirty_inodes_lru; > > > ... > > > ... > > > }; > > > By doing this, we would obtain a sharded list of inodes per NUMA node. > > > > I think you've misunderstood Dave's suggestion here. list_lru was given as > > an example of a structure for inspiration. We cannot take it directly as is > > for writeback purposes because we don't want to be sharding based on NUMA > > nodes but rather based on some other (likely FS driven) criteria. > > Well, you might say that, but..... > > ... I was actually thinking of taking the list_lru and abstracting > it's N-way parallelism away from the numa infrastructure. > > The NUMA awareness of the list_lru is largely in external APIs. Th > eonly implicit NUMA awareness is in the list_lru_add() function > where it converts the object being added to the list to a node ID > based on where it is physically located in memory. > > The only other thing that is NUMA specific is that the list is set > up with N-way concurrency when N = the number of NUMA nodes in the > machine. > > So, really, it is just thin veneer of NUMA wrapped around the > inherent concurrency built into the structure. > > IOWs, when we create a list_lru for a numa aware shrinker, we simply > use the number of nodes as the N-way parallelism for the list, > and the existing per-node infrastructure simply feeds the right > numa node ID as the "list index" for it to function as is. > > In the case of writeback parallelism, we could create a list_lru > with the number of AGs as the N-way parallism for the list, and then > have the concurrent BDI writeback context (1 per AG) simply provide > the AG number as the "list index" for writeback.... > If we go ahead with Jan's suggestion, each AG will have a separate bdi_writeback_context. Each bdi_writeback_context has its own b_dirty inode list. This naturally partitions inodes per AG. Given that, do we still need AG based sharded list_lru? Am I missing something here? [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-03-13 20:22 ` Jan Kara 2025-03-18 3:42 ` Dave Chinner @ 2025-03-18 6:41 ` Kundan Kumar 2025-03-18 11:37 ` Anuj Gupta 2 siblings, 0 replies; 22+ messages in thread From: Kundan Kumar @ 2025-03-18 6:41 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Christoph Hellwig, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev [-- Attachment #1: Type: text/plain, Size: 2928 bytes --] >> > Then selecting inodes for writeback becomes a list_lru_walk() >> > variant depending on what needs to be written back (e.g. physical >> > node, memcg, both, everything that is dirty everywhere, etc). >> >> We considered using list_lru to track inodes within a writeback context. >> This can be implemented as: >> struct bdi_writeback { >> struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list >> struct list_lru b_io_dirty_inodes_lru; >> ... >> ... >> }; >> By doing this, we would obtain a sharded list of inodes per NUMA node. > >I think you've misunderstood Dave's suggestion here. list_lru was given as >an example of a structure for inspiration. We cannot take it directly as is >for writeback purposes because we don't want to be sharding based on NUMA >nodes but rather based on some other (likely FS driven) criteria. Makes sense. Thanks for the clarification. >I was thinking about how to best parallelize the writeback and I think >there are two quite different demands for which we probably want two >different levels of parallelism. > >One case is the situation when the filesystem for example has multiple >underlying devices (like btrfs or bcachefs) or for other reasons writeback >to different parts is fairly independent (like for different XFS AGs). Here >we want parallelism at rather high level I think including separate >dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like >separate bdis (struct backing_dev_info) but I think it would be technically >and also conceptually somewhat easier to do the multiplexing by factoring >out: > > struct bdi_writeback wb; /* the root writeback info for this bdi */ > struct list_head wb_list; /* list of all wbs */ >#ifdef CONFIG_CGROUP_WRITEBACK > struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ >#endif > wait_queue_head_t wb_waitq; > >into a new structure (looking for a good name - bdi_writeback_context???) >that can get multiplied (filesystem can create its own bdi on mount and >configure there number of bdi_writeback_contexts it wants). We also need to >add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() >which will return appropriate bdi_writeback_context (or perhaps just it's >index?) for an inode. This will be used by the filesystem to direct >writeback code where the inode should go. This is kind of what Kundan did >in the last revision of his patches but I hope this approach should >somewhat limit the changes necessary to writeback infrastructure - the This looks much better than the data structures we had in previous version. I will prepare a new version based on this feedback. >patch 2 in his series is really unreviewably large... I agree. Sorry, will try to streamline the patches in a better fashion in the next iteration. [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-03-13 20:22 ` Jan Kara 2025-03-18 3:42 ` Dave Chinner 2025-03-18 6:41 ` Kundan Kumar @ 2025-03-18 11:37 ` Anuj Gupta 2025-03-19 15:54 ` Jan Kara 2 siblings, 1 reply; 22+ messages in thread From: Anuj Gupta @ 2025-03-18 11:37 UTC (permalink / raw) To: Jan Kara Cc: Kundan Kumar, Dave Chinner, Christoph Hellwig, lsf-pc, linux-fsdevel, mcgrof, joshi.k, axboe, clm, willy, gost.dev [-- Attachment #1: Type: text/plain, Size: 38401 bytes --] > I was thinking about how to best parallelize the writeback and I think > there are two quite different demands for which we probably want two > different levels of parallelism. > > One case is the situation when the filesystem for example has multiple > underlying devices (like btrfs or bcachefs) or for other reasons writeback > to different parts is fairly independent (like for different XFS AGs). Here > we want parallelism at rather high level I think including separate > dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like > separate bdis (struct backing_dev_info) but I think it would be technically > and also conceptually somewhat easier to do the multiplexing by factoring > out: > > struct bdi_writeback wb; /* the root writeback info for this bdi */ > struct list_head wb_list; /* list of all wbs */ > #ifdef CONFIG_CGROUP_WRITEBACK > struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > #endif > wait_queue_head_t wb_waitq; > > into a new structure (looking for a good name - bdi_writeback_context???) > that can get multiplied (filesystem can create its own bdi on mount and > configure there number of bdi_writeback_contexts it wants). We also need to > add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() > which will return appropriate bdi_writeback_context (or perhaps just it's > index?) for an inode. This will be used by the filesystem to direct > writeback code where the inode should go. This is kind of what Kundan did > in the last revision of his patches but I hope this approach should > somewhat limit the changes necessary to writeback infrastructure - the > patch 2 in his series is really unreviewably large... Thanks Jan. I tried to modify the existing infra based on your suggestion [1]. This only takes care of the first case that you mentioned. I am yet to start to evaluate and implement the second case (when amount of cpu work is high). The patch is large, because it requires changing all the places that uses bdi's fields that have now been moved to a new structure. I will try to streamline it further. I have omitted the xfs side plumbing for now. Can you please see if this aligns with the scheme that you had in mind? If the infra looks fine, I can plumb XFS changes on top of it. Note: This patch is only compile tested, haven't run any tests on it. It is on top of v6.13 https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v6.13 [1] Subject: [PATCH] writeback: add infra for parallel writeback Introduce a new bdi_writeback_ctx structure that enables us to have multiple writeback contexts for parallel writeback. Existing single threaded writeback will use default_ctx. Filesystem now have the option to create there own bdi aind populate multiple bdi_writeback_ctx in bdi's bdi_wb_ctx_arr (xarray for now, but plan to move to use a better structure like list_lru). Introduce a new hook get_inode_wb_ctx() in super_operations which takes inode as a parameter and returns the bdi_wb_ctx corresponding to the inode. Modify all the functions/places that operate on bdi's wb, wb_list, cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved to bdi_writeback_ctx Store a reference to bdi_wb_ctx in bdi_writeback. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> --- fs/fs-writeback.c | 170 +++++++++++++++++------- include/linux/backing-dev-defs.h | 34 +++-- include/linux/backing-dev.h | 48 +++++-- include/linux/fs.h | 1 + mm/backing-dev.c | 218 +++++++++++++++++++++++-------- mm/page-writeback.c | 5 +- 6 files changed, 345 insertions(+), 131 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 3cd99e2dc6ac..4c47c298f174 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -265,23 +265,27 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio) { struct backing_dev_info *bdi = inode_to_bdi(inode); struct bdi_writeback *wb = NULL; + struct bdi_writeback_ctx *bdi_writeback_ctx = + fetch_bdi_writeback_ctx(inode); if (inode_cgwb_enabled(inode)) { struct cgroup_subsys_state *memcg_css; if (folio) { memcg_css = mem_cgroup_css_from_folio(folio); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, + GFP_ATOMIC); } else { /* must pin memcg_css, see wb_get_create() */ memcg_css = task_get_css(current, memory_cgrp_id); - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, + GFP_ATOMIC); css_put(memcg_css); } } if (!wb) - wb = &bdi->wb; + wb = &bdi_writeback_ctx->wb; /* * There may be multiple instances of this function racing to @@ -302,12 +306,15 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio) static void inode_cgwb_move_to_attached(struct inode *inode, struct bdi_writeback *wb) { + struct bdi_writeback_ctx *bdi_wb_ctx; + assert_spin_locked(&wb->list_lock); assert_spin_locked(&inode->i_lock); WARN_ON_ONCE(inode->i_state & I_FREEING); inode->i_state &= ~I_SYNC_QUEUED; - if (wb != &wb->bdi->wb) + bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); + if (wb != &bdi_wb_ctx->wb) list_move(&inode->i_io_list, &wb->b_attached); else list_del_init(&inode->i_io_list); @@ -382,14 +389,14 @@ struct inode_switch_wbs_context { struct inode *inodes[]; }; -static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) +static void bdi_wb_ctx_down_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { - down_write(&bdi->wb_switch_rwsem); + down_write(&bdi_wb_ctx->wb_switch_rwsem); } -static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi) +static void bdi_wb_ctx_up_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { - up_write(&bdi->wb_switch_rwsem); + up_write(&bdi_wb_ctx->wb_switch_rwsem); } static bool inode_do_switch_wbs(struct inode *inode, @@ -490,7 +497,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) { struct inode_switch_wbs_context *isw = container_of(to_rcu_work(work), struct inode_switch_wbs_context, work); - struct backing_dev_info *bdi = inode_to_bdi(isw->inodes[0]); + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(isw->inodes[0]); struct bdi_writeback *old_wb = isw->inodes[0]->i_wb; struct bdi_writeback *new_wb = isw->new_wb; unsigned long nr_switched = 0; @@ -500,7 +507,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) * If @inode switches cgwb membership while sync_inodes_sb() is * being issued, sync_inodes_sb() might miss it. Synchronize. */ - down_read(&bdi->wb_switch_rwsem); + down_read(&bdi_wb_ctx->wb_switch_rwsem); /* * By the time control reaches here, RCU grace period has passed @@ -529,7 +536,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work) spin_unlock(&new_wb->list_lock); spin_unlock(&old_wb->list_lock); - up_read(&bdi->wb_switch_rwsem); + up_read(&bdi_wb_ctx->wb_switch_rwsem); if (nr_switched) { wb_wakeup(new_wb); @@ -583,6 +590,7 @@ static bool inode_prepare_wbs_switch(struct inode *inode, static void inode_switch_wbs(struct inode *inode, int new_wb_id) { struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); struct cgroup_subsys_state *memcg_css; struct inode_switch_wbs_context *isw; @@ -609,7 +617,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) if (!memcg_css) goto out_free; - isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + isw->new_wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, GFP_ATOMIC); css_put(memcg_css); if (!isw->new_wb) goto out_free; @@ -678,12 +686,12 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) for (memcg_css = wb->memcg_css->parent; memcg_css; memcg_css = memcg_css->parent) { - isw->new_wb = wb_get_create(wb->bdi, memcg_css, GFP_KERNEL); + isw->new_wb = wb_get_create(wb->bdi, wb->bdi_wb_ctx, memcg_css, GFP_KERNEL); if (isw->new_wb) break; } if (unlikely(!isw->new_wb)) - isw->new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */ + isw->new_wb = &wb->bdi_wb_ctx->wb; /* wb_get() is noop for bdi's wb */ nr = 0; spin_lock(&wb->list_lock); @@ -995,17 +1003,18 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) */ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, struct wb_writeback_work *base_work, - bool skip_if_busy) + bool skip_if_busy, + struct bdi_writeback_ctx *bdi_wb_ctx) { struct bdi_writeback *last_wb = NULL; - struct bdi_writeback *wb = list_entry(&bdi->wb_list, + struct bdi_writeback *wb = list_entry(&bdi_wb_ctx->wb_list, struct bdi_writeback, bdi_node); might_sleep(); restart: rcu_read_lock(); - list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) { - DEFINE_WB_COMPLETION(fallback_work_done, bdi); + list_for_each_entry_continue_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) { + DEFINE_WB_COMPLETION(fallback_work_done, bdi_wb_ctx); struct wb_writeback_work fallback_work; struct wb_writeback_work *work; long nr_pages; @@ -1103,7 +1112,17 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, * And find the associated wb. If the wb isn't there already * there's nothing to flush, don't create one. */ - wb = wb_get_lookup(bdi, memcg_css); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + wb = wb_get_lookup(bdi_wb_ctx, memcg_css); + if (wb) + break; + } + } else { + wb = wb_get_lookup(&bdi->default_ctx, memcg_css); + } if (!wb) { ret = -ENOENT; goto out_css_put; @@ -1189,8 +1208,8 @@ fs_initcall(cgroup_writeback_init); #else /* CONFIG_CGROUP_WRITEBACK */ -static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { } -static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi) { } +static void bdi_wb_ctx_down_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { } +static void bdi_wb_ctx_up_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { } static void inode_cgwb_move_to_attached(struct inode *inode, struct bdi_writeback *wb) @@ -1232,13 +1251,14 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, struct wb_writeback_work *base_work, - bool skip_if_busy) + bool skip_if_busy, + struct bdi_writeback_ctx *bdi_wb_ctx) { might_sleep(); - if (!skip_if_busy || !writeback_in_progress(&bdi->wb)) { + if (!skip_if_busy || !writeback_in_progress(&bdi_wb_ctx->wb)) { base_work->auto_free = 0; - wb_queue_work(&bdi->wb, base_work); + wb_queue_work(&bdi_wb_ctx->wb, base_work); } } @@ -2371,8 +2391,18 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, if (!bdi_has_dirty_io(bdi)) return; - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) - wb_start_writeback(wb, reason); + /* traverse all the bdi_wb_ctx of the bdi */ + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) + wb_start_writeback(wb, reason); + } + } else { + list_for_each_entry_rcu(wb, &bdi->default_ctx.wb_list, bdi_node) + wb_start_writeback(wb, reason); + } } void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, @@ -2427,9 +2457,19 @@ static void wakeup_dirtytime_writeback(struct work_struct *w) list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { struct bdi_writeback *wb; - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) - if (!list_empty(&wb->b_dirty_time)) - wb_wakeup(wb); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) + if (!list_empty(&wb->b_dirty_time)) + wb_wakeup(wb); + } + } else { + list_for_each_entry_rcu(wb, &bdi->default_ctx.wb_list, bdi_node) + if (!list_empty(&wb->b_dirty_time)) + wb_wakeup(wb); + } } rcu_read_unlock(); schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ); @@ -2713,11 +2753,12 @@ static void wait_sb_inodes(struct super_block *sb) mutex_unlock(&sb->s_sync_lock); } -static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, - enum wb_reason reason, bool skip_if_busy) +static void __writeback_inodes_sb_nr_bdi_wb_ctx(struct super_block *sb, + unsigned long nr, enum wb_reason reason, + bool skip_if_busy, + struct bdi_writeback_ctx *bdi_wb_ctx) { - struct backing_dev_info *bdi = sb->s_bdi; - DEFINE_WB_COMPLETION(done, bdi); + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_NONE, @@ -2727,12 +2768,29 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, .reason = reason, }; + bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy, bdi_wb_ctx); + wb_wait_for_completion(&done); +} + +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, + enum wb_reason reason, bool skip_if_busy) +{ + struct backing_dev_info *bdi = sb->s_bdi; + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); - bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); - wb_wait_for_completion(&done); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) + __writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason, + skip_if_busy, bdi_wb_ctx); + } else { + __writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason, + skip_if_busy, &bdi->default_ctx); + } } /** @@ -2785,17 +2843,10 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) } EXPORT_SYMBOL(try_to_writeback_inodes_sb); -/** - * sync_inodes_sb - sync sb inode pages - * @sb: the superblock - * - * This function writes and waits on any dirty inode belonging to this - * super_block. - */ -void sync_inodes_sb(struct super_block *sb) +static void sync_inodes_bdi_wb_ctx(struct super_block *sb, + struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx) { - struct backing_dev_info *bdi = sb->s_bdi; - DEFINE_WB_COMPLETION(done, bdi); + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -2805,6 +2856,22 @@ void sync_inodes_sb(struct super_block *sb) .reason = WB_REASON_SYNC, .for_sync = 1, }; + bdi_wb_ctx_down_write_wb_switch_rwsem(bdi_wb_ctx); + bdi_split_work_to_wbs(bdi, &work, false, bdi_wb_ctx); + wb_wait_for_completion(&done); + bdi_wb_ctx_up_write_wb_switch_rwsem(bdi_wb_ctx); +} + +/** + * sync_inodes_sb - sync sb inode pages + * @sb: the superblock + * + * This function writes and waits on any dirty inode belonging to this + * super_block. + */ +void sync_inodes_sb(struct super_block *sb) +{ + struct backing_dev_info *bdi = sb->s_bdi; /* * Can't skip on !bdi_has_dirty() because we should wait for !dirty @@ -2815,12 +2882,15 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); - /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ - bdi_down_write_wb_switch_rwsem(bdi); - bdi_split_work_to_wbs(bdi, &work, false); - wb_wait_for_completion(&done); - bdi_up_write_wb_switch_rwsem(bdi); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx); + } + } else { + sync_inodes_bdi_wb_ctx(sb, bdi, &bdi->default_ctx); + } wait_sb_inodes(sb); } EXPORT_SYMBOL(sync_inodes_sb); diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 2ad261082bba..d00ae5b55f33 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -75,10 +75,10 @@ struct wb_completion { * can wait for the completion of all using wb_wait_for_completion(). Work * items which are waited upon aren't freed automatically on completion. */ -#define WB_COMPLETION_INIT(bdi) __WB_COMPLETION_INIT(&(bdi)->wb_waitq) +#define WB_COMPLETION_INIT(bdi_wb_ctx) __WB_COMPLETION_INIT(&(bdi_wb_ctx)->wb_waitq) -#define DEFINE_WB_COMPLETION(cmpl, bdi) \ - struct wb_completion cmpl = WB_COMPLETION_INIT(bdi) +#define DEFINE_WB_COMPLETION(cmpl, bdi_wb_ctx) \ + struct wb_completion cmpl = WB_COMPLETION_INIT(bdi_wb_ctx) /* * Each wb (bdi_writeback) can perform writeback operations, is measured @@ -104,6 +104,7 @@ struct wb_completion { */ struct bdi_writeback { struct backing_dev_info *bdi; /* our parent bdi */ + struct bdi_writeback_ctx *bdi_wb_ctx; unsigned long state; /* Always use atomic bitops on this */ unsigned long last_old_flush; /* last old data flush */ @@ -160,6 +161,16 @@ struct bdi_writeback { #endif }; +struct bdi_writeback_ctx { + struct bdi_writeback wb; /* the root writeback info for this bdi */ + struct list_head wb_list; /* list of all wbs */ +#ifdef CONFIG_CGROUP_WRITEBACK + struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ + struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ +#endif + wait_queue_head_t wb_waitq; +}; + struct backing_dev_info { u64 id; struct rb_node rb_node; /* keyed by ->id */ @@ -182,16 +193,13 @@ struct backing_dev_info { * blk-wbt. */ unsigned long last_bdp_sleep; - - struct bdi_writeback wb; /* the root writeback info for this bdi */ - struct list_head wb_list; /* list of all wbs */ + struct bdi_writeback_ctx default_ctx; + bool is_parallel; + int nr_wb_ctx; + struct xarray bdi_wb_ctx_arr; #ifdef CONFIG_CGROUP_WRITEBACK - struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */ - struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ #endif - wait_queue_head_t wb_waitq; - struct device *dev; char dev_name[64]; struct device *owner; @@ -216,7 +224,7 @@ struct wb_lock_cookie { */ static inline bool wb_tryget(struct bdi_writeback *wb) { - if (wb != &wb->bdi->wb) + if (wb != &wb->bdi_wb_ctx->wb) return percpu_ref_tryget(&wb->refcnt); return true; } @@ -227,7 +235,7 @@ static inline bool wb_tryget(struct bdi_writeback *wb) */ static inline void wb_get(struct bdi_writeback *wb) { - if (wb != &wb->bdi->wb) + if (wb != &wb->bdi_wb_ctx->wb) percpu_ref_get(&wb->refcnt); } @@ -246,7 +254,7 @@ static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr) return; } - if (wb != &wb->bdi->wb) + if (wb != &wb->bdi_wb_ctx->wb) percpu_ref_put_many(&wb->refcnt, nr); } diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 8e7af9a03b41..e24cd3fd42b4 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -148,11 +148,29 @@ static inline bool mapping_can_writeback(struct address_space *mapping) return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK; } +#define for_each_bdi_wb_ctx(bdi, wb_ctx) \ + for (int i = 0; (i) < (bdi)->nr_wb_ctx \ + && ((wb_ctx) = xa_load(&(bdi)->bdi_wb_ctx_arr, i), 1); (i)++) + +static inline struct bdi_writeback_ctx *fetch_bdi_writeback_ctx(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct super_block *sb = inode->i_sb; + struct bdi_writeback_ctx *bdi_wb_ctx; + + if (sb->s_op->get_inode_wb_ctx) + bdi_wb_ctx = sb->s_op->get_inode_wb_ctx(inode); + else + bdi_wb_ctx = &bdi->default_ctx; + return bdi_wb_ctx; +} + #ifdef CONFIG_CGROUP_WRITEBACK -struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, +struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css); struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css, gfp_t gfp); void wb_memcg_offline(struct mem_cgroup *memcg); @@ -187,16 +205,17 @@ static inline bool inode_cgwb_enabled(struct inode *inode) * Must be called under rcu_read_lock() which protects the returend wb. * NULL if not found. */ -static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) +static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { struct cgroup_subsys_state *memcg_css; struct bdi_writeback *wb; memcg_css = task_css(current, memory_cgrp_id); if (!memcg_css->parent) - return &bdi->wb; + return &bdi_wb_ctx->wb; - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id); /* * %current's blkcg equals the effective blkcg of its memcg. No @@ -217,12 +236,13 @@ static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi * wb_find_current(). */ static inline struct bdi_writeback * -wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) +wb_get_create_current(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, gfp_t gfp) { struct bdi_writeback *wb; rcu_read_lock(); - wb = wb_find_current(bdi); + wb = wb_find_current(bdi, bdi_wb_ctx); if (wb && unlikely(!wb_tryget(wb))) wb = NULL; rcu_read_unlock(); @@ -231,7 +251,7 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) struct cgroup_subsys_state *memcg_css; memcg_css = task_get_css(current, memory_cgrp_id); - wb = wb_get_create(bdi, memcg_css, gfp); + wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, gfp); css_put(memcg_css); } return wb; @@ -264,7 +284,7 @@ static inline struct bdi_writeback *inode_to_wb_wbc( * If wbc does not have inode attached, it means cgroup writeback was * disabled when wbc started. Just use the default wb in that case. */ - return wbc->wb ? wbc->wb : &inode_to_bdi(inode)->wb; + return wbc->wb ? wbc->wb : &fetch_bdi_writeback_ctx(inode)->wb; } /** @@ -324,20 +344,22 @@ static inline bool inode_cgwb_enabled(struct inode *inode) return false; } -static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) +static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { - return &bdi->wb; + return &bdi_wb_ctx->wb; } static inline struct bdi_writeback * -wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) +wb_get_create_current(struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx, + gfp_t gfp) { - return &bdi->wb; + return &bdi_wb_ctx->wb; } static inline struct bdi_writeback *inode_to_wb(struct inode *inode) { - return &inode_to_bdi(inode)->wb; + return &fetch_bdi_writeback_ctx(inode)->wb; } static inline struct bdi_writeback *inode_to_wb_wbc( diff --git a/include/linux/fs.h b/include/linux/fs.h index 7e29433c5ecc..667f97c68cd1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2228,6 +2228,7 @@ struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); void (*free_inode)(struct inode *); + struct bdi_writeback_ctx* (*get_inode_wb_ctx)(struct inode *); void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, struct writeback_control *wbc); diff --git a/mm/backing-dev.c b/mm/backing-dev.c index e61bbb1bd622..f904994efba1 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -84,26 +84,48 @@ static void collect_wb_stats(struct wb_stats *stats, } #ifdef CONFIG_CGROUP_WRITEBACK -static void bdi_collect_stats(struct backing_dev_info *bdi, - struct wb_stats *stats) + +static void bdi_wb_ctx_collect_stats(struct bdi_writeback_ctx *bdi_wb_ctx, + struct wb_stats *stats) { struct bdi_writeback *wb; - rcu_read_lock(); - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) { + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) { if (!wb_tryget(wb)) continue; collect_wb_stats(stats, wb); wb_put(wb); } +} +static void bdi_collect_stats(struct backing_dev_info *bdi, + struct wb_stats *stats) +{ + rcu_read_lock(); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + bdi_wb_ctx_collect_stats(bdi_wb_ctx, stats); + } + } else { + bdi_wb_ctx_collect_stats(&bdi->default_ctx, stats); + } rcu_read_unlock(); } #else static void bdi_collect_stats(struct backing_dev_info *bdi, struct wb_stats *stats) { - collect_wb_stats(stats, &bdi->wb); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + collect_wb_stats(stats, &bdi_wb_ctx->wb); + } + } else { + collect_wb_stats(stats, &bdi->default_ctx.wb); + } } #endif @@ -135,8 +157,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "b_io: %10lu\n" "b_more_io: %10lu\n" "b_dirty_time: %10lu\n" - "bdi_list: %10u\n" - "state: %10lx\n", + "bdi_list: %10u\n", + /* TBD: what to do for state */ + /* "state: %10lx\n", */ K(stats.nr_writeback), K(stats.nr_reclaimable), K(stats.wb_thresh), @@ -149,7 +172,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) stats.nr_io, stats.nr_more_io, stats.nr_dirty_time, - !list_empty(&bdi->bdi_list), bdi->wb.state); + !list_empty(&bdi->bdi_list));/* bdi->wb.state); */ return 0; } @@ -190,17 +213,17 @@ static void wb_stats_show(struct seq_file *m, struct bdi_writeback *wb, wb->state); } -static int cgwb_debug_stats_show(struct seq_file *m, void *v) + +static int cgwb_debug_stats_bdi_wb_ctx(struct seq_file *m, + struct bdi_writeback_ctx *bdi_wb_ctx) { - struct backing_dev_info *bdi = m->private; + struct bdi_writeback *wb; unsigned long background_thresh; unsigned long dirty_thresh; - struct bdi_writeback *wb; global_dirty_limits(&background_thresh, &dirty_thresh); - rcu_read_lock(); - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) { + list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) { struct wb_stats stats = { .dirty_thresh = dirty_thresh }; if (!wb_tryget(wb)) @@ -224,6 +247,23 @@ static int cgwb_debug_stats_show(struct seq_file *m, void *v) wb_put(wb); } + return 0; +} + +static int cgwb_debug_stats_show(struct seq_file *m, void *v) +{ + struct backing_dev_info *bdi = m->private; + + rcu_read_lock(); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + cgwb_debug_stats_bdi_wb_ctx(m, bdi_wb_ctx); + } + } else { + cgwb_debug_stats_bdi_wb_ctx(m, &bdi->default_ctx); + } rcu_read_unlock(); return 0; @@ -520,6 +560,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, memset(wb, 0, sizeof(*wb)); wb->bdi = bdi; + wb->bdi_wb_ctx = &bdi->default_ctx; wb->last_old_flush = jiffies; INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); @@ -643,11 +684,11 @@ static void cgwb_release(struct percpu_ref *refcnt) queue_work(cgwb_release_wq, &wb->release_work); } -static void cgwb_kill(struct bdi_writeback *wb) +static void cgwb_kill(struct bdi_writeback *wb, struct bdi_writeback_ctx *bdi_wb_ctx) { lockdep_assert_held(&cgwb_lock); - WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); + WARN_ON(!radix_tree_delete(&bdi_wb_ctx->cgwb_tree, wb->memcg_css->id)); list_del(&wb->memcg_node); list_del(&wb->blkcg_node); list_add(&wb->offline_node, &offline_cgwbs); @@ -662,6 +703,7 @@ static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) } static int cgwb_create(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css, gfp_t gfp) { struct mem_cgroup *memcg; @@ -678,9 +720,9 @@ static int cgwb_create(struct backing_dev_info *bdi, /* look up again under lock and discard on blkcg mismatch */ spin_lock_irqsave(&cgwb_lock, flags); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id); if (wb && wb->blkcg_css != blkcg_css) { - cgwb_kill(wb); + cgwb_kill(wb, bdi_wb_ctx); wb = NULL; } spin_unlock_irqrestore(&cgwb_lock, flags); @@ -721,12 +763,12 @@ static int cgwb_create(struct backing_dev_info *bdi, */ ret = -ENODEV; spin_lock_irqsave(&cgwb_lock, flags); - if (test_bit(WB_registered, &bdi->wb.state) && + if (test_bit(WB_registered, &bdi_wb_ctx->wb.state) && blkcg_cgwb_list->next && memcg_cgwb_list->next) { /* we might have raced another instance of this function */ - ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb); + ret = radix_tree_insert(&bdi_wb_ctx->cgwb_tree, memcg_css->id, wb); if (!ret) { - list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list); + list_add_tail_rcu(&wb->bdi_node, &bdi_wb_ctx->wb_list); list_add(&wb->memcg_node, memcg_cgwb_list); list_add(&wb->blkcg_node, blkcg_cgwb_list); blkcg_pin_online(blkcg_css); @@ -779,16 +821,16 @@ static int cgwb_create(struct backing_dev_info *bdi, * each lookup. On mismatch, the existing wb is discarded and a new one is * created. */ -struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, +struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css) { struct bdi_writeback *wb; if (!memcg_css->parent) - return &bdi->wb; + return &bdi_wb_ctx->wb; rcu_read_lock(); - wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id); if (wb) { struct cgroup_subsys_state *blkcg_css; @@ -813,6 +855,7 @@ struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi, * create one. See wb_get_lookup() for more details. */ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx, struct cgroup_subsys_state *memcg_css, gfp_t gfp) { @@ -821,45 +864,67 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, might_alloc(gfp); do { - wb = wb_get_lookup(bdi, memcg_css); - } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); + wb = wb_get_lookup(bdi_wb_ctx, memcg_css); + } while (!wb && !cgwb_create(bdi, bdi_wb_ctx, memcg_css, gfp)); return wb; } +static int cgwb_bdi_wb_ctx_init(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) +{ + int ret; + + INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC); + init_rwsem(&bdi_wb_ctx->wb_switch_rwsem); + ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); + /* better error handling */ + if (!ret) { + bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css; + bdi_wb_ctx->wb.blkcg_css = blkcg_root_css; + } + return ret; +} + static int cgwb_bdi_init(struct backing_dev_info *bdi) { int ret; - INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC); mutex_init(&bdi->cgwb_release_mutex); - init_rwsem(&bdi->wb_switch_rwsem); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; - ret = wb_init(&bdi->wb, bdi, GFP_KERNEL); - if (!ret) { - bdi->wb.memcg_css = &root_mem_cgroup->css; - bdi->wb.blkcg_css = blkcg_root_css; + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + ret = cgwb_bdi_wb_ctx_init(bdi, bdi_wb_ctx); + if (ret) + return ret; + } + } else { + cgwb_bdi_wb_ctx_init(bdi, &bdi->default_ctx); } + return ret; } -static void cgwb_bdi_unregister(struct backing_dev_info *bdi) +/* callers should create a loop and pass bdi_wb_ctx */ +static void cgwb_bdi_unregister(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { struct radix_tree_iter iter; void **slot; struct bdi_writeback *wb; - WARN_ON(test_bit(WB_registered, &bdi->wb.state)); + WARN_ON(test_bit(WB_registered, &bdi_wb_ctx->wb.state)); spin_lock_irq(&cgwb_lock); - radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) - cgwb_kill(*slot); + radix_tree_for_each_slot(slot, &bdi_wb_ctx->cgwb_tree, &iter, 0) + cgwb_kill(*slot, bdi_wb_ctx); spin_unlock_irq(&cgwb_lock); mutex_lock(&bdi->cgwb_release_mutex); spin_lock_irq(&cgwb_lock); - while (!list_empty(&bdi->wb_list)) { - wb = list_first_entry(&bdi->wb_list, struct bdi_writeback, + while (!list_empty(&bdi_wb_ctx->wb_list)) { + wb = list_first_entry(&bdi_wb_ctx->wb_list, struct bdi_writeback, bdi_node); spin_unlock_irq(&cgwb_lock); wb_shutdown(wb); @@ -930,7 +995,7 @@ void wb_memcg_offline(struct mem_cgroup *memcg) spin_lock_irq(&cgwb_lock); list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) - cgwb_kill(wb); + cgwb_kill(wb, wb->bdi_wb_ctx); memcg_cgwb_list->next = NULL; /* prevent new wb's */ spin_unlock_irq(&cgwb_lock); @@ -950,15 +1015,16 @@ void wb_blkcg_offline(struct cgroup_subsys_state *css) spin_lock_irq(&cgwb_lock); list_for_each_entry_safe(wb, next, list, blkcg_node) - cgwb_kill(wb); + cgwb_kill(wb, wb->bdi_wb_ctx); list->next = NULL; /* prevent new wb's */ spin_unlock_irq(&cgwb_lock); } -static void cgwb_bdi_register(struct backing_dev_info *bdi) +static void cgwb_bdi_register(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { spin_lock_irq(&cgwb_lock); - list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list); + list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list); spin_unlock_irq(&cgwb_lock); } @@ -981,14 +1047,31 @@ subsys_initcall(cgwb_init); static int cgwb_bdi_init(struct backing_dev_info *bdi) { - return wb_init(&bdi->wb, bdi, GFP_KERNEL); + int ret; + + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL); + /* better error handling */ + if (!ret) + return ret; + } + } else { + ret = wb_init(&bdi->default_ctx.wb, bdi, GFP_KERNEL); + } + return ret; } -static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { } +static void cgwb_bdi_unregister(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { } -static void cgwb_bdi_register(struct backing_dev_info *bdi) +/* callers should create a loop and pass bdi_wb_ctx */ +static void cgwb_bdi_register(struct backing_dev_info *bdi, + struct bdi_writeback_ctx *bdi_wb_ctx) { - list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list); + list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list); } static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb) @@ -1003,12 +1086,14 @@ int bdi_init(struct backing_dev_info *bdi) bdi->dev = NULL; kref_init(&bdi->refcnt); + bdi->nr_wb_ctx = 1; + bdi->is_parallel = false; bdi->min_ratio = 0; bdi->max_ratio = 100 * BDI_RATIO_SCALE; bdi->max_prop_frac = FPROP_FRAC_BASE; INIT_LIST_HEAD(&bdi->bdi_list); - INIT_LIST_HEAD(&bdi->wb_list); - init_waitqueue_head(&bdi->wb_waitq); + INIT_LIST_HEAD(&bdi->default_ctx.wb_list); + init_waitqueue_head(&bdi->default_ctx.wb_waitq); bdi->last_bdp_sleep = jiffies; return cgwb_bdi_init(bdi); @@ -1095,11 +1180,20 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args) if (IS_ERR(dev)) return PTR_ERR(dev); - cgwb_bdi_register(bdi); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + cgwb_bdi_register(bdi, bdi_wb_ctx); + set_bit(WB_registered, &bdi_wb_ctx->wb.state); + } + } else { + cgwb_bdi_register(bdi, &bdi->default_ctx); + set_bit(WB_registered, &bdi->default_ctx.wb.state); + } bdi->dev = dev; bdi_debug_register(bdi, dev_name(dev)); - set_bit(WB_registered, &bdi->wb.state); spin_lock_bh(&bdi_lock); @@ -1155,8 +1249,17 @@ void bdi_unregister(struct backing_dev_info *bdi) /* make sure nobody finds us on the bdi_list anymore */ bdi_remove_from_list(bdi); - wb_shutdown(&bdi->wb); - cgwb_bdi_unregister(bdi); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + wb_shutdown(&bdi_wb_ctx->wb); + cgwb_bdi_unregister(bdi, bdi_wb_ctx); + } + } else { + wb_shutdown(&bdi->default_ctx.wb); + cgwb_bdi_unregister(bdi, &bdi->default_ctx); + } /* * If this BDI's min ratio has been set, use bdi_set_min_ratio() to @@ -1183,9 +1286,18 @@ static void release_bdi(struct kref *ref) struct backing_dev_info *bdi = container_of(ref, struct backing_dev_info, refcnt); - WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb.state)); WARN_ON_ONCE(bdi->dev); - wb_exit(&bdi->wb); + if (bdi->is_parallel) { + struct bdi_writeback_ctx *bdi_wb_ctx; + + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { + WARN_ON_ONCE(test_bit(WB_registered, &bdi_wb_ctx->wb.state)); + wb_exit(&bdi_wb_ctx->wb); + } + } else { + WARN_ON_ONCE(test_bit(WB_registered, &bdi->default_ctx.wb.state)); + wb_exit(&bdi->default_ctx.wb); + } kfree(bdi); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d9861e42b2bd..51420474b4cf 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2101,6 +2101,7 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping, { struct inode *inode = mapping->host; struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode); struct bdi_writeback *wb = NULL; int ratelimit; int ret = 0; @@ -2110,9 +2111,9 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping, return ret; if (inode_cgwb_enabled(inode)) - wb = wb_get_create_current(bdi, GFP_KERNEL); + wb = wb_get_create_current(bdi, bdi_wb_ctx, GFP_KERNEL); if (!wb) - wb = &bdi->wb; + wb = &bdi_wb_ctx->wb; ratelimit = current->nr_dirtied_pause; if (wb->dirty_exceeded) -- 2.25.1 [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-03-18 11:37 ` Anuj Gupta @ 2025-03-19 15:54 ` Jan Kara 2025-03-20 7:08 ` Anuj Gupta 0 siblings, 1 reply; 22+ messages in thread From: Jan Kara @ 2025-03-19 15:54 UTC (permalink / raw) To: Anuj Gupta Cc: Jan Kara, Kundan Kumar, Dave Chinner, Christoph Hellwig, lsf-pc, linux-fsdevel, mcgrof, joshi.k, axboe, clm, willy, gost.dev On Tue 18-03-25 17:07:12, Anuj Gupta wrote: > > I was thinking about how to best parallelize the writeback and I think > > there are two quite different demands for which we probably want two > > different levels of parallelism. > > > > One case is the situation when the filesystem for example has multiple > > underlying devices (like btrfs or bcachefs) or for other reasons writeback > > to different parts is fairly independent (like for different XFS AGs). Here > > we want parallelism at rather high level I think including separate > > dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like > > separate bdis (struct backing_dev_info) but I think it would be technically > > and also conceptually somewhat easier to do the multiplexing by factoring > > out: > > > > struct bdi_writeback wb; /* the root writeback info for this bdi */ > > struct list_head wb_list; /* list of all wbs */ > > #ifdef CONFIG_CGROUP_WRITEBACK > > struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > > struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > > #endif > > wait_queue_head_t wb_waitq; > > > > into a new structure (looking for a good name - bdi_writeback_context???) > > that can get multiplied (filesystem can create its own bdi on mount and > > configure there number of bdi_writeback_contexts it wants). We also need to > > add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb() > > which will return appropriate bdi_writeback_context (or perhaps just it's > > index?) for an inode. This will be used by the filesystem to direct > > writeback code where the inode should go. This is kind of what Kundan did > > in the last revision of his patches but I hope this approach should > > somewhat limit the changes necessary to writeback infrastructure - the > > patch 2 in his series is really unreviewably large... > > Thanks Jan. > > I tried to modify the existing infra based on your suggestion [1]. This > only takes care of the first case that you mentioned. I am yet to start > to evaluate and implement the second case (when amount of cpu work is > high). The patch is large, because it requires changing all the places > that uses bdi's fields that have now been moved to a new structure. I > will try to streamline it further. > > I have omitted the xfs side plumbing for now. > Can you please see if this aligns with the scheme that you had in mind? > If the infra looks fine, I can plumb XFS changes on top of it. > > Note: This patch is only compile tested, haven't run any tests on it. Sure, this is fine for this early prototyping phase. > Subject: [PATCH] writeback: add infra for parallel writeback > > Introduce a new bdi_writeback_ctx structure that enables us to have > multiple writeback contexts for parallel writeback. > > Existing single threaded writeback will use default_ctx. > > Filesystem now have the option to create there own bdi aind populate > multiple bdi_writeback_ctx in bdi's bdi_wb_ctx_arr (xarray for now, but > plan to move to use a better structure like list_lru). > > Introduce a new hook get_inode_wb_ctx() in super_operations which takes > inode as a parameter and returns the bdi_wb_ctx corresponding to the > inode. > > Modify all the functions/places that operate on bdi's wb, wb_list, > cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved > to bdi_writeback_ctx > > Store a reference to bdi_wb_ctx in bdi_writeback. > > Suggested-by: Jan Kara <jack@suse.cz> > Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> > Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Thanks for the patch! Overall I think we are going in the right direction :) Some comments below. > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 3cd99e2dc6ac..4c47c298f174 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -265,23 +265,27 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio) > { > struct backing_dev_info *bdi = inode_to_bdi(inode); > struct bdi_writeback *wb = NULL; > + struct bdi_writeback_ctx *bdi_writeback_ctx = > + fetch_bdi_writeback_ctx(inode); ^^^ I'd call this inode_bdi_writeback_ctx(). I think that name tells a bit more. > if (inode_cgwb_enabled(inode)) { > struct cgroup_subsys_state *memcg_css; > > if (folio) { > memcg_css = mem_cgroup_css_from_folio(folio); > - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); > + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, > + GFP_ATOMIC); > } else { > /* must pin memcg_css, see wb_get_create() */ > memcg_css = task_get_css(current, memory_cgrp_id); > - wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); > + wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css, > + GFP_ATOMIC); > css_put(memcg_css); > } > } > > if (!wb) > - wb = &bdi->wb; > + wb = &bdi_writeback_ctx->wb; Perhaps as a preparation patch, can you please rename bdi->wb to bdi->root_wb (just mechanical replacement) and then carry over this name to bdi_writeback_ctx? Because we have too many wbs here already and as the structure is getting more complex, explanatory naming becomes more important. Thanks! > @@ -1103,7 +1112,17 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, > * And find the associated wb. If the wb isn't there already > * there's nothing to flush, don't create one. > */ > - wb = wb_get_lookup(bdi, memcg_css); > + if (bdi->is_parallel) { > + struct bdi_writeback_ctx *bdi_wb_ctx; > + > + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { > + wb = wb_get_lookup(bdi_wb_ctx, memcg_css); > + if (wb) > + break; > + } > + } else { > + wb = wb_get_lookup(&bdi->default_ctx, memcg_css); > + } Why do you introduce this bdi->is_parallel bool? Why don't we unconditionally do: for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { wb = wb_get_lookup(bdi_wb_ctx, memcg_css); if (wb) break; } It would seem more obvious and with less code duplication... It might require a bit of magic inside for_each_bdi_wb_ctx() but I think it pays off. BTW, you should initialize 'wb' before the loop to NULL. Otherwise I expect some compilers to complain and also for readers it makes the test below obviously do the right thing. > if (!wb) { > ret = -ENOENT; > goto out_css_put; ... > @@ -1232,13 +1251,14 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) > > static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, > struct wb_writeback_work *base_work, > - bool skip_if_busy) > + bool skip_if_busy, > + struct bdi_writeback_ctx *bdi_wb_ctx) Nit; logical position for bdi_wb_ctx argument would be just after 'bdi' but see below - I think we could pass only bdi_wb_ctx here after some changes. > @@ -2713,11 +2753,12 @@ static void wait_sb_inodes(struct super_block *sb) > mutex_unlock(&sb->s_sync_lock); > } > > -static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, > - enum wb_reason reason, bool skip_if_busy) > +static void __writeback_inodes_sb_nr_bdi_wb_ctx(struct super_block *sb, > + unsigned long nr, enum wb_reason reason, > + bool skip_if_busy, > + struct bdi_writeback_ctx *bdi_wb_ctx) > { > - struct backing_dev_info *bdi = sb->s_bdi; > - DEFINE_WB_COMPLETION(done, bdi); > + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); > struct wb_writeback_work work = { > .sb = sb, > .sync_mode = WB_SYNC_NONE, > @@ -2727,12 +2768,29 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, > .reason = reason, > }; > > + bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy, bdi_wb_ctx); > + wb_wait_for_completion(&done); > +} > + > +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, > + enum wb_reason reason, bool skip_if_busy) > +{ > + struct backing_dev_info *bdi = sb->s_bdi; > + > if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) > return; > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > - bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); > - wb_wait_for_completion(&done); > + if (bdi->is_parallel) { > + struct bdi_writeback_ctx *bdi_wb_ctx; > + > + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) > + __writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason, > + skip_if_busy, bdi_wb_ctx); > + } else { > + __writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason, > + skip_if_busy, &bdi->default_ctx); > + } > } If we drop this 'is_parallel' thing, I think we don't need the new __writeback_inodes_sb_nr_bdi_wb_ctx() function and can keep everything in one function. > > /** > @@ -2785,17 +2843,10 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) > } > EXPORT_SYMBOL(try_to_writeback_inodes_sb); > > -/** > - * sync_inodes_sb - sync sb inode pages > - * @sb: the superblock > - * > - * This function writes and waits on any dirty inode belonging to this > - * super_block. > - */ > -void sync_inodes_sb(struct super_block *sb) > +static void sync_inodes_bdi_wb_ctx(struct super_block *sb, > + struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx) > { > - struct backing_dev_info *bdi = sb->s_bdi; > - DEFINE_WB_COMPLETION(done, bdi); > + DEFINE_WB_COMPLETION(done, bdi_wb_ctx); > struct wb_writeback_work work = { > .sb = sb, > .sync_mode = WB_SYNC_ALL, > @@ -2805,6 +2856,22 @@ void sync_inodes_sb(struct super_block *sb) > .reason = WB_REASON_SYNC, > .for_sync = 1, > }; > + bdi_wb_ctx_down_write_wb_switch_rwsem(bdi_wb_ctx); > + bdi_split_work_to_wbs(bdi, &work, false, bdi_wb_ctx); > + wb_wait_for_completion(&done); > + bdi_wb_ctx_up_write_wb_switch_rwsem(bdi_wb_ctx); > +} > + > +/** > + * sync_inodes_sb - sync sb inode pages > + * @sb: the superblock > + * > + * This function writes and waits on any dirty inode belonging to this > + * super_block. > + */ > +void sync_inodes_sb(struct super_block *sb) > +{ > + struct backing_dev_info *bdi = sb->s_bdi; > > /* > * Can't skip on !bdi_has_dirty() because we should wait for !dirty > @@ -2815,12 +2882,15 @@ void sync_inodes_sb(struct super_block *sb) > return; > WARN_ON(!rwsem_is_locked(&sb->s_umount)); > > - /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ > - bdi_down_write_wb_switch_rwsem(bdi); > - bdi_split_work_to_wbs(bdi, &work, false); > - wb_wait_for_completion(&done); > - bdi_up_write_wb_switch_rwsem(bdi); > + if (bdi->is_parallel) { > + struct bdi_writeback_ctx *bdi_wb_ctx; > > + for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) { > + sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx); > + } > + } else { > + sync_inodes_bdi_wb_ctx(sb, bdi, &bdi->default_ctx); > + } > wait_sb_inodes(sb); > } > EXPORT_SYMBOL(sync_inodes_sb); The same comment as above. > @@ -104,6 +104,7 @@ struct wb_completion { > */ > struct bdi_writeback { > struct backing_dev_info *bdi; /* our parent bdi */ > + struct bdi_writeback_ctx *bdi_wb_ctx; > > unsigned long state; /* Always use atomic bitops on this */ > unsigned long last_old_flush; /* last old data flush */ > @@ -160,6 +161,16 @@ struct bdi_writeback { > #endif > }; > > +struct bdi_writeback_ctx { > + struct bdi_writeback wb; /* the root writeback info for this bdi */ > + struct list_head wb_list; /* list of all wbs */ > +#ifdef CONFIG_CGROUP_WRITEBACK > + struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > + struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > +#endif > + wait_queue_head_t wb_waitq; > +}; > + As I was seeing the patch, I'd also add: struct backing_dev_info *bdi; pointer to bdi_writeback_ctx. Then in most functions, you can just pass bdi_wb_ctx into them instead of both bdi *and* bdi_wb_ctx and it makes interfaces somewhat nicer. > struct backing_dev_info { > u64 id; > struct rb_node rb_node; /* keyed by ->id */ > @@ -182,16 +193,13 @@ struct backing_dev_info { > * blk-wbt. > */ > unsigned long last_bdp_sleep; > - > - struct bdi_writeback wb; /* the root writeback info for this bdi */ > - struct list_head wb_list; /* list of all wbs */ > + struct bdi_writeback_ctx default_ctx; > + bool is_parallel; > + int nr_wb_ctx; > + struct xarray bdi_wb_ctx_arr; I think xarray here is overkill. I'd just make this a plain array: struct bdi_writeback_ctx **bdi_wb_ctx_arr; which will get allocated with nr_wb_ctx entries during bdi_init(). Also I'd make default_ctx just be entry at index 0 in this array. I'm undecided whether it will be clearer to just drop default_ctx altogether or keep it and set: struct bdi_writeback_ctx *default_ctx = bdi_wb_ctx_arr[0]; on init so I'll leave that up to you. > #ifdef CONFIG_CGROUP_WRITEBACK > - struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ > struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */ > - struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ > #endif > - wait_queue_head_t wb_waitq; > - > struct device *dev; > char dev_name[64]; > struct device *owner; ... > +static inline struct bdi_writeback_ctx *fetch_bdi_writeback_ctx(struct inode *inode) > +{ > + struct backing_dev_info *bdi = inode_to_bdi(inode); > + struct super_block *sb = inode->i_sb; > + struct bdi_writeback_ctx *bdi_wb_ctx; > + > + if (sb->s_op->get_inode_wb_ctx) > + bdi_wb_ctx = sb->s_op->get_inode_wb_ctx(inode); > + else > + bdi_wb_ctx = &bdi->default_ctx; > + return bdi_wb_ctx; > +} The indirect call (and the handling in there) might get potentially expensive given all the places where this is called. For now I think we are fine but eventually we might need to find space in struct inode (I know that's a hard sell) for u8/u16 to store writeback context index there. But first let's get this working before we burry ourselves in (potentially premature) performance optimizations. > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 7e29433c5ecc..667f97c68cd1 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2228,6 +2228,7 @@ struct super_operations { > struct inode *(*alloc_inode)(struct super_block *sb); > void (*destroy_inode)(struct inode *); > void (*free_inode)(struct inode *); > + struct bdi_writeback_ctx* (*get_inode_wb_ctx)(struct inode *); ^^ wrongly placed space here. > > void (*dirty_inode) (struct inode *, int flags); > int (*write_inode) (struct inode *, struct writeback_control *wbc); Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-03-19 15:54 ` Jan Kara @ 2025-03-20 7:08 ` Anuj Gupta 0 siblings, 0 replies; 22+ messages in thread From: Anuj Gupta @ 2025-03-20 7:08 UTC (permalink / raw) To: Jan Kara Cc: Kundan Kumar, Dave Chinner, Christoph Hellwig, lsf-pc, linux-fsdevel, mcgrof, joshi.k, axboe, clm, willy, gost.dev [-- Attachment #1: Type: text/plain, Size: 921 bytes --] > I think xarray here is overkill. I'd just make this a plain array: > > struct bdi_writeback_ctx **bdi_wb_ctx_arr; > > which will get allocated with nr_wb_ctx entries during bdi_init(). Also I'd > make default_ctx just be entry at index 0 in this array. I'm undecided > whether it will be clearer to just drop default_ctx altogether or keep it > and set: > > struct bdi_writeback_ctx *default_ctx = bdi_wb_ctx_arr[0]; > > on init so I'll leave that up to you. Killing default_ctx completely seems like a better choice here. This allows us to unconditionally use the for_each_bdi_wb_ctx loop, even for the existing single writeback case, by simply relying on bdi_wb_ctx[0]. It also eliminates the need for is_parallel, simplifying the design. Thanks for the detailed review, Jan! Your suggestions look good to me and help streamline the plumbing. I'll incorporate them in the next version. Thanks, Anuj Gupta [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-02-11 1:13 ` Dave Chinner 2025-02-11 13:43 ` Jan Kara 2025-02-20 14:19 ` Kundan Kumar @ 2025-03-12 17:47 ` Luis Chamberlain 2025-03-13 19:39 ` Jan Kara 2 siblings, 1 reply; 22+ messages in thread From: Luis Chamberlain @ 2025-03-12 17:47 UTC (permalink / raw) To: Dave Chinner, David Bueso Cc: Jan Kara, Christoph Hellwig, Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, joshi.k, axboe, clm, willy, gost.dev On Tue, Feb 11, 2025 at 12:13:18PM +1100, Dave Chinner wrote: > Should we be looking towards using a subset of the existing list_lru > functionality for writeback contexts here? i.e. create a list_lru > object with N-way scalability, allow the fs to provide an > inode-number-to-list mapping function, and use the list_lru > interfaces to abstract away everything physical and cgroup related > for tracking dirty inodes? > > Then selecting inodes for writeback becomes a list_lru_walk() > variant depending on what needs to be written back (e.g. physical > node, memcg, both, everything that is dirty everywhere, etc). I *suspect* you're referring to abstracting or sharing the sharding to numa node functionality of list_lru so we can, divide objects to numa nodes in similar ways for different use cases? Because list_lru is about reclaim, not writeback, but from my reading the list_lru sharding to numa nodes was the golden nugget to focus on. Luis ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback 2025-03-12 17:47 ` Luis Chamberlain @ 2025-03-13 19:39 ` Jan Kara 0 siblings, 0 replies; 22+ messages in thread From: Jan Kara @ 2025-03-13 19:39 UTC (permalink / raw) To: Luis Chamberlain Cc: Dave Chinner, David Bueso, Jan Kara, Christoph Hellwig, Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, joshi.k, axboe, clm, willy, gost.dev On Wed 12-03-25 10:47:12, Luis Chamberlain wrote: > On Tue, Feb 11, 2025 at 12:13:18PM +1100, Dave Chinner wrote: > > Should we be looking towards using a subset of the existing list_lru > > functionality for writeback contexts here? i.e. create a list_lru > > object with N-way scalability, allow the fs to provide an > > inode-number-to-list mapping function, and use the list_lru > > interfaces to abstract away everything physical and cgroup related > > for tracking dirty inodes? > > > > Then selecting inodes for writeback becomes a list_lru_walk() > > variant depending on what needs to be written back (e.g. physical > > node, memcg, both, everything that is dirty everywhere, etc). > > I *suspect* you're referring to abstracting or sharing the sharding > to numa node functionality of list_lru so we can, divide objects > to numa nodes in similar ways for different use cases? > > Because list_lru is about reclaim, not writeback, but from my reading > the list_lru sharding to numa nodes was the golden nugget to focus on. Dave was speaking of list_lru mostly as an example how we have a structure that is both memcg aware and has N-way parallelism built in it (for NUMA nodes). For writeback we need something similar - we already have cgroup awareness and now you want to add N-way parallelism. So the idea was that we could possibly create a generic structure supporting this usable for both. But details matter here for what ends up being practical... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-03-20 11:20 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20250129103448epcas5p1f7d71506e4443429a0b0002eb842e749@epcas5p1.samsung.com>
2025-01-29 10:26 ` [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback Kundan Kumar
2025-01-29 15:42 ` Christoph Hellwig
2025-01-31 9:57 ` Kundan Kumar
2025-01-29 22:51 ` Dave Chinner
2025-01-31 9:32 ` Kundan Kumar
2025-01-31 17:06 ` Luis Chamberlain
2025-02-04 2:50 ` Dave Chinner
2025-02-04 5:06 ` Christoph Hellwig
2025-02-04 7:08 ` Dave Chinner
2025-02-10 17:28 ` [Lsf-pc] " Jan Kara
2025-02-11 1:13 ` Dave Chinner
2025-02-11 13:43 ` Jan Kara
2025-02-20 14:19 ` Kundan Kumar
2025-03-13 20:22 ` Jan Kara
2025-03-18 3:42 ` Dave Chinner
[not found] ` <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com>
2025-03-19 8:07 ` Anuj Gupta
2025-03-18 6:41 ` Kundan Kumar
2025-03-18 11:37 ` Anuj Gupta
2025-03-19 15:54 ` Jan Kara
2025-03-20 7:08 ` Anuj Gupta
2025-03-12 17:47 ` Luis Chamberlain
2025-03-13 19:39 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).