[LSF/MM/BPF TOPIC] Parallelizing filesystem writeback

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
       [not found] <CGME20250129103448epcas5p1f7d71506e4443429a0b0002eb842e749@epcas5p1.samsung.com>
@ 2025-01-29 10:26 ` Kundan Kumar
  2025-01-29 15:42   ` Christoph Hellwig
  2025-01-29 22:51   ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Kundan Kumar @ 2025-01-29 10:26 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-fsdevel, anuj20.g, mcgrof, joshi.k, david, axboe, clm, hch,
	willy, gost.dev, Kundan Kumar

Greetings everyone,

Anuj and I would like to present our proposal for implementing parallel
writeback at LSF/MM. This approach could address the writeback performance
issue discussed by Luis last year [1].

Currently, the pagecache writeback is executed in a single-threaded manner.
Inodes are added to the dirty list, and the delayed writeback is scheduled.
The single-threaded writeback then iterates through the dirty list and performs
the writeback operations.

We have developed a prototype implementation of parallel writeback, where we
have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time,
and b_more_io lists have also been modified to be per-CPU. When an inode needs
to be added to the b_dirty list, we select the next CPU (in a round-robin
fashion) and schedule the per-CPU writeback work on the selected CPU.

With the per-CPU threads handling the writeback, we have observed a significant
increase in IOPS. Here is a test and comparison between the older writeback and
the newer parallel writeback, using XFS filesystem:
https://github.com/kundanthebest/parallel_writeback/blob/main/README.md

In our tests, we have found that with this implementation the buffered I/O
IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe
(Optane) and PMEM.

There are a few items that we would like to discuss:

During the implementation, we noticed several ways in which the writeback IOs
are throttled:
A) balance_dirty_pages
B) writeback_chunk_size
C) wb_over_bg_thresh
Additionally, there are delayed writeback executions in the form of
dirty_writeback_centisecs and dirty_expire_centisecs.

With the introduction of per-CPU writeback, we need to determine the
appropriate way to adjust the throttling and delayed writeback settings.

Lock contention:
We have observed that the writeback list_lock and the inode->i_lock are the
locks that are most likely to cause contention when we make the thread per-CPU.
Our next task will be to convert the writeback list_lock to a per-CPU lock.
Another potential improvement could be to assign a specific CPU to an inode.

We are preparing the RFC to depict the changes and planning to submit the same
soon.

[1] https://lore.kernel.org/linux-mm/Zd5lORiOCUsARPWq@dread.disaster.area/T/

-- 
2.25.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-01-29 10:26 ` [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback Kundan Kumar
@ 2025-01-29 15:42   ` Christoph Hellwig
  2025-01-31  9:57     ` Kundan Kumar
  2025-01-29 22:51   ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2025-01-29 15:42 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, david, axboe,
	clm, hch, willy, gost.dev

On Wed, Jan 29, 2025 at 03:56:27PM +0530, Kundan Kumar wrote:
> and b_more_io lists have also been modified to be per-CPU. When an inode needs
> to be added to the b_dirty list, we select the next CPU (in a round-robin
> fashion) and schedule the per-CPU writeback work on the selected CPU.

I don't think per-cpu is the right shard here.  You want to write
related data together.  A fіrst approximation might be inodes.

FYI, a really good "benchmark" is if you can use this parallel writeback
code to replace the btrfs workqueue threads spawned to handle checksumming
and compression.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-01-29 15:42   ` Christoph Hellwig
@ 2025-01-31  9:57     ` Kundan Kumar
  0 siblings, 0 replies; 22+ messages in thread
From: Kundan Kumar @ 2025-01-31  9:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, david, axboe,
	clm, willy, gost.dev

[-- Attachment #1: Type: text/plain, Size: 566 bytes --]

On 29/01/25 04:42PM, Christoph Hellwig wrote:
>I don't think per-cpu is the right shard here.  You want to write
>related data together.  A fіrst approximation might be inodes.
>
We have plan to affine inode to a particular writeback thread, like on
first encounter of inode we select one writeback context, and use the same
context when inode tries to issue IO again. 

The number of writeback contexts can be changed from per-cpu and can be
decided using other factors as in this discussion.
https://lore.kernel.org/all/20250131093209.6luwm4ny5kj34jqc@green245/


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-01-29 10:26 ` [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback Kundan Kumar
  2025-01-29 15:42   ` Christoph Hellwig
@ 2025-01-29 22:51   ` Dave Chinner
  2025-01-31  9:32     ` Kundan Kumar
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2025-01-29 22:51 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, hch,
	willy, gost.dev

On Wed, Jan 29, 2025 at 03:56:27PM +0530, Kundan Kumar wrote:
> Greetings everyone,
> 
> Anuj and I would like to present our proposal for implementing parallel
> writeback at LSF/MM. This approach could address the writeback performance
> issue discussed by Luis last year [1].
> 
> Currently, the pagecache writeback is executed in a single-threaded manner.
> Inodes are added to the dirty list, and the delayed writeback is scheduled.
> The single-threaded writeback then iterates through the dirty list and performs
> the writeback operations.
> 
> We have developed a prototype implementation of parallel writeback, where we
> have made the writeback process per-CPU-based. The b_io, b_dirty, b_dirty_time,
> and b_more_io lists have also been modified to be per-CPU. When an inode needs
> to be added to the b_dirty list, we select the next CPU (in a round-robin
> fashion) and schedule the per-CPU writeback work on the selected CPU.

My own experiments and experience tell me that per-cpu is not
the right architecture for expanding writeback concurrency.

The actual writeback parallelism that can be acheived is determined
by the underlying filesystem architecture and geometry, not the CPU
layout of the machine.

I think some background info on writeback concurrency is also
necessary before we go any further.

If we go back many years to before we had the current
write throttling architecture, we go back to an age where balance_dirty_pages()
throttled by submitting foreground IO itself. It was throttled by
the block layer submission queue running out of slots rather than
any specific performance measurement. Under memory pressure, this
lead highly concurrent dispatch of dirty pages, but very poor
selection of dirty inodes/pages to dispatch. The result was that
this foreground submission would often dispatch single page IOs,
and issue tens of them concurrently across all user processes that
were trying to write data. IOWs, throttling caused "IO breakdown"
where the IO pattern ended up being small random IO exactly when
we need to perform bulk cleaning of dirty pages. We needed bulk
page cleaning throughput, not random dirty page shootdown.

When we moved to "writeback is only done by the single BDI flusher
thread" architecture, we significantly improved system behaviour
under heavy load because writeback now had a tight focus issuing IO
from a single inode at a time as efficiently as possible. i.e.  when
we had large sequential dirty file regions, the writeback IO was all
large sequential IO, and that didn't get chopped up by another
hundred processes issuing IO from random other inodes at the same
time.

IOWs, having too much parallelism in writeback for the underlying
storage and/or filesystem can be far more harmful to system
performance under load than having too little parallelism to drive
the filesystem/hardware to it's maximum performance.

This is an underlying constraint that everyone needs to understand
before discussing writeback concurrency changes: excessive
writeback concurrency can result in worse outcomes than having no
writeback concurrency under sustained heavy production loads....

> With the per-CPU threads handling the writeback, we have observed a significant
> increase in IOPS. Here is a test and comparison between the older writeback and
> the newer parallel writeback, using XFS filesystem:
> https://github.com/kundanthebest/parallel_writeback/blob/main/README.md

What kernel is that from?

> In our tests, we have found that with this implementation the buffered I/O
> IOPS surpasses the DIRECT I/O. We have done very limited testing with NVMe
> (Optane) and PMEM.

Buffered IO often goes faster than direct for these workloads
because the buffered IO is aggregated in memory to much larger, less
random sequential IOs at writeback time.  Writing the data in fewer,
larger, better ordered IOs is almost always faster (at both the
filesystem and storage hardware layers) than a pure 4kB random IO
submission pattern.

i.e. with enough RAM, this random write workload using buffered IO
is pretty much guaranteed to outperform direct IO regardless of the
underlying writeback concurrency.

> There are a few items that we would like to discuss:
> 
> During the implementation, we noticed several ways in which the writeback IOs
> are throttled:
> A) balance_dirty_pages
> B) writeback_chunk_size
> C) wb_over_bg_thresh
> Additionally, there are delayed writeback executions in the form of
> dirty_writeback_centisecs and dirty_expire_centisecs.
>
> With the introduction of per-CPU writeback, we need to determine the
> appropriate way to adjust the throttling and delayed writeback settings.

Yup, that's kinda my point - residency of the dirty data set in the
page cache influences throughput in a workload like this more than
writeback parallelism does.

> Lock contention:
> We have observed that the writeback list_lock and the inode->i_lock are the
> locks that are most likely to cause contention when we make the thread per-CPU.

You clearly aren't driving the system hard enough with small IOs. :)

Once you load up the system with lots of small files (e.g.  with
fsmark to create millions of 4kB files concurrently) and use
flush-on-close writeback semantics to drive writeback IO submission
concurrency, the profiles are typically dominated by folio
allocation and freeing in the page cache and XFS journal free space
management (i.e. transaction reservation management around timestamp
updates, block allocation and unwritten extent conversion.)

That is, writeback parallelism in XFS is determined by the
underlying allocation parallelism of the filesystem and journal
size/throughput because it uses delayed allocation (same goes for
ext4 and btrfs). In the case of XFS, allocation parallelism is
determined by the filesytem geometry - the number of allocation
groups created at mkfs time.

So if you have 4 allocation groups (AGs) in XFS (typical default for
small filesystems), then you can only be doing 4 extent allocation
operations at once. Hence it doesn't matter how many concurrent
writeback contexts the system has, the filesystem will limit
concurrency in many workloads to just 4. i.e. the filesystem
geometry determines writeback concurrency, not the number of CPUs in
the system.

Not every writeback will require allocation in XFS (specualtive
delalloc beyond EOF greatly reduces this overhead) but when you have
hundreds of thousands of small files that each require allocation
(e.g. cloning or untarring a large source tree), then writeback has
a 1:1 ratio between per-inode writeback and extent allocation. This
is the case where filesystem writeback concurrency really matters...

> Our next task will be to convert the writeback list_lock to a per-CPU lock.
> Another potential improvement could be to assign a specific CPU to an inode.

As I mentioned above, per-cpu writeback doesn't seem like the right
architecture to me.

IMO, we need writeback to be optimised for is asynchronous IO
dispatch through each filesystems; our writeback IOPS problems in
XFS largely stem from the per-IO cpu overhead of block allocation in
the filesystems (i.e. delayed allocation).

Because of the way we do allocation locality in XFS, we always try
to allocate data extents in the same allocation group as the inode
is located.  Hence if we have a writeback context per AG, and we
always know what AG an inode is located in, we always know which
writeback context is should belong to.

Hence what a filesystem like XFS needs is the ability to say to the
writeback infrastructure "I need N writeback contexts" and then have
the ability to control which writeback context an inode is attached
to.

A writeback context would be an entirely self contained object with
dirty lists, locks to protect the lists, a (set of) worker threads
that does the writeback dispatch, etc.

Then the VFS will naturally write back all the inodes in a given AG
indepedently to the inodes in any other AG, and there will be almost
no serialisation on allocation or IO submission between writeback
contexts. And there won't be any excessive writeback concurrency at
the filesystem layer so contention problems during writeback go
away, too.

IOWs, for XFS we really don't want CPU aligned writeback lists, we
want filesystem geometry aligned writeback contexts...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-01-29 22:51   ` Dave Chinner
@ 2025-01-31  9:32     ` Kundan Kumar
  2025-01-31 17:06       ` Luis Chamberlain
  2025-02-04  2:50       ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Kundan Kumar @ 2025-01-31  9:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, hch,
	willy, gost.dev

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

>IOWs, having too much parallelism in writeback for the underlying
>storage and/or filesystem can be far more harmful to system
>performance under load than having too little parallelism to drive
>the filesystem/hardware to it's maximum performance.

With increasing speed of devices we would like to improve the performance of
buffered IO as well. This will help the applications(DB, AI/ML) using buffered
I/O. If more parallelism is causing side effect, we can reduce it using some
factor like:
1) writeback context per NUMA node.
2) Fixed number of writeback contexts, say min(10, numcpu).
3) NUMCPU/N number of writeback contexts.
4) Writeback context based on FS geometry like per AG for XFS, as per your
   suggestion.
>
>What kernel is that from?
>

6.12.0-rc4+ 
commit d165768847839f8d1ae5f8081ecc018a190d50e8

>i.e. with enough RAM, this random write workload using buffered IO
>is pretty much guaranteed to outperform direct IO regardless of the
>underlying writeback concurrency.

We tested making sure RAM is available for both buffered and direct IO. On a
system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device.

fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \
 --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \
 --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting

We can see the results which show direct IO exceed buffered IO by big margin.

BW (MiB/s)         buffered dontcache %improvement  direct  %improvement
randwrite (bs=4k)    3393    5397	 59.06%	    9315      174.53%

>IMO, we need writeback to be optimised for is asynchronous IO
>dispatch through each filesystems; our writeback IOPS problems in
>XFS largely stem from the per-IO cpu overhead of block allocation in
>the filesystems (i.e. delayed allocation).

This is a good idea, but it means we will not be able to paralellize within
an AG. I will spend some time to build a POC with per AG writeback context, and
compare it with per-cpu writeback performance and extent fragmentation. Other
filesystems using delayed allocation will also need a similar scheme.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-01-31  9:32     ` Kundan Kumar
@ 2025-01-31 17:06       ` Luis Chamberlain
  2025-02-04  2:50       ` Dave Chinner
  1 sibling, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2025-01-31 17:06 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Dave Chinner, lsf-pc, linux-fsdevel, anuj20.g, joshi.k, axboe,
	clm, hch, willy, gost.dev, Javier González

On Fri, Jan 31, 2025 at 03:02:09PM +0530, Kundan Kumar wrote:
> > IOWs, having too much parallelism in writeback for the underlying
> > storage and/or filesystem can be far more harmful to system
> > performance under load than having too little parallelism to drive
> > the filesystem/hardware to it's maximum performance.
> 
> With increasing speed of devices we would like to improve the performance of
> buffered IO as well. This will help the applications(DB, AI/ML) using buffered
> I/O. If more parallelism is causing side effect, we can reduce it using some
> factor like:
> 1) writeback context per NUMA node.
> 2) Fixed number of writeback contexts, say min(10, numcpu).
> 3) NUMCPU/N number of writeback contexts.

Based on Dave's feedback, it would seem not using 4) can in the worst
case make things worse in certain heavy workloads. So an opt-in rather
than default would probably be best for 1-3.

> 4) Writeback context based on FS geometry like per AG for XFS, as per your
>   suggestion.

To this later point 4):

This is not the first time having the ability to gather filesystem topology
somehow comes up for more interesting enhancements, FDP being the other one.
I don't think we have a generic way to gather this information today, and so
do we want a way to at least allow internal users to query for something like
this?

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-01-31  9:32     ` Kundan Kumar
  2025-01-31 17:06       ` Luis Chamberlain
@ 2025-02-04  2:50       ` Dave Chinner
  2025-02-04  5:06         ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2025-02-04  2:50 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k, axboe, clm, hch,
	willy, gost.dev

On Fri, Jan 31, 2025 at 03:02:09PM +0530, Kundan Kumar wrote:
> > IOWs, having too much parallelism in writeback for the underlying
> > storage and/or filesystem can be far more harmful to system
> > performance under load than having too little parallelism to drive
> > the filesystem/hardware to it's maximum performance.
> 
> With increasing speed of devices we would like to improve the performance of
> buffered IO as well. This will help the applications(DB, AI/ML) using buffered
> I/O. If more parallelism is causing side effect, we can reduce it using some
> factor like:
> 1) writeback context per NUMA node.

I doubt that will create enough concurrency for a typical small
server or desktop machine that only has a single NUMA node but has a
couple of fast nvme SSDs in it.

> 2) Fixed number of writeback contexts, say min(10, numcpu).
> 3) NUMCPU/N number of writeback contexts.

These don't take into account the concurrency available from
the underlying filesystem or storage.

That's the point I was making - CPU count has -zero- relationship to
the concurrency the filesystem and/or storage provide the system. It
is fundamentally incorrect to base decisions about IO concurrency on
the number of CPU cores in the system.

Same goes for using NUMA nodes, a fixed minimum, etc.

> 4) Writeback context based on FS geometry like per AG for XFS, as per your
>   suggestion.

Leave the concurrency and queue affinity selection up to the
filesystem.  If they don't specify anything, then we remain with a
single queue and writeback thread.

> > i.e. with enough RAM, this random write workload using buffered IO
> > is pretty much guaranteed to outperform direct IO regardless of the
> > underlying writeback concurrency.
> 
> We tested making sure RAM is available for both buffered and direct IO. On a
> system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device.
> 
> fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \
> --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \
> --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting
> 
> We can see the results which show direct IO exceed buffered IO by big margin.
> 
> BW (MiB/s)         buffered dontcache %improvement  direct  %improvement
> randwrite (bs=4k)    3393    5397	 59.06%	    9315      174.53%

DIO is faster on PMEM because it only copies the data once
(i.e. user to PMEM) whilst buffered IO copies the data twice (user
to page cache to PMEM). The IO has exactly the same CPU overhead
regardless of in-memory page cache aggregation for buffered IO
because IO cpu usage is dominated by the data copy overhead.

Throughput benchmarks on PMEM are -always- CPU and memory bandwidth
bound doing synchronous data copying.  PMEM benchamrks may respond
to CPU count based optimisations, but that should not be taken as
"infrastructure needs to be CPU-count optimised.

This is why it is fundamentally wrong to use PMEM for IO
infrastructure optimisation: the behaviour it displays is completely
different to pretty much all other types of IO device we support.
PMEM is a niche, and it performs very poorly compared to a handful
of NVMe SSDs being driven by the same CPUs...

> > IMO, we need writeback to be optimised for is asynchronous IO
> > dispatch through each filesystems; our writeback IOPS problems in
> > XFS largely stem from the per-IO cpu overhead of block allocation in
> > the filesystems (i.e. delayed allocation).
> 
> This is a good idea, but it means we will not be able to paralellize within
> an AG.

The XFS allocation concurrency model scales out across AGs, not
within AGs. If we need writeback concurrency within an AG (e.g. for
pure overwrite concurrency) then we want async dispatch worker
threads per writeback queue, not multiple writeback queues per
AG....

> I will spend some time to build a POC with per AG writeback context, and
> compare it with per-cpu writeback performance and extent fragmentation. Other
> filesystems using delayed allocation will also need a similar scheme.

The two are not exclusive like you seem to think they are.

If you want to optimise your filesystem geometry for CPU count based
concurrency, then on XFS all you need to do is this:

# mkfs.xfs -d concurrency=nr_cpus /dev/pmem0

mkfs will then create a geometry that has sufficient concurrency for
the number of CPUs in the host system.  Hence we end up with -all-
in-memory and on-disk filesystem structures on that PMEM device
being CPU count optimised, not just writeback.

Writeback is just a small part of a much larger system, and we
already optimise filesystem and IO concurrency via filesystem
geometry.  Writeback needs to work within those same controls we
already provide users with.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-04  2:50       ` Dave Chinner
@ 2025-02-04  5:06         ` Christoph Hellwig
  2025-02-04  7:08           ` Dave Chinner
  2025-02-10 17:28           ` [Lsf-pc] " Jan Kara
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2025-02-04  5:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k,
	axboe, clm, hch, willy, gost.dev

On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> I doubt that will create enough concurrency for a typical small
> server or desktop machine that only has a single NUMA node but has a
> couple of fast nvme SSDs in it.
> 
> > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > 3) NUMCPU/N number of writeback contexts.
> 
> These don't take into account the concurrency available from
> the underlying filesystem or storage.
> 
> That's the point I was making - CPU count has -zero- relationship to
> the concurrency the filesystem and/or storage provide the system. It
> is fundamentally incorrect to base decisions about IO concurrency on
> the number of CPU cores in the system.

Yes.  But as mention in my initial reply there is a use case for more
WB threads than fs writeback contexts, which is when the writeback
threads do CPU intensive work like compression.  Being able to do that
from normal writeback threads vs forking out out to fs level threads
would really simply the btrfs code a lot.  Not really interesting for
XFS right now of course.

Or in other words: fs / device geometry really should be the main
driver, but if a file systems supports compression (or really expensive
data checksums) being able to scale up the numbes of threads per
context might still make sense.  But that's really the advanced part,
we'll need to get the fs geometry aligned to work first.

> > > XFS largely stem from the per-IO cpu overhead of block allocation in
> > > the filesystems (i.e. delayed allocation).
> > 
> > This is a good idea, but it means we will not be able to paralellize within
> > an AG.
> 
> The XFS allocation concurrency model scales out across AGs, not
> within AGs. If we need writeback concurrency within an AG (e.g. for
> pure overwrite concurrency) then we want async dispatch worker
> threads per writeback queue, not multiple writeback queues per
> AG....

Unless the computational work mentioned above is involved there would be
something really wrong if we're saturating a CPU per AG.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-04  5:06         ` Christoph Hellwig
@ 2025-02-04  7:08           ` Dave Chinner
  2025-02-10 17:28           ` [Lsf-pc] " Jan Kara
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2025-02-04  7:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, mcgrof, joshi.k,
	axboe, clm, willy, gost.dev

On Tue, Feb 04, 2025 at 06:06:42AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> > I doubt that will create enough concurrency for a typical small
> > server or desktop machine that only has a single NUMA node but has a
> > couple of fast nvme SSDs in it.
> > 
> > > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > > 3) NUMCPU/N number of writeback contexts.
> > 
> > These don't take into account the concurrency available from
> > the underlying filesystem or storage.
> > 
> > That's the point I was making - CPU count has -zero- relationship to
> > the concurrency the filesystem and/or storage provide the system. It
> > is fundamentally incorrect to base decisions about IO concurrency on
> > the number of CPU cores in the system.
> 
> Yes.  But as mention in my initial reply there is a use case for more
> WB threads than fs writeback contexts, which is when the writeback

I understand that - there's more that one reason we may want
multiple IO dispatch threads per writeback context.

> threads do CPU intensive work like compression.  Being able to do that
> from normal writeback threads vs forking out out to fs level threads
> would really simply the btrfs code a lot.  Not really interesting for
> XFS right now of course.
> 
> Or in other words: fs / device geometry really should be the main
> driver, but if a file systems supports compression (or really expensive
> data checksums) being able to scale up the numbes of threads per
> context might still make sense.  But that's really the advanced part,
> we'll need to get the fs geometry aligned to work first.
> 
> > > > XFS largely stem from the per-IO cpu overhead of block allocation in
> > > > the filesystems (i.e. delayed allocation).
> > > 
> > > This is a good idea, but it means we will not be able to paralellize within
> > > an AG.
> > 
> > The XFS allocation concurrency model scales out across AGs, not
> > within AGs. If we need writeback concurrency within an AG (e.g. for
> > pure overwrite concurrency) then we want async dispatch worker
> > threads per writeback queue, not multiple writeback queues per
> > AG....
> 
> Unless the computational work mentioned above is involved there would be
> something really wrong if we're saturating a CPU per AG.

... and the common case I'm thinking of here is stalling writeback
on the AGF lock. i.e. another non-writeback task holding the AGF
lock - inode cluster allocation, directory block allocation, freeing
space in that AG, etc - will stall any writeback that requires
allocation in that AG if we don't have multiple dispatch threads per
writeback context available.

That, IMO, is future optimisation; just moving to a writeback
context per AG would solve a lot of the current "single thread CPU
bound" writeback throughput limitations we have right now,
especially when it comes to writing lots of small files across
many directories in a very short period of time (i.e. opening
tarballs, rsync, recursive dir copies, etc).

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-04  5:06         ` Christoph Hellwig
  2025-02-04  7:08           ` Dave Chinner
@ 2025-02-10 17:28           ` Jan Kara
  2025-02-11  1:13             ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Kara @ 2025-02-10 17:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g,
	mcgrof, joshi.k, axboe, clm, willy, gost.dev

On Tue 04-02-25 06:06:42, Christoph Hellwig wrote:
> On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> > I doubt that will create enough concurrency for a typical small
> > server or desktop machine that only has a single NUMA node but has a
> > couple of fast nvme SSDs in it.
> > 
> > > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > > 3) NUMCPU/N number of writeback contexts.
> > 
> > These don't take into account the concurrency available from
> > the underlying filesystem or storage.
> > 
> > That's the point I was making - CPU count has -zero- relationship to
> > the concurrency the filesystem and/or storage provide the system. It
> > is fundamentally incorrect to base decisions about IO concurrency on
> > the number of CPU cores in the system.
> 
> Yes.  But as mention in my initial reply there is a use case for more
> WB threads than fs writeback contexts, which is when the writeback
> threads do CPU intensive work like compression.  Being able to do that
> from normal writeback threads vs forking out out to fs level threads
> would really simply the btrfs code a lot.  Not really interesting for
> XFS right now of course.
> 
> Or in other words: fs / device geometry really should be the main
> driver, but if a file systems supports compression (or really expensive
> data checksums) being able to scale up the numbes of threads per
> context might still make sense.  But that's really the advanced part,
> we'll need to get the fs geometry aligned to work first.

As I'm reading the thread it sounds to me the writeback subsystem should
provide an API for the filesystem to configure number of writeback
contexts which would be kind of similar to what we currently do for cgroup
aware writeback? Currently we create writeback context per cgroup so now
additionally we'll have some property like "inode writeback locality" that
will also influence what inode->i_wb gets set to and hence where
mark_inode_dirty() files inodes etc.

Then additionally you'd like to have possibility to have more processes
operate on one struct wb_writeback (again configurable by filesystem?) to
handle cases of cpu intensive writeback submission more efficiently.

Sounds about right?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-10 17:28           ` [Lsf-pc] " Jan Kara
@ 2025-02-11  1:13             ` Dave Chinner
  2025-02-11 13:43               ` Jan Kara
                                 ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Dave Chinner @ 2025-02-11  1:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g,
	mcgrof, joshi.k, axboe, clm, willy, gost.dev

On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote:
> On Tue 04-02-25 06:06:42, Christoph Hellwig wrote:
> > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> > > I doubt that will create enough concurrency for a typical small
> > > server or desktop machine that only has a single NUMA node but has a
> > > couple of fast nvme SSDs in it.
> > > 
> > > > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > > > 3) NUMCPU/N number of writeback contexts.
> > > 
> > > These don't take into account the concurrency available from
> > > the underlying filesystem or storage.
> > > 
> > > That's the point I was making - CPU count has -zero- relationship to
> > > the concurrency the filesystem and/or storage provide the system. It
> > > is fundamentally incorrect to base decisions about IO concurrency on
> > > the number of CPU cores in the system.
> > 
> > Yes.  But as mention in my initial reply there is a use case for more
> > WB threads than fs writeback contexts, which is when the writeback
> > threads do CPU intensive work like compression.  Being able to do that
> > from normal writeback threads vs forking out out to fs level threads
> > would really simply the btrfs code a lot.  Not really interesting for
> > XFS right now of course.
> > 
> > Or in other words: fs / device geometry really should be the main
> > driver, but if a file systems supports compression (or really expensive
> > data checksums) being able to scale up the numbes of threads per
> > context might still make sense.  But that's really the advanced part,
> > we'll need to get the fs geometry aligned to work first.
> 
> As I'm reading the thread it sounds to me the writeback subsystem should
> provide an API for the filesystem to configure number of writeback
> contexts which would be kind of similar to what we currently do for cgroup
> aware writeback?

Yes, that's pretty much what I've been trying to say.

> Currently we create writeback context per cgroup so now
> additionally we'll have some property like "inode writeback locality" that
> will also influence what inode->i_wb gets set to and hence where
> mark_inode_dirty() files inodes etc.

Well, that's currently selected by __inode_attach_wb() based on
whether there is a memcg attached to the folio/task being dirtied or
not. If there isn't a cgroup based writeback task, then it uses the
bdi->wb as the wb context.

In my mind, what you are describing above sounds like we would be
heading down the same road list_lru started down back in 2012 to
support NUMA scalability for LRU based memory reclaim.

i.e. we originally had a single global LRU list for important
caches. This didn't scale up, so I introduced the list_lru construct
to abstract the physical layout of the LRU from the objects being
stored on it and the reclaim infrastructure walking it. That gave us
per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The
fundamental concept was that we abstract away the sharding of the
object tracking into per-physical-node structures via generic
infrastructure (i.e. list_lru).

Then memcgs needed memory reclaim, and so they were added as extra
lists with a different indexing mechanism to the list-lru contexts.
These weren't per-node lists because there could be thousands of
them. Hence it was just a single "global" list per memcg, and so it
didn't scale on large machines.

This wasn't seen as a problem initially, but a few years later
applications using memcgs wanted to scale properly on large NUMA
systems. So now we have each memcg tracking the physical per-node
memory usage for reclaim purposes (i.e.  combinatorial explosion of
memcg vs per-node lists).

Hence suggesting "physically sharded lists for global objects,
single per-cgroup lists for cgroup-owned objects" sounds like
exactly the same problem space progression is about to play out with
writeback contexts.

i.e. we shared the global writeback context into a set of physically
sharded lists for scalability and perofrmance reasons, but leave
cgroups with the old single threaded list constructs. Then someone
says "my cgroup based workload doesn't perform the same as a global
workload" and we're off to solve the problem list_lru solves again.

So....

Should we be looking towards using a subset of the existing list_lru
functionality for writeback contexts here? i.e. create a list_lru
object with N-way scalability, allow the fs to provide an
inode-number-to-list mapping function, and use the list_lru
interfaces to abstract away everything physical and cgroup related
for tracking dirty inodes?

Then selecting inodes for writeback becomes a list_lru_walk()
variant depending on what needs to be written back (e.g. physical
node, memcg, both, everything that is dirty everywhere, etc).

> Then additionally you'd like to have possibility to have more processes
> operate on one struct wb_writeback (again configurable by filesystem?) to
> handle cases of cpu intensive writeback submission more efficiently.

Yeah, that is a separate problem...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-11  1:13             ` Dave Chinner
@ 2025-02-11 13:43               ` Jan Kara
  2025-02-20 14:19               ` Kundan Kumar
  2025-03-12 17:47               ` Luis Chamberlain
  2 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2025-02-11 13:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, Kundan Kumar, lsf-pc, linux-fsdevel,
	anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev

On Tue 11-02-25 12:13:18, Dave Chinner wrote:
> On Mon, Feb 10, 2025 at 06:28:28PM +0100, Jan Kara wrote:
> > On Tue 04-02-25 06:06:42, Christoph Hellwig wrote:
> > > On Tue, Feb 04, 2025 at 01:50:08PM +1100, Dave Chinner wrote:
> > > > I doubt that will create enough concurrency for a typical small
> > > > server or desktop machine that only has a single NUMA node but has a
> > > > couple of fast nvme SSDs in it.
> > > > 
> > > > > 2) Fixed number of writeback contexts, say min(10, numcpu).
> > > > > 3) NUMCPU/N number of writeback contexts.
> > > > 
> > > > These don't take into account the concurrency available from
> > > > the underlying filesystem or storage.
> > > > 
> > > > That's the point I was making - CPU count has -zero- relationship to
> > > > the concurrency the filesystem and/or storage provide the system. It
> > > > is fundamentally incorrect to base decisions about IO concurrency on
> > > > the number of CPU cores in the system.
> > > 
> > > Yes.  But as mention in my initial reply there is a use case for more
> > > WB threads than fs writeback contexts, which is when the writeback
> > > threads do CPU intensive work like compression.  Being able to do that
> > > from normal writeback threads vs forking out out to fs level threads
> > > would really simply the btrfs code a lot.  Not really interesting for
> > > XFS right now of course.
> > > 
> > > Or in other words: fs / device geometry really should be the main
> > > driver, but if a file systems supports compression (or really expensive
> > > data checksums) being able to scale up the numbes of threads per
> > > context might still make sense.  But that's really the advanced part,
> > > we'll need to get the fs geometry aligned to work first.
> > 
> > As I'm reading the thread it sounds to me the writeback subsystem should
> > provide an API for the filesystem to configure number of writeback
> > contexts which would be kind of similar to what we currently do for cgroup
> > aware writeback?
> 
> Yes, that's pretty much what I've been trying to say.
> 
> > Currently we create writeback context per cgroup so now
> > additionally we'll have some property like "inode writeback locality" that
> > will also influence what inode->i_wb gets set to and hence where
> > mark_inode_dirty() files inodes etc.
> 
> Well, that's currently selected by __inode_attach_wb() based on
> whether there is a memcg attached to the folio/task being dirtied or
> not. If there isn't a cgroup based writeback task, then it uses the
> bdi->wb as the wb context.
> 
> In my mind, what you are describing above sounds like we would be
> heading down the same road list_lru started down back in 2012 to
> support NUMA scalability for LRU based memory reclaim.
> 
> i.e. we originally had a single global LRU list for important
> caches. This didn't scale up, so I introduced the list_lru construct
> to abstract the physical layout of the LRU from the objects being
> stored on it and the reclaim infrastructure walking it. That gave us
> per-NUMA-node LRUs and NUMA-aware shrinkers for memory reclaim. The
> fundamental concept was that we abstract away the sharding of the
> object tracking into per-physical-node structures via generic
> infrastructure (i.e. list_lru).
> 
> Then memcgs needed memory reclaim, and so they were added as extra
> lists with a different indexing mechanism to the list-lru contexts.
> These weren't per-node lists because there could be thousands of
> them. Hence it was just a single "global" list per memcg, and so it
> didn't scale on large machines.
> 
> This wasn't seen as a problem initially, but a few years later
> applications using memcgs wanted to scale properly on large NUMA
> systems. So now we have each memcg tracking the physical per-node
> memory usage for reclaim purposes (i.e.  combinatorial explosion of
> memcg vs per-node lists).
> 
> Hence suggesting "physically sharded lists for global objects,
> single per-cgroup lists for cgroup-owned objects" sounds like
> exactly the same problem space progression is about to play out with
> writeback contexts.
> 
> i.e. we shared the global writeback context into a set of physically
> sharded lists for scalability and perofrmance reasons, but leave
> cgroups with the old single threaded list constructs. Then someone
> says "my cgroup based workload doesn't perform the same as a global
> workload" and we're off to solve the problem list_lru solves again.
> 
> So....
> 
> Should we be looking towards using a subset of the existing list_lru
> functionality for writeback contexts here? i.e. create a list_lru
> object with N-way scalability, allow the fs to provide an
> inode-number-to-list mapping function, and use the list_lru
> interfaces to abstract away everything physical and cgroup related
> for tracking dirty inodes?

Interesting idea. Indeed, the similarity with problems list_lru is solving
is significant. I like the idea.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-11  1:13             ` Dave Chinner
  2025-02-11 13:43               ` Jan Kara
@ 2025-02-20 14:19               ` Kundan Kumar
  2025-03-13 20:22                 ` Jan Kara
  2025-03-12 17:47               ` Luis Chamberlain
  2 siblings, 1 reply; 22+ messages in thread
From: Kundan Kumar @ 2025-02-20 14:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Christoph Hellwig, lsf-pc, linux-fsdevel, anuj20.g,
	mcgrof, joshi.k, axboe, clm, willy, gost.dev

[-- Attachment #1: Type: text/plain, Size: 1976 bytes --]

>Well, that's currently selected by __inode_attach_wb() based on
>whether there is a memcg attached to the folio/task being dirtied or
>not. If there isn't a cgroup based writeback task, then it uses the
>bdi->wb as the wb context.

We have created a proof of concept for per-AG context-based writeback, as
described in [1]. The AG is mapped to a writeback context (wb_ctx). Using
the filesystem handler, __mark_inode_dirty() selects writeback context
corresponding to the inode.

We attempted to handle memcg and bdi based writeback in a similar manner.
This approach aims to maintain the original writeback semantics while
providing parallelism. This helps in pushing more data early to the
device, trying to ease the write pressure faster.
[1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/

>Then selecting inodes for writeback becomes a list_lru_walk()
>variant depending on what needs to be written back (e.g. physical
>node, memcg, both, everything that is dirty everywhere, etc).

We considered using list_lru to track inodes within a writeback context.
This can be implemented as:
struct bdi_writeback {
  struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
  struct list_lru b_io_dirty_inodes_lru;
  ...
  ...
 };
By doing this, we would obtain a sharded list of inodes per NUMA node.
However, we would also need per-NUMA writeback contexts. Otherwise,
even if inodes are NUMA-sharded, a single writeback context would stil
process them sequentially, limiting parallelism. But there’s a concern:
NUMA-based writeback contexts are not aligned with filesystem geometry,
which could negatively impact delayed allocation and writeback efficiency,
as you pointed out in your previous reply [2].

Would it be better to let the filesystem dictate the number of writeback
threads, rather than enforcing a per-NUMA model?

Do you see it differently?

[2] https://lore.kernel.org/all/Z5qw_1BOqiFum5Dn@dread.disaster.area/

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-20 14:19               ` Kundan Kumar
@ 2025-03-13 20:22                 ` Jan Kara
  2025-03-18  3:42                   ` Dave Chinner
                                     ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Jan Kara @ 2025-03-13 20:22 UTC (permalink / raw)
  To: Kundan Kumar
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, lsf-pc, linux-fsdevel,
	anuj20.g, mcgrof, joshi.k, axboe, clm, willy, gost.dev

On Thu 20-02-25 19:49:22, Kundan Kumar wrote:
> > Well, that's currently selected by __inode_attach_wb() based on
> > whether there is a memcg attached to the folio/task being dirtied or
> > not. If there isn't a cgroup based writeback task, then it uses the
> > bdi->wb as the wb context.
> 
> We have created a proof of concept for per-AG context-based writeback, as
> described in [1]. The AG is mapped to a writeback context (wb_ctx). Using
> the filesystem handler, __mark_inode_dirty() selects writeback context
> corresponding to the inode.
> 
> We attempted to handle memcg and bdi based writeback in a similar manner.
> This approach aims to maintain the original writeback semantics while
> providing parallelism. This helps in pushing more data early to the
> device, trying to ease the write pressure faster.
> [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/

Yeah, I've seen the patches. Sorry for not getting to you earlier.

> > Then selecting inodes for writeback becomes a list_lru_walk()
> > variant depending on what needs to be written back (e.g. physical
> > node, memcg, both, everything that is dirty everywhere, etc).
> 
> We considered using list_lru to track inodes within a writeback context.
> This can be implemented as:
> struct bdi_writeback {
>  struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
>  struct list_lru b_io_dirty_inodes_lru;
>  ...
>  ...
> };
> By doing this, we would obtain a sharded list of inodes per NUMA node.

I think you've misunderstood Dave's suggestion here. list_lru was given as
an example of a structure for inspiration. We cannot take it directly as is
for writeback purposes because we don't want to be sharding based on NUMA
nodes but rather based on some other (likely FS driven) criteria.

> However, we would also need per-NUMA writeback contexts. Otherwise,
> even if inodes are NUMA-sharded, a single writeback context would stil
> process them sequentially, limiting parallelism. But there’s a concern:
> NUMA-based writeback contexts are not aligned with filesystem geometry,
> which could negatively impact delayed allocation and writeback efficiency,
> as you pointed out in your previous reply [2].
> 
> Would it be better to let the filesystem dictate the number of writeback
> threads, rather than enforcing a per-NUMA model?

I was thinking about how to best parallelize the writeback and I think
there are two quite different demands for which we probably want two
different levels of parallelism.

One case is the situation when the filesystem for example has multiple
underlying devices (like btrfs or bcachefs) or for other reasons writeback
to different parts is fairly independent (like for different XFS AGs). Here
we want parallelism at rather high level I think including separate
dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
separate bdis (struct backing_dev_info) but I think it would be technically
and also conceptually somewhat easier to do the multiplexing by factoring
out:

        struct bdi_writeback wb;  /* the root writeback info for this bdi */
        struct list_head wb_list; /* list of all wbs */
#ifdef CONFIG_CGROUP_WRITEBACK
        struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
        struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
#endif
        wait_queue_head_t wb_waitq;

into a new structure (looking for a good name - bdi_writeback_context???)
that can get multiplied (filesystem can create its own bdi on mount and
configure there number of bdi_writeback_contexts it wants). We also need to
add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
which will return appropriate bdi_writeback_context (or perhaps just it's
index?) for an inode. This will be used by the filesystem to direct
writeback code where the inode should go. This is kind of what Kundan did
in the last revision of his patches but I hope this approach should
somewhat limit the changes necessary to writeback infrastructure - the
patch 2 in his series is really unreviewably large...

Then another case is a situation where either the amount of CPU work is
rather high for IO submission (cases like Christoph mentioned where
filesystem needs to do checksumming on submission or similar) or simply the
device is rather fast for a single submission thread and the FS doesn't
have a sensible way to partition inodes (e.g. for ext4 there's no
meaningful way of partitioning inodes into independent groups - ext4
allocation groups are small and inodes often span multiple groups and the
sets of groups used by different inodes randomly overlap). In this case I
think we want single dirty throttling instance, single writeback throughput
estimation, single set of dirty inode lists etc. The level where the
parallelism needs to happen is fairly low - I'd say duplicate:

	struct delayed_work dwork;      /* work item used for writeback */

in struct bdi_writeback. Again, the number of dworks should be configurable
when creating bdi for the filesystem. 

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-03-13 20:22                 ` Jan Kara
@ 2025-03-18  3:42                   ` Dave Chinner
       [not found]                     ` <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com>
  2025-03-18  6:41                   ` Kundan Kumar
  2025-03-18 11:37                   ` Anuj Gupta
  2 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2025-03-18  3:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kundan Kumar, Christoph Hellwig, lsf-pc, linux-fsdevel, anuj20.g,
	mcgrof, joshi.k, axboe, clm, willy, gost.dev

On Thu, Mar 13, 2025 at 09:22:00PM +0100, Jan Kara wrote:
> On Thu 20-02-25 19:49:22, Kundan Kumar wrote:
> > > Well, that's currently selected by __inode_attach_wb() based on
> > > whether there is a memcg attached to the folio/task being dirtied or
> > > not. If there isn't a cgroup based writeback task, then it uses the
> > > bdi->wb as the wb context.
> > 
> > We have created a proof of concept for per-AG context-based writeback, as
> > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using
> > the filesystem handler, __mark_inode_dirty() selects writeback context
> > corresponding to the inode.
> > 
> > We attempted to handle memcg and bdi based writeback in a similar manner.
> > This approach aims to maintain the original writeback semantics while
> > providing parallelism. This helps in pushing more data early to the
> > device, trying to ease the write pressure faster.
> > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/
> 
> Yeah, I've seen the patches. Sorry for not getting to you earlier.
>  
> > > Then selecting inodes for writeback becomes a list_lru_walk()
> > > variant depending on what needs to be written back (e.g. physical
> > > node, memcg, both, everything that is dirty everywhere, etc).
> > 
> > We considered using list_lru to track inodes within a writeback context.
> > This can be implemented as:
> > struct bdi_writeback {
> >  struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
> >  struct list_lru b_io_dirty_inodes_lru;
> >  ...
> >  ...
> > };
> > By doing this, we would obtain a sharded list of inodes per NUMA node.
> 
> I think you've misunderstood Dave's suggestion here. list_lru was given as
> an example of a structure for inspiration. We cannot take it directly as is
> for writeback purposes because we don't want to be sharding based on NUMA
> nodes but rather based on some other (likely FS driven) criteria.

Well, you might say that, but.....

... I was actually thinking of taking the list_lru and abstracting
it's N-way parallelism away from the numa infrastructure.

The NUMA awareness of the list_lru is largely in external APIs. Th
eonly implicit NUMA awareness is in the list_lru_add() function
where it converts the object being added to the list to a node ID
based on where it is physically located in memory.

The only other thing that is NUMA specific is that the list is set
up with N-way concurrency when N = the number of NUMA nodes in the
machine.

So, really, it is just thin veneer of NUMA wrapped around the
inherent concurrency built into the structure.

IOWs, when we create a list_lru for a numa aware shrinker, we simply
use the number of nodes as the N-way parallelism for the list,
and the existing per-node infrastructure simply feeds the right
numa node ID as the "list index" for it to function as is.

In the case of writeback parallelism, we could create a list_lru
with the number of AGs as the N-way parallism for the list, and then
have the concurrent BDI writeback context (1 per AG) simply provide
the AG number as the "list index" for writeback....

> > However, we would also need per-NUMA writeback contexts. Otherwise,
> > even if inodes are NUMA-sharded, a single writeback context would stil
> > process them sequentially, limiting parallelism. But there’s a concern:
> > NUMA-based writeback contexts are not aligned with filesystem geometry,
> > which could negatively impact delayed allocation and writeback efficiency,
> > as you pointed out in your previous reply [2].
> > 
> > Would it be better to let the filesystem dictate the number of writeback
> > threads, rather than enforcing a per-NUMA model?
> 
> I was thinking about how to best parallelize the writeback and I think
> there are two quite different demands for which we probably want two
> different levels of parallelism.
> 
> One case is the situation when the filesystem for example has multiple
> underlying devices (like btrfs or bcachefs) or for other reasons writeback
> to different parts is fairly independent (like for different XFS AGs). Here
> we want parallelism at rather high level I think including separate
> dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
> separate bdis (struct backing_dev_info) but I think it would be technically
> and also conceptually somewhat easier to do the multiplexing by factoring
> out:
> 
>         struct bdi_writeback wb;  /* the root writeback info for this bdi */
>         struct list_head wb_list; /* list of all wbs */
> #ifdef CONFIG_CGROUP_WRITEBACK
>         struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
>         struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
> #endif
>         wait_queue_head_t wb_waitq;
> 
> into a new structure (looking for a good name - bdi_writeback_context???)
> that can get multiplied (filesystem can create its own bdi on mount and
> configure there number of bdi_writeback_contexts it wants). We also need to
> add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
> which will return appropriate bdi_writeback_context (or perhaps just it's
> index?) for an inode. This will be used by the filesystem to direct
> writeback code where the inode should go. This is kind of what Kundan did
> in the last revision of his patches but I hope this approach should
> somewhat limit the changes necessary to writeback infrastructure - the
> patch 2 in his series is really unreviewably large...

Yes, this is the equivalent of the SHRINKER_NUMA_AWARE flag that
triggers the shrinker infrastructure to track work in a NUMA aware
way when the subsystem shrinker callbacks are using the NUMA aware
list_lru to back it....

> Then another case is a situation where either the amount of CPU work is
> rather high for IO submission (cases like Christoph mentioned where
> filesystem needs to do checksumming on submission or similar) or simply the
> device is rather fast for a single submission thread and the FS doesn't
> have a sensible way to partition inodes (e.g. for ext4 there's no
> meaningful way of partitioning inodes into independent groups - ext4
> allocation groups are small and inodes often span multiple groups and the
> sets of groups used by different inodes randomly overlap). In this case I
> think we want single dirty throttling instance, single writeback throughput
> estimation, single set of dirty inode lists etc. The level where the
> parallelism needs to happen is fairly low - I'd say duplicate:
> 
> 	struct delayed_work dwork;      /* work item used for writeback */
> 
> in struct bdi_writeback. Again, the number of dworks should be configurable
> when creating bdi for the filesystem. 

Right - this is what I described early on in the discussion as a
need for both scaling out across bdi contexts as well as scaling up
within a single bdi context.

That is, even within the context of a BDI writeback context per AG
in XFS, we can hav eneed for multiple IO submissions to be in flight
at once. e.g. one IO submission gets blocked doing allocation, but
there are a heap of pure overwrites queued up behind it. We can
submit the pure overwrites without getting blocked on the
allocation, but if we don't have IO submission concurrency within a
BDI, we cannot do that at all.

IOWs, these aren't exclusive optimisations to writeback; there are
situations were only one of the two mechanisms can solve the
concurrency problem. i.e. we need both high level and low level
concurrency mechanisms in filesystems like XFS....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

[parent not found: <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com>]

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
       [not found]                     ` <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com>
@ 2025-03-19  8:07                       ` Anuj Gupta
  0 siblings, 0 replies; 22+ messages in thread
From: Anuj Gupta @ 2025-03-19  8:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Kundan Kumar, Christoph Hellwig, lsf-pc, linux-fsdevel,
	mcgrof, joshi.k, axboe, clm, willy, gost.dev

[-- Attachment #1: Type: text/plain, Size: 3590 bytes --]

On Tue, Mar 18, 2025 at 02:42:18PM +1100, Dave Chinner wrote:
> On Thu, Mar 13, 2025 at 09:22:00PM +0100, Jan Kara wrote:
> > On Thu 20-02-25 19:49:22, Kundan Kumar wrote:
> > > > Well, that's currently selected by __inode_attach_wb() based on
> > > > whether there is a memcg attached to the folio/task being dirtied or
> > > > not. If there isn't a cgroup based writeback task, then it uses the
> > > > bdi->wb as the wb context.
> > > 
> > > We have created a proof of concept for per-AG context-based writeback, as
> > > described in [1]. The AG is mapped to a writeback context (wb_ctx). Using
> > > the filesystem handler, __mark_inode_dirty() selects writeback context
> > > corresponding to the inode.
> > > 
> > > We attempted to handle memcg and bdi based writeback in a similar manner.
> > > This approach aims to maintain the original writeback semantics while
> > > providing parallelism. This helps in pushing more data early to the
> > > device, trying to ease the write pressure faster.
> > > [1] https://lore.kernel.org/all/20250212103634.448437-1-kundan.kumar@samsung.com/
> > 
> > Yeah, I've seen the patches. Sorry for not getting to you earlier.
> >  
> > > > Then selecting inodes for writeback becomes a list_lru_walk()
> > > > variant depending on what needs to be written back (e.g. physical
> > > > node, memcg, both, everything that is dirty everywhere, etc).
> > > 
> > > We considered using list_lru to track inodes within a writeback context.
> > > This can be implemented as:
> > > struct bdi_writeback {
> > >  struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
> > >  struct list_lru b_io_dirty_inodes_lru;
> > >  ...
> > >  ...
> > > };
> > > By doing this, we would obtain a sharded list of inodes per NUMA node.
> > 
> > I think you've misunderstood Dave's suggestion here. list_lru was given as
> > an example of a structure for inspiration. We cannot take it directly as is
> > for writeback purposes because we don't want to be sharding based on NUMA
> > nodes but rather based on some other (likely FS driven) criteria.
> 
> Well, you might say that, but.....
> 
> ... I was actually thinking of taking the list_lru and abstracting
> it's N-way parallelism away from the numa infrastructure.
> 
> The NUMA awareness of the list_lru is largely in external APIs. Th
> eonly implicit NUMA awareness is in the list_lru_add() function
> where it converts the object being added to the list to a node ID
> based on where it is physically located in memory.
> 
> The only other thing that is NUMA specific is that the list is set
> up with N-way concurrency when N = the number of NUMA nodes in the
> machine.
> 
> So, really, it is just thin veneer of NUMA wrapped around the
> inherent concurrency built into the structure.
> 
> IOWs, when we create a list_lru for a numa aware shrinker, we simply
> use the number of nodes as the N-way parallelism for the list,
> and the existing per-node infrastructure simply feeds the right
> numa node ID as the "list index" for it to function as is.
> 
> In the case of writeback parallelism, we could create a list_lru
> with the number of AGs as the N-way parallism for the list, and then
> have the concurrent BDI writeback context (1 per AG) simply provide
> the AG number as the "list index" for writeback....
> 

If we go ahead with Jan's suggestion, each AG will have a separate
bdi_writeback_context. Each bdi_writeback_context has its own b_dirty
inode list. This naturally partitions inodes per AG. Given that, do we
still need AG based sharded list_lru? Am I missing something here?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-03-13 20:22                 ` Jan Kara
  2025-03-18  3:42                   ` Dave Chinner
@ 2025-03-18  6:41                   ` Kundan Kumar
  2025-03-18 11:37                   ` Anuj Gupta
  2 siblings, 0 replies; 22+ messages in thread
From: Kundan Kumar @ 2025-03-18  6:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christoph Hellwig, lsf-pc, linux-fsdevel, anuj20.g,
	mcgrof, joshi.k, axboe, clm, willy, gost.dev

[-- Attachment #1: Type: text/plain, Size: 2928 bytes --]

>> > Then selecting inodes for writeback becomes a list_lru_walk()
>> > variant depending on what needs to be written back (e.g. physical
>> > node, memcg, both, everything that is dirty everywhere, etc).
>>
>> We considered using list_lru to track inodes within a writeback context.
>> This can be implemented as:
>> struct bdi_writeback {
>>  struct list_lru b_dirty_inodes_lru; // instead of a single b_dirty list
>>  struct list_lru b_io_dirty_inodes_lru;
>>  ...
>>  ...
>> };
>> By doing this, we would obtain a sharded list of inodes per NUMA node.
>
>I think you've misunderstood Dave's suggestion here. list_lru was given as
>an example of a structure for inspiration. We cannot take it directly as is
>for writeback purposes because we don't want to be sharding based on NUMA
>nodes but rather based on some other (likely FS driven) criteria.

Makes sense. Thanks for the clarification.

>I was thinking about how to best parallelize the writeback and I think
>there are two quite different demands for which we probably want two
>different levels of parallelism.
>
>One case is the situation when the filesystem for example has multiple
>underlying devices (like btrfs or bcachefs) or for other reasons writeback
>to different parts is fairly independent (like for different XFS AGs). Here
>we want parallelism at rather high level I think including separate
>dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
>separate bdis (struct backing_dev_info) but I think it would be technically
>and also conceptually somewhat easier to do the multiplexing by factoring
>out:
>
>        struct bdi_writeback wb;  /* the root writeback info for this bdi */
>        struct list_head wb_list; /* list of all wbs */
>#ifdef CONFIG_CGROUP_WRITEBACK
>        struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
>        struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
>#endif
>        wait_queue_head_t wb_waitq;
>
>into a new structure (looking for a good name - bdi_writeback_context???)
>that can get multiplied (filesystem can create its own bdi on mount and
>configure there number of bdi_writeback_contexts it wants). We also need to
>add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
>which will return appropriate bdi_writeback_context (or perhaps just it's
>index?) for an inode. This will be used by the filesystem to direct
>writeback code where the inode should go. This is kind of what Kundan did
>in the last revision of his patches but I hope this approach should
>somewhat limit the changes necessary to writeback infrastructure - the

This looks much better than the data structures we had in previous
version. I will prepare a new version based on this feedback.

>patch 2 in his series is really unreviewably large...

I agree. Sorry, will try to streamline the patches in a better fashion
in the next iteration.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-03-13 20:22                 ` Jan Kara
  2025-03-18  3:42                   ` Dave Chinner
  2025-03-18  6:41                   ` Kundan Kumar
@ 2025-03-18 11:37                   ` Anuj Gupta
  2025-03-19 15:54                     ` Jan Kara
  2 siblings, 1 reply; 22+ messages in thread
From: Anuj Gupta @ 2025-03-18 11:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kundan Kumar, Dave Chinner, Christoph Hellwig, lsf-pc,
	linux-fsdevel, mcgrof, joshi.k, axboe, clm, willy, gost.dev

[-- Attachment #1: Type: text/plain, Size: 38401 bytes --]

> I was thinking about how to best parallelize the writeback and I think
> there are two quite different demands for which we probably want two
> different levels of parallelism.
> 
> One case is the situation when the filesystem for example has multiple
> underlying devices (like btrfs or bcachefs) or for other reasons writeback
> to different parts is fairly independent (like for different XFS AGs). Here
> we want parallelism at rather high level I think including separate
> dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
> separate bdis (struct backing_dev_info) but I think it would be technically
> and also conceptually somewhat easier to do the multiplexing by factoring
> out:
> 
>         struct bdi_writeback wb;  /* the root writeback info for this bdi */
>         struct list_head wb_list; /* list of all wbs */
> #ifdef CONFIG_CGROUP_WRITEBACK
>         struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
>         struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
> #endif
>         wait_queue_head_t wb_waitq;
> 
> into a new structure (looking for a good name - bdi_writeback_context???)
> that can get multiplied (filesystem can create its own bdi on mount and
> configure there number of bdi_writeback_contexts it wants). We also need to
> add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
> which will return appropriate bdi_writeback_context (or perhaps just it's
> index?) for an inode. This will be used by the filesystem to direct
> writeback code where the inode should go. This is kind of what Kundan did
> in the last revision of his patches but I hope this approach should
> somewhat limit the changes necessary to writeback infrastructure - the
> patch 2 in his series is really unreviewably large...

Thanks Jan.

I tried to modify the existing infra based on your suggestion [1]. This
only takes care of the first case that you mentioned. I am yet to start
to evaluate and implement the second case (when amount of cpu work is
high). The patch is large, because it requires changing all the places
that uses bdi's fields that have now been moved to a new structure. I
will try to streamline it further.

I have omitted the xfs side plumbing for now.
Can you please see if this aligns with the scheme that you had in mind?
If the infra looks fine, I can plumb XFS changes on top of it.

Note: This patch is only compile tested, haven't run any tests on it.

It is on top of v6.13
https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v6.13

[1]

Subject: [PATCH] writeback: add infra for parallel writeback

Introduce a new bdi_writeback_ctx structure that enables us to have
multiple writeback contexts for parallel writeback.

Existing single threaded writeback will use default_ctx.

Filesystem now have the option to create there own bdi aind populate
multiple bdi_writeback_ctx in bdi's bdi_wb_ctx_arr (xarray for now, but
plan to move to use a better structure like list_lru).

Introduce a new hook get_inode_wb_ctx() in super_operations which takes
inode as a parameter and returns the bdi_wb_ctx corresponding to the
inode.

Modify all the functions/places that operate on bdi's wb, wb_list,
cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved
to bdi_writeback_ctx

Store a reference to bdi_wb_ctx in bdi_writeback.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
---
 fs/fs-writeback.c                | 170 +++++++++++++++++-------
 include/linux/backing-dev-defs.h |  34 +++--
 include/linux/backing-dev.h      |  48 +++++--
 include/linux/fs.h               |   1 +
 mm/backing-dev.c                 | 218 +++++++++++++++++++++++--------
 mm/page-writeback.c              |   5 +-
 6 files changed, 345 insertions(+), 131 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3cd99e2dc6ac..4c47c298f174 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -265,23 +265,27 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = NULL;
+	struct bdi_writeback_ctx *bdi_writeback_ctx =
+						fetch_bdi_writeback_ctx(inode);
 
 	if (inode_cgwb_enabled(inode)) {
 		struct cgroup_subsys_state *memcg_css;
 
 		if (folio) {
 			memcg_css = mem_cgroup_css_from_folio(folio);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
+					   GFP_ATOMIC);
 		} else {
 			/* must pin memcg_css, see wb_get_create() */
 			memcg_css = task_get_css(current, memory_cgrp_id);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
+					   GFP_ATOMIC);
 			css_put(memcg_css);
 		}
 	}
 
 	if (!wb)
-		wb = &bdi->wb;
+		wb = &bdi_writeback_ctx->wb;
 
 	/*
 	 * There may be multiple instances of this function racing to
@@ -302,12 +306,15 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
 static void inode_cgwb_move_to_attached(struct inode *inode,
 					struct bdi_writeback *wb)
 {
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 	WARN_ON_ONCE(inode->i_state & I_FREEING);
 
 	inode->i_state &= ~I_SYNC_QUEUED;
-	if (wb != &wb->bdi->wb)
+	bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
+	if (wb != &bdi_wb_ctx->wb)
 		list_move(&inode->i_io_list, &wb->b_attached);
 	else
 		list_del_init(&inode->i_io_list);
@@ -382,14 +389,14 @@ struct inode_switch_wbs_context {
 	struct inode		*inodes[];
 };
 
-static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi)
+static void bdi_wb_ctx_down_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	down_write(&bdi->wb_switch_rwsem);
+	down_write(&bdi_wb_ctx->wb_switch_rwsem);
 }
 
-static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi)
+static void bdi_wb_ctx_up_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	up_write(&bdi->wb_switch_rwsem);
+	up_write(&bdi_wb_ctx->wb_switch_rwsem);
 }
 
 static bool inode_do_switch_wbs(struct inode *inode,
@@ -490,7 +497,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 {
 	struct inode_switch_wbs_context *isw =
 		container_of(to_rcu_work(work), struct inode_switch_wbs_context, work);
-	struct backing_dev_info *bdi = inode_to_bdi(isw->inodes[0]);
+	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(isw->inodes[0]);
 	struct bdi_writeback *old_wb = isw->inodes[0]->i_wb;
 	struct bdi_writeback *new_wb = isw->new_wb;
 	unsigned long nr_switched = 0;
@@ -500,7 +507,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 	 * If @inode switches cgwb membership while sync_inodes_sb() is
 	 * being issued, sync_inodes_sb() might miss it.  Synchronize.
 	 */
-	down_read(&bdi->wb_switch_rwsem);
+	down_read(&bdi_wb_ctx->wb_switch_rwsem);
 
 	/*
 	 * By the time control reaches here, RCU grace period has passed
@@ -529,7 +536,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 	spin_unlock(&new_wb->list_lock);
 	spin_unlock(&old_wb->list_lock);
 
-	up_read(&bdi->wb_switch_rwsem);
+	up_read(&bdi_wb_ctx->wb_switch_rwsem);
 
 	if (nr_switched) {
 		wb_wakeup(new_wb);
@@ -583,6 +590,7 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
 static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
 	struct cgroup_subsys_state *memcg_css;
 	struct inode_switch_wbs_context *isw;
 
@@ -609,7 +617,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 	if (!memcg_css)
 		goto out_free;
 
-	isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+	isw->new_wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, GFP_ATOMIC);
 	css_put(memcg_css);
 	if (!isw->new_wb)
 		goto out_free;
@@ -678,12 +686,12 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb)
 
 	for (memcg_css = wb->memcg_css->parent; memcg_css;
 	     memcg_css = memcg_css->parent) {
-		isw->new_wb = wb_get_create(wb->bdi, memcg_css, GFP_KERNEL);
+		isw->new_wb = wb_get_create(wb->bdi, wb->bdi_wb_ctx, memcg_css, GFP_KERNEL);
 		if (isw->new_wb)
 			break;
 	}
 	if (unlikely(!isw->new_wb))
-		isw->new_wb = &wb->bdi->wb; /* wb_get() is noop for bdi's wb */
+		isw->new_wb = &wb->bdi_wb_ctx->wb; /* wb_get() is noop for bdi's wb */
 
 	nr = 0;
 	spin_lock(&wb->list_lock);
@@ -995,17 +1003,18 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
  */
 static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 				  struct wb_writeback_work *base_work,
-				  bool skip_if_busy)
+				  bool skip_if_busy,
+				  struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	struct bdi_writeback *last_wb = NULL;
-	struct bdi_writeback *wb = list_entry(&bdi->wb_list,
+	struct bdi_writeback *wb = list_entry(&bdi_wb_ctx->wb_list,
 					      struct bdi_writeback, bdi_node);
 
 	might_sleep();
 restart:
 	rcu_read_lock();
-	list_for_each_entry_continue_rcu(wb, &bdi->wb_list, bdi_node) {
-		DEFINE_WB_COMPLETION(fallback_work_done, bdi);
+	list_for_each_entry_continue_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) {
+		DEFINE_WB_COMPLETION(fallback_work_done, bdi_wb_ctx);
 		struct wb_writeback_work fallback_work;
 		struct wb_writeback_work *work;
 		long nr_pages;
@@ -1103,7 +1112,17 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
 	 * And find the associated wb.  If the wb isn't there already
 	 * there's nothing to flush, don't create one.
 	 */
-	wb = wb_get_lookup(bdi, memcg_css);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
+			if (wb)
+				break;
+		}
+	} else {
+		wb = wb_get_lookup(&bdi->default_ctx, memcg_css);
+	}
 	if (!wb) {
 		ret = -ENOENT;
 		goto out_css_put;
@@ -1189,8 +1208,8 @@ fs_initcall(cgroup_writeback_init);
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
-static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { }
-static void bdi_up_write_wb_switch_rwsem(struct backing_dev_info *bdi) { }
+static void bdi_wb_ctx_down_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { }
+static void bdi_wb_ctx_up_write_wb_switch_rwsem(struct bdi_writeback_ctx *bdi_wb_ctx) { }
 
 static void inode_cgwb_move_to_attached(struct inode *inode,
 					struct bdi_writeback *wb)
@@ -1232,13 +1251,14 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
 
 static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 				  struct wb_writeback_work *base_work,
-				  bool skip_if_busy)
+				  bool skip_if_busy,
+				  struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	might_sleep();
 
-	if (!skip_if_busy || !writeback_in_progress(&bdi->wb)) {
+	if (!skip_if_busy || !writeback_in_progress(&bdi_wb_ctx->wb)) {
 		base_work->auto_free = 0;
-		wb_queue_work(&bdi->wb, base_work);
+		wb_queue_work(&bdi_wb_ctx->wb, base_work);
 	}
 }
 
@@ -2371,8 +2391,18 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
 	if (!bdi_has_dirty_io(bdi))
 		return;
 
-	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
-		wb_start_writeback(wb, reason);
+	/* traverse all the bdi_wb_ctx of the bdi */
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node)
+				wb_start_writeback(wb, reason);
+		}
+	} else {
+		list_for_each_entry_rcu(wb, &bdi->default_ctx.wb_list, bdi_node)
+			wb_start_writeback(wb, reason);
+	}
 }
 
 void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
@@ -2427,9 +2457,19 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
 		struct bdi_writeback *wb;
 
-		list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
-			if (!list_empty(&wb->b_dirty_time))
-				wb_wakeup(wb);
+		if (bdi->is_parallel) {
+			struct bdi_writeback_ctx *bdi_wb_ctx;
+
+			for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+				list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node)
+					if (!list_empty(&wb->b_dirty_time))
+						wb_wakeup(wb);
+			}
+		} else {
+			list_for_each_entry_rcu(wb, &bdi->default_ctx.wb_list, bdi_node)
+				if (!list_empty(&wb->b_dirty_time))
+					wb_wakeup(wb);
+		}
 	}
 	rcu_read_unlock();
 	schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
@@ -2713,11 +2753,12 @@ static void wait_sb_inodes(struct super_block *sb)
 	mutex_unlock(&sb->s_sync_lock);
 }
 
-static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
-				     enum wb_reason reason, bool skip_if_busy)
+static void __writeback_inodes_sb_nr_bdi_wb_ctx(struct super_block *sb,
+						unsigned long nr, enum wb_reason reason,
+						bool skip_if_busy,
+						struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	struct backing_dev_info *bdi = sb->s_bdi;
-	DEFINE_WB_COMPLETION(done, bdi);
+	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
 	struct wb_writeback_work work = {
 		.sb			= sb,
 		.sync_mode		= WB_SYNC_NONE,
@@ -2727,12 +2768,29 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
 		.reason			= reason,
 	};
 
+	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy, bdi_wb_ctx);
+	wb_wait_for_completion(&done);
+}
+
+static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+				     enum wb_reason reason, bool skip_if_busy)
+{
+	struct backing_dev_info *bdi = sb->s_bdi;
+
 	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
-	wb_wait_for_completion(&done);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
+			__writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason,
+							    skip_if_busy, bdi_wb_ctx);
+	} else {
+		__writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason,
+						    skip_if_busy, &bdi->default_ctx);
+	}
 }
 
 /**
@@ -2785,17 +2843,10 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
 }
 EXPORT_SYMBOL(try_to_writeback_inodes_sb);
 
-/**
- * sync_inodes_sb	-	sync sb inode pages
- * @sb: the superblock
- *
- * This function writes and waits on any dirty inode belonging to this
- * super_block.
- */
-void sync_inodes_sb(struct super_block *sb)
+static void sync_inodes_bdi_wb_ctx(struct super_block *sb,
+		struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	struct backing_dev_info *bdi = sb->s_bdi;
-	DEFINE_WB_COMPLETION(done, bdi);
+	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
@@ -2805,6 +2856,22 @@ void sync_inodes_sb(struct super_block *sb)
 		.reason		= WB_REASON_SYNC,
 		.for_sync	= 1,
 	};
+	bdi_wb_ctx_down_write_wb_switch_rwsem(bdi_wb_ctx);
+	bdi_split_work_to_wbs(bdi, &work, false, bdi_wb_ctx);
+	wb_wait_for_completion(&done);
+	bdi_wb_ctx_up_write_wb_switch_rwsem(bdi_wb_ctx);
+}
+
+/**
+ * sync_inodes_sb	-	sync sb inode pages
+ * @sb: the superblock
+ *
+ * This function writes and waits on any dirty inode belonging to this
+ * super_block.
+ */
+void sync_inodes_sb(struct super_block *sb)
+{
+	struct backing_dev_info *bdi = sb->s_bdi;
 
 	/*
 	 * Can't skip on !bdi_has_dirty() because we should wait for !dirty
@@ -2815,12 +2882,15 @@ void sync_inodes_sb(struct super_block *sb)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
-	bdi_down_write_wb_switch_rwsem(bdi);
-	bdi_split_work_to_wbs(bdi, &work, false);
-	wb_wait_for_completion(&done);
-	bdi_up_write_wb_switch_rwsem(bdi);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
 
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx);
+		}
+	} else {
+		sync_inodes_bdi_wb_ctx(sb, bdi, &bdi->default_ctx);
+	}
 	wait_sb_inodes(sb);
 }
 EXPORT_SYMBOL(sync_inodes_sb);
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 2ad261082bba..d00ae5b55f33 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -75,10 +75,10 @@ struct wb_completion {
  * can wait for the completion of all using wb_wait_for_completion().  Work
  * items which are waited upon aren't freed automatically on completion.
  */
-#define WB_COMPLETION_INIT(bdi)		__WB_COMPLETION_INIT(&(bdi)->wb_waitq)
+#define WB_COMPLETION_INIT(bdi_wb_ctx)		__WB_COMPLETION_INIT(&(bdi_wb_ctx)->wb_waitq)
 
-#define DEFINE_WB_COMPLETION(cmpl, bdi)	\
-	struct wb_completion cmpl = WB_COMPLETION_INIT(bdi)
+#define DEFINE_WB_COMPLETION(cmpl, bdi_wb_ctx)	\
+	struct wb_completion cmpl = WB_COMPLETION_INIT(bdi_wb_ctx)
 
 /*
  * Each wb (bdi_writeback) can perform writeback operations, is measured
@@ -104,6 +104,7 @@ struct wb_completion {
  */
 struct bdi_writeback {
 	struct backing_dev_info *bdi;	/* our parent bdi */
+	struct bdi_writeback_ctx *bdi_wb_ctx;
 
 	unsigned long state;		/* Always use atomic bitops on this */
 	unsigned long last_old_flush;	/* last old data flush */
@@ -160,6 +161,16 @@ struct bdi_writeback {
 #endif
 };
 
+struct bdi_writeback_ctx {
+	struct bdi_writeback wb;  /* the root writeback info for this bdi */
+	struct list_head wb_list; /* list of all wbs */
+#ifdef CONFIG_CGROUP_WRITEBACK
+	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
+	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
+#endif
+	wait_queue_head_t wb_waitq;
+};
+
 struct backing_dev_info {
 	u64 id;
 	struct rb_node rb_node; /* keyed by ->id */
@@ -182,16 +193,13 @@ struct backing_dev_info {
 	 * blk-wbt.
 	 */
 	unsigned long last_bdp_sleep;
-
-	struct bdi_writeback wb;  /* the root writeback info for this bdi */
-	struct list_head wb_list; /* list of all wbs */
+	struct bdi_writeback_ctx default_ctx;
+	bool is_parallel;
+	int nr_wb_ctx;
+	struct xarray bdi_wb_ctx_arr;
 #ifdef CONFIG_CGROUP_WRITEBACK
-	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
 	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
-	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
 #endif
-	wait_queue_head_t wb_waitq;
-
 	struct device *dev;
 	char dev_name[64];
 	struct device *owner;
@@ -216,7 +224,7 @@ struct wb_lock_cookie {
  */
 static inline bool wb_tryget(struct bdi_writeback *wb)
 {
-	if (wb != &wb->bdi->wb)
+	if (wb != &wb->bdi_wb_ctx->wb)
 		return percpu_ref_tryget(&wb->refcnt);
 	return true;
 }
@@ -227,7 +235,7 @@ static inline bool wb_tryget(struct bdi_writeback *wb)
  */
 static inline void wb_get(struct bdi_writeback *wb)
 {
-	if (wb != &wb->bdi->wb)
+	if (wb != &wb->bdi_wb_ctx->wb)
 		percpu_ref_get(&wb->refcnt);
 }
 
@@ -246,7 +254,7 @@ static inline void wb_put_many(struct bdi_writeback *wb, unsigned long nr)
 		return;
 	}
 
-	if (wb != &wb->bdi->wb)
+	if (wb != &wb->bdi_wb_ctx->wb)
 		percpu_ref_put_many(&wb->refcnt, nr);
 }
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8e7af9a03b41..e24cd3fd42b4 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -148,11 +148,29 @@ static inline bool mapping_can_writeback(struct address_space *mapping)
 	return inode_to_bdi(mapping->host)->capabilities & BDI_CAP_WRITEBACK;
 }
 
+#define for_each_bdi_wb_ctx(bdi, wb_ctx) \
+	for (int i = 0; (i) < (bdi)->nr_wb_ctx \
+		    && ((wb_ctx) = xa_load(&(bdi)->bdi_wb_ctx_arr, i), 1); (i)++)
+
+static inline struct bdi_writeback_ctx *fetch_bdi_writeback_ctx(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct super_block *sb = inode->i_sb;
+	struct bdi_writeback_ctx *bdi_wb_ctx;
+
+	if (sb->s_op->get_inode_wb_ctx)
+		bdi_wb_ctx = sb->s_op->get_inode_wb_ctx(inode);
+	else
+		bdi_wb_ctx = &bdi->default_ctx;
+	return bdi_wb_ctx;
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
-struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
+struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css);
 struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
+				    struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css,
 				    gfp_t gfp);
 void wb_memcg_offline(struct mem_cgroup *memcg);
@@ -187,16 +205,17 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
  * Must be called under rcu_read_lock() which protects the returend wb.
  * NULL if not found.
  */
-static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
+static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	struct cgroup_subsys_state *memcg_css;
 	struct bdi_writeback *wb;
 
 	memcg_css = task_css(current, memory_cgrp_id);
 	if (!memcg_css->parent)
-		return &bdi->wb;
+		return &bdi_wb_ctx->wb;
 
-	wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+	wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id);
 
 	/*
 	 * %current's blkcg equals the effective blkcg of its memcg.  No
@@ -217,12 +236,13 @@ static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi
  * wb_find_current().
  */
 static inline struct bdi_writeback *
-wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
+wb_get_create_current(struct backing_dev_info *bdi,
+		      struct bdi_writeback_ctx *bdi_wb_ctx, gfp_t gfp)
 {
 	struct bdi_writeback *wb;
 
 	rcu_read_lock();
-	wb = wb_find_current(bdi);
+	wb = wb_find_current(bdi, bdi_wb_ctx);
 	if (wb && unlikely(!wb_tryget(wb)))
 		wb = NULL;
 	rcu_read_unlock();
@@ -231,7 +251,7 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
 		struct cgroup_subsys_state *memcg_css;
 
 		memcg_css = task_get_css(current, memory_cgrp_id);
-		wb = wb_get_create(bdi, memcg_css, gfp);
+		wb = wb_get_create(bdi, bdi_wb_ctx, memcg_css, gfp);
 		css_put(memcg_css);
 	}
 	return wb;
@@ -264,7 +284,7 @@ static inline struct bdi_writeback *inode_to_wb_wbc(
 	 * If wbc does not have inode attached, it means cgroup writeback was
 	 * disabled when wbc started. Just use the default wb in that case.
 	 */
-	return wbc->wb ? wbc->wb : &inode_to_bdi(inode)->wb;
+	return wbc->wb ? wbc->wb : &fetch_bdi_writeback_ctx(inode)->wb;
 }
 
 /**
@@ -324,20 +344,22 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
 	return false;
 }
 
-static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
+static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	return &bdi->wb;
+	return &bdi_wb_ctx->wb;
 }
 
 static inline struct bdi_writeback *
-wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
+wb_get_create_current(struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx,
+		gfp_t gfp)
 {
-	return &bdi->wb;
+	return &bdi_wb_ctx->wb;
 }
 
 static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
 {
-	return &inode_to_bdi(inode)->wb;
+	return &fetch_bdi_writeback_ctx(inode)->wb;
 }
 
 static inline struct bdi_writeback *inode_to_wb_wbc(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e29433c5ecc..667f97c68cd1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2228,6 +2228,7 @@ struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
 	void (*free_inode)(struct inode *);
+	struct bdi_writeback_ctx* (*get_inode_wb_ctx)(struct inode *);
 
    	void (*dirty_inode) (struct inode *, int flags);
 	int (*write_inode) (struct inode *, struct writeback_control *wbc);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e61bbb1bd622..f904994efba1 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -84,26 +84,48 @@ static void collect_wb_stats(struct wb_stats *stats,
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-static void bdi_collect_stats(struct backing_dev_info *bdi,
-			      struct wb_stats *stats)
+
+static void bdi_wb_ctx_collect_stats(struct bdi_writeback_ctx *bdi_wb_ctx,
+				     struct wb_stats *stats)
 {
 	struct bdi_writeback *wb;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
+	list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) {
 		if (!wb_tryget(wb))
 			continue;
 
 		collect_wb_stats(stats, wb);
 		wb_put(wb);
 	}
+}
+static void bdi_collect_stats(struct backing_dev_info *bdi,
+			      struct wb_stats *stats)
+{
+	rcu_read_lock();
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			bdi_wb_ctx_collect_stats(bdi_wb_ctx, stats);
+		}
+	} else {
+		bdi_wb_ctx_collect_stats(&bdi->default_ctx, stats);
+	}
 	rcu_read_unlock();
 }
 #else
 static void bdi_collect_stats(struct backing_dev_info *bdi,
 			      struct wb_stats *stats)
 {
-	collect_wb_stats(stats, &bdi->wb);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			collect_wb_stats(stats, &bdi_wb_ctx->wb);
+		}
+	} else {
+		collect_wb_stats(stats, &bdi->default_ctx.wb);
+	}
 }
 #endif
 
@@ -135,8 +157,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
 		   "b_dirty_time:       %10lu\n"
-		   "bdi_list:           %10u\n"
-		   "state:              %10lx\n",
+		   "bdi_list:           %10u\n",
+		   /* TBD: what to do for state */
+		   /* "state:              %10lx\n", */
 		   K(stats.nr_writeback),
 		   K(stats.nr_reclaimable),
 		   K(stats.wb_thresh),
@@ -149,7 +172,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   stats.nr_io,
 		   stats.nr_more_io,
 		   stats.nr_dirty_time,
-		   !list_empty(&bdi->bdi_list), bdi->wb.state);
+		   !list_empty(&bdi->bdi_list));/* bdi->wb.state); */
 
 	return 0;
 }
@@ -190,17 +213,17 @@ static void wb_stats_show(struct seq_file *m, struct bdi_writeback *wb,
 		   wb->state);
 }
 
-static int cgwb_debug_stats_show(struct seq_file *m, void *v)
+
+static int cgwb_debug_stats_bdi_wb_ctx(struct seq_file *m,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	struct backing_dev_info *bdi = m->private;
+	struct bdi_writeback *wb;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
-	struct bdi_writeback *wb;
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
+	list_for_each_entry_rcu(wb, &bdi_wb_ctx->wb_list, bdi_node) {
 		struct wb_stats stats = { .dirty_thresh = dirty_thresh };
 
 		if (!wb_tryget(wb))
@@ -224,6 +247,23 @@ static int cgwb_debug_stats_show(struct seq_file *m, void *v)
 
 		wb_put(wb);
 	}
+	return 0;
+}
+
+static int cgwb_debug_stats_show(struct seq_file *m, void *v)
+{
+	struct backing_dev_info *bdi = m->private;
+
+	rcu_read_lock();
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			cgwb_debug_stats_bdi_wb_ctx(m, bdi_wb_ctx);
+		}
+	} else {
+		cgwb_debug_stats_bdi_wb_ctx(m, &bdi->default_ctx);
+	}
 	rcu_read_unlock();
 
 	return 0;
@@ -520,6 +560,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
 	memset(wb, 0, sizeof(*wb));
 
 	wb->bdi = bdi;
+	wb->bdi_wb_ctx = &bdi->default_ctx;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
@@ -643,11 +684,11 @@ static void cgwb_release(struct percpu_ref *refcnt)
 	queue_work(cgwb_release_wq, &wb->release_work);
 }
 
-static void cgwb_kill(struct bdi_writeback *wb)
+static void cgwb_kill(struct bdi_writeback *wb, struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	lockdep_assert_held(&cgwb_lock);
 
-	WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
+	WARN_ON(!radix_tree_delete(&bdi_wb_ctx->cgwb_tree, wb->memcg_css->id));
 	list_del(&wb->memcg_node);
 	list_del(&wb->blkcg_node);
 	list_add(&wb->offline_node, &offline_cgwbs);
@@ -662,6 +703,7 @@ static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
 }
 
 static int cgwb_create(struct backing_dev_info *bdi,
+		       struct bdi_writeback_ctx *bdi_wb_ctx,
 		       struct cgroup_subsys_state *memcg_css, gfp_t gfp)
 {
 	struct mem_cgroup *memcg;
@@ -678,9 +720,9 @@ static int cgwb_create(struct backing_dev_info *bdi,
 
 	/* look up again under lock and discard on blkcg mismatch */
 	spin_lock_irqsave(&cgwb_lock, flags);
-	wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+	wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id);
 	if (wb && wb->blkcg_css != blkcg_css) {
-		cgwb_kill(wb);
+		cgwb_kill(wb, bdi_wb_ctx);
 		wb = NULL;
 	}
 	spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -721,12 +763,12 @@ static int cgwb_create(struct backing_dev_info *bdi,
 	 */
 	ret = -ENODEV;
 	spin_lock_irqsave(&cgwb_lock, flags);
-	if (test_bit(WB_registered, &bdi->wb.state) &&
+	if (test_bit(WB_registered, &bdi_wb_ctx->wb.state) &&
 	    blkcg_cgwb_list->next && memcg_cgwb_list->next) {
 		/* we might have raced another instance of this function */
-		ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
+		ret = radix_tree_insert(&bdi_wb_ctx->cgwb_tree, memcg_css->id, wb);
 		if (!ret) {
-			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
+			list_add_tail_rcu(&wb->bdi_node, &bdi_wb_ctx->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
 			blkcg_pin_online(blkcg_css);
@@ -779,16 +821,16 @@ static int cgwb_create(struct backing_dev_info *bdi,
  * each lookup.  On mismatch, the existing wb is discarded and a new one is
  * created.
  */
-struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
+struct bdi_writeback *wb_get_lookup(struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css)
 {
 	struct bdi_writeback *wb;
 
 	if (!memcg_css->parent)
-		return &bdi->wb;
+		return &bdi_wb_ctx->wb;
 
 	rcu_read_lock();
-	wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+	wb = radix_tree_lookup(&bdi_wb_ctx->cgwb_tree, memcg_css->id);
 	if (wb) {
 		struct cgroup_subsys_state *blkcg_css;
 
@@ -813,6 +855,7 @@ struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
  * create one.  See wb_get_lookup() for more details.
  */
 struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
+				    struct bdi_writeback_ctx *bdi_wb_ctx,
 				    struct cgroup_subsys_state *memcg_css,
 				    gfp_t gfp)
 {
@@ -821,45 +864,67 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 	might_alloc(gfp);
 
 	do {
-		wb = wb_get_lookup(bdi, memcg_css);
-	} while (!wb && !cgwb_create(bdi, memcg_css, gfp));
+		wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
+	} while (!wb && !cgwb_create(bdi, bdi_wb_ctx, memcg_css, gfp));
 
 	return wb;
 }
 
+static int cgwb_bdi_wb_ctx_init(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
+{
+	int ret;
+
+	INIT_RADIX_TREE(&bdi_wb_ctx->cgwb_tree, GFP_ATOMIC);
+	init_rwsem(&bdi_wb_ctx->wb_switch_rwsem);
+	ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
+	/* better error handling */
+	if (!ret) {
+		bdi_wb_ctx->wb.memcg_css = &root_mem_cgroup->css;
+		bdi_wb_ctx->wb.blkcg_css = blkcg_root_css;
+	}
+	return ret;
+}
+
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
 {
 	int ret;
 
-	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
 	mutex_init(&bdi->cgwb_release_mutex);
-	init_rwsem(&bdi->wb_switch_rwsem);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
 
-	ret = wb_init(&bdi->wb, bdi, GFP_KERNEL);
-	if (!ret) {
-		bdi->wb.memcg_css = &root_mem_cgroup->css;
-		bdi->wb.blkcg_css = blkcg_root_css;
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			ret = cgwb_bdi_wb_ctx_init(bdi, bdi_wb_ctx);
+			if (ret)
+				return ret;
+		}
+	} else {
+		cgwb_bdi_wb_ctx_init(bdi, &bdi->default_ctx);
 	}
+
 	return ret;
 }
 
-static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
+/* callers should create a loop and pass bdi_wb_ctx */
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi,
+	struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	struct radix_tree_iter iter;
 	void **slot;
 	struct bdi_writeback *wb;
 
-	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
+	WARN_ON(test_bit(WB_registered, &bdi_wb_ctx->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
-	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
-		cgwb_kill(*slot);
+	radix_tree_for_each_slot(slot, &bdi_wb_ctx->cgwb_tree, &iter, 0)
+		cgwb_kill(*slot, bdi_wb_ctx);
 	spin_unlock_irq(&cgwb_lock);
 
 	mutex_lock(&bdi->cgwb_release_mutex);
 	spin_lock_irq(&cgwb_lock);
-	while (!list_empty(&bdi->wb_list)) {
-		wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
+	while (!list_empty(&bdi_wb_ctx->wb_list)) {
+		wb = list_first_entry(&bdi_wb_ctx->wb_list, struct bdi_writeback,
 				      bdi_node);
 		spin_unlock_irq(&cgwb_lock);
 		wb_shutdown(wb);
@@ -930,7 +995,7 @@ void wb_memcg_offline(struct mem_cgroup *memcg)
 
 	spin_lock_irq(&cgwb_lock);
 	list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
-		cgwb_kill(wb);
+		cgwb_kill(wb, wb->bdi_wb_ctx);
 	memcg_cgwb_list->next = NULL;	/* prevent new wb's */
 	spin_unlock_irq(&cgwb_lock);
 
@@ -950,15 +1015,16 @@ void wb_blkcg_offline(struct cgroup_subsys_state *css)
 
 	spin_lock_irq(&cgwb_lock);
 	list_for_each_entry_safe(wb, next, list, blkcg_node)
-		cgwb_kill(wb);
+		cgwb_kill(wb, wb->bdi_wb_ctx);
 	list->next = NULL;	/* prevent new wb's */
 	spin_unlock_irq(&cgwb_lock);
 }
 
-static void cgwb_bdi_register(struct backing_dev_info *bdi)
+static void cgwb_bdi_register(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
 	spin_lock_irq(&cgwb_lock);
-	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+	list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list);
 	spin_unlock_irq(&cgwb_lock);
 }
 
@@ -981,14 +1047,31 @@ subsys_initcall(cgwb_init);
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
 {
-	return wb_init(&bdi->wb, bdi, GFP_KERNEL);
+	int ret;
+
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			ret = wb_init(&bdi_wb_ctx->wb, bdi, GFP_KERNEL);
+			/* better error handling */
+			if (!ret)
+				return ret;
+		}
+	} else {
+		ret = wb_init(&bdi->default_ctx.wb, bdi, GFP_KERNEL);
+	}
+	return ret;
 }
 
-static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_unregister(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx) { }
 
-static void cgwb_bdi_register(struct backing_dev_info *bdi)
+/* callers should create a loop and pass bdi_wb_ctx */
+static void cgwb_bdi_register(struct backing_dev_info *bdi,
+		struct bdi_writeback_ctx *bdi_wb_ctx)
 {
-	list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
+	list_add_tail_rcu(&bdi_wb_ctx->wb.bdi_node, &bdi_wb_ctx->wb_list);
 }
 
 static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
@@ -1003,12 +1086,14 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->dev = NULL;
 
 	kref_init(&bdi->refcnt);
+	bdi->nr_wb_ctx = 1;
+	bdi->is_parallel = false;
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100 * BDI_RATIO_SCALE;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	INIT_LIST_HEAD(&bdi->wb_list);
-	init_waitqueue_head(&bdi->wb_waitq);
+	INIT_LIST_HEAD(&bdi->default_ctx.wb_list);
+	init_waitqueue_head(&bdi->default_ctx.wb_waitq);
 	bdi->last_bdp_sleep = jiffies;
 
 	return cgwb_bdi_init(bdi);
@@ -1095,11 +1180,20 @@ int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
 	if (IS_ERR(dev))
 		return PTR_ERR(dev);
 
-	cgwb_bdi_register(bdi);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			cgwb_bdi_register(bdi, bdi_wb_ctx);
+			set_bit(WB_registered, &bdi_wb_ctx->wb.state);
+		}
+	} else {
+		cgwb_bdi_register(bdi, &bdi->default_ctx);
+		set_bit(WB_registered, &bdi->default_ctx.wb.state);
+	}
 	bdi->dev = dev;
 
 	bdi_debug_register(bdi, dev_name(dev));
-	set_bit(WB_registered, &bdi->wb.state);
 
 	spin_lock_bh(&bdi_lock);
 
@@ -1155,8 +1249,17 @@ void bdi_unregister(struct backing_dev_info *bdi)
 
 	/* make sure nobody finds us on the bdi_list anymore */
 	bdi_remove_from_list(bdi);
-	wb_shutdown(&bdi->wb);
-	cgwb_bdi_unregister(bdi);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			wb_shutdown(&bdi_wb_ctx->wb);
+			cgwb_bdi_unregister(bdi, bdi_wb_ctx);
+		}
+	} else {
+		wb_shutdown(&bdi->default_ctx.wb);
+		cgwb_bdi_unregister(bdi, &bdi->default_ctx);
+	}
 
 	/*
 	 * If this BDI's min ratio has been set, use bdi_set_min_ratio() to
@@ -1183,9 +1286,18 @@ static void release_bdi(struct kref *ref)
 	struct backing_dev_info *bdi =
 			container_of(ref, struct backing_dev_info, refcnt);
 
-	WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb.state));
 	WARN_ON_ONCE(bdi->dev);
-	wb_exit(&bdi->wb);
+	if (bdi->is_parallel) {
+		struct bdi_writeback_ctx *bdi_wb_ctx;
+
+		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
+			WARN_ON_ONCE(test_bit(WB_registered, &bdi_wb_ctx->wb.state));
+			wb_exit(&bdi_wb_ctx->wb);
+		}
+	} else {
+		WARN_ON_ONCE(test_bit(WB_registered, &bdi->default_ctx.wb.state));
+		wb_exit(&bdi->default_ctx.wb);
+	}
 	kfree(bdi);
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d9861e42b2bd..51420474b4cf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2101,6 +2101,7 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
 {
 	struct inode *inode = mapping->host;
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback_ctx *bdi_wb_ctx = fetch_bdi_writeback_ctx(inode);
 	struct bdi_writeback *wb = NULL;
 	int ratelimit;
 	int ret = 0;
@@ -2110,9 +2111,9 @@ int balance_dirty_pages_ratelimited_flags(struct address_space *mapping,
 		return ret;
 
 	if (inode_cgwb_enabled(inode))
-		wb = wb_get_create_current(bdi, GFP_KERNEL);
+		wb = wb_get_create_current(bdi, bdi_wb_ctx, GFP_KERNEL);
 	if (!wb)
-		wb = &bdi->wb;
+		wb = &bdi_wb_ctx->wb;
 
 	ratelimit = current->nr_dirtied_pause;
 	if (wb->dirty_exceeded)
-- 
2.25.1


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-03-18 11:37                   ` Anuj Gupta
@ 2025-03-19 15:54                     ` Jan Kara
  2025-03-20  7:08                       ` Anuj Gupta
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Kara @ 2025-03-19 15:54 UTC (permalink / raw)
  To: Anuj Gupta
  Cc: Jan Kara, Kundan Kumar, Dave Chinner, Christoph Hellwig, lsf-pc,
	linux-fsdevel, mcgrof, joshi.k, axboe, clm, willy, gost.dev

On Tue 18-03-25 17:07:12, Anuj Gupta wrote:
> > I was thinking about how to best parallelize the writeback and I think
> > there are two quite different demands for which we probably want two
> > different levels of parallelism.
> > 
> > One case is the situation when the filesystem for example has multiple
> > underlying devices (like btrfs or bcachefs) or for other reasons writeback
> > to different parts is fairly independent (like for different XFS AGs). Here
> > we want parallelism at rather high level I think including separate
> > dirty throttling, tracking of writeback bandwidth etc.. It is *almost* like
> > separate bdis (struct backing_dev_info) but I think it would be technically
> > and also conceptually somewhat easier to do the multiplexing by factoring
> > out:
> > 
> >         struct bdi_writeback wb;  /* the root writeback info for this bdi */
> >         struct list_head wb_list; /* list of all wbs */
> > #ifdef CONFIG_CGROUP_WRITEBACK
> >         struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
> >         struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
> > #endif
> >         wait_queue_head_t wb_waitq;
> > 
> > into a new structure (looking for a good name - bdi_writeback_context???)
> > that can get multiplied (filesystem can create its own bdi on mount and
> > configure there number of bdi_writeback_contexts it wants). We also need to
> > add a hook sb->s_ops->get_inode_wb_context() called from __inode_attach_wb()
> > which will return appropriate bdi_writeback_context (or perhaps just it's
> > index?) for an inode. This will be used by the filesystem to direct
> > writeback code where the inode should go. This is kind of what Kundan did
> > in the last revision of his patches but I hope this approach should
> > somewhat limit the changes necessary to writeback infrastructure - the
> > patch 2 in his series is really unreviewably large...
> 
> Thanks Jan.
> 
> I tried to modify the existing infra based on your suggestion [1]. This
> only takes care of the first case that you mentioned. I am yet to start
> to evaluate and implement the second case (when amount of cpu work is
> high). The patch is large, because it requires changing all the places
> that uses bdi's fields that have now been moved to a new structure. I
> will try to streamline it further.
> 
> I have omitted the xfs side plumbing for now.
> Can you please see if this aligns with the scheme that you had in mind?
> If the infra looks fine, I can plumb XFS changes on top of it.
> 
> Note: This patch is only compile tested, haven't run any tests on it.

Sure, this is fine for this early prototyping phase.

> Subject: [PATCH] writeback: add infra for parallel writeback
> 
> Introduce a new bdi_writeback_ctx structure that enables us to have
> multiple writeback contexts for parallel writeback.
> 
> Existing single threaded writeback will use default_ctx.
> 
> Filesystem now have the option to create there own bdi aind populate
> multiple bdi_writeback_ctx in bdi's bdi_wb_ctx_arr (xarray for now, but
> plan to move to use a better structure like list_lru).
> 
> Introduce a new hook get_inode_wb_ctx() in super_operations which takes
> inode as a parameter and returns the bdi_wb_ctx corresponding to the
> inode.
> 
> Modify all the functions/places that operate on bdi's wb, wb_list,
> cgwb_tree, wb_switch_rwsem, wb_waitq as these fields have now been moved
> to bdi_writeback_ctx
> 
> Store a reference to bdi_wb_ctx in bdi_writeback.
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
> Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>

Thanks for the patch! Overall I think we are going in the right direction
:) Some comments below.

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 3cd99e2dc6ac..4c47c298f174 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -265,23 +265,27 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
>  {
>  	struct backing_dev_info *bdi = inode_to_bdi(inode);
>  	struct bdi_writeback *wb = NULL;
> +	struct bdi_writeback_ctx *bdi_writeback_ctx =
> +						fetch_bdi_writeback_ctx(inode);
						^^^
I'd call this inode_bdi_writeback_ctx(). I think that name tells a bit
more.

>  	if (inode_cgwb_enabled(inode)) {
>  		struct cgroup_subsys_state *memcg_css;
>  
>  		if (folio) {
>  			memcg_css = mem_cgroup_css_from_folio(folio);
> -			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
> +			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
> +					   GFP_ATOMIC);
>  		} else {
>  			/* must pin memcg_css, see wb_get_create() */
>  			memcg_css = task_get_css(current, memory_cgrp_id);
> -			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
> +			wb = wb_get_create(bdi, bdi_writeback_ctx, memcg_css,
> +					   GFP_ATOMIC);
>  			css_put(memcg_css);
>  		}
>  	}
>  
>  	if (!wb)
> -		wb = &bdi->wb;
> +		wb = &bdi_writeback_ctx->wb;

Perhaps as a preparation patch, can you please rename bdi->wb to
bdi->root_wb (just mechanical replacement) and then carry over this name to
bdi_writeback_ctx? Because we have too many wbs here already and as the
structure is getting more complex, explanatory naming becomes more
important. Thanks!

> @@ -1103,7 +1112,17 @@ int cgroup_writeback_by_id(u64 bdi_id, int memcg_id,
>  	 * And find the associated wb.  If the wb isn't there already
>  	 * there's nothing to flush, don't create one.
>  	 */
> -	wb = wb_get_lookup(bdi, memcg_css);
> +	if (bdi->is_parallel) {
> +		struct bdi_writeback_ctx *bdi_wb_ctx;
> +
> +		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
> +			wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
> +			if (wb)
> +				break;
> +		}
> +	} else {
> +		wb = wb_get_lookup(&bdi->default_ctx, memcg_css);
> +	}

Why do you introduce this bdi->is_parallel bool? Why don't we
unconditionally do:

	for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
		wb = wb_get_lookup(bdi_wb_ctx, memcg_css);
		if (wb)
			break;
	}

It would seem more obvious and with less code duplication... It might
require a bit of magic inside for_each_bdi_wb_ctx() but I think it pays
off. BTW, you should initialize 'wb' before the loop to NULL. Otherwise I
expect some compilers to complain and also for readers it makes the test
below obviously do the right thing.

>  	if (!wb) {
>  		ret = -ENOENT;
>  		goto out_css_put;
...
> @@ -1232,13 +1251,14 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
>  
>  static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  				  struct wb_writeback_work *base_work,
> -				  bool skip_if_busy)
> +				  bool skip_if_busy,
> +				  struct bdi_writeback_ctx *bdi_wb_ctx)

Nit; logical position for bdi_wb_ctx argument would be just after 'bdi' but
see below - I think we could pass only bdi_wb_ctx here after some changes.

> @@ -2713,11 +2753,12 @@ static void wait_sb_inodes(struct super_block *sb)
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  
> -static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
> -				     enum wb_reason reason, bool skip_if_busy)
> +static void __writeback_inodes_sb_nr_bdi_wb_ctx(struct super_block *sb,
> +						unsigned long nr, enum wb_reason reason,
> +						bool skip_if_busy,
> +						struct bdi_writeback_ctx *bdi_wb_ctx)
>  {
> -	struct backing_dev_info *bdi = sb->s_bdi;
> -	DEFINE_WB_COMPLETION(done, bdi);
> +	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
>  	struct wb_writeback_work work = {
>  		.sb			= sb,
>  		.sync_mode		= WB_SYNC_NONE,
> @@ -2727,12 +2768,29 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
>  		.reason			= reason,
>  	};
>  
> +	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy, bdi_wb_ctx);
> +	wb_wait_for_completion(&done);
> +}
> +
> +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
> +				     enum wb_reason reason, bool skip_if_busy)
> +{
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +
>  	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
>  		return;
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
> -	wb_wait_for_completion(&done);
> +	if (bdi->is_parallel) {
> +		struct bdi_writeback_ctx *bdi_wb_ctx;
> +
> +		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx)
> +			__writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason,
> +							    skip_if_busy, bdi_wb_ctx);
> +	} else {
> +		__writeback_inodes_sb_nr_bdi_wb_ctx(sb, nr, reason,
> +						    skip_if_busy, &bdi->default_ctx);
> +	}
>  }

If we drop this 'is_parallel' thing, I think we don't need the new
__writeback_inodes_sb_nr_bdi_wb_ctx() function and can keep everything
in one function.

>  
>  /**
> @@ -2785,17 +2843,10 @@ void try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
>  }
>  EXPORT_SYMBOL(try_to_writeback_inodes_sb);
>  
> -/**
> - * sync_inodes_sb	-	sync sb inode pages
> - * @sb: the superblock
> - *
> - * This function writes and waits on any dirty inode belonging to this
> - * super_block.
> - */
> -void sync_inodes_sb(struct super_block *sb)
> +static void sync_inodes_bdi_wb_ctx(struct super_block *sb,
> +		struct backing_dev_info *bdi, struct bdi_writeback_ctx *bdi_wb_ctx)
>  {
> -	struct backing_dev_info *bdi = sb->s_bdi;
> -	DEFINE_WB_COMPLETION(done, bdi);
> +	DEFINE_WB_COMPLETION(done, bdi_wb_ctx);
>  	struct wb_writeback_work work = {
>  		.sb		= sb,
>  		.sync_mode	= WB_SYNC_ALL,
> @@ -2805,6 +2856,22 @@ void sync_inodes_sb(struct super_block *sb)
>  		.reason		= WB_REASON_SYNC,
>  		.for_sync	= 1,
>  	};
> +	bdi_wb_ctx_down_write_wb_switch_rwsem(bdi_wb_ctx);
> +	bdi_split_work_to_wbs(bdi, &work, false, bdi_wb_ctx);
> +	wb_wait_for_completion(&done);
> +	bdi_wb_ctx_up_write_wb_switch_rwsem(bdi_wb_ctx);
> +}
> +
> +/**
> + * sync_inodes_sb	-	sync sb inode pages
> + * @sb: the superblock
> + *
> + * This function writes and waits on any dirty inode belonging to this
> + * super_block.
> + */
> +void sync_inodes_sb(struct super_block *sb)
> +{
> +	struct backing_dev_info *bdi = sb->s_bdi;
>  
>  	/*
>  	 * Can't skip on !bdi_has_dirty() because we should wait for !dirty
> @@ -2815,12 +2882,15 @@ void sync_inodes_sb(struct super_block *sb)
>  		return;
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
> -	bdi_down_write_wb_switch_rwsem(bdi);
> -	bdi_split_work_to_wbs(bdi, &work, false);
> -	wb_wait_for_completion(&done);
> -	bdi_up_write_wb_switch_rwsem(bdi);
> +	if (bdi->is_parallel) {
> +		struct bdi_writeback_ctx *bdi_wb_ctx;
>  
> +		for_each_bdi_wb_ctx(bdi, bdi_wb_ctx) {
> +			sync_inodes_bdi_wb_ctx(sb, bdi, bdi_wb_ctx);
> +		}
> +	} else {
> +		sync_inodes_bdi_wb_ctx(sb, bdi, &bdi->default_ctx);
> +	}
>  	wait_sb_inodes(sb);
>  }
>  EXPORT_SYMBOL(sync_inodes_sb);

The same comment as above.

> @@ -104,6 +104,7 @@ struct wb_completion {
>   */
>  struct bdi_writeback {
>  	struct backing_dev_info *bdi;	/* our parent bdi */
> +	struct bdi_writeback_ctx *bdi_wb_ctx;
>  
>  	unsigned long state;		/* Always use atomic bitops on this */
>  	unsigned long last_old_flush;	/* last old data flush */
> @@ -160,6 +161,16 @@ struct bdi_writeback {
>  #endif
>  };
>  
> +struct bdi_writeback_ctx {
> +	struct bdi_writeback wb;  /* the root writeback info for this bdi */
> +	struct list_head wb_list; /* list of all wbs */
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
> +	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
> +#endif
> +	wait_queue_head_t wb_waitq;
> +};
> +

As I was seeing the patch, I'd also add:
	struct backing_dev_info *bdi;

pointer to bdi_writeback_ctx. Then in most functions, you can just pass
bdi_wb_ctx into them instead of both bdi *and* bdi_wb_ctx and it makes
interfaces somewhat nicer.

>  struct backing_dev_info {
>  	u64 id;
>  	struct rb_node rb_node; /* keyed by ->id */
> @@ -182,16 +193,13 @@ struct backing_dev_info {
>  	 * blk-wbt.
>  	 */
>  	unsigned long last_bdp_sleep;
> -
> -	struct bdi_writeback wb;  /* the root writeback info for this bdi */
> -	struct list_head wb_list; /* list of all wbs */
> +	struct bdi_writeback_ctx default_ctx;
> +	bool is_parallel;
> +	int nr_wb_ctx;
> +	struct xarray bdi_wb_ctx_arr;

I think xarray here is overkill. I'd just make this a plain array:

	struct bdi_writeback_ctx **bdi_wb_ctx_arr;

which will get allocated with nr_wb_ctx entries during bdi_init(). Also I'd
make default_ctx just be entry at index 0 in this array. I'm undecided
whether it will be clearer to just drop default_ctx altogether or keep it
and set:

	struct bdi_writeback_ctx *default_ctx = bdi_wb_ctx_arr[0];

on init so I'll leave that up to you.

>  #ifdef CONFIG_CGROUP_WRITEBACK
> -	struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
>  	struct mutex cgwb_release_mutex;  /* protect shutdown of wb structs */
> -	struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
>  #endif
> -	wait_queue_head_t wb_waitq;
> -
>  	struct device *dev;
>  	char dev_name[64];
>  	struct device *owner;

...

> +static inline struct bdi_writeback_ctx *fetch_bdi_writeback_ctx(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +	struct super_block *sb = inode->i_sb;
> +	struct bdi_writeback_ctx *bdi_wb_ctx;
> +
> +	if (sb->s_op->get_inode_wb_ctx)
> +		bdi_wb_ctx = sb->s_op->get_inode_wb_ctx(inode);
> +	else
> +		bdi_wb_ctx = &bdi->default_ctx;
> +	return bdi_wb_ctx;
> +}

The indirect call (and the handling in there) might get potentially
expensive given all the places where this is called. For now I think we are
fine but eventually we might need to find space in struct inode (I know
that's a hard sell) for u8/u16 to store writeback context index there. But
first let's get this working before we burry ourselves in (potentially
premature) performance optimizations.

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7e29433c5ecc..667f97c68cd1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2228,6 +2228,7 @@ struct super_operations {
>     	struct inode *(*alloc_inode)(struct super_block *sb);
>  	void (*destroy_inode)(struct inode *);
>  	void (*free_inode)(struct inode *);
> +	struct bdi_writeback_ctx* (*get_inode_wb_ctx)(struct inode *);
                                ^^ wrongly placed space here.
>  
>     	void (*dirty_inode) (struct inode *, int flags);
>  	int (*write_inode) (struct inode *, struct writeback_control *wbc);

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-03-19 15:54                     ` Jan Kara
@ 2025-03-20  7:08                       ` Anuj Gupta
  0 siblings, 0 replies; 22+ messages in thread
From: Anuj Gupta @ 2025-03-20  7:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Kundan Kumar, Dave Chinner, Christoph Hellwig, lsf-pc,
	linux-fsdevel, mcgrof, joshi.k, axboe, clm, willy, gost.dev

[-- Attachment #1: Type: text/plain, Size: 921 bytes --]

> I think xarray here is overkill. I'd just make this a plain array:
> 
> 	struct bdi_writeback_ctx **bdi_wb_ctx_arr;
> 
> which will get allocated with nr_wb_ctx entries during bdi_init(). Also I'd
> make default_ctx just be entry at index 0 in this array. I'm undecided
> whether it will be clearer to just drop default_ctx altogether or keep it
> and set:
> 
> 	struct bdi_writeback_ctx *default_ctx = bdi_wb_ctx_arr[0];
> 
> on init so I'll leave that up to you.

Killing default_ctx completely seems like a better choice here. This
allows us to unconditionally use the for_each_bdi_wb_ctx loop, even for
the existing single writeback case,  by simply relying on bdi_wb_ctx[0].
It also eliminates the need for is_parallel, simplifying the design.

Thanks for the detailed review, Jan! Your suggestions look good to me
and help streamline the plumbing. I'll incorporate them in the next
version.

Thanks,
Anuj Gupta  

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-02-11  1:13             ` Dave Chinner
  2025-02-11 13:43               ` Jan Kara
  2025-02-20 14:19               ` Kundan Kumar
@ 2025-03-12 17:47               ` Luis Chamberlain
  2025-03-13 19:39                 ` Jan Kara
  2 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2025-03-12 17:47 UTC (permalink / raw)
  To: Dave Chinner, David Bueso
  Cc: Jan Kara, Christoph Hellwig, Kundan Kumar, lsf-pc, linux-fsdevel,
	anuj20.g, joshi.k, axboe, clm, willy, gost.dev

On Tue, Feb 11, 2025 at 12:13:18PM +1100, Dave Chinner wrote:
> Should we be looking towards using a subset of the existing list_lru
> functionality for writeback contexts here? i.e. create a list_lru
> object with N-way scalability, allow the fs to provide an
> inode-number-to-list mapping function, and use the list_lru
> interfaces to abstract away everything physical and cgroup related
> for tracking dirty inodes?
> 
> Then selecting inodes for writeback becomes a list_lru_walk()
> variant depending on what needs to be written back (e.g. physical
> node, memcg, both, everything that is dirty everywhere, etc).

I *suspect* you're referring to abstracting or sharing the sharding
to numa node functionality of list_lru so we can, divide objects
to numa nodes in similar ways for different use cases?

Because list_lru is about reclaim, not writeback, but from my reading
the list_lru sharding to numa nodes was the golden nugget to focus on.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback
  2025-03-12 17:47               ` Luis Chamberlain
@ 2025-03-13 19:39                 ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2025-03-13 19:39 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Dave Chinner, David Bueso, Jan Kara, Christoph Hellwig,
	Kundan Kumar, lsf-pc, linux-fsdevel, anuj20.g, joshi.k, axboe,
	clm, willy, gost.dev

On Wed 12-03-25 10:47:12, Luis Chamberlain wrote:
> On Tue, Feb 11, 2025 at 12:13:18PM +1100, Dave Chinner wrote:
> > Should we be looking towards using a subset of the existing list_lru
> > functionality for writeback contexts here? i.e. create a list_lru
> > object with N-way scalability, allow the fs to provide an
> > inode-number-to-list mapping function, and use the list_lru
> > interfaces to abstract away everything physical and cgroup related
> > for tracking dirty inodes?
> > 
> > Then selecting inodes for writeback becomes a list_lru_walk()
> > variant depending on what needs to be written back (e.g. physical
> > node, memcg, both, everything that is dirty everywhere, etc).
> 
> I *suspect* you're referring to abstracting or sharing the sharding
> to numa node functionality of list_lru so we can, divide objects
> to numa nodes in similar ways for different use cases?
> 
> Because list_lru is about reclaim, not writeback, but from my reading
> the list_lru sharding to numa nodes was the golden nugget to focus on.

Dave was speaking of list_lru mostly as an example how we have a structure
that is both memcg aware and has N-way parallelism built in it (for NUMA
nodes). For writeback we need something similar - we already have cgroup
awareness and now you want to add N-way parallelism. So the idea was that
we could possibly create a generic structure supporting this usable for
both. But details matter here for what ends up being practical...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-03-20 11:20 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20250129103448epcas5p1f7d71506e4443429a0b0002eb842e749@epcas5p1.samsung.com>
2025-01-29 10:26 ` [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback Kundan Kumar
2025-01-29 15:42   ` Christoph Hellwig
2025-01-31  9:57     ` Kundan Kumar
2025-01-29 22:51   ` Dave Chinner
2025-01-31  9:32     ` Kundan Kumar
2025-01-31 17:06       ` Luis Chamberlain
2025-02-04  2:50       ` Dave Chinner
2025-02-04  5:06         ` Christoph Hellwig
2025-02-04  7:08           ` Dave Chinner
2025-02-10 17:28           ` [Lsf-pc] " Jan Kara
2025-02-11  1:13             ` Dave Chinner
2025-02-11 13:43               ` Jan Kara
2025-02-20 14:19               ` Kundan Kumar
2025-03-13 20:22                 ` Jan Kara
2025-03-18  3:42                   ` Dave Chinner
     [not found]                     ` <CGME20250319081521epcas5p39ab71751aef70c73ba0f664db852ad69@epcas5p3.samsung.com>
2025-03-19  8:07                       ` Anuj Gupta
2025-03-18  6:41                   ` Kundan Kumar
2025-03-18 11:37                   ` Anuj Gupta
2025-03-19 15:54                     ` Jan Kara
2025-03-20  7:08                       ` Anuj Gupta
2025-03-12 17:47               ` Luis Chamberlain
2025-03-13 19:39                 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).