linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] writeback: Avoid lockups when switching inodes
@ 2025-09-09 14:44 Jan Kara
  2025-09-09 14:44 ` [PATCH 1/4] writeback: Avoid contention on wb->list_lock " Jan Kara
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Jan Kara @ 2025-09-09 14:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Tejun Heo, Jan Kara

Hello!

This patch series addresses lockups reported by users when systemd unit reading
lots of files from a filesystem mounted with lazytime mount option exits. See
patch 3 for more details about the reproducer.

There are two main issues why switching many inodes between wbs:

1) Multiple workers will be spawned to do the switching but they all contend
on the same wb->list_lock making all the parallelism pointless and just
wasting time.

2) Sorting of wb->b_dirty list by dirtied_time_when is inherently slow.

Patches 1-3 address these problems, patch 4 adds a tracepoint for better
observability of inode writeback switching.

								Honza

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes
  2025-09-09 14:44 [PATCH 0/4] writeback: Avoid lockups when switching inodes Jan Kara
@ 2025-09-09 14:44 ` Jan Kara
  2025-09-09 16:52   ` Tejun Heo
  2025-09-09 14:44 ` [PATCH 2/4] writeback: Avoid softlockup when switching many inodes Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-09 14:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Tejun Heo, Jan Kara

There can be multiple inode switch works that are trying to switch
inodes to / from the same wb. This can happen in particular if some
cgroup exits which owns many (thousands) inodes and we need to switch
them all. In this case several inode_switch_wbs_work_fn() instances will
be just spinning on the same wb->list_lock while only one of them makes
forward progress. This wastes CPU cycles and quickly leads to softlockup
reports and unusable system.

Instead of running several inode_switch_wbs_work_fn() instances in
parallel switching to the same wb and contending on wb->list_lock, run
just one instance and let the other isw items switching to this wb queue
behind the one being processed.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 42 +++++++++++++++++++++++++++++---
 include/linux/backing-dev-defs.h |  5 +++-
 mm/backing-dev.c                 |  2 ++
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a07b8cf73ae2..3f3e6efd5d78 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -369,6 +369,8 @@ static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode)
 
 struct inode_switch_wbs_context {
 	struct rcu_work		work;
+	/* List of queued switching contexts for new_wb */
+	struct list_head	list;
 
 	/*
 	 * Multiple inodes can be switched at once.  The switching procedure
@@ -486,10 +488,8 @@ static bool inode_do_switch_wbs(struct inode *inode,
 	return switched;
 }
 
-static void inode_switch_wbs_work_fn(struct work_struct *work)
+static void process_inode_switch_wbs_work(struct inode_switch_wbs_context *isw)
 {
-	struct inode_switch_wbs_context *isw =
-		container_of(to_rcu_work(work), struct inode_switch_wbs_context, work);
 	struct backing_dev_info *bdi = inode_to_bdi(isw->inodes[0]);
 	struct bdi_writeback *old_wb = isw->inodes[0]->i_wb;
 	struct bdi_writeback *new_wb = isw->new_wb;
@@ -539,10 +539,44 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
 	for (inodep = isw->inodes; *inodep; inodep++)
 		iput(*inodep);
 	wb_put(new_wb);
-	kfree(isw);
 	atomic_dec(&isw_nr_in_flight);
 }
 
+static void inode_switch_wbs_work_fn(struct work_struct *work)
+{
+	struct inode_switch_wbs_context *isw =
+		container_of(to_rcu_work(work), struct inode_switch_wbs_context, work);
+	struct bdi_writeback *new_wb = isw->new_wb;
+	bool switch_running;
+
+	spin_lock_irq(&new_wb->work_lock);
+	switch_running = !list_empty(&new_wb->switch_wbs_ctxs);
+	list_add_tail(&isw->list, &new_wb->switch_wbs_ctxs);
+	spin_unlock_irq(&new_wb->work_lock);
+
+	/*
+	 * Let's leave the real work for the running worker since we'd just
+	 * contend with it on wb->list_lock anyway.
+	 */
+	if (switch_running)
+		return;
+
+	/* OK, we will be doing the switching work */
+	wb_get(new_wb);
+	spin_lock_irq(&new_wb->work_lock);
+	while (!list_empty(&new_wb->switch_wbs_ctxs)) {
+		isw = list_first_entry(&new_wb->switch_wbs_ctxs,
+				       struct inode_switch_wbs_context, list);
+		spin_unlock_irq(&new_wb->work_lock);
+		process_inode_switch_wbs_work(isw);
+		spin_lock_irq(&new_wb->work_lock);
+		list_del(&isw->list);
+		kfree(isw);
+	}
+	spin_unlock_irq(&new_wb->work_lock);
+	wb_put(new_wb);
+}
+
 static bool inode_prepare_wbs_switch(struct inode *inode,
 				     struct bdi_writeback *new_wb)
 {
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 2ad261082bba..f94fd458c248 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -136,7 +136,8 @@ struct bdi_writeback {
 	int dirty_exceeded;
 	enum wb_reason start_all_reason;
 
-	spinlock_t work_lock;		/* protects work_list & dwork scheduling */
+	spinlock_t work_lock;		/* protects work_list,
+					   dwork scheduling, and switch_wbs_ctxs */
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
 	struct delayed_work bw_dwork;	/* work item used for bandwidth estimate */
@@ -152,6 +153,8 @@ struct bdi_writeback {
 	struct list_head blkcg_node;	/* anchored at blkcg->cgwb_list */
 	struct list_head b_attached;	/* attached inodes, protected by list_lock */
 	struct list_head offline_node;	/* anchored at offline_cgwbs */
+	struct list_head switch_wbs_ctxs;	/* queued contexts for
+						 * writeback switching */
 
 	union {
 		struct work_struct release_work;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 783904d8c5ef..ce0aa7d03cc5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -709,6 +709,7 @@ static int cgwb_create(struct backing_dev_info *bdi,
 	wb->memcg_css = memcg_css;
 	wb->blkcg_css = blkcg_css;
 	INIT_LIST_HEAD(&wb->b_attached);
+	INIT_LIST_HEAD(&wb->switch_wbs_ctxs);
 	INIT_WORK(&wb->release_work, cgwb_release_workfn);
 	set_bit(WB_registered, &wb->state);
 	bdi_get(bdi);
@@ -839,6 +840,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	if (!ret) {
 		bdi->wb.memcg_css = &root_mem_cgroup->css;
 		bdi->wb.blkcg_css = blkcg_root_css;
+		INIT_LIST_HEAD(&bdi->wb.switch_wbs_ctxs);
 	}
 	return ret;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] writeback: Avoid softlockup when switching many inodes
  2025-09-09 14:44 [PATCH 0/4] writeback: Avoid lockups when switching inodes Jan Kara
  2025-09-09 14:44 ` [PATCH 1/4] writeback: Avoid contention on wb->list_lock " Jan Kara
@ 2025-09-09 14:44 ` Jan Kara
  2025-09-09 16:55   ` Tejun Heo
  2025-09-09 14:44 ` [PATCH 3/4] writeback: Avoid excessively long inode switching times Jan Kara
  2025-09-09 14:44 ` [PATCH 4/4] writeback: Add tracepoint to track pending inode switches Jan Kara
  3 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-09 14:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Tejun Heo, Jan Kara

process_inode_switch_wbs_work() can be switching over 100 inodes to a
different cgroup. Since switching an inode requires counting all dirty &
under-writeback pages in the address space of each inode, this can take
a significant amount of time. Add a possibility to reschedule after
processing each inode to avoid softlockups.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3f3e6efd5d78..125f477c34c1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -502,6 +502,7 @@ static void process_inode_switch_wbs_work(struct inode_switch_wbs_context *isw)
 	 */
 	down_read(&bdi->wb_switch_rwsem);
 
+	inodep = isw->inodes;
 	/*
 	 * By the time control reaches here, RCU grace period has passed
 	 * since I_WB_SWITCH assertion and all wb stat update transactions
@@ -512,6 +513,7 @@ static void process_inode_switch_wbs_work(struct inode_switch_wbs_context *isw)
 	 * gives us exclusion against all wb related operations on @inode
 	 * including IO list manipulations and stat updates.
 	 */
+relock:
 	if (old_wb < new_wb) {
 		spin_lock(&old_wb->list_lock);
 		spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING);
@@ -520,10 +522,17 @@ static void process_inode_switch_wbs_work(struct inode_switch_wbs_context *isw)
 		spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING);
 	}
 
-	for (inodep = isw->inodes; *inodep; inodep++) {
+	while (*inodep) {
 		WARN_ON_ONCE((*inodep)->i_wb != old_wb);
 		if (inode_do_switch_wbs(*inodep, old_wb, new_wb))
 			nr_switched++;
+		inodep++;
+		if (*inodep && need_resched()) {
+			spin_unlock(&new_wb->list_lock);
+			spin_unlock(&old_wb->list_lock);
+			cond_resched();
+			goto relock;
+		}
 	}
 
 	spin_unlock(&new_wb->list_lock);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] writeback: Avoid excessively long inode switching times
  2025-09-09 14:44 [PATCH 0/4] writeback: Avoid lockups when switching inodes Jan Kara
  2025-09-09 14:44 ` [PATCH 1/4] writeback: Avoid contention on wb->list_lock " Jan Kara
  2025-09-09 14:44 ` [PATCH 2/4] writeback: Avoid softlockup when switching many inodes Jan Kara
@ 2025-09-09 14:44 ` Jan Kara
  2025-09-09 16:56   ` Tejun Heo
  2025-09-09 14:44 ` [PATCH 4/4] writeback: Add tracepoint to track pending inode switches Jan Kara
  3 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-09 14:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Tejun Heo, Jan Kara

With lazytime mount option enabled we can be switching many dirty inodes
on cgroup exit to the parent cgroup. The numbers observed in practice
when systemd slice of a large cron job exits can easily reach hundreds
of thousands or millions. The logic in inode_do_switch_wbs() which sorts
the inode into appropriate place in b_dirty list of the target wb
however has linear complexity in the number of dirty inodes thus overall
time complexity of switching all the inodes is quadratic leading to
workers being pegged for hours consuming 100% of the CPU and switching
inodes to the parent wb.

Simple reproducer of the issue:
  FILES=10000
  # Filesystem mounted with lazytime mount option
  MNT=/mnt/
  echo "Creating files and switching timestamps"
  for (( j = 0; j < 50; j ++ )); do
      mkdir $MNT/dir$j
      for (( i = 0; i < $FILES; i++ )); do
          echo "foo" >$MNT/dir$j/file$i
      done
      touch -a -t 202501010000 $MNT/dir$j/file*
  done
  wait
  echo "Syncing and flushing"
  sync
  echo 3 >/proc/sys/vm/drop_caches

  echo "Reading all files from a cgroup"
  mkdir /sys/fs/cgroup/unified/mycg1 || exit
  echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
  for (( j = 0; j < 50; j ++ )); do
      cat /mnt/dir$j/file* >/dev/null &
  done
  wait
  echo "Switching wbs"
  # Now rmdir the cgroup after the script exits

We need to maintain b_dirty list ordering to keep writeback happy so
instead of sorting inode into appropriate place just append it at the
end of the list and clobber dirtied_time_when. This may result in inode
writeback starting later after cgroup switch however cgroup switches are
rare so it shouldn't matter much. Since the cgroup had write access to
the inode, there are no practical concerns of the possible DoS issues.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 125f477c34c1..7765c3deccd6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -447,22 +447,23 @@ static bool inode_do_switch_wbs(struct inode *inode,
 	 * Transfer to @new_wb's IO list if necessary.  If the @inode is dirty,
 	 * the specific list @inode was on is ignored and the @inode is put on
 	 * ->b_dirty which is always correct including from ->b_dirty_time.
-	 * The transfer preserves @inode->dirtied_when ordering.  If the @inode
-	 * was clean, it means it was on the b_attached list, so move it onto
-	 * the b_attached list of @new_wb.
+	 * If the @inode was clean, it means it was on the b_attached list, so
+	 * move it onto the b_attached list of @new_wb.
 	 */
 	if (!list_empty(&inode->i_io_list)) {
 		inode->i_wb = new_wb;
 
 		if (inode->i_state & I_DIRTY_ALL) {
-			struct inode *pos;
-
-			list_for_each_entry(pos, &new_wb->b_dirty, i_io_list)
-				if (time_after_eq(inode->dirtied_when,
-						  pos->dirtied_when))
-					break;
+			/*
+			 * We need to keep b_dirty list sorted by
+			 * dirtied_time_when. However properly sorting the
+			 * inode in the list gets too expensive when switching
+			 * many inodes. So just attach inode at the end of the
+			 * dirty list and clobber the dirtied_time_when.
+			 */
+			inode->dirtied_time_when = jiffies;
 			inode_io_list_move_locked(inode, new_wb,
-						  pos->i_io_list.prev);
+						  &new_wb->b_dirty);
 		} else {
 			inode_cgwb_move_to_attached(inode, new_wb);
 		}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] writeback: Add tracepoint to track pending inode switches
  2025-09-09 14:44 [PATCH 0/4] writeback: Avoid lockups when switching inodes Jan Kara
                   ` (2 preceding siblings ...)
  2025-09-09 14:44 ` [PATCH 3/4] writeback: Avoid excessively long inode switching times Jan Kara
@ 2025-09-09 14:44 ` Jan Kara
  2025-09-09 16:57   ` Tejun Heo
  3 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-09 14:44 UTC (permalink / raw)
  To: Christian Brauner; +Cc: linux-fsdevel, Tejun Heo, Jan Kara

Add trace_inode_switch_wbs_queue tracepoint to allow insight into how
many inodes are queued to switch their bdi_writeback structure.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                |  4 ++++
 include/trace/events/writeback.h | 29 +++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7765c3deccd6..094b78ce6d72 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -658,6 +658,8 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 	if (!isw->new_wb)
 		goto out_free;
 
+	trace_inode_switch_wbs_queue(inode->i_wb, isw->new_wb, 1);
+
 	if (!inode_prepare_wbs_switch(inode, isw->new_wb))
 		goto out_free;
 
@@ -752,6 +754,8 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb)
 		return restart;
 	}
 
+	trace_inode_switch_wbs_queue(wb, isw->new_wb, nr);
+
 	/*
 	 * In addition to synchronizing among switchers, I_WB_SWITCH tells
 	 * the RCU protected stat update paths to grab the i_page
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 1e23919c0da9..c08aff044e80 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -213,6 +213,35 @@ TRACE_EVENT(inode_foreign_history,
 	)
 );
 
+TRACE_EVENT(inode_switch_wbs_queue,
+
+	TP_PROTO(struct bdi_writeback *old_wb, struct bdi_writeback *new_wb,
+		 unsigned int count),
+
+	TP_ARGS(old_wb, new_wb, count),
+
+	TP_STRUCT__entry(
+		__array(char,		name, 32)
+		__field(ino_t,		old_cgroup_ino)
+		__field(ino_t,		new_cgroup_ino)
+		__field(unsigned int,	count)
+	),
+
+	TP_fast_assign(
+		strscpy_pad(__entry->name, bdi_dev_name(old_wb->bdi), 32);
+		__entry->old_cgroup_ino	= __trace_wb_assign_cgroup(old_wb);
+		__entry->new_cgroup_ino	= __trace_wb_assign_cgroup(new_wb);
+		__entry->count		= count;
+	),
+
+	TP_printk("bdi %s: old_cgroup_ino=%lu new_cgroup_ino=%lu count=%u",
+		__entry->name,
+		(unsigned long)__entry->old_cgroup_ino,
+		(unsigned long)__entry->new_cgroup_ino,
+		__entry->count
+	)
+);
+
 TRACE_EVENT(inode_switch_wbs,
 
 	TP_PROTO(struct inode *inode, struct bdi_writeback *old_wb,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes
  2025-09-09 14:44 ` [PATCH 1/4] writeback: Avoid contention on wb->list_lock " Jan Kara
@ 2025-09-09 16:52   ` Tejun Heo
  2025-09-10  8:19     ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2025-09-09 16:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel

Hello, Jan.

On Tue, Sep 09, 2025 at 04:44:02PM +0200, Jan Kara wrote:
> There can be multiple inode switch works that are trying to switch
> inodes to / from the same wb. This can happen in particular if some
> cgroup exits which owns many (thousands) inodes and we need to switch
> them all. In this case several inode_switch_wbs_work_fn() instances will
> be just spinning on the same wb->list_lock while only one of them makes
> forward progress. This wastes CPU cycles and quickly leads to softlockup
> reports and unusable system.
> 
> Instead of running several inode_switch_wbs_work_fn() instances in
> parallel switching to the same wb and contending on wb->list_lock, run
> just one instance and let the other isw items switching to this wb queue
> behind the one being processed.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
...
> +static void inode_switch_wbs_work_fn(struct work_struct *work)
> +{
> +	struct inode_switch_wbs_context *isw =
> +		container_of(to_rcu_work(work), struct inode_switch_wbs_context, work);
> +	struct bdi_writeback *new_wb = isw->new_wb;
> +	bool switch_running;
> +
> +	spin_lock_irq(&new_wb->work_lock);
> +	switch_running = !list_empty(&new_wb->switch_wbs_ctxs);
> +	list_add_tail(&isw->list, &new_wb->switch_wbs_ctxs);
> +	spin_unlock_irq(&new_wb->work_lock);
> +
> +	/*
> +	 * Let's leave the real work for the running worker since we'd just
> +	 * contend with it on wb->list_lock anyway.
> +	 */
> +	if (switch_running)
> +		return;
> +
> +	/* OK, we will be doing the switching work */
> +	wb_get(new_wb);
> +	spin_lock_irq(&new_wb->work_lock);
> +	while (!list_empty(&new_wb->switch_wbs_ctxs)) {
> +		isw = list_first_entry(&new_wb->switch_wbs_ctxs,
> +				       struct inode_switch_wbs_context, list);
> +		spin_unlock_irq(&new_wb->work_lock);
> +		process_inode_switch_wbs_work(isw);
> +		spin_lock_irq(&new_wb->work_lock);
> +		list_del(&isw->list);
> +		kfree(isw);
> +	}
> +	spin_unlock_irq(&new_wb->work_lock);
> +	wb_put(new_wb);
> +}

Would it be easier to achieve the same effect if we just reduced @max_active
when creating inode_switch_wbs? If we update cgroup_writeback_init() to use
the following instead:

        isw_wq = alloc_workqueue("inode_switch_wbs", WQ_UNBOUND, 1);

Wouldn't that achieve the same thing? Note the addition of WQ_UNBOUND isn't
strictly necessary but we're in the process of defaulting to unbound
workqueues, so might as well update it together. I can't think of any reason
why this would require per-cpu behavior.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] writeback: Avoid softlockup when switching many inodes
  2025-09-09 14:44 ` [PATCH 2/4] writeback: Avoid softlockup when switching many inodes Jan Kara
@ 2025-09-09 16:55   ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2025-09-09 16:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel

On Tue, Sep 09, 2025 at 04:44:03PM +0200, Jan Kara wrote:
> process_inode_switch_wbs_work() can be switching over 100 inodes to a
> different cgroup. Since switching an inode requires counting all dirty &
> under-writeback pages in the address space of each inode, this can take
> a significant amount of time. Add a possibility to reschedule after
> processing each inode to avoid softlockups.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] writeback: Avoid excessively long inode switching times
  2025-09-09 14:44 ` [PATCH 3/4] writeback: Avoid excessively long inode switching times Jan Kara
@ 2025-09-09 16:56   ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2025-09-09 16:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel

On Tue, Sep 09, 2025 at 04:44:04PM +0200, Jan Kara wrote:
> With lazytime mount option enabled we can be switching many dirty inodes
> on cgroup exit to the parent cgroup. The numbers observed in practice
> when systemd slice of a large cron job exits can easily reach hundreds
> of thousands or millions. The logic in inode_do_switch_wbs() which sorts
> the inode into appropriate place in b_dirty list of the target wb
> however has linear complexity in the number of dirty inodes thus overall
> time complexity of switching all the inodes is quadratic leading to
> workers being pegged for hours consuming 100% of the CPU and switching
> inodes to the parent wb.
> 
> Simple reproducer of the issue:
>   FILES=10000
>   # Filesystem mounted with lazytime mount option
>   MNT=/mnt/
>   echo "Creating files and switching timestamps"
>   for (( j = 0; j < 50; j ++ )); do
>       mkdir $MNT/dir$j
>       for (( i = 0; i < $FILES; i++ )); do
>           echo "foo" >$MNT/dir$j/file$i
>       done
>       touch -a -t 202501010000 $MNT/dir$j/file*
>   done
>   wait
>   echo "Syncing and flushing"
>   sync
>   echo 3 >/proc/sys/vm/drop_caches
> 
>   echo "Reading all files from a cgroup"
>   mkdir /sys/fs/cgroup/unified/mycg1 || exit
>   echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
>   for (( j = 0; j < 50; j ++ )); do
>       cat /mnt/dir$j/file* >/dev/null &
>   done
>   wait
>   echo "Switching wbs"
>   # Now rmdir the cgroup after the script exits
> 
> We need to maintain b_dirty list ordering to keep writeback happy so
> instead of sorting inode into appropriate place just append it at the
> end of the list and clobber dirtied_time_when. This may result in inode
> writeback starting later after cgroup switch however cgroup switches are
> rare so it shouldn't matter much. Since the cgroup had write access to
> the inode, there are no practical concerns of the possible DoS issues.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] writeback: Add tracepoint to track pending inode switches
  2025-09-09 14:44 ` [PATCH 4/4] writeback: Add tracepoint to track pending inode switches Jan Kara
@ 2025-09-09 16:57   ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2025-09-09 16:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel

On Tue, Sep 09, 2025 at 04:44:05PM +0200, Jan Kara wrote:
> Add trace_inode_switch_wbs_queue tracepoint to allow insight into how
> many inodes are queued to switch their bdi_writeback structure.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes
  2025-09-09 16:52   ` Tejun Heo
@ 2025-09-10  8:19     ` Jan Kara
  2025-09-10 17:10       ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-10  8:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, Christian Brauner, linux-fsdevel

Hello Tejun,

On Tue 09-09-25 06:52:27, Tejun Heo wrote:
> On Tue, Sep 09, 2025 at 04:44:02PM +0200, Jan Kara wrote:
> > There can be multiple inode switch works that are trying to switch
> > inodes to / from the same wb. This can happen in particular if some
> > cgroup exits which owns many (thousands) inodes and we need to switch
> > them all. In this case several inode_switch_wbs_work_fn() instances will
> > be just spinning on the same wb->list_lock while only one of them makes
> > forward progress. This wastes CPU cycles and quickly leads to softlockup
> > reports and unusable system.
> > 
> > Instead of running several inode_switch_wbs_work_fn() instances in
> > parallel switching to the same wb and contending on wb->list_lock, run
> > just one instance and let the other isw items switching to this wb queue
> > behind the one being processed.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> ...
> > +static void inode_switch_wbs_work_fn(struct work_struct *work)
> > +{
> > +	struct inode_switch_wbs_context *isw =
> > +		container_of(to_rcu_work(work), struct inode_switch_wbs_context, work);
> > +	struct bdi_writeback *new_wb = isw->new_wb;
> > +	bool switch_running;
> > +
> > +	spin_lock_irq(&new_wb->work_lock);
> > +	switch_running = !list_empty(&new_wb->switch_wbs_ctxs);
> > +	list_add_tail(&isw->list, &new_wb->switch_wbs_ctxs);
> > +	spin_unlock_irq(&new_wb->work_lock);
> > +
> > +	/*
> > +	 * Let's leave the real work for the running worker since we'd just
> > +	 * contend with it on wb->list_lock anyway.
> > +	 */
> > +	if (switch_running)
> > +		return;
> > +
> > +	/* OK, we will be doing the switching work */
> > +	wb_get(new_wb);
> > +	spin_lock_irq(&new_wb->work_lock);
> > +	while (!list_empty(&new_wb->switch_wbs_ctxs)) {
> > +		isw = list_first_entry(&new_wb->switch_wbs_ctxs,
> > +				       struct inode_switch_wbs_context, list);
> > +		spin_unlock_irq(&new_wb->work_lock);
> > +		process_inode_switch_wbs_work(isw);
> > +		spin_lock_irq(&new_wb->work_lock);
> > +		list_del(&isw->list);
> > +		kfree(isw);
> > +	}
> > +	spin_unlock_irq(&new_wb->work_lock);
> > +	wb_put(new_wb);
> > +}
> 
> Would it be easier to achieve the same effect if we just reduced @max_active
> when creating inode_switch_wbs? If we update cgroup_writeback_init() to use
> the following instead:
> 
>         isw_wq = alloc_workqueue("inode_switch_wbs", WQ_UNBOUND, 1);
> 
> Wouldn't that achieve the same thing? Note the addition of WQ_UNBOUND isn't
> strictly necessary but we're in the process of defaulting to unbound
> workqueues, so might as well update it together. I can't think of any reason
> why this would require per-cpu behavior.

Well, reducing @max_active to 1 will certainly deal with the list_lock
contention as well. But I didn't want to do that as on a busy container
system I assume there can be switching happening between different pairs of
cgroups. With the approach in this patch switches with different target
cgroups can still run in parallel. I don't have any real world data to back
that assumption so if you think this parallelism isn't really needed and we
are fine with at most one switch happening in the system, switching
max_active to 1 is certainly simple enough.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes
  2025-09-10  8:19     ` Jan Kara
@ 2025-09-10 17:10       ` Tejun Heo
  2025-09-11 11:30         ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2025-09-10 17:10 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel

Hello, Jan.

On Wed, Sep 10, 2025 at 10:19:36AM +0200, Jan Kara wrote:
> Well, reducing @max_active to 1 will certainly deal with the list_lock
> contention as well. But I didn't want to do that as on a busy container
> system I assume there can be switching happening between different pairs of
> cgroups. With the approach in this patch switches with different target
> cgroups can still run in parallel. I don't have any real world data to back
> that assumption so if you think this parallelism isn't really needed and we
> are fine with at most one switch happening in the system, switching
> max_active to 1 is certainly simple enough.

What bothers me is that the concurrency doesn't match between the work items
being scheduled and the actual execution and we're resolving that by early
exiting from some work items. It just feels like an roundabout way to do it
with extra code. I think there are better ways to achieve per-bdi_writeback
concurrency:

- Move work_struct from isw to bdi_writeback and schedule the work item on
  the target wb which processes isw's queued on the bdi_writeback.

- Or have a per-wb workqueue with max_active limit so that concurrency is
  regulated per-wb.

The latter is a bit simpler but does cost more memory as workqueue_struct
isn't tiny. The former is a bit more complicated but most likely less so
than the current code. What do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes
  2025-09-10 17:10       ` Tejun Heo
@ 2025-09-11 11:30         ` Jan Kara
  2025-09-11 12:01           ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-11 11:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, Christian Brauner, linux-fsdevel

On Wed 10-09-25 07:10:12, Tejun Heo wrote:
> Hello, Jan.
> 
> On Wed, Sep 10, 2025 at 10:19:36AM +0200, Jan Kara wrote:
> > Well, reducing @max_active to 1 will certainly deal with the list_lock
> > contention as well. But I didn't want to do that as on a busy container
> > system I assume there can be switching happening between different pairs of
> > cgroups. With the approach in this patch switches with different target
> > cgroups can still run in parallel. I don't have any real world data to back
> > that assumption so if you think this parallelism isn't really needed and we
> > are fine with at most one switch happening in the system, switching
> > max_active to 1 is certainly simple enough.
> 
> What bothers me is that the concurrency doesn't match between the work items
> being scheduled and the actual execution and we're resolving that by early
> exiting from some work items. It just feels like an roundabout way to do it
> with extra code. I think there are better ways to achieve per-bdi_writeback
> concurrency:
> 
> - Move work_struct from isw to bdi_writeback and schedule the work item on
>   the target wb which processes isw's queued on the bdi_writeback.
> 
> - Or have a per-wb workqueue with max_active limit so that concurrency is
>   regulated per-wb.
> 
> The latter is a bit simpler but does cost more memory as workqueue_struct
> isn't tiny. The former is a bit more complicated but most likely less so
> than the current code. What do you think?

That's a fair objection and good idea. I'll rework the patch to go with
the first variant.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes
  2025-09-11 11:30         ` Jan Kara
@ 2025-09-11 12:01           ` Jan Kara
  2025-09-12 10:39             ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-09-11 12:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, Christian Brauner, linux-fsdevel

On Thu 11-09-25 13:30:13, Jan Kara wrote:
> On Wed 10-09-25 07:10:12, Tejun Heo wrote:
> > Hello, Jan.
> > 
> > On Wed, Sep 10, 2025 at 10:19:36AM +0200, Jan Kara wrote:
> > > Well, reducing @max_active to 1 will certainly deal with the list_lock
> > > contention as well. But I didn't want to do that as on a busy container
> > > system I assume there can be switching happening between different pairs of
> > > cgroups. With the approach in this patch switches with different target
> > > cgroups can still run in parallel. I don't have any real world data to back
> > > that assumption so if you think this parallelism isn't really needed and we
> > > are fine with at most one switch happening in the system, switching
> > > max_active to 1 is certainly simple enough.
> > 
> > What bothers me is that the concurrency doesn't match between the work items
> > being scheduled and the actual execution and we're resolving that by early
> > exiting from some work items. It just feels like an roundabout way to do it
> > with extra code. I think there are better ways to achieve per-bdi_writeback
> > concurrency:
> > 
> > - Move work_struct from isw to bdi_writeback and schedule the work item on
> >   the target wb which processes isw's queued on the bdi_writeback.
> > 
> > - Or have a per-wb workqueue with max_active limit so that concurrency is
> >   regulated per-wb.
> > 
> > The latter is a bit simpler but does cost more memory as workqueue_struct
> > isn't tiny. The former is a bit more complicated but most likely less so
> > than the current code. What do you think?
> 
> That's a fair objection and good idea. I'll rework the patch to go with
> the first variant.

I've realized why I didn't do something like this from the beginning. The
slight snag is that you can start switching inode's wb only once rcu period
expires (so that I_WB_SWITCH setting is guaranteed to be visible). This
makes moving the work struct to bdi_writeback sligthly tricky. It shouldn't
be too bad so I think I'll try it and see how it looks like.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] writeback: Avoid contention on wb->list_lock when switching inodes
  2025-09-11 12:01           ` Jan Kara
@ 2025-09-12 10:39             ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2025-09-12 10:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jan Kara, Christian Brauner, linux-fsdevel

On Thu 11-09-25 14:01:36, Jan Kara wrote:
> On Thu 11-09-25 13:30:13, Jan Kara wrote:
> > On Wed 10-09-25 07:10:12, Tejun Heo wrote:
> > > Hello, Jan.
> > > 
> > > On Wed, Sep 10, 2025 at 10:19:36AM +0200, Jan Kara wrote:
> > > > Well, reducing @max_active to 1 will certainly deal with the list_lock
> > > > contention as well. But I didn't want to do that as on a busy container
> > > > system I assume there can be switching happening between different pairs of
> > > > cgroups. With the approach in this patch switches with different target
> > > > cgroups can still run in parallel. I don't have any real world data to back
> > > > that assumption so if you think this parallelism isn't really needed and we
> > > > are fine with at most one switch happening in the system, switching
> > > > max_active to 1 is certainly simple enough.
> > > 
> > > What bothers me is that the concurrency doesn't match between the work items
> > > being scheduled and the actual execution and we're resolving that by early
> > > exiting from some work items. It just feels like an roundabout way to do it
> > > with extra code. I think there are better ways to achieve per-bdi_writeback
> > > concurrency:
> > > 
> > > - Move work_struct from isw to bdi_writeback and schedule the work item on
> > >   the target wb which processes isw's queued on the bdi_writeback.
> > > 
> > > - Or have a per-wb workqueue with max_active limit so that concurrency is
> > >   regulated per-wb.
> > > 
> > > The latter is a bit simpler but does cost more memory as workqueue_struct
> > > isn't tiny. The former is a bit more complicated but most likely less so
> > > than the current code. What do you think?
> > 
> > That's a fair objection and good idea. I'll rework the patch to go with
> > the first variant.
> 
> I've realized why I didn't do something like this from the beginning. The
> slight snag is that you can start switching inode's wb only once rcu period
> expires (so that I_WB_SWITCH setting is guaranteed to be visible). This
> makes moving the work struct to bdi_writeback sligthly tricky. It shouldn't
> be too bad so I think I'll try it and see how it looks like.

It actually worked out pretty nicely I think. Posting v2 with the
changes...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-09-12 10:39 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-09 14:44 [PATCH 0/4] writeback: Avoid lockups when switching inodes Jan Kara
2025-09-09 14:44 ` [PATCH 1/4] writeback: Avoid contention on wb->list_lock " Jan Kara
2025-09-09 16:52   ` Tejun Heo
2025-09-10  8:19     ` Jan Kara
2025-09-10 17:10       ` Tejun Heo
2025-09-11 11:30         ` Jan Kara
2025-09-11 12:01           ` Jan Kara
2025-09-12 10:39             ` Jan Kara
2025-09-09 14:44 ` [PATCH 2/4] writeback: Avoid softlockup when switching many inodes Jan Kara
2025-09-09 16:55   ` Tejun Heo
2025-09-09 14:44 ` [PATCH 3/4] writeback: Avoid excessively long inode switching times Jan Kara
2025-09-09 16:56   ` Tejun Heo
2025-09-09 14:44 ` [PATCH 4/4] writeback: Add tracepoint to track pending inode switches Jan Kara
2025-09-09 16:57   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).