public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH v2] lib/list_sort: introduce list_sort_nonatomic() and clean up scheduling workarounds
@ 2026-03-17 16:59 Kuan-Wei Chiu
  2026-03-18  5:57 ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Kuan-Wei Chiu @ 2026-03-17 16:59 UTC (permalink / raw)
  To: richard, akpm
  Cc: chengzhihao1, hch, jserv, eleanor15x, marscheng, linux-mtd,
	linux-kernel, Kuan-Wei Chiu

Historically, list_sort() implemented a hack in merge_final():
    if (unlikely(!++count))
        cmp(priv, b, b);

This was designed specifically so that callers could periodically
invoke cond_resched() within their comparison functions when merging
highly unbalanced lists.

However, an audit of the kernel tree reveals that only fs/ubifs/ relies
on this mechanism. For the vast majority of list_sort() users (such as
block layer IO schedulers and file systems), this results in completely
wasted function calls. In the worst-case scenario (merging an already
sorted list where 'a' is exhausted quickly), it results in
approximately (N/2)/256 unnecessary cmp() calls.

To clean up this API, eliminate the overhead for generic users, and
consolidate the scheduling logic:
1. Introduce list_sort_nonatomic(), which explicitly calls
   cond_resched() within its inner merge loops.
2. Remove the dummy cmp(priv, b, b) fallback from standard list_sort(),
   saving unnecessary function calls and improving determinism for all
   other subsystems.
3. Convert the sole user (fs/ubifs/) to the new API and completely
   remove cond_resched() from UBIFS's comparison callbacks, unpolluting
   its comparison logic.

This change leaves the generic list_sort() completely free of
scheduling hacks, simplifies UBIFS's callbacks, and ensures that legacy
long-list sorting workloads remain safe from soft lockups on
non-preemptible kernels.

Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
---
Changes in v2:
- Drop the 'count' rate-limiter for cond_resched().
- Remove cond_resched() from UBIFS comparison callbacks.

 fs/ubifs/gc.c             |   6 +-
 fs/ubifs/replay.c         |   3 +-
 include/linux/list_sort.h |   3 +
 lib/list_sort.c           | 174 ++++++++++++++++++++++----------------
 4 files changed, 106 insertions(+), 80 deletions(-)

diff --git a/fs/ubifs/gc.c b/fs/ubifs/gc.c
index 0bf08b7755b8..54fac33224fa 100644
--- a/fs/ubifs/gc.c
+++ b/fs/ubifs/gc.c
@@ -109,7 +109,6 @@ static int data_nodes_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct ubifs_scan_node *sa, *sb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -153,7 +152,6 @@ static int nondata_nodes_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct ubifs_scan_node *sa, *sb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -277,8 +275,8 @@ static int sort_nodes(struct ubifs_info *c, struct ubifs_scan_leb *sleb,
 	}
 
 	/* Sort data and non-data nodes */
-	list_sort(c, &sleb->nodes, &data_nodes_cmp);
-	list_sort(c, nondata, &nondata_nodes_cmp);
+	list_sort_nonatomic(c, &sleb->nodes, &data_nodes_cmp);
+	list_sort_nonatomic(c, nondata, &nondata_nodes_cmp);
 
 	err = dbg_check_data_nodes_order(c, &sleb->nodes);
 	if (err)
diff --git a/fs/ubifs/replay.c b/fs/ubifs/replay.c
index a9a568f4a868..c639394e60ac 100644
--- a/fs/ubifs/replay.c
+++ b/fs/ubifs/replay.c
@@ -305,7 +305,6 @@ static int replay_entries_cmp(void *priv, const struct list_head *a,
 	struct ubifs_info *c = priv;
 	struct replay_entry *ra, *rb;
 
-	cond_resched();
 	if (a == b)
 		return 0;
 
@@ -329,7 +328,7 @@ static int apply_replay_list(struct ubifs_info *c)
 	struct replay_entry *r;
 	int err;
 
-	list_sort(c, &c->replay_list, &replay_entries_cmp);
+	list_sort_nonatomic(c, &c->replay_list, &replay_entries_cmp);
 
 	list_for_each_entry(r, &c->replay_list, list) {
 		cond_resched();
diff --git a/include/linux/list_sort.h b/include/linux/list_sort.h
index 453105f74e05..f7af29073d48 100644
--- a/include/linux/list_sort.h
+++ b/include/linux/list_sort.h
@@ -11,4 +11,7 @@ typedef int __attribute__((nonnull(2,3))) (*list_cmp_func_t)(void *,
 
 __attribute__((nonnull(2,3)))
 void list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp);
+
+__attribute__((nonnull(2, 3)))
+void list_sort_nonatomic(void *priv, struct list_head *head, list_cmp_func_t cmp);
 #endif
diff --git a/lib/list_sort.c b/lib/list_sort.c
index a310ecb7ccc0..9f9d12ae23ab 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -3,6 +3,7 @@
 #include <linux/export.h>
 #include <linux/list_sort.h>
 #include <linux/list.h>
+#include <linux/sched.h>
 
 /*
  * Returns a list organized in an intermediate format suited
@@ -11,11 +12,14 @@
  */
 __attribute__((nonnull(2,3,4)))
 static struct list_head *merge(void *priv, list_cmp_func_t cmp,
-				struct list_head *a, struct list_head *b)
+				struct list_head *a, struct list_head *b,
+				bool may_schedule)
 {
 	struct list_head *head, **tail = &head;
 
 	for (;;) {
+		if (may_schedule)
+			cond_resched();
 		/* if equal, take 'a' -- important for sort stability */
 		if (cmp(priv, a, b) <= 0) {
 			*tail = a;
@@ -47,12 +51,13 @@ static struct list_head *merge(void *priv, list_cmp_func_t cmp,
  */
 __attribute__((nonnull(2,3,4,5)))
 static void merge_final(void *priv, list_cmp_func_t cmp, struct list_head *head,
-			struct list_head *a, struct list_head *b)
+			struct list_head *a, struct list_head *b, bool may_schedule)
 {
 	struct list_head *tail = head;
-	u8 count = 0;
 
 	for (;;) {
+		if (may_schedule)
+			cond_resched();
 		/* if equal, take 'a' -- important for sort stability */
 		if (cmp(priv, a, b) <= 0) {
 			tail->next = a;
@@ -79,12 +84,11 @@ static void merge_final(void *priv, list_cmp_func_t cmp, struct list_head *head,
 		/*
 		 * If the merge is highly unbalanced (e.g. the input is
 		 * already sorted), this loop may run many iterations.
-		 * Continue callbacks to the client even though no
-		 * element comparison is needed, so the client's cmp()
-		 * routine can invoke cond_resched() periodically.
+		 * If may_schedule is true, invoke cond_resched() to
+		 * avoid soft lockups.
 		 */
-		if (unlikely(!++count))
-			cmp(priv, b, b);
+		if (may_schedule)
+			cond_resched();
 		b->prev = tail;
 		tail = b;
 		b = b->next;
@@ -95,6 +99,75 @@ static void merge_final(void *priv, list_cmp_func_t cmp, struct list_head *head,
 	head->prev = tail;
 }
 
+static void __list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp, bool may_schedule)
+{
+	struct list_head *list = head->next, *pending = NULL;
+	size_t count = 0;	/* Count of pending */
+
+	if (list == head->prev)	/* Zero or one elements */
+		return;
+
+	/* Convert to a null-terminated singly-linked list. */
+	head->prev->next = NULL;
+
+	/*
+	 * Data structure invariants:
+	 * - All lists are singly linked and null-terminated; prev
+	 *   pointers are not maintained.
+	 * - pending is a prev-linked "list of lists" of sorted
+	 *   sublists awaiting further merging.
+	 * - Each of the sorted sublists is power-of-two in size.
+	 * - Sublists are sorted by size and age, smallest & newest at front.
+	 * - There are zero to two sublists of each size.
+	 * - A pair of pending sublists are merged as soon as the number
+	 *   of following pending elements equals their size (i.e.
+	 *   each time count reaches an odd multiple of that size).
+	 *   That ensures each later final merge will be at worst 2:1.
+	 * - Each round consists of:
+	 *   - Merging the two sublists selected by the highest bit
+	 *     which flips when count is incremented, and
+	 *   - Adding an element from the input as a size-1 sublist.
+	 */
+	do {
+		size_t bits;
+		struct list_head **tail = &pending;
+
+		/* Find the least-significant clear bit in count */
+		for (bits = count; bits & 1; bits >>= 1)
+			tail = &(*tail)->prev;
+		/* Do the indicated merge */
+		if (likely(bits)) {
+			struct list_head *a = *tail, *b = a->prev;
+
+			a = merge(priv, cmp, b, a, may_schedule);
+			/* Install the merged result in place of the inputs */
+			a->prev = b->prev;
+			*tail = a;
+		}
+
+		/* Move one element from input list to pending */
+		list->prev = pending;
+		pending = list;
+		list = list->next;
+		pending->next = NULL;
+		count++;
+	} while (list);
+
+	/* End of input; merge together all the pending lists. */
+	list = pending;
+	pending = pending->prev;
+	for (;;) {
+		struct list_head *next = pending->prev;
+
+		if (!next)
+			break;
+		list = merge(priv, cmp, pending, list, may_schedule);
+		pending = next;
+	}
+	/* The final merge, rebuilding prev links */
+	merge_final(priv, cmp, head, pending, list, may_schedule);
+}
+
 /**
  * list_sort - sort a list
  * @priv: private data, opaque to list_sort(), passed to @cmp
@@ -185,73 +258,26 @@ static void merge_final(void *priv, list_cmp_func_t cmp, struct list_head *head,
  * of size 2^k varies from 2^(k-1) (cases 3 and 5 when x == 0) to
  * 2^(k+1) - 1 (second merge of case 5 when x == 2^(k-1) - 1).
  */
-__attribute__((nonnull(2,3)))
+__attribute__((nonnull(2, 3)))
 void list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp)
 {
-	struct list_head *list = head->next, *pending = NULL;
-	size_t count = 0;	/* Count of pending */
-
-	if (list == head->prev)	/* Zero or one elements */
-		return;
-
-	/* Convert to a null-terminated singly-linked list. */
-	head->prev->next = NULL;
-
-	/*
-	 * Data structure invariants:
-	 * - All lists are singly linked and null-terminated; prev
-	 *   pointers are not maintained.
-	 * - pending is a prev-linked "list of lists" of sorted
-	 *   sublists awaiting further merging.
-	 * - Each of the sorted sublists is power-of-two in size.
-	 * - Sublists are sorted by size and age, smallest & newest at front.
-	 * - There are zero to two sublists of each size.
-	 * - A pair of pending sublists are merged as soon as the number
-	 *   of following pending elements equals their size (i.e.
-	 *   each time count reaches an odd multiple of that size).
-	 *   That ensures each later final merge will be at worst 2:1.
-	 * - Each round consists of:
-	 *   - Merging the two sublists selected by the highest bit
-	 *     which flips when count is incremented, and
-	 *   - Adding an element from the input as a size-1 sublist.
-	 */
-	do {
-		size_t bits;
-		struct list_head **tail = &pending;
-
-		/* Find the least-significant clear bit in count */
-		for (bits = count; bits & 1; bits >>= 1)
-			tail = &(*tail)->prev;
-		/* Do the indicated merge */
-		if (likely(bits)) {
-			struct list_head *a = *tail, *b = a->prev;
-
-			a = merge(priv, cmp, b, a);
-			/* Install the merged result in place of the inputs */
-			a->prev = b->prev;
-			*tail = a;
-		}
-
-		/* Move one element from input list to pending */
-		list->prev = pending;
-		pending = list;
-		list = list->next;
-		pending->next = NULL;
-		count++;
-	} while (list);
-
-	/* End of input; merge together all the pending lists. */
-	list = pending;
-	pending = pending->prev;
-	for (;;) {
-		struct list_head *next = pending->prev;
-
-		if (!next)
-			break;
-		list = merge(priv, cmp, pending, list);
-		pending = next;
-	}
-	/* The final merge, rebuilding prev links */
-	merge_final(priv, cmp, head, pending, list);
+	__list_sort(priv, head, cmp, false);
 }
 EXPORT_SYMBOL(list_sort);
+
+/**
+ * list_sort_nonatomic - sort a list with conditional rescheduling
+ * @priv: private data, opaque to list_sort(), passed to @cmp
+ * @head: the list to sort
+ * @cmp: the elements comparison function
+ *
+ * This variant of list_sort() periodically invokes cond_resched()
+ * during the sorting process. It should be used for sorting very
+ * long lists where the operation might otherwise cause soft lockups.
+ */
+__attribute__((nonnull(2, 3)))
+void list_sort_nonatomic(void *priv, struct list_head *head, list_cmp_func_t cmp)
+{
+	__list_sort(priv, head, cmp, true);
+}
+EXPORT_SYMBOL(list_sort_nonatomic);
-- 
2.53.0.851.ga537e3e6e9-goog


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] lib/list_sort: introduce list_sort_nonatomic() and clean up scheduling workarounds
  2026-03-17 16:59 [PATCH v2] lib/list_sort: introduce list_sort_nonatomic() and clean up scheduling workarounds Kuan-Wei Chiu
@ 2026-03-18  5:57 ` Christoph Hellwig
  2026-03-18 10:07   ` Richard Weinberger
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2026-03-18  5:57 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: richard, akpm, chengzhihao1, hch, jserv, eleanor15x, marscheng,
	linux-mtd, linux-kernel

On Tue, Mar 17, 2026 at 04:59:05PM +0000, Kuan-Wei Chiu wrote:
> Historically, list_sort() implemented a hack in merge_final():
>     if (unlikely(!++count))
>         cmp(priv, b, b);
> 
> This was designed specifically so that callers could periodically
> invoke cond_resched() within their comparison functions when merging
> highly unbalanced lists.
> 
> However, an audit of the kernel tree reveals that only fs/ubifs/ relies
> on this mechanism. For the vast majority of list_sort() users (such as
> block layer IO schedulers and file systems), this results in completely
> wasted function calls. In the worst-case scenario (merging an already
> sorted list where 'a' is exhausted quickly), it results in
> approximately (N/2)/256 unnecessary cmp() calls.
> 
> To clean up this API, eliminate the overhead for generic users, and
> consolidate the scheduling logic:
> 1. Introduce list_sort_nonatomic(), which explicitly calls
>    cond_resched() within its inner merge loops.
> 2. Remove the dummy cmp(priv, b, b) fallback from standard list_sort(),
>    saving unnecessary function calls and improving determinism for all
>    other subsystems.
> 3. Convert the sole user (fs/ubifs/) to the new API and completely
>    remove cond_resched() from UBIFS's comparison callbacks, unpolluting
>    its comparison logic.
> 
> This change leaves the generic list_sort() completely free of
> scheduling hacks, simplifies UBIFS's callbacks, and ensures that legacy
> long-list sorting workloads remain safe from soft lockups on
> non-preemptible kernels.

As said before we really should not add the extra nonatomic API
and just do the right thing, and drop the cond_resched in ubifs
in a prep patch.


______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] lib/list_sort: introduce list_sort_nonatomic() and clean up scheduling workarounds
  2026-03-18  5:57 ` Christoph Hellwig
@ 2026-03-18 10:07   ` Richard Weinberger
  2026-03-19  5:37     ` Kuan-Wei Chiu
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Weinberger @ 2026-03-18 10:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kuan-Wei Chiu, Andrew Morton, chengzhihao1, jserv, eleanor15x,
	marscheng, linux-mtd, linux-kernel

----- Ursprüngliche Mail -----
> Von: "Christoph Hellwig" <hch@infradead.org>
>> This change leaves the generic list_sort() completely free of
>> scheduling hacks, simplifies UBIFS's callbacks, and ensures that legacy
>> long-list sorting workloads remain safe from soft lockups on
>> non-preemptible kernels.
> 
> As said before we really should not add the extra nonatomic API
> and just do the right thing, and drop the cond_resched in ubifs
> in a prep patch.

I think you are right. After inspecting UBIFS's usage of list_sort()
I feel more confident that we can remove the calls to cond_resched()
from the compare functions.

The compare functions are rather cheap, they don't do (blocking)
MTD io.
In the GC case each list contains at most as many UBIFS nodes you can
stuff into a single LEB.
The replay case is a little different, the replay list can contain
elements from multiple LEBs. But the UBIFS journal is limited to
a few LEBs, so the list is likely always at most a few thousand
elements long.
So, we always talk about calling the compare functions a few thousand
times, not millions times.

Thanks,
//richard

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v2] lib/list_sort: introduce list_sort_nonatomic() and clean up scheduling workarounds
  2026-03-18 10:07   ` Richard Weinberger
@ 2026-03-19  5:37     ` Kuan-Wei Chiu
  0 siblings, 0 replies; 4+ messages in thread
From: Kuan-Wei Chiu @ 2026-03-19  5:37 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Christoph Hellwig, Andrew Morton, chengzhihao1, jserv, eleanor15x,
	marscheng, linux-mtd, linux-kernel

On Wed, Mar 18, 2026 at 11:07:51AM +0100, Richard Weinberger wrote:
> ----- Ursprüngliche Mail -----
> > Von: "Christoph Hellwig" <hch@infradead.org>
> >> This change leaves the generic list_sort() completely free of
> >> scheduling hacks, simplifies UBIFS's callbacks, and ensures that legacy
> >> long-list sorting workloads remain safe from soft lockups on
> >> non-preemptible kernels.
> > 
> > As said before we really should not add the extra nonatomic API
> > and just do the right thing, and drop the cond_resched in ubifs
> > in a prep patch.
> 
> I think you are right. After inspecting UBIFS's usage of list_sort()
> I feel more confident that we can remove the calls to cond_resched()
> from the compare functions.
> 
> The compare functions are rather cheap, they don't do (blocking)
> MTD io.
> In the GC case each list contains at most as many UBIFS nodes you can
> stuff into a single LEB.
> The replay case is a little different, the replay list can contain
> elements from multiple LEBs. But the UBIFS journal is limited to
> a few LEBs, so the list is likely always at most a few thousand
> elements long.
> So, we always talk about calling the compare functions a few thousand
> times, not millions times.
> 
Great, thanks for verifying this.

I'll prepare a v3 to drop the cond_resched() calls from UBIFS's cmp(),
and remove the if(!++count) from list_sort().

Regards,
Kuan-Wei

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-19  5:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 16:59 [PATCH v2] lib/list_sort: introduce list_sort_nonatomic() and clean up scheduling workarounds Kuan-Wei Chiu
2026-03-18  5:57 ` Christoph Hellwig
2026-03-18 10:07   ` Richard Weinberger
2026-03-19  5:37     ` Kuan-Wei Chiu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox