[PATCH 1/2] lib: more scalable list

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] lib: more scalable list_sort()
@ 2010-01-21  4:51 Don Mullis
  2010-01-21  5:17 ` [PATCH 2/2] lib: revise list_sort() comment Don Mullis
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Don Mullis @ 2010-01-21  4:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: airlied, andi, david, dedekind

The use of list_sort() by UBIFS looks like it could generate long
lists; this alternative implementation scales better, reaching ~3x
performance gain as list length approaches the L2 cache size.

Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
gcc-4.4, with flags extracted from an Ubuntu kernel build.  Object
size is 552 bytes versus 405 for Mark J. Roberts' code.

Worst case for either implementation is a list length just over a POT,
and to roughly the same degree, so here are results for a range of
2^N+1 lengths.  List elements were 16 bytes each including malloc
overhead; random initial order.

                      time (msec)
                      Roberts
                      |       Mullis
loop_count  length    |       |    ratio
2000000       3     188     211    1.122
1000000       5     193     158    0.818
 500000       9     232     160    0.689
 250000      17     235     161    0.685
 125000      33     254     178    0.700
  62500      65     284     200    0.704
  31250     129     301     213    0.707
  15625     257     313     224    0.715
   7812     513     332     237    0.713
   3906    1025     361     261    0.722
   1953    2049     390     276    0.707  ~ L1 size
    976    4097     553     323    0.584
    488    8193     678     362    0.533
    244   16385     771     396    0.513
    122   32769     848     430    0.507
     61   65537     918     458    0.498
     30  131073    1216     550    0.452
     15  262145    2557     961    0.375  ~ L2 size
      7  524289    5650    1764    0.312
      3 1048577    6287    2141    0.340
      1 2097153    3650    1588    0.435


Mark's code does not actually implement the usual or generic
mergesort, but rather a variant from Simon Tatham described here:

    http://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html

Simon's algorithm performs O(log N) passes over the entire input list,
doing merges of sublists that double in size on each pass.  The
generic algorithm instead merges pairs of equal length lists as early
as possible, in recursive order.  For either algorithm, the elements
that extend the list beyond power-of-two length are a special case
handled as nearly as possible as a "rounding-up" to a full POT.

Some intuition for the locality of reference implications of merge
order may be gotten by watching this animation:

    http://www.sorting-algorithms.com/merge-sort

Simon's algorithm requires only O(1) extra space rather than the
generic algorithm's O(log N), but in my non-recursive implementation
the actual O(log N) data is merely a vector of ~20 pointers, which
I've put on the stack.

Stability of the sort: distinct elements that compare equal emerge
from the sort in the same order as with Mark's code, for simple test
cases.

A kernel that uses drm.ko appears to run normally with this change; I
have no suitable hardware to similarly test the use by UBIFS.

Signed-off-by: Don Mullis <don.mullis@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
---
 lib/list_sort.c |  135 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 70 insertions(+), 65 deletions(-)

Index: linux-2.6/lib/list_sort.c
===================================================================
--- linux-2.6.orig/lib/list_sort.c	2010-01-19 20:10:31.000000000 -0800
+++ linux-2.6/lib/list_sort.c	2010-01-19 20:19:26.000000000 -0800
@@ -4,94 +4,99 @@
 #include <linux/slab.h>
 #include <linux/list.h>
 
+#define MAX_LIST_SIZE_BITS 20
+
+static struct list_head *merge(void *priv,
+				int (*cmp)(void *priv, struct list_head *a,
+					struct list_head *b),
+				struct list_head *a, struct list_head *b)
+{
+	struct list_head head, *tail = &head;
+
+	while (a && b) {
+		/* if equal, take 'a' -- important for sort stability */
+		if ((*cmp)(priv, a, b) <= 0) {
+			tail->next = a;
+			a = a->next;
+		} else {
+			tail->next = b;
+			b = b->next;
+		}
+		tail = tail->next;
+	}
+	tail->next = a?:b;
+	return head.next;
+}
+
+static void restore_back_links(struct list_head *head)
+{
+	struct list_head *p;
+
+	for (p = head; p->next; p = p->next)
+		p->next->prev = p;
+	head->prev = p;
+}
+
 /**
  * list_sort - sort a list.
  * @priv: private data, passed to @cmp
  * @head: the list to sort
  * @cmp: the elements comparison function
  *
- * This function has been implemented by Mark J Roberts <mjr@znex.org>. It
- * implements "merge sort" which has O(nlog(n)) complexity. The list is sorted
- * in ascending order.
+ * This function implements "merge sort" which has O(nlog(n)) complexity.
+ * The list is sorted in ascending order.
  *
  * The comparison function @cmp is supposed to return a negative value if @a is
  * less than @b, and a positive value if @a is greater than @b. If @a and @b
  * are equivalent, then it does not matter what this function returns.
  */
 void list_sort(void *priv, struct list_head *head,
-	       int (*cmp)(void *priv, struct list_head *a,
-			  struct list_head *b))
+		int (*cmp)(void *priv, struct list_head *a,
+			struct list_head *b))
 {
-	struct list_head *p, *q, *e, *list, *tail, *oldhead;
-	int insize, nmerges, psize, qsize, i;
+	struct list_head *part[MAX_LIST_SIZE_BITS+1]; /* sorted partial lists --
+							 last slot a sentinel */
+	int lev;  /* index into part[] */
+	int max_lev = -1;
+	struct list_head *list;
 
 	if (list_empty(head))
 		return;
 
+	memset(part, 0, sizeof(part));
+
+	/*
+	 * Rather than take time to maintain the back links through the merges,
+	 * we'll recreate them afterward.  Null-terminate the input list too.
+	 */
 	list = head->next;
-	list_del(head);
-	insize = 1;
-	for (;;) {
-		p = oldhead = list;
-		list = tail = NULL;
-		nmerges = 0;
-
-		while (p) {
-			nmerges++;
-			q = p;
-			psize = 0;
-			for (i = 0; i < insize; i++) {
-				psize++;
-				q = q->next == oldhead ? NULL : q->next;
-				if (!q)
-					break;
-			}
+	head->prev->next = NULL;
 
-			qsize = insize;
-			while (psize > 0 || (qsize > 0 && q)) {
-				if (!psize) {
-					e = q;
-					q = q->next;
-					qsize--;
-					if (q == oldhead)
-						q = NULL;
-				} else if (!qsize || !q) {
-					e = p;
-					p = p->next;
-					psize--;
-					if (p == oldhead)
-						p = NULL;
-				} else if (cmp(priv, p, q) <= 0) {
-					e = p;
-					p = p->next;
-					psize--;
-					if (p == oldhead)
-						p = NULL;
-				} else {
-					e = q;
-					q = q->next;
-					qsize--;
-					if (q == oldhead)
-						q = NULL;
-				}
-				if (tail)
-					tail->next = e;
-				else
-					list = e;
-				e->prev = tail;
-				tail = e;
+	while (list) {
+		struct list_head *cur = list;
+		list = list->next;
+		cur->next = NULL;
+
+		for (lev = 0; part[lev]; lev++) {
+			cur = merge(priv, cmp, part[lev], cur);
+			part[lev] = NULL;
+		}
+		if (lev > max_lev) {
+			if (unlikely(lev >= ARRAY_SIZE(part)-1)) {
+				printk_once(KERN_DEBUG "list passed to"
+					" list_sort() too long for"
+					" efficiency\n");
+				lev--;
 			}
-			p = q;
+			max_lev = lev;
 		}
+		part[lev] = cur;
+	}
 
-		tail->next = list;
-		list->prev = tail;
-
-		if (nmerges <= 1)
-			break;
+	for (lev = 0; lev <= max_lev; lev++)
+		list = merge(priv, cmp, list, part[lev]);
 
-		insize *= 2;
-	}
+	restore_back_links(list);
 
 	head->next = list;
 	head->prev = list->prev;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/2] lib: revise list_sort() comment
  2010-01-21  4:51 [PATCH 1/2] lib: more scalable list_sort() Don Mullis
@ 2010-01-21  5:17 ` Don Mullis
  2010-01-21 19:11   ` Olaf Titz
  2010-01-21  9:22 ` [PATCH 1/2] lib: more scalable list_sort() Artem Bityutskiy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Don Mullis @ 2010-01-21  5:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: airlied, andi, david, dedekind

Clarify and correct header comment of list_sort().

Signed-off-by: Don Mullis <don.mullis@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
---
 lib/list_sort.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6/lib/list_sort.c
===================================================================
--- linux-2.6.orig/lib/list_sort.c	2010-01-19 22:26:03.000000000 -0800
+++ linux-2.6/lib/list_sort.c	2010-01-19 22:28:19.000000000 -0800
@@ -38,17 +38,19 @@ static void restore_back_links(struct li
 }
 
 /**
- * list_sort - sort a list.
- * @priv: private data, passed to @cmp
+ * list_sort - sort a list
+ * @priv: private data, opaque to list_sort(), passed to @cmp
  * @head: the list to sort
  * @cmp: the elements comparison function
  *
- * This function implements "merge sort" which has O(nlog(n)) complexity.
- * The list is sorted in ascending order.
+ * This function implements "merge sort", which has O(nlog(n))
+ * complexity.
  *
- * The comparison function @cmp is supposed to return a negative value if @a is
- * less than @b, and a positive value if @a is greater than @b. If @a and @b
- * are equivalent, then it does not matter what this function returns.
+ * The comparison function @cmp must return a negative value if @a
+ * should sort before @b, and a positive value if @a should sort after
+ * @b. If @a and @b are equivalent, and their original relative
+ * ordering is to be preserved, @cmp should return 0; otherwise, the
+ * return value does not matter.
  */
 void list_sort(void *priv, struct list_head *head,
 		int (*cmp)(void *priv, struct list_head *a,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-21  4:51 [PATCH 1/2] lib: more scalable list_sort() Don Mullis
  2010-01-21  5:17 ` [PATCH 2/2] lib: revise list_sort() comment Don Mullis
@ 2010-01-21  9:22 ` Artem Bityutskiy
  2010-01-21  9:54   ` Dave Chinner
  2010-01-21 17:59 ` Andi Kleen
  2010-08-04 14:04 ` Artem Bityutskiy
  3 siblings, 1 reply; 22+ messages in thread
From: Artem Bityutskiy @ 2010-01-21  9:22 UTC (permalink / raw)
  To: Don Mullis; +Cc: linux-kernel, airlied, andi, david

On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.
> 
> Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> gcc-4.4, with flags extracted from an Ubuntu kernel build.  Object
> size is 552 bytes versus 405 for Mark J. Roberts' code.
> 
> Worst case for either implementation is a list length just over a POT,
> and to roughly the same degree, so here are results for a range of
> 2^N+1 lengths.  List elements were 16 bytes each including malloc
> overhead; random initial order.
> 

Could you please add a debugging function which would be compiled-out
normally, and which would check that on the output 'list_sort()' gives
really sorted list, and number of elements in the list stays the same.
You'd call this function before returning from list_sort(). Something
like:

#ifdef DEBUG_LIST_SORT
static int list_check(void *priv, struct list_head *head,
                      int (*cmp)(void *priv, struct list_head *a,
                                 struct list_head *b))
{
   /* Checking */
}
#else
#define list_check(priv, head, cmp) 0
#endif

This will provide more confidence in the algorithm correctness for
everyone who modifies 'list_sort()'.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-21  9:22 ` [PATCH 1/2] lib: more scalable list_sort() Artem Bityutskiy
@ 2010-01-21  9:54   ` Dave Chinner
  2010-01-21 11:44     ` Artem Bityutskiy
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-21  9:54 UTC (permalink / raw)
  To: Artem Bityutskiy; +Cc: Don Mullis, linux-kernel, airlied, andi

On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
> On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > The use of list_sort() by UBIFS looks like it could generate long
> > lists; this alternative implementation scales better, reaching ~3x
> > performance gain as list length approaches the L2 cache size.
> > 
> > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > gcc-4.4, with flags extracted from an Ubuntu kernel build.  Object
> > size is 552 bytes versus 405 for Mark J. Roberts' code.
> > 
> > Worst case for either implementation is a list length just over a POT,
> > and to roughly the same degree, so here are results for a range of
> > 2^N+1 lengths.  List elements were 16 bytes each including malloc
> > overhead; random initial order.
> > 
> 
> Could you please add a debugging function which would be compiled-out
> normally, and which would check that on the output 'list_sort()' gives
> really sorted list, and number of elements in the list stays the same.
> You'd call this function before returning from list_sort(). Something
> like:
> 
> #ifdef DEBUG_LIST_SORT
> static int list_check(void *priv, struct list_head *head,
>                       int (*cmp)(void *priv, struct list_head *a,
>                                  struct list_head *b))
> {
>    /* Checking */
> }
> #else
> #define list_check(priv, head, cmp) 0
> #endif
> 
> This will provide more confidence in the algorithm correctness for
> everyone who modifies 'list_sort()'.

I'd suggest the same method as employed in lib/sort.c - a
simple userspace program that verifies correct operation is included
in lib/sort.c....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-21  9:54   ` Dave Chinner
@ 2010-01-21 11:44     ` Artem Bityutskiy
  2010-01-21 16:34       ` Don Mullis
  0 siblings, 1 reply; 22+ messages in thread
From: Artem Bityutskiy @ 2010-01-21 11:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Don Mullis, linux-kernel, airlied, andi

On Thu, 2010-01-21 at 20:54 +1100, Dave Chinner wrote:
> On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
> > On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > > The use of list_sort() by UBIFS looks like it could generate long
> > > lists; this alternative implementation scales better, reaching ~3x
> > > performance gain as list length approaches the L2 cache size.
> > > 
> > > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > > gcc-4.4, with flags extracted from an Ubuntu kernel build.  Object
> > > size is 552 bytes versus 405 for Mark J. Roberts' code.
> > > 
> > > Worst case for either implementation is a list length just over a POT,
> > > and to roughly the same degree, so here are results for a range of
> > > 2^N+1 lengths.  List elements were 16 bytes each including malloc
> > > overhead; random initial order.
> > > 
> > 
> > Could you please add a debugging function which would be compiled-out
> > normally, and which would check that on the output 'list_sort()' gives
> > really sorted list, and number of elements in the list stays the same.
> > You'd call this function before returning from list_sort(). Something
> > like:
> > 
> > #ifdef DEBUG_LIST_SORT
> > static int list_check(void *priv, struct list_head *head,
> >                       int (*cmp)(void *priv, struct list_head *a,
> >                                  struct list_head *b))
> > {
> >    /* Checking */
> > }
> > #else
> > #define list_check(priv, head, cmp) 0
> > #endif
> > 
> > This will provide more confidence in the algorithm correctness for
> > everyone who modifies 'list_sort()'.
> 
> I'd suggest the same method as employed in lib/sort.c - a
> simple userspace program that verifies correct operation is included
> in lib/sort.c....

Yeah, that's also an option.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-21 11:44     ` Artem Bityutskiy
@ 2010-01-21 16:34       ` Don Mullis
  0 siblings, 0 replies; 22+ messages in thread
From: Don Mullis @ 2010-01-21 16:34 UTC (permalink / raw)
  To: dedekind; +Cc: Dave Chinner, linux-kernel, airlied, andi

Artem Bityutskiy <dedekind@infradead.org> writes:

> On Thu, 2010-01-21 at 20:54 +1100, Dave Chinner wrote:
>> On Thu, Jan 21, 2010 at 11:22:55AM +0200, Artem Bityutskiy wrote:
>> > 
>> > Could you please add a debugging function which would be compiled-out
>> > normally, and which would check that on the output 'list_sort()' gives
>> > really sorted list, and number of elements in the list stays the same.
>> > You'd call this function before returning from list_sort(). Something
>> > like:
>> > 
>> > #ifdef DEBUG_LIST_SORT
>> > static int list_check(void *priv, struct list_head *head,
>> >                       int (*cmp)(void *priv, struct list_head *a,
>> >                                  struct list_head *b))
>> > {
>> >    /* Checking */
>> > }
>> > #else
>> > #define list_check(priv, head, cmp) 0
>> > #endif
>> > 
>> > This will provide more confidence in the algorithm correctness for
>> > everyone who modifies 'list_sort()'.
>> 
>> I'd suggest the same method as employed in lib/sort.c - a
>> simple userspace program that verifies correct operation is included
>> in lib/sort.c....
>
> Yeah, that's also an option.

Okay. The regression test in lib/sort.c is kernel-space, run once at
boot. I'd like to do something similar for lib/list_sort.c, conditioned
on DEBUG_LIST_SORT. I would extend the testing to verify stability as
well as sort order and number of elements.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-21  4:51 [PATCH 1/2] lib: more scalable list_sort() Don Mullis
  2010-01-21  5:17 ` [PATCH 2/2] lib: revise list_sort() comment Don Mullis
  2010-01-21  9:22 ` [PATCH 1/2] lib: more scalable list_sort() Artem Bityutskiy
@ 2010-01-21 17:59 ` Andi Kleen
  2010-01-22  3:17   ` Don Mullis
  2010-08-04 14:04 ` Artem Bityutskiy
  3 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2010-01-21 17:59 UTC (permalink / raw)
  To: Don Mullis; +Cc: linux-kernel, airlied, andi, david, dedekind

On Wed, Jan 20, 2010 at 08:51:26PM -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.

If this can really be called with long lists 
the function likely needs (optional) need_resched()s
Otherwise it could ruin scheduling latencies.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib: revise list_sort() comment
  2010-01-21  5:17 ` [PATCH 2/2] lib: revise list_sort() comment Don Mullis
@ 2010-01-21 19:11   ` Olaf Titz
  2010-01-22  4:54     ` Don Mullis
  0 siblings, 1 reply; 22+ messages in thread
From: Olaf Titz @ 2010-01-21 19:11 UTC (permalink / raw)
  To: linux-kernel

> + * The comparison function @cmp must return a negative value if @a
> + * should sort before @b, and a positive value if @a should sort after
> + * @b. If @a and @b are equivalent, and their original relative
> + * ordering is to be preserved, @cmp should return 0; otherwise, the
> + * return value does not matter.

This "otherwise... does not matter" sounds funny and confusing. Either
read this as "the return value does not matter if it is neither <0, >0
or ==0" or "the return value does not matter if the function wants it
to be ignored". :-)

Just omitting the "otherwise" clause would be clearer.

Olaf

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-21 17:59 ` Andi Kleen
@ 2010-01-22  3:17   ` Don Mullis
  2010-01-22 10:43     ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Don Mullis @ 2010-01-22  3:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, airlied, david, dedekind

Andi Kleen <andi@firstfloor.org> writes:

> On Wed, Jan 20, 2010 at 08:51:26PM -0800, Don Mullis wrote:
>> The use of list_sort() by UBIFS looks like it could generate long
>> lists; this alternative implementation scales better, reaching ~3x
>> performance gain as list length approaches the L2 cache size.
>
> If this can really be called with long lists 
> the function likely needs (optional) need_resched()s
> Otherwise it could ruin scheduling latencies.
>
> -Andi

Being just a dumb library routine, list_sort() has no idea what context
it's been called in, how long a list a particular client could pass in,
nor how expensive the client's cmp() callback might be.

The cmp() callback already passes back a client-private pointer.
Hanging off of this could be a count of calls, or timing information,
maintained by the client.  Whenever some threshold is reached, the
client's cmp() could do whatever good CPU-sharing citizenship required.

This doesn't address the final O(n) pass over the list to restore the
back links.  So the cost of that pass would dictate the upper limit on
list length for a client already using the cmp() call-counting/timing
trick to break up the earlier compare-and-merge passes.

If that's not good enough, a more complicated solution would be
required.  But I'm hoping we don't need to go there yet.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] lib: revise list_sort() comment
  2010-01-21 19:11   ` Olaf Titz
@ 2010-01-22  4:54     ` Don Mullis
  0 siblings, 0 replies; 22+ messages in thread
From: Don Mullis @ 2010-01-22  4:54 UTC (permalink / raw)
  To: Olaf Titz; +Cc: linux-kernel

Olaf Titz <olaf@bigred.inka.de> writes:

>> + * The comparison function @cmp must return a negative value if @a
>> + * should sort before @b, and a positive value if @a should sort after
>> + * @b. If @a and @b are equivalent, and their original relative
>> + * ordering is to be preserved, @cmp should return 0; otherwise, the
>> + * return value does not matter.
>
> This "otherwise... does not matter" sounds funny and confusing. Either
> read this as "the return value does not matter if it is neither <0, >0
> or ==0" or "the return value does not matter if the function wants it
> to be ignored". :-)
>
> Just omitting the "otherwise" clause would be clearer.
>
> Olaf

Okay, I will simplify the wording.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-22  3:17   ` Don Mullis
@ 2010-01-22 10:43     ` Andi Kleen
  2010-01-22 12:29       ` Artem Bityutskiy
  2010-01-23  8:28       ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Andi Kleen @ 2010-01-22 10:43 UTC (permalink / raw)
  To: Don Mullis; +Cc: linux-kernel, airlied, david, dedekind

Don Mullis <don.mullis@gmail.com> writes:
>
> Being just a dumb library routine, list_sort() has no idea what context
> it's been called in, how long a list a particular client could pass in,
> nor how expensive the client's cmp() callback might be.
>
> The cmp() callback already passes back a client-private pointer.
> Hanging off of this could be a count of calls, or timing information,
> maintained by the client.  Whenever some threshold is reached, the
> client's cmp() could do whatever good CPU-sharing citizenship required.

need_resched() does all the timing/thresholding (it checks the 
reschedule flag set by the timer interrupt). You just have to call it.
But preferable not in the inner loop, but in a outer one. It's
not hyper-expensive, but it's not free either.

The drawback is that if it's called the context always has to
allow sleeping, so it might need to be optional.

Anyways a better fix might be simply to ensure in the caller
that lists never get as long that they become a scheduling
hazard. But you indicated that ubifs would pass very long lists?
Perhaps ubifs (and other calls who might have that problem) simply
needs to be fixed.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-22 10:43     ` Andi Kleen
@ 2010-01-22 12:29       ` Artem Bityutskiy
  2010-01-22 17:55         ` Don Mullis
  2010-01-23  8:28       ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Artem Bityutskiy @ 2010-01-22 12:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Don Mullis, linux-kernel, airlied, david

On Fri, 2010-01-22 at 11:43 +0100, Andi Kleen wrote:
> Don Mullis <don.mullis@gmail.com> writes:
> >
> > Being just a dumb library routine, list_sort() has no idea what context
> > it's been called in, how long a list a particular client could pass in,
> > nor how expensive the client's cmp() callback might be.
> >
> > The cmp() callback already passes back a client-private pointer.
> > Hanging off of this could be a count of calls, or timing information,
> > maintained by the client.  Whenever some threshold is reached, the
> > client's cmp() could do whatever good CPU-sharing citizenship required.
> 
> need_resched() does all the timing/thresholding (it checks the 
> reschedule flag set by the timer interrupt). You just have to call it.
> But preferable not in the inner loop, but in a outer one. It's
> not hyper-expensive, but it's not free either.
> 
> The drawback is that if it's called the context always has to
> allow sleeping, so it might need to be optional.
> 
> Anyways a better fix might be simply to ensure in the caller
> that lists never get as long that they become a scheduling
> hazard. But you indicated that ubifs would pass very long lists?
> Perhaps ubifs (and other calls who might have that problem) simply
> needs to be fixed.

No, they are not very long. A hundred or so I guess, rarely. But we need
to check what is really the worst case, but it should not be too many.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-22 12:29       ` Artem Bityutskiy
@ 2010-01-22 17:55         ` Don Mullis
  0 siblings, 0 replies; 22+ messages in thread
From: Don Mullis @ 2010-01-22 17:55 UTC (permalink / raw)
  To: dedekind; +Cc: Andi Kleen, linux-kernel, airlied, david

Artem Bityutskiy <dedekind@infradead.org> writes:

> On Fri, 2010-01-22 at 11:43 +0100, Andi Kleen wrote:
>> Don Mullis <don.mullis@gmail.com> writes:
>> >
>> > Being just a dumb library routine, list_sort() has no idea what context
>> > it's been called in, how long a list a particular client could pass in,
>> > nor how expensive the client's cmp() callback might be.
>> >
>> > The cmp() callback already passes back a client-private pointer.
>> > Hanging off of this could be a count of calls, or timing information,
>> > maintained by the client.  Whenever some threshold is reached, the
>> > client's cmp() could do whatever good CPU-sharing citizenship required.
>> 
>> need_resched() does all the timing/thresholding (it checks the 
>> reschedule flag set by the timer interrupt). You just have to call it.
>> But preferable not in the inner loop, but in a outer one. It's
>> not hyper-expensive, but it's not free either.
>> 
>> The drawback is that if it's called the context always has to
>> allow sleeping, so it might need to be optional.
>> 
>> Anyways a better fix might be simply to ensure in the caller
>> that lists never get as long that they become a scheduling
>> hazard. But you indicated that ubifs would pass very long lists?
>> Perhaps ubifs (and other calls who might have that problem) simply
>> needs to be fixed.
>
> No, they are not very long. A hundred or so I guess, rarely. But we need
> to check what is really the worst case, but it should not be too many.

I suggest for now we leave scheduling issues as the caller's
responsibility, and keep list_sort() simple.  Wouldn't want to be
getting any email like this:

         http://lwn.net/Articles/366768/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-22 10:43     ` Andi Kleen
  2010-01-22 12:29       ` Artem Bityutskiy
@ 2010-01-23  8:28       ` Dave Chinner
  2010-01-23 11:35         ` Andi Kleen
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-23  8:28 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Don Mullis, linux-kernel, airlied, dedekind

On Fri, Jan 22, 2010 at 11:43:21AM +0100, Andi Kleen wrote:
> Don Mullis <don.mullis@gmail.com> writes:
> >
> > Being just a dumb library routine, list_sort() has no idea what context
> > it's been called in, how long a list a particular client could pass in,
> > nor how expensive the client's cmp() callback might be.
> >
> > The cmp() callback already passes back a client-private pointer.
> > Hanging off of this could be a count of calls, or timing information,
> > maintained by the client.  Whenever some threshold is reached, the
> > client's cmp() could do whatever good CPU-sharing citizenship required.
> 
> need_resched() does all the timing/thresholding (it checks the 
> reschedule flag set by the timer interrupt). You just have to call it.
> But preferable not in the inner loop, but in a outer one. It's
> not hyper-expensive, but it's not free either.
> 
> The drawback is that if it's called the context always has to
> allow sleeping, so it might need to be optional.
> 
> Anyways a better fix might be simply to ensure in the caller
> that lists never get as long that they become a scheduling
> hazard. But you indicated that ubifs would pass very long lists?
> Perhaps ubifs (and other calls who might have that problem) simply
> needs to be fixed.

Burning CPU time to save on IO is a very valid tradeoff in
filesystem design - burning a few hundred millieseconds of CPU
time can result in savcwinge tens of seconds of IO time. Hence
passing big long lists to be sorted is not an indication of broken
design, it's an indication of understanding CPU time vs IO time
tradeoffs during design...

For example, the use I have for list sort (sorting delayed write
buffers in ascending block order before Io submission) will result
in XFS asking for lists in the order of tens to hundreds of
thousands of items to be sorted. The code I already is showing wall
clock time reductions of 30-40% for stuff like kernel tarball
extraction or creation of millions of small files. This is because
the number of buffers being issued for IO is far larger than we
should sanely hold in the elevator request queues, but pre-sorting
them results in the elevator getting a near 100% merge efficiency of
delayed write buffers instead of near zero.  Hence we get much, much
better IO patterns out of the filesystem and things go faster.

Hence I think scalability and minimising scheduling latency for
list_sort() is definitely important. I was kind of planning to address
this when the problem was a performance limiting factor, but Don
has made a pre-emptive strike. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-23  8:28       ` Dave Chinner
@ 2010-01-23 11:35         ` Andi Kleen
  2010-01-23 16:05           ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2010-01-23 11:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andi Kleen, Don Mullis, linux-kernel, airlied, dedekind

> Burning CPU time to save on IO is a very valid tradeoff in
> filesystem design - burning a few hundred millieseconds of CPU
> time can result in savcwinge tens of seconds of IO time. Hence
> passing big long lists to be sorted is not an indication of broken
> design, it's an indication of understanding CPU time vs IO time
> tradeoffs during design...

Burning long CPU time in kernel code without latency breaker code is always
a sign of broken design. When you burn you have to check for 
reschedules. It's that simple.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-23 11:35         ` Andi Kleen
@ 2010-01-23 16:05           ` Dave Chinner
  2010-01-24 20:59             ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2010-01-23 16:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Don Mullis, linux-kernel, airlied, dedekind

On Sat, Jan 23, 2010 at 12:35:51PM +0100, Andi Kleen wrote:
> > Burning CPU time to save on IO is a very valid tradeoff in
> > filesystem design - burning a few hundred millieseconds of CPU
> > time can result in savcwinge tens of seconds of IO time. Hence
> > passing big long lists to be sorted is not an indication of broken
> > design, it's an indication of understanding CPU time vs IO time
> > tradeoffs during design...
> 
> Burning long CPU time in kernel code without latency breaker code is always
> a sign of broken design.

It's a characteristic of a sub-optimal implementation, not bad
design. Plenty of code has been fixed over the years simply by
adding cond_resched() to loops that have triggered latency
warnings.

Similarly, adding cond_resched() to list_sort means you can stop
worrying about the scheduling latency impact of sorting long lists.
I fail to see why you're making such a big deal out of this.....

Cheers

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-23 16:05           ` Dave Chinner
@ 2010-01-24 20:59             ` Andi Kleen
  2010-01-24 21:10               ` Artem Bityutskiy
  2010-01-25  3:41               ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Andi Kleen @ 2010-01-24 20:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andi Kleen, Don Mullis, linux-kernel, airlied, dedekind

> Similarly, adding cond_resched() to list_sort means you can stop
> worrying about the scheduling latency impact of sorting long lists.
> I fail to see why you're making such a big deal out of this.....

It's not easy to add it to a low level library function like list_sort()
because you would need to ensure that all callers are allowed to sleep.

Or even if all callers currently allow it future ones might not.

So it would need a new flag or so if needed, completely changing 
the interface.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-24 20:59             ` Andi Kleen
@ 2010-01-24 21:10               ` Artem Bityutskiy
  2010-01-24 22:38                 ` Don Mullis
  2010-01-25  3:41               ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Artem Bityutskiy @ 2010-01-24 21:10 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Dave Chinner, Don Mullis, linux-kernel, airlied

On Sun, 2010-01-24 at 21:59 +0100, Andi Kleen wrote:
> > Similarly, adding cond_resched() to list_sort means you can stop
> > worrying about the scheduling latency impact of sorting long lists.
> > I fail to see why you're making such a big deal out of this.....
> 
> It's not easy to add it to a low level library function like list_sort()
> because you would need to ensure that all callers are allowed to sleep.
> 
> Or even if all callers currently allow it future ones might not.
> 
> So it would need a new flag or so if needed, completely changing 
> the interface.

Don made IMO a good proposal for the caller to add cond_reshed() in its
'cmp()' callback, if needed. Shouldn't that work fine?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-24 21:10               ` Artem Bityutskiy
@ 2010-01-24 22:38                 ` Don Mullis
  0 siblings, 0 replies; 22+ messages in thread
From: Don Mullis @ 2010-01-24 22:38 UTC (permalink / raw)
  To: dedekind; +Cc: Andi Kleen, Dave Chinner, linux-kernel, airlied

Artem Bityutskiy <dedekind1@gmail.com> writes:

> Don made IMO a good proposal for the caller to add cond_reshed() in its
> 'cmp()' callback, if needed. Shouldn't that work fine?

In the list_sort() implementation I posted, there was an O(n) loop that
contained no calls back to cmp().  The V2 implementation I'm testing
right now fixes that -- all inner loops will call back to cmp().  It's
also a bit faster, though ~100 bytes bigger.

If there's any other objection to pushing responsibility for calling
cond_resched() back to the client, via the cmp() routine, please, let's
hear it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-24 20:59             ` Andi Kleen
  2010-01-24 21:10               ` Artem Bityutskiy
@ 2010-01-25  3:41               ` Dave Chinner
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-01-25  3:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Don Mullis, linux-kernel, airlied, dedekind

On Sun, Jan 24, 2010 at 09:59:06PM +0100, Andi Kleen wrote:
> > Similarly, adding cond_resched() to list_sort means you can stop
> > worrying about the scheduling latency impact of sorting long lists.
> > I fail to see why you're making such a big deal out of this.....
> 
> It's not easy to add it to a low level library function like list_sort()
> because you would need to ensure that all callers are allowed to sleep.

might_sleep() annotations are typically used in this case....

> Or even if all callers currently allow it future ones might not.

You're trying to make this way more complex than the current
requirements need it to be.  Just declaring a simple rule -
"list_sort() can sleep" - and all these problems go away until
someone really needs list sorting in atomic context.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-01-21  4:51 [PATCH 1/2] lib: more scalable list_sort() Don Mullis
                   ` (2 preceding siblings ...)
  2010-01-21 17:59 ` Andi Kleen
@ 2010-08-04 14:04 ` Artem Bityutskiy
  2010-08-07  7:50   ` Artem Bityutskiy
  3 siblings, 1 reply; 22+ messages in thread
From: Artem Bityutskiy @ 2010-08-04 14:04 UTC (permalink / raw)
  To: Don Mullis; +Cc: linux-kernel, airlied, andi, david

On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> The use of list_sort() by UBIFS looks like it could generate long
> lists; this alternative implementation scales better, reaching ~3x
> performance gain as list length approaches the L2 cache size.
> 
> Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> gcc-4.4, with flags extracted from an Ubuntu kernel build.  Object
> size is 552 bytes versus 405 for Mark J. Roberts' code.
> 
> Worst case for either implementation is a list length just over a POT,
> and to roughly the same degree, so here are results for a range of
> 2^N+1 lengths.  List elements were 16 bytes each including malloc
> overhead; random initial order.

This patch breaks UBIFS. I did not have time to dig deeper, but the
symptoms is that list_sort() calls the 'cmp()' function with bogus
'struct list_head *a' parameter, which did not exist in the original
list.

I see this on 2.6.35-rc1. Did not try the release yet. But when I revert
your patch - everything works fine.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] lib: more scalable list_sort()
  2010-08-04 14:04 ` Artem Bityutskiy
@ 2010-08-07  7:50   ` Artem Bityutskiy
  0 siblings, 0 replies; 22+ messages in thread
From: Artem Bityutskiy @ 2010-08-07  7:50 UTC (permalink / raw)
  To: Don Mullis; +Cc: linux-kernel, airlied, andi, david

On Wed, 2010-08-04 at 17:04 +0300, Artem Bityutskiy wrote:
> On Wed, 2010-01-20 at 20:51 -0800, Don Mullis wrote:
> > The use of list_sort() by UBIFS looks like it could generate long
> > lists; this alternative implementation scales better, reaching ~3x
> > performance gain as list length approaches the L2 cache size.
> > 
> > Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
> > gcc-4.4, with flags extracted from an Ubuntu kernel build.  Object
> > size is 552 bytes versus 405 for Mark J. Roberts' code.
> > 
> > Worst case for either implementation is a list length just over a POT,
> > and to roughly the same degree, so here are results for a range of
> > 2^N+1 lengths.  List elements were 16 bytes each including malloc
> > overhead; random initial order.
> 
> This patch breaks UBIFS. I did not have time to dig deeper, but the
> symptoms is that list_sort() calls the 'cmp()' function with bogus
> 'struct list_head *a' parameter, which did not exist in the original
> list.

Sorry, it appeared to be that UBIFS 'cmp()' function is broken, so your
patches revealed the issue. Sorry for noise and thanks for revealing
UBIFS problem!

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-08-07  7:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-21  4:51 [PATCH 1/2] lib: more scalable list_sort() Don Mullis
2010-01-21  5:17 ` [PATCH 2/2] lib: revise list_sort() comment Don Mullis
2010-01-21 19:11   ` Olaf Titz
2010-01-22  4:54     ` Don Mullis
2010-01-21  9:22 ` [PATCH 1/2] lib: more scalable list_sort() Artem Bityutskiy
2010-01-21  9:54   ` Dave Chinner
2010-01-21 11:44     ` Artem Bityutskiy
2010-01-21 16:34       ` Don Mullis
2010-01-21 17:59 ` Andi Kleen
2010-01-22  3:17   ` Don Mullis
2010-01-22 10:43     ` Andi Kleen
2010-01-22 12:29       ` Artem Bityutskiy
2010-01-22 17:55         ` Don Mullis
2010-01-23  8:28       ` Dave Chinner
2010-01-23 11:35         ` Andi Kleen
2010-01-23 16:05           ` Dave Chinner
2010-01-24 20:59             ` Andi Kleen
2010-01-24 21:10               ` Artem Bityutskiy
2010-01-24 22:38                 ` Don Mullis
2010-01-25  3:41               ` Dave Chinner
2010-08-04 14:04 ` Artem Bityutskiy
2010-08-07  7:50   ` Artem Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).