[Linux-ia64] PATCH: performance problems with swiotlb.c

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* [Linux-ia64] PATCH: performance problems with swiotlb.c
@ 2001-12-03 20:12 Luck, Tony
  2001-12-03 21:55 ` David Mosberger
  0 siblings, 1 reply; 2+ messages in thread
From: Luck, Tony @ 2001-12-03 20:12 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1593 bytes --]

This problem was found and this fix suggested by Dori Eldar here
at Intel (I just critiqued it for a while and pointed out some
corner cases that needed to be addressed).

There are performance problems with the current swiotlb.c bounce
buffer allocation code.  Users with large systems full of devices
that require bounce buffers can sometimes find that they need to
increase the number of bounce buffers available using the swiotlb
boot time option to avoid panicing when running out of buffers.
However, this can result in slow allocation/free of buffers as the
swiotlb code spends a lot of cpu time coalescing blocks.  On one
benchmark this fix raised ethernet throughput from around 40 Mb/s
to 95Mb/s while reducing cpu load from 100% to 20%.

The basis of the fix is to partition the space reserved for bounce
buffers into smaller segments so that we place an upper bound on
the amount of work needed to coalesce blocks.  In addition to the
performace boost, this patch also fixes one real bug that Dori
found while testing.  map_single() would pick a "stride" based on
the number of slots needed for the request ... but if this stride
is not a power of two, the "do { ... } while (index != wrap);" loop
can spin indefinitely. He changed that to use a stride of 1 because
he couldn't see the benefit of the larger stride ... nor can I ... e.g.
when looking for 5 slots you might look at an allocation map that
looks like this:

	3 <- look here, 3<5 so skip down 5 slots
	2
	1
	0
	5
	4 <- now look here, missing the large enough block that began
	     on the previous slot.

-Tony Luck



[-- Attachment #2: patch-swiotlb --]
[-- Type: application/octet-stream, Size: 3233 bytes --]

diff -ru ../../REF/2.4.16-ia64-011128/arch/ia64/lib/swiotlb.c linux/arch/ia64/lib/swiotlb.c
--- ../../REF/2.4.16-ia64-011128/arch/ia64/lib/swiotlb.c	Wed Nov 28 16:55:04 2001
+++ linux/arch/ia64/lib/swiotlb.c	Mon Dec  3 11:41:51 2001
@@ -27,6 +27,16 @@
 #define ALIGN(val, align) ((unsigned long)	\
 	(((unsigned long) (val) + ((align) - 1)) & ~((align) - 1)))
 
+#define OFFSET(val,align) ((unsigned long)	\
+	                   ( (val) & ( (align) - 1)))
+
+/*
+ * Maximum allowable number of contiguous slabs to map, 
+ * must be a power of 2.  What is the appropriate value ?
+ * The complexity of {map,unmap}_single is linearly dependent on this value.
+ */ 
+#define IO_TLB_SEGSIZE	128
+
 /*
  * log of the size of each IO TLB slab.  The number of slabs is command line controllable.
  */
@@ -65,10 +75,15 @@
 setup_io_tlb_npages (char *str)
 {
 	io_tlb_nslabs = simple_strtoul(str, NULL, 0) << (PAGE_SHIFT - IO_TLB_SHIFT);
+
+	/* avoid tail segment of size < IO_TLB_SEGSIZE */  
+	io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+
 	return 1;
 }
 __setup("swiotlb=", setup_io_tlb_npages);
 
+
 /*
  * Statically reserve bounce buffer space and initialize bounce buffer data structures for
  * the software IO TLB used to implement the PCI DMA API.
@@ -88,12 +103,12 @@
 
 	/*
 	 * Allocate and initialize the free list array.  This array is used
-	 * to find contiguous free memory regions of size 2^IO_TLB_SHIFT between
-	 * io_tlb_start and io_tlb_end.
+	 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
+	 * between io_tlb_start and io_tlb_end.
 	 */
 	io_tlb_list = alloc_bootmem(io_tlb_nslabs * sizeof(int));
 	for (i = 0; i < io_tlb_nslabs; i++)
-		io_tlb_list[i] = io_tlb_nslabs - i;
+ 		io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
 	io_tlb_index = 0;
 	io_tlb_orig_addr = alloc_bootmem(io_tlb_nslabs * sizeof(char *));
 
@@ -120,7 +135,7 @@
 	if (size > (1 << PAGE_SHIFT))
 		stride = (1 << (PAGE_SHIFT - IO_TLB_SHIFT));
 	else
-		stride = nslots;
+		stride = 1; 
 
 	if (!nslots)
 		BUG();
@@ -147,7 +162,8 @@
 
 				for (i = index; i < index + nslots; i++)
 					io_tlb_list[i] = 0;
-				for (i = index - 1; (i >= 0) && io_tlb_list[i]; i--)
+				for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE -1)
+				       && io_tlb_list[i]; i--)
 					io_tlb_list[i] = ++count;
 				dma_addr = io_tlb_start + (index << IO_TLB_SHIFT);
 
@@ -213,7 +229,8 @@
 	 */
 	spin_lock_irqsave(&io_tlb_lock, flags);
 	{
-		int count = ((index + nslots) < io_tlb_nslabs ? io_tlb_list[index + nslots] : 0);
+		int count = ((index + nslots) < ALIGN(index + 1, IO_TLB_SEGSIZE) ?
+			     io_tlb_list[index + nslots] : 0);
 		/*
 		 * Step 1: return the slots to the free list, merging the slots with
 		 * superceeding slots
@@ -224,7 +241,8 @@
 		 * Step 2: merge the returned slots with the preceeding slots, if
 		 * available (non zero)
 		 */
-		for (i = index - 1; (i >= 0) && io_tlb_list[i]; i--)
+		for (i = index - 1;  (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE -1) &&
+		       io_tlb_list[i]; i--)
 			io_tlb_list[i] = ++count;
 	}
 	spin_unlock_irqrestore(&io_tlb_lock, flags);

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Linux-ia64] PATCH: performance problems with swiotlb.c
  2001-12-03 20:12 [Linux-ia64] PATCH: performance problems with swiotlb.c Luck, Tony
@ 2001-12-03 21:55 ` David Mosberger
  0 siblings, 0 replies; 2+ messages in thread
From: David Mosberger @ 2001-12-03 21:55 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 3 Dec 2001 12:12:07 -0800, "Luck, Tony" <tony.luck@intel.com> said:

  Tony> The basis of the fix is to partition the space reserved for
  Tony> bounce buffers into smaller segments so that we place an upper
  Tony> bound on the amount of work needed to coalesce blocks.

Looks good to me.  I applied the patch,

  Tony> In addition to the performace boost, this patch also fixes one
  Tony> real bug that Dori found while testing.  map_single() would
  Tony> pick a "stride" based on the number of slots needed for the
  Tony> request ... but if this stride is not a power of two, the "do
  Tony> { ... } while (index != wrap);" loop can spin indefinitely. He
  Tony> changed that to use a stride of 1 because he couldn't see the
  Tony> benefit of the larger stride ... nor can I ... e.g.  when
  Tony> looking for 5 slots you might look at an allocation map that
  Tony> looks like this:

  Tony> 	3 <- look here, 3<5 so skip down 5 slots 2 1 0 5 4 <-
  Tony> now look here, missing the large enough block that began on
  Tony> the previous slot.

I'm not sure either.  I assume the stride was intended to reduce
searching overheads.  Perhaps Asit or Goutham would know?

Thanks,

	--david

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2001-12-03 21:55 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-12-03 20:12 [Linux-ia64] PATCH: performance problems with swiotlb.c Luck, Tony
2001-12-03 21:55 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox