This problem was found and this fix suggested by Dori Eldar here
at Intel (I just critiqued it for a while and pointed out some
corner cases that needed to be addressed).

There are performance problems with the current swiotlb.c bounce
buffer allocation code.  Users with large systems full of devices
that require bounce buffers can sometimes find that they need to
increase the number of bounce buffers available using the swiotlb
boot time option to avoid panicing when running out of buffers.
However, this can result in slow allocation/free of buffers as the
swiotlb code spends a lot of cpu time coalescing blocks.  On one
benchmark this fix raised ethernet throughput from around 40 Mb/s
to 95Mb/s while reducing cpu load from 100% to 20%.

The basis of the fix is to partition the space reserved for bounce
buffers into smaller segments so that we place an upper bound on
the amount of work needed to coalesce blocks.  In addition to the
performace boost, this patch also fixes one real bug that Dori
found while testing.  map_single() would pick a "stride" based on
the number of slots needed for the request ... but if this stride
is not a power of two, the "do { ... } while (index != wrap);" loop
can spin indefinitely. He changed that to use a stride of 1 because
he couldn't see the benefit of the larger stride ... nor can I ... e.g.
when looking for 5 slots you might look at an allocation map that
looks like this:

	3 <- look here, 3<5 so skip down 5 slots
	2
	1
	0
	5
	4 <- now look here, missing the large enough block that began
	     on the previous slot.

-Tony Luck