This problem was found and this fix suggested by Dori Eldar here at Intel (I just critiqued it for a while and pointed out some corner cases that needed to be addressed). There are performance problems with the current swiotlb.c bounce buffer allocation code. Users with large systems full of devices that require bounce buffers can sometimes find that they need to increase the number of bounce buffers available using the swiotlb boot time option to avoid panicing when running out of buffers. However, this can result in slow allocation/free of buffers as the swiotlb code spends a lot of cpu time coalescing blocks. On one benchmark this fix raised ethernet throughput from around 40 Mb/s to 95Mb/s while reducing cpu load from 100% to 20%. The basis of the fix is to partition the space reserved for bounce buffers into smaller segments so that we place an upper bound on the amount of work needed to coalesce blocks. In addition to the performace boost, this patch also fixes one real bug that Dori found while testing. map_single() would pick a "stride" based on the number of slots needed for the request ... but if this stride is not a power of two, the "do { ... } while (index != wrap);" loop can spin indefinitely. He changed that to use a stride of 1 because he couldn't see the benefit of the larger stride ... nor can I ... e.g. when looking for 5 slots you might look at an allocation map that looks like this: 3 <- look here, 3<5 so skip down 5 slots 2 1 0 5 4 <- now look here, missing the large enough block that began on the previous slot. -Tony Luck