[PATCH] dm-thin: optimize power of two block size

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] dm-thin: optimize power of two block size
@ 2012-06-18 14:09 Mikulas Patocka
  2012-06-18 16:35 ` Joe Thornber
  0 siblings, 1 reply; 4+ messages in thread
From: Mikulas Patocka @ 2012-06-18 14:09 UTC (permalink / raw)
  To: Mike Snitzer, Edward Thornber, Alasdair G. Kergon; +Cc: dm-devel

Hi

This patch should be applied after 
dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes 
power-of-two blocksize.

Mikulas

---

dm-thin: optimize power of two block size

dm-thin will be most likely used with a block size that is a power of
two. So it should be optimized for this case.

This patch changes division and modulo operations to shifts and bit
masks if block size is a power of two.

A test that bi_sector is divisible by a block size is removed from
io_overlaps_block. Device mapper never sends bios that span block
boundary. Consequently, if we tested that bi_size is equivalent to block
size, bi_sector must already be on a block boundary.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm-thin.c |   27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

Index: linux-3.4.2-fast/drivers/md/dm-thin.c
===================================================================
--- linux-3.4.2-fast.orig/drivers/md/dm-thin.c	2012-06-18 15:38:53.000000000 +0200
+++ linux-3.4.2-fast/drivers/md/dm-thin.c	2012-06-18 16:06:15.000000000 +0200
@@ -512,6 +512,7 @@ struct pool {
 
 	dm_block_t low_water_blocks;
 	uint32_t sectors_per_block;
+	int sectors_per_block_shift;
 
 	struct pool_features pf;
 	unsigned low_water_triggered:1;	/* A dm event has been sent */
@@ -678,7 +679,10 @@ static dm_block_t get_bio_block(struct t
 {
 	sector_t block_nr = bio->bi_sector;
 
-	(void) sector_div(block_nr, tc->pool->sectors_per_block);
+	if (tc->pool->sectors_per_block_shift < 0)
+		(void) sector_div(block_nr, tc->pool->sectors_per_block);
+	else
+		block_nr >>= tc->pool->sectors_per_block_shift;
 
 	return block_nr;
 }
@@ -689,8 +693,12 @@ static void remap(struct thin_c *tc, str
 	sector_t bi_sector = bio->bi_sector;
 
 	bio->bi_bdev = tc->pool_dev->bdev;
-	bio->bi_sector = (block * pool->sectors_per_block) +
-			 sector_div(bi_sector, pool->sectors_per_block);
+	if (tc->pool->sectors_per_block_shift < 0)
+		bio->bi_sector = (block * pool->sectors_per_block) +
+				 sector_div(bi_sector, pool->sectors_per_block);
+	else
+		bio->bi_sector = (block << pool->sectors_per_block_shift) |
+				(bi_sector & (pool->sectors_per_block - 1));
 }
 
 static void remap_to_origin(struct thin_c *tc, struct bio *bio)
@@ -935,10 +943,7 @@ static void process_prepared(struct pool
  */
 static int io_overlaps_block(struct pool *pool, struct bio *bio)
 {
-	sector_t bi_sector = bio->bi_sector;
-
-	return !sector_div(bi_sector, pool->sectors_per_block) &&
-		(bio->bi_size == (pool->sectors_per_block << SECTOR_SHIFT));
+	return bio->bi_size == (pool->sectors_per_block << SECTOR_SHIFT);
 }
 
 static int io_overwrites_block(struct pool *pool, struct bio *bio)
@@ -1241,7 +1246,9 @@ static void process_discard(struct thin_
 			 * part of the discard that is in a subsequent
 			 * block.
 			 */
-			sector_t offset = bio->bi_sector - (block * pool->sectors_per_block);
+			sector_t offset = pool->sectors_per_block_shift >= 0 ?
+			      bio->bi_sector & (pool->sectors_per_block - 1) :
+			      bio->bi_sector - block * pool->sectors_per_block;
 			unsigned remaining = (pool->sectors_per_block - offset) << SECTOR_SHIFT;
 			bio->bi_size = min(bio->bi_size, remaining);
 
@@ -1718,6 +1725,10 @@ static struct pool *pool_create(struct m
 
 	pool->pmd = pmd;
 	pool->sectors_per_block = block_size;
+	if (block_size & (block_size - 1))
+		pool->sectors_per_block_shift = -1;
+	else
+		pool->sectors_per_block_shift = __ffs(block_size);
 	pool->low_water_blocks = 0;
 	pool_features_init(&pool->pf);
 	pool->prison = prison_create(PRISON_CELLS);

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] dm-thin: optimize power of two block size
  2012-06-18 14:09 [PATCH] dm-thin: optimize power of two block size Mikulas Patocka
@ 2012-06-18 16:35 ` Joe Thornber
  2012-06-25  1:53   ` Mikulas Patocka
  0 siblings, 1 reply; 4+ messages in thread
From: Joe Thornber @ 2012-06-18 16:35 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Mike Snitzer, dm-devel, Alasdair G. Kergon

On Mon, Jun 18, 2012 at 10:09:56AM -0400, Mikulas Patocka wrote:
> Hi
> 
> This patch should be applied after 
> dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes 
> power-of-two blocksize.

I'm going to nack this unless you can provide a benchmark that shows
it measurably improves performance for some architecture somewhere.
And a real benchmark, with io going through all the devices, not just
a micro benchmark of the 'if' in a tight loop.

- Joe

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] dm-thin: optimize power of two block size
  2012-06-18 16:35 ` Joe Thornber
@ 2012-06-25  1:53   ` Mikulas Patocka
  2012-06-25 14:09     ` Joe Thornber
  0 siblings, 1 reply; 4+ messages in thread
From: Mikulas Patocka @ 2012-06-25  1:53 UTC (permalink / raw)
  To: Joe Thornber; +Cc: Mike Snitzer, dm-devel, Alasdair G. Kergon

On Mon, 18 Jun 2012, Joe Thornber wrote:

> On Mon, Jun 18, 2012 at 10:09:56AM -0400, Mikulas Patocka wrote:
> > Hi
> > 
> > This patch should be applied after 
> > dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes 
> > power-of-two blocksize.
> 
> I'm going to nack this unless you can provide a benchmark that shows
> it measurably improves performance for some architecture somewhere.
> And a real benchmark, with io going through all the devices, not just
> a micro benchmark of the 'if' in a tight loop.
> 
> - Joe

Hi

Here are some tests ran on the collection of my computers.

This is a do_div benchmark, the source is here:
http://people.redhat.com/~mpatocka/testcases/do_div_benchmark.c
For the "bignum" test, I replaced 0x12345678 with 0xff12345678LL (so that 
do_div divides real 64-bit numbers).

It is especially slow on PA-RISC and Alpha because they don't have a 
divide instruction.

PA-RISC 900MHz 64-bit:
shift+mask:		4 ticks		(4.4ns)
shift+mask bignum:	4 ticks		(4.4ns)
do_div:			825 ticks	(917ns)
do_div bignum:		825 ticks	(917ns)

UltraSparc2 440MHz 64-bit:
shift+mask:		3 ticks		(6.8ns)
shift+mask bignum:	3 ticks		(6.8ns)
do_div:			87 ticks	(198ns)
do_div bignum:		93 ticks	(211ns)

Alpha ev45 233MHz 64-bit:
shift+mask:		7 ticks		(30ns)
shift+mask bignum:	8 ticks		(34ns)
do_div:			598 ticks	(2563ns)
do_div bignum:		897 ticks	(3844ns)

Pentium 3 850MHz:
shift+mask:		12.25 ticks	(14ns)
shift+mask bignum:	16 ticks	(19ns)
do_div:			63.5 ticks	(75ns)
do_div bignum:		94 ticks	(111ns)

Core2 Xeon 1600MHz 64-bit:
shift+mask:		3.2 ticks	(2ns)
shift+mask bignum:	3.4 ticks	(2.1ns)
do_div:			64 ticks	(40ns)
do_div bignum:		64 ticks	(40ns)

K10 Opteron 2300MHz 64-bit:
shift+mask:		3 ticks		(1.3ns)
shift+mask bignum:	3 ticks		(1.3ns)
do_div:			46 ticks	(20ns)
do_div bignum:		57 ticks	(28ns)

---

On that PA-RISC machine, I set up dm-stripe target consisting of two 
stripes on a ramdisk, with 4k stripe size. I performed
dd if=/dev/mapper/stripe of=/dev/null bs=512 count=100000 iflag=direct
With the optimization patches: 38.2-38.5 MB/s
Without the optimization patches: 35.3-35.6 MB/s

With larger io size:
dd if=/dev/mapper/stripe of=/dev/null bs=1M count=200 iflag=direct
With the optimization patches: 269-272 MB/s
Without the optimization patches: 250-253 MB/s

Tests with dm-thin on PA-RISC:
A device with 512MB pool and 512MB metadata on ramdisks, 64k chunk.

Overwrite the first time with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 91.0-91.4
With the optimization patches: 90.6-91.6

Subsequent overwrite with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 104 MB/s
With the optimization patches: 104 MB/s

Read the overwritten device with
dd if=/dev/mapper/thin of=/dev/null bs=1M iflag=direct
Without the optimization patches: 252-254 MB/s
With the optimization patches: 257-258 MB/s

So the conclusion is that is that that divide instruction degrades 
transfer speed, especially on dm-stripe with 4k stripe size (on dm-thin it 
is measurable only with raw read, the difference is smaller because it has 
a minimum chunk size 64k).

The question is why do you want to avoid such optimization? If it is 
because of source code clarity, we can create #define sector_div_optimized 
that optimizes the common case of power-of-two divisor and the code would 
be no more complicated than with sector div. Or do you have some other 
reasons?

BTW. when unloading the dm-thin device with debugging enabled (the tests 
were done with debugging disabled), I got this message:
device-mapper: space map checker: free block counts differ, checker 
131060, sm-disk:130991
--- so there is supposedly some bug? The kernel is 3.4.3.

Mikulas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] dm-thin: optimize power of two block size
  2012-06-25  1:53   ` Mikulas Patocka
@ 2012-06-25 14:09     ` Joe Thornber
  0 siblings, 0 replies; 4+ messages in thread
From: Joe Thornber @ 2012-06-25 14:09 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Mike Snitzer, dm-devel, Alasdair G. Kergon

On Sun, Jun 24, 2012 at 09:53:22PM -0400, Mikulas Patocka wrote:
> So the conclusion is that is that that divide instruction degrades 
> transfer speed, especially on dm-stripe with 4k stripe size (on dm-thin it 
> is measurable only with raw read, the difference is smaller because it has 
> a minimum chunk size 64k).
> 
> 
> The question is why do you want to avoid such optimization?

You've conviced me.  I just wanted proof, which you've done very
nicely.  Thankyou.

> BTW. when unloading the dm-thin device with debugging enabled (the tests 
> were done with debugging disabled), I got this message:
> device-mapper: space map checker: free block counts differ, checker 
> 131060, sm-disk:130991
> --- so there is supposedly some bug? The kernel is 3.4.3.

That message is ok.  I'm going to remove the sm-checker in 3.6.  It's
not earning it's keep.

- Joe

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-06-25 14:09 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-18 14:09 [PATCH] dm-thin: optimize power of two block size Mikulas Patocka
2012-06-18 16:35 ` Joe Thornber
2012-06-25  1:53   ` Mikulas Patocka
2012-06-25 14:09     ` Joe Thornber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.