* [PATCH] dm-thin: optimize power of two block size
@ 2012-06-18 14:09 Mikulas Patocka
2012-06-18 16:35 ` Joe Thornber
0 siblings, 1 reply; 4+ messages in thread
From: Mikulas Patocka @ 2012-06-18 14:09 UTC (permalink / raw)
To: Mike Snitzer, Edward Thornber, Alasdair G. Kergon; +Cc: dm-devel
Hi
This patch should be applied after
dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes
power-of-two blocksize.
Mikulas
---
dm-thin: optimize power of two block size
dm-thin will be most likely used with a block size that is a power of
two. So it should be optimized for this case.
This patch changes division and modulo operations to shifts and bit
masks if block size is a power of two.
A test that bi_sector is divisible by a block size is removed from
io_overlaps_block. Device mapper never sends bios that span block
boundary. Consequently, if we tested that bi_size is equivalent to block
size, bi_sector must already be on a block boundary.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/md/dm-thin.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
Index: linux-3.4.2-fast/drivers/md/dm-thin.c
===================================================================
--- linux-3.4.2-fast.orig/drivers/md/dm-thin.c 2012-06-18 15:38:53.000000000 +0200
+++ linux-3.4.2-fast/drivers/md/dm-thin.c 2012-06-18 16:06:15.000000000 +0200
@@ -512,6 +512,7 @@ struct pool {
dm_block_t low_water_blocks;
uint32_t sectors_per_block;
+ int sectors_per_block_shift;
struct pool_features pf;
unsigned low_water_triggered:1; /* A dm event has been sent */
@@ -678,7 +679,10 @@ static dm_block_t get_bio_block(struct t
{
sector_t block_nr = bio->bi_sector;
- (void) sector_div(block_nr, tc->pool->sectors_per_block);
+ if (tc->pool->sectors_per_block_shift < 0)
+ (void) sector_div(block_nr, tc->pool->sectors_per_block);
+ else
+ block_nr >>= tc->pool->sectors_per_block_shift;
return block_nr;
}
@@ -689,8 +693,12 @@ static void remap(struct thin_c *tc, str
sector_t bi_sector = bio->bi_sector;
bio->bi_bdev = tc->pool_dev->bdev;
- bio->bi_sector = (block * pool->sectors_per_block) +
- sector_div(bi_sector, pool->sectors_per_block);
+ if (tc->pool->sectors_per_block_shift < 0)
+ bio->bi_sector = (block * pool->sectors_per_block) +
+ sector_div(bi_sector, pool->sectors_per_block);
+ else
+ bio->bi_sector = (block << pool->sectors_per_block_shift) |
+ (bi_sector & (pool->sectors_per_block - 1));
}
static void remap_to_origin(struct thin_c *tc, struct bio *bio)
@@ -935,10 +943,7 @@ static void process_prepared(struct pool
*/
static int io_overlaps_block(struct pool *pool, struct bio *bio)
{
- sector_t bi_sector = bio->bi_sector;
-
- return !sector_div(bi_sector, pool->sectors_per_block) &&
- (bio->bi_size == (pool->sectors_per_block << SECTOR_SHIFT));
+ return bio->bi_size == (pool->sectors_per_block << SECTOR_SHIFT);
}
static int io_overwrites_block(struct pool *pool, struct bio *bio)
@@ -1241,7 +1246,9 @@ static void process_discard(struct thin_
* part of the discard that is in a subsequent
* block.
*/
- sector_t offset = bio->bi_sector - (block * pool->sectors_per_block);
+ sector_t offset = pool->sectors_per_block_shift >= 0 ?
+ bio->bi_sector & (pool->sectors_per_block - 1) :
+ bio->bi_sector - block * pool->sectors_per_block;
unsigned remaining = (pool->sectors_per_block - offset) << SECTOR_SHIFT;
bio->bi_size = min(bio->bi_size, remaining);
@@ -1718,6 +1725,10 @@ static struct pool *pool_create(struct m
pool->pmd = pmd;
pool->sectors_per_block = block_size;
+ if (block_size & (block_size - 1))
+ pool->sectors_per_block_shift = -1;
+ else
+ pool->sectors_per_block_shift = __ffs(block_size);
pool->low_water_blocks = 0;
pool_features_init(&pool->pf);
pool->prison = prison_create(PRISON_CELLS);
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] dm-thin: optimize power of two block size
2012-06-18 14:09 [PATCH] dm-thin: optimize power of two block size Mikulas Patocka
@ 2012-06-18 16:35 ` Joe Thornber
2012-06-25 1:53 ` Mikulas Patocka
0 siblings, 1 reply; 4+ messages in thread
From: Joe Thornber @ 2012-06-18 16:35 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: Mike Snitzer, dm-devel, Alasdair G. Kergon
On Mon, Jun 18, 2012 at 10:09:56AM -0400, Mikulas Patocka wrote:
> Hi
>
> This patch should be applied after
> dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes
> power-of-two blocksize.
I'm going to nack this unless you can provide a benchmark that shows
it measurably improves performance for some architecture somewhere.
And a real benchmark, with io going through all the devices, not just
a micro benchmark of the 'if' in a tight loop.
- Joe
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] dm-thin: optimize power of two block size
2012-06-18 16:35 ` Joe Thornber
@ 2012-06-25 1:53 ` Mikulas Patocka
2012-06-25 14:09 ` Joe Thornber
0 siblings, 1 reply; 4+ messages in thread
From: Mikulas Patocka @ 2012-06-25 1:53 UTC (permalink / raw)
To: Joe Thornber; +Cc: Mike Snitzer, dm-devel, Alasdair G. Kergon
On Mon, 18 Jun 2012, Joe Thornber wrote:
> On Mon, Jun 18, 2012 at 10:09:56AM -0400, Mikulas Patocka wrote:
> > Hi
> >
> > This patch should be applied after
> > dm-thin-support-for-non-power-of-2-pool-blocksize.patch. It optimizes
> > power-of-two blocksize.
>
> I'm going to nack this unless you can provide a benchmark that shows
> it measurably improves performance for some architecture somewhere.
> And a real benchmark, with io going through all the devices, not just
> a micro benchmark of the 'if' in a tight loop.
>
> - Joe
Hi
Here are some tests ran on the collection of my computers.
This is a do_div benchmark, the source is here:
http://people.redhat.com/~mpatocka/testcases/do_div_benchmark.c
For the "bignum" test, I replaced 0x12345678 with 0xff12345678LL (so that
do_div divides real 64-bit numbers).
It is especially slow on PA-RISC and Alpha because they don't have a
divide instruction.
PA-RISC 900MHz 64-bit:
shift+mask: 4 ticks (4.4ns)
shift+mask bignum: 4 ticks (4.4ns)
do_div: 825 ticks (917ns)
do_div bignum: 825 ticks (917ns)
UltraSparc2 440MHz 64-bit:
shift+mask: 3 ticks (6.8ns)
shift+mask bignum: 3 ticks (6.8ns)
do_div: 87 ticks (198ns)
do_div bignum: 93 ticks (211ns)
Alpha ev45 233MHz 64-bit:
shift+mask: 7 ticks (30ns)
shift+mask bignum: 8 ticks (34ns)
do_div: 598 ticks (2563ns)
do_div bignum: 897 ticks (3844ns)
Pentium 3 850MHz:
shift+mask: 12.25 ticks (14ns)
shift+mask bignum: 16 ticks (19ns)
do_div: 63.5 ticks (75ns)
do_div bignum: 94 ticks (111ns)
Core2 Xeon 1600MHz 64-bit:
shift+mask: 3.2 ticks (2ns)
shift+mask bignum: 3.4 ticks (2.1ns)
do_div: 64 ticks (40ns)
do_div bignum: 64 ticks (40ns)
K10 Opteron 2300MHz 64-bit:
shift+mask: 3 ticks (1.3ns)
shift+mask bignum: 3 ticks (1.3ns)
do_div: 46 ticks (20ns)
do_div bignum: 57 ticks (28ns)
---
On that PA-RISC machine, I set up dm-stripe target consisting of two
stripes on a ramdisk, with 4k stripe size. I performed
dd if=/dev/mapper/stripe of=/dev/null bs=512 count=100000 iflag=direct
With the optimization patches: 38.2-38.5 MB/s
Without the optimization patches: 35.3-35.6 MB/s
With larger io size:
dd if=/dev/mapper/stripe of=/dev/null bs=1M count=200 iflag=direct
With the optimization patches: 269-272 MB/s
Without the optimization patches: 250-253 MB/s
Tests with dm-thin on PA-RISC:
A device with 512MB pool and 512MB metadata on ramdisks, 64k chunk.
Overwrite the first time with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 91.0-91.4
With the optimization patches: 90.6-91.6
Subsequent overwrite with
dd if=/dev/zero of=/dev/mapper/thin bs=1M oflag=direct
Without the optimization patches: 104 MB/s
With the optimization patches: 104 MB/s
Read the overwritten device with
dd if=/dev/mapper/thin of=/dev/null bs=1M iflag=direct
Without the optimization patches: 252-254 MB/s
With the optimization patches: 257-258 MB/s
So the conclusion is that is that that divide instruction degrades
transfer speed, especially on dm-stripe with 4k stripe size (on dm-thin it
is measurable only with raw read, the difference is smaller because it has
a minimum chunk size 64k).
The question is why do you want to avoid such optimization? If it is
because of source code clarity, we can create #define sector_div_optimized
that optimizes the common case of power-of-two divisor and the code would
be no more complicated than with sector div. Or do you have some other
reasons?
BTW. when unloading the dm-thin device with debugging enabled (the tests
were done with debugging disabled), I got this message:
device-mapper: space map checker: free block counts differ, checker
131060, sm-disk:130991
--- so there is supposedly some bug? The kernel is 3.4.3.
Mikulas
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] dm-thin: optimize power of two block size
2012-06-25 1:53 ` Mikulas Patocka
@ 2012-06-25 14:09 ` Joe Thornber
0 siblings, 0 replies; 4+ messages in thread
From: Joe Thornber @ 2012-06-25 14:09 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: Mike Snitzer, dm-devel, Alasdair G. Kergon
On Sun, Jun 24, 2012 at 09:53:22PM -0400, Mikulas Patocka wrote:
> So the conclusion is that is that that divide instruction degrades
> transfer speed, especially on dm-stripe with 4k stripe size (on dm-thin it
> is measurable only with raw read, the difference is smaller because it has
> a minimum chunk size 64k).
>
>
> The question is why do you want to avoid such optimization?
You've conviced me. I just wanted proof, which you've done very
nicely. Thankyou.
> BTW. when unloading the dm-thin device with debugging enabled (the tests
> were done with debugging disabled), I got this message:
> device-mapper: space map checker: free block counts differ, checker
> 131060, sm-disk:130991
> --- so there is supposedly some bug? The kernel is 3.4.3.
That message is ok. I'm going to remove the sm-checker in 3.6. It's
not earning it's keep.
- Joe
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-06-25 14:09 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-18 14:09 [PATCH] dm-thin: optimize power of two block size Mikulas Patocka
2012-06-18 16:35 ` Joe Thornber
2012-06-25 1:53 ` Mikulas Patocka
2012-06-25 14:09 ` Joe Thornber
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.