* [RFC][PATCH] dm: improve read performance @ 2010-12-27 11:19 Mustafa Mesanovic 2010-12-27 11:54 ` Neil Brown 0 siblings, 1 reply; 15+ messages in thread From: Mustafa Mesanovic @ 2010-12-27 11:19 UTC (permalink / raw) To: dm-devel, akpm, snitzer, neilb Cc: linux-kernel, heiko.carstens, cotte, ehrhardt From: Mustafa Mesanovic <mume@linux.vnet.ibm.com> A short explanation in prior: in this case we have "stacked" dm devices. Two multipathed luns combined together to one striped logical volume. I/O throughput degradation happens at __bio_add_page when bio's get checked upon max_sectors. In this setup max_sectors is always set to 8 -> what is 4KiB. A standalone striped logical volume on luns which are not multipathed do not have the problem: the logical volume will take over the max_sectors from luns below. Same happens with luns which are multipathed -> the multipathed targets have the same max_sectors as the luns below. So "magic" happens only when target has no own merge_fn and below lying devices have a merge function -> we got then max_sectors=PAGE_SIZE >> 9. This patch prevents that max_sectors will be set to PAGE_SIZE >> 9. Instead it will use the minimum max_sectors value from below devices. Using the patch improves read I/O up to 3x. In this specific case from 600MiB/s up to 1800MiB/s. Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com> --- dm-table.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/drivers/md/dm-table.c =================================================================== --- linux-2.6.orig/drivers/md/dm-table.c 2010-12-23 13:49:18.000000000 +0100 +++ linux-2.6/drivers/md/dm-table.c 2010-12-23 13:50:22.000000000 +0100 @@ -518,7 +518,7 @@ if (q->merge_bvec_fn && !ti->type->merge) blk_limits_max_hw_sectors(limits, - (unsigned int) (PAGE_SIZE >> 9)); + q->limits.max_sectors); return 0; } EXPORT_SYMBOL_GPL(dm_set_device_limits); ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH] dm: improve read performance 2010-12-27 11:19 [RFC][PATCH] dm: improve read performance Mustafa Mesanovic @ 2010-12-27 11:54 ` Neil Brown 2010-12-27 12:23 ` Mustafa Mesanovic 0 siblings, 1 reply; 15+ messages in thread From: Neil Brown @ 2010-12-27 11:54 UTC (permalink / raw) To: Mustafa Mesanovic Cc: dm-devel, akpm, snitzer, linux-kernel, heiko.carstens, cotte, ehrhardt On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic <mume@linux.vnet.ibm.com> wrote: > From: Mustafa Mesanovic <mume@linux.vnet.ibm.com> > > A short explanation in prior: in this case we have "stacked" dm devices. > Two multipathed luns combined together to one striped logical volume. > > I/O throughput degradation happens at __bio_add_page when bio's get checked > upon max_sectors. In this setup max_sectors is always set to 8 -> what is > 4KiB. > A standalone striped logical volume on luns which are not multipathed do not > have the problem: the logical volume will take over the max_sectors from luns > below. > > Same happens with luns which are multipathed -> the multipathed targets have > the same max_sectors as the luns below. > > So "magic" happens only when target has no own merge_fn and below lying > devices > have a merge function -> we got then max_sectors=PAGE_SIZE >> 9. > This patch prevents that max_sectors will be set to PAGE_SIZE >> 9. > Instead it will use the minimum max_sectors value from below devices. > > Using the patch improves read I/O up to 3x. In this specific case from 600MiB/s > up to 1800MiB/s. and using this patch will cause IO to fail sometimes. If an IO request which is larger than a page crosses a device boundary in the underlying e.g. RAID0, the RAID0 will return an error as such things should not happen - they are prevented by merge_bvec_fn. If merge_bvec_fn is not being honoured, then you MUST limit requests to a single entry iovec of at most one page. NeilBrown > > Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com> > --- > > dm-table.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux-2.6/drivers/md/dm-table.c > =================================================================== > --- linux-2.6.orig/drivers/md/dm-table.c 2010-12-23 13:49:18.000000000 +0100 > +++ linux-2.6/drivers/md/dm-table.c 2010-12-23 13:50:22.000000000 +0100 > @@ -518,7 +518,7 @@ > > if (q->merge_bvec_fn && !ti->type->merge) > blk_limits_max_hw_sectors(limits, > - (unsigned int) (PAGE_SIZE >> 9)); > + q->limits.max_sectors); > return 0; > } > EXPORT_SYMBOL_GPL(dm_set_device_limits); ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH] dm: improve read performance 2010-12-27 11:54 ` Neil Brown @ 2010-12-27 12:23 ` Mustafa Mesanovic 2011-03-07 10:10 ` Mustafa Mesanovic 0 siblings, 1 reply; 15+ messages in thread From: Mustafa Mesanovic @ 2010-12-27 12:23 UTC (permalink / raw) To: Neil Brown Cc: dm-devel, akpm, snitzer, linux-kernel, heiko.carstens, cotte, ehrhardt On Mon December 27 2010 12:54:59 Neil Brown wrote: > On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic > > <mume@linux.vnet.ibm.com> wrote: > > From: Mustafa Mesanovic <mume@linux.vnet.ibm.com> > > > > A short explanation in prior: in this case we have "stacked" dm devices. > > Two multipathed luns combined together to one striped logical volume. > > > > I/O throughput degradation happens at __bio_add_page when bio's get > > checked upon max_sectors. In this setup max_sectors is always set to 8 > > -> what is 4KiB. > > A standalone striped logical volume on luns which are not multipathed do > > not have the problem: the logical volume will take over the max_sectors > > from luns below. > > > > Same happens with luns which are multipathed -> the multipathed targets > > have the same max_sectors as the luns below. > > > > So "magic" happens only when target has no own merge_fn and below lying > > devices > > have a merge function -> we got then max_sectors=PAGE_SIZE >> 9. > > This patch prevents that max_sectors will be set to PAGE_SIZE >> 9. > > Instead it will use the minimum max_sectors value from below devices. > > > > Using the patch improves read I/O up to 3x. In this specific case from > > 600MiB/s up to 1800MiB/s. > > and using this patch will cause IO to fail sometimes. > If an IO request which is larger than a page crosses a device boundary in > the underlying e.g. RAID0, the RAID0 will return an error as such things > should not happen - they are prevented by merge_bvec_fn. > > If merge_bvec_fn is not being honoured, then you MUST limit requests to a > single entry iovec of at most one page. > > NeilBrown > Thank you for that hint, I will try to write a merge_bvec_fn for dm-stripe.c which solves the problem, if that is ok? Mustafa Mesanovic > > Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com> > > --- > > > > dm-table.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > Index: linux-2.6/drivers/md/dm-table.c > > =================================================================== > > --- linux-2.6.orig/drivers/md/dm-table.c 2010-12-23 13:49:18.000000000 > > +0100 +++ linux-2.6/drivers/md/dm-table.c 2010-12-23 13:50:22.000000000 > > +0100 @@ -518,7 +518,7 @@ > > > > if (q->merge_bvec_fn && !ti->type->merge) > > > > blk_limits_max_hw_sectors(limits, > > > > - (unsigned int) (PAGE_SIZE >> 9)); > > + q->limits.max_sectors); > > > > return 0; > > > > } > > EXPORT_SYMBOL_GPL(dm_set_device_limits); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH] dm: improve read performance 2010-12-27 12:23 ` Mustafa Mesanovic @ 2011-03-07 10:10 ` Mustafa Mesanovic 2011-03-08 2:21 ` [PATCH v3] dm stripe: implement merge method Mike Snitzer 2011-03-17 5:12 ` [RFC][PATCH] dm: improve read performance Nikanth Karthikesan 0 siblings, 2 replies; 15+ messages in thread From: Mustafa Mesanovic @ 2011-03-07 10:10 UTC (permalink / raw) To: Neil Brown, akpm, snitzer Cc: dm-devel, linux-kernel, heiko.carstens, cotte, ehrhardt On 12/27/2010 01:23 PM, Mustafa Mesanovic wrote: > On Mon December 27 2010 12:54:59 Neil Brown wrote: >> On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic >> >> <mume@linux.vnet.ibm.com> wrote: >>> From: Mustafa Mesanovic<mume@linux.vnet.ibm.com> >>> >>> A short explanation in prior: in this case we have "stacked" dm devices. >>> Two multipathed luns combined together to one striped logical volume. >>> >>> I/O throughput degradation happens at __bio_add_page when bio's get >>> checked upon max_sectors. In this setup max_sectors is always set to 8 >>> -> what is 4KiB. >>> A standalone striped logical volume on luns which are not multipathed do >>> not have the problem: the logical volume will take over the max_sectors >>> from luns below. [...] >>> Using the patch improves read I/O up to 3x. In this specific case from >>> 600MiB/s up to 1800MiB/s. >> and using this patch will cause IO to fail sometimes. >> If an IO request which is larger than a page crosses a device boundary in >> the underlying e.g. RAID0, the RAID0 will return an error as such things >> should not happen - they are prevented by merge_bvec_fn. >> >> If merge_bvec_fn is not being honoured, then you MUST limit requests to a >> single entry iovec of at most one page. >> >> NeilBrown >> > Thank you for that hint, I will try to write a merge_bvec_fn for dm-stripe.c > which solves the problem, if that is ok? > > Mustafa Mesanovic > Now here my new suggestion to fix this issue, what is your opinion? I tested this with different setups, and it worked fine and I had very good performance improvements. [RFC][PATCH] dm: improve read performance - v2 This patch adds a merge_fn for the dm stripe target. This merge_fn prevents dm_set_device_limits() setting the max_sectors to 4KiB (PAGE_SIZE). (As in a prior patch already mentioned.) Now the read performance improved up to 3x higher compared to before. What happened before: I/O throughput degradation happened at __bio_add_page() when bio's got checked at the very beginning upon max_sectors. In this setup max_sectors is always set to 8. So bio's entered the dm target with a max of 4KiB. Now dm-stripe target will have its own merge_fn so max_sectors will not pushed down to 8 (4KiB), and bio's can get bigger than 4KiB. Signed-off-by: Mustafa Mesanovic<mume@linux.vnet.ibm.com> --- dm-stripe.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) Index: linux-2.6/drivers/md/dm-stripe.c =================================================================== --- linux-2.6.orig/drivers/md/dm-stripe.c 2011-02-28 10:23:37.000000000 +0100 +++ linux-2.6/drivers/md/dm-stripe.c 2011-02-28 10:24:29.000000000 +0100 @@ -396,6 +396,29 @@ blk_limits_io_opt(limits, chunk_size * sc->stripes); } +static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct stripe_c *sc = (struct stripe_c *) ti->private; + sector_t offset, chunk; + uint32_t stripe; + struct request_queue *q; + + offset = bvm->bi_sector - ti->begin; + chunk = offset>> sc->chunk_shift; + stripe = sector_div(chunk, sc->stripes); + + if (!bdev_get_queue(sc->stripe[stripe].dev->bdev)->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = sc->stripe[stripe].dev->bdev; + q = bdev_get_queue(bvm->bi_bdev); + bvm->bi_sector = sc->stripe[stripe].physical_start + + (chunk<< sc->chunk_shift) + (offset& sc->chunk_mask); + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + static struct target_type stripe_target = { .name = "striped", .version = {1, 3, 1}, @@ -403,6 +426,7 @@ .ctr = stripe_ctr, .dtr = stripe_dtr, .map = stripe_map, + .merge = stripe_merge, .end_io = stripe_end_io, .status = stripe_status, .iterate_devices = stripe_iterate_devices, ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3] dm stripe: implement merge method 2011-03-07 10:10 ` Mustafa Mesanovic @ 2011-03-08 2:21 ` Mike Snitzer 2011-03-08 10:29 ` Mustafa Mesanovic 2011-03-16 20:21 ` [PATCH v4] " Mike Snitzer 2011-03-17 5:12 ` [RFC][PATCH] dm: improve read performance Nikanth Karthikesan 1 sibling, 2 replies; 15+ messages in thread From: Mike Snitzer @ 2011-03-08 2:21 UTC (permalink / raw) To: Mustafa Mesanovic Cc: Neil Brown, akpm, cotte, dm-devel, heiko.carstens, linux-kernel, ehrhardt Hello Mustafa, On Mon, Mar 07 2011 at 5:10am -0500, Mustafa Mesanovic <mume@linux.vnet.ibm.com> wrote: > On 12/27/2010 01:23 PM, Mustafa Mesanovic wrote: > >On Mon December 27 2010 12:54:59 Neil Brown wrote: > >>On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic > >> > >><mume@linux.vnet.ibm.com> wrote: > >>>From: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > >>> > >>>A short explanation in prior: in this case we have "stacked" dm devices. > >>>Two multipathed luns combined together to one striped logical volume. > >>> > >>>I/O throughput degradation happens at __bio_add_page when bio's get > >>>checked upon max_sectors. In this setup max_sectors is always set to 8 > >>>-> what is 4KiB. > >>>A standalone striped logical volume on luns which are not multipathed do > >>>not have the problem: the logical volume will take over the max_sectors > >>>from luns below. > [...] > > >>>Using the patch improves read I/O up to 3x. In this specific case from > >>>600MiB/s up to 1800MiB/s. > >>and using this patch will cause IO to fail sometimes. > >>If an IO request which is larger than a page crosses a device boundary in > >>the underlying e.g. RAID0, the RAID0 will return an error as such things > >>should not happen - they are prevented by merge_bvec_fn. > >> > >>If merge_bvec_fn is not being honoured, then you MUST limit requests to a > >>single entry iovec of at most one page. > >> > >>NeilBrown > >> > >Thank you for that hint, I will try to write a merge_bvec_fn for dm-stripe.c > >which solves the problem, if that is ok? > > > >Mustafa Mesanovic > > > Now here my new suggestion to fix this issue, what is your opinion? > I tested this with different setups, and it worked fine and I had > very good performance improvements. > > [RFC][PATCH] dm: improve read performance - v2 > > This patch adds a merge_fn for the dm stripe target. This merge_fn > prevents dm_set_device_limits() setting the max_sectors to 4KiB > (PAGE_SIZE). (As in a prior patch already mentioned.) > > Now the read performance improved up to 3x higher compared to before. > > What happened before: > I/O throughput degradation happened at __bio_add_page() when bio's got checked > at the very beginning upon max_sectors. In this setup max_sectors is always > set to 8. So bio's entered the dm target with a max of 4KiB. So to recap: - you are stacking dm-stripe on DM multipath devices - dm_set_device_limits() will set max_hw_sectors to PAGE_SIZE if: 1) q->merge_bvec_fn is defined (as is the case for all DM devices) 2) target does not define a merge_fn (as was the case for dm-stripe) - dm_merge_bvec (aka DM's q->merge_bvec_fn) will only allow a single page for the situation described in the previous point. > Now dm-stripe target will have its own merge_fn so max_sectors will not > pushed down to 8 (4KiB), and bio's can get bigger than 4KiB. > > Signed-off-by: Mustafa Mesanovic<mume@linux.vnet.ibm.com> Here is a revised version of your patch that uses the relatively new stripe_map_sector(), removes an unnecessary cast, and a few other cleanups. This v3 should work as well as your v2 but should be suitable for upstream inclusion -- authorship of the change is still attributed to you as I only updated the code slightly. I'd appreciate it if you could verify your config still performs as expected. Thanks, Mike From: Mustafa Mesanovic <mume@linux.vnet.ibm.com> When the stripe target's underlying devices provide a merge_bvec_fn (e.g. DM) it is important to call down to them when building a biovec that doesn't span a stripe boundary. Without the merge method, a striped DM device stacked on DM devices would cause bios with a single page to be submitted. The small bios that were built resulted in unnecessary overhead that hurt performance. In one setup, striped DM multipath devices, read performance was improved from 600MB/s to 1800MB/s. Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/md/dm-stripe.c | 23 ++++++++++++++++++++++- 1 files changed, 22 insertions(+), 1 deletions(-) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index dddfa14..fdfc1c6 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -396,9 +396,29 @@ static void stripe_io_hints(struct dm_target *ti, blk_limits_io_opt(limits, chunk_size * sc->stripes); } +static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct stripe_c *sc = ti->private; + sector_t bvm_sector = bvm->bi_sector; + uint32_t stripe; + struct request_queue *q; + + stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector); + + q = bdev_get_queue(sc->stripe[stripe].dev->bdev); + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = sc->stripe[stripe].dev->bdev; + bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector; + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + static struct target_type stripe_target = { .name = "striped", - .version = {1, 3, 1}, + .version = {1, 3, 2}, .module = THIS_MODULE, .ctr = stripe_ctr, .dtr = stripe_dtr, @@ -407,6 +427,7 @@ static struct target_type stripe_target = { .status = stripe_status, .iterate_devices = stripe_iterate_devices, .io_hints = stripe_io_hints, + .merge = stripe_merge, }; int __init dm_stripe_init(void) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v3] dm stripe: implement merge method 2011-03-08 2:21 ` [PATCH v3] dm stripe: implement merge method Mike Snitzer @ 2011-03-08 10:29 ` Mustafa Mesanovic 2011-03-08 16:48 ` Mike Snitzer 2011-03-16 20:21 ` [PATCH v4] " Mike Snitzer 1 sibling, 1 reply; 15+ messages in thread From: Mustafa Mesanovic @ 2011-03-08 10:29 UTC (permalink / raw) To: Mike Snitzer, Neil Brown, akpm, dm-devel Cc: cotte, heiko.carstens, linux-kernel, ehrhardt On 03/08/2011 03:21 AM, Mike Snitzer wrote: > Hello Mustafa, > > On Mon, Mar 07 2011 at 5:10am -0500, > Mustafa Mesanovic<mume@linux.vnet.ibm.com> wrote: > >> On 12/27/2010 01:23 PM, Mustafa Mesanovic wrote: >>> On Mon December 27 2010 12:54:59 Neil Brown wrote: >>>> On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic >>>> >>>> <mume@linux.vnet.ibm.com> wrote: >>>>> From: Mustafa Mesanovic<mume@linux.vnet.ibm.com> >>>>> >>>>> A short explanation in prior: in this case we have "stacked" dm devices. >>>>> Two multipathed luns combined together to one striped logical volume. >>>>> >>>>> I/O throughput degradation happens at __bio_add_page when bio's get >>>>> checked upon max_sectors. In this setup max_sectors is always set to 8 >>>>> -> what is 4KiB. >>>>> A standalone striped logical volume on luns which are not multipathed do >>>>> not have the problem: the logical volume will take over the max_sectors >>>> >from luns below. >> [...] >> >>>>> Using the patch improves read I/O up to 3x. In this specific case from >>>>> 600MiB/s up to 1800MiB/s. >>>> and using this patch will cause IO to fail sometimes. >>>> If an IO request which is larger than a page crosses a device boundary in >>>> the underlying e.g. RAID0, the RAID0 will return an error as such things >>>> should not happen - they are prevented by merge_bvec_fn. >>>> >>>> If merge_bvec_fn is not being honoured, then you MUST limit requests to a >>>> single entry iovec of at most one page. >>>> >>>> NeilBrown >>>> >>> Thank you for that hint, I will try to write a merge_bvec_fn for dm-stripe.c >>> which solves the problem, if that is ok? >>> >>> Mustafa Mesanovic >>> >> Now here my new suggestion to fix this issue, what is your opinion? >> I tested this with different setups, and it worked fine and I had >> very good performance improvements. >> >> [RFC][PATCH] dm: improve read performance - v2 >> >> This patch adds a merge_fn for the dm stripe target. This merge_fn >> prevents dm_set_device_limits() setting the max_sectors to 4KiB >> (PAGE_SIZE). (As in a prior patch already mentioned.) >> >> Now the read performance improved up to 3x higher compared to before. >> >> What happened before: >> I/O throughput degradation happened at __bio_add_page() when bio's got checked >> at the very beginning upon max_sectors. In this setup max_sectors is always >> set to 8. So bio's entered the dm target with a max of 4KiB. > So to recap: > - you are stacking dm-stripe on DM multipath devices > - dm_set_device_limits() will set max_hw_sectors to PAGE_SIZE if: > 1) q->merge_bvec_fn is defined (as is the case for all DM devices) > 2) target does not define a merge_fn (as was the case for dm-stripe) > - dm_merge_bvec (aka DM's q->merge_bvec_fn) will only allow a single > page for the situation described in the previous point. > >> Now dm-stripe target will have its own merge_fn so max_sectors will not >> pushed down to 8 (4KiB), and bio's can get bigger than 4KiB. >> >> Signed-off-by: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > Here is a revised version of your patch that uses the relatively new > stripe_map_sector(), removes an unnecessary cast, and a few other > cleanups. > > This v3 should work as well as your v2 but should be suitable for > upstream inclusion -- authorship of the change is still attributed to > you as I only updated the code slightly. I'd appreciate it if you could > verify your config still performs as expected. > > Thanks, > Mike > > > From: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > > When the stripe target's underlying devices provide a merge_bvec_fn > (e.g. DM) it is important to call down to them when building a biovec > that doesn't span a stripe boundary. > > Without the merge method, a striped DM device stacked on DM devices > would cause bios with a single page to be submitted. The small bios > that were built resulted in unnecessary overhead that hurt performance. > > In one setup, striped DM multipath devices, read performance was > improved from 600MB/s to 1800MB/s. > > Signed-off-by: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > Signed-off-by: Mike Snitzer<snitzer@redhat.com> > --- > drivers/md/dm-stripe.c | 23 ++++++++++++++++++++++- > 1 files changed, 22 insertions(+), 1 deletions(-) > > diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c > index dddfa14..fdfc1c6 100644 > --- a/drivers/md/dm-stripe.c > +++ b/drivers/md/dm-stripe.c > @@ -396,9 +396,29 @@ static void stripe_io_hints(struct dm_target *ti, > blk_limits_io_opt(limits, chunk_size * sc->stripes); > } > > +static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm, > + struct bio_vec *biovec, int max_size) > +{ > + struct stripe_c *sc = ti->private; > + sector_t bvm_sector = bvm->bi_sector; > + uint32_t stripe; > + struct request_queue *q; > + > + stripe_map_sector(sc, bvm_sector,&stripe,&bvm_sector); > + > + q = bdev_get_queue(sc->stripe[stripe].dev->bdev); > + if (!q->merge_bvec_fn) > + return max_size; > + > + bvm->bi_bdev = sc->stripe[stripe].dev->bdev; > + bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector; > + > + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); > +} > + > static struct target_type stripe_target = { > .name = "striped", > - .version = {1, 3, 1}, > + .version = {1, 3, 2}, > .module = THIS_MODULE, > .ctr = stripe_ctr, > .dtr = stripe_dtr, > @@ -407,6 +427,7 @@ static struct target_type stripe_target = { > .status = stripe_status, > .iterate_devices = stripe_iterate_devices, > .io_hints = stripe_io_hints, > + .merge = stripe_merge, > }; > > int __init dm_stripe_init(void) > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ Mike, your changes are working well and performing even a bit better. Are there any further comments from others, or can consider putting it upstream. Regards, Mustafa ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3] dm stripe: implement merge method 2011-03-08 10:29 ` Mustafa Mesanovic @ 2011-03-08 16:48 ` Mike Snitzer 2011-03-10 14:02 ` Mustafa Mesanovic 0 siblings, 1 reply; 15+ messages in thread From: Mike Snitzer @ 2011-03-08 16:48 UTC (permalink / raw) To: Mustafa Mesanovic Cc: Neil Brown, akpm, dm-devel, cotte, heiko.carstens, linux-kernel, ehrhardt, Alasdair G. Kergon, Jeff Moyer On Tue, Mar 08 2011 at 5:29am -0500, Mustafa Mesanovic <mume@linux.vnet.ibm.com> wrote: > On 03/08/2011 03:21 AM, Mike Snitzer wrote: > > >Here is a revised version of your patch that uses the relatively new > >stripe_map_sector(), removes an unnecessary cast, and a few other > >cleanups. > > > >This v3 should work as well as your v2 but should be suitable for > >upstream inclusion -- authorship of the change is still attributed to > >you as I only updated the code slightly. I'd appreciate it if you could > >verify your config still performs as expected. > > Mike, > > your changes are working well and performing even a bit better. Good to know. > Are there any further comments from others, or can consider putting it > upstream. I chatted with Alasdair and one concern he had was: does the existence of stripe_merge() ever hurt due to the fact that stripe_map_sector() is performed twice (once for .merge and again for .map)? May be that a really small chuck_size proves to be more problematic but using a small chunk_size generally isn't productive to begin with. In any case, it clearly helps your workload. Could you explain your config in more detail? - what is your chunk_size? - how many stripes (how many mpath devices)? - what is the performance, of your test workload, of a single underlying mpath device? And, in particular, what is your test workload? - What is the nature of your IO (are you using a particular tool)? - Are you using AIO? - How many threads? - Are you driving deep queue depths? Etc. I have various configs that I'll be testing to help verify the benefit. The only other change Alasdair request is that the target version should be bumped to 1.4 (rather than 1.3.2). Given that I can put some time to this now: we should be able to sort all this out for upstream inclusion in 2.6.39. Thanks, Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3] dm stripe: implement merge method 2011-03-08 16:48 ` Mike Snitzer @ 2011-03-10 14:02 ` Mustafa Mesanovic 2011-03-12 22:42 ` Mike Snitzer 0 siblings, 1 reply; 15+ messages in thread From: Mustafa Mesanovic @ 2011-03-10 14:02 UTC (permalink / raw) To: Mike Snitzer, dm-devel Cc: Neil Brown, akpm, cotte, heiko.carstens, linux-kernel, ehrhardt, Alasdair G. Kergon, Jeff Moyer On 03/08/2011 05:48 PM, Mike Snitzer wrote: > On Tue, Mar 08 2011 at 5:29am -0500, > Mustafa Mesanovic<mume@linux.vnet.ibm.com> wrote: > >> On 03/08/2011 03:21 AM, Mike Snitzer wrote: >> >>> Here is a revised version of your patch that uses the relatively new >>> stripe_map_sector(), removes an unnecessary cast, and a few other >>> cleanups. >>> >>> This v3 should work as well as your v2 but should be suitable for >>> upstream inclusion -- authorship of the change is still attributed to >>> you as I only updated the code slightly. I'd appreciate it if you could >>> verify your config still performs as expected. >> Mike, >> >> your changes are working well and performing even a bit better. > Good to know. > >> Are there any further comments from others, or can consider putting it >> upstream. > I chatted with Alasdair and one concern he had was: does the existence > of stripe_merge() ever hurt due to the fact that stripe_map_sector() is > performed twice (once for .merge and again for .map)? May be that a > really small chuck_size proves to be more problematic but using a small > chunk_size generally isn't productive to begin with. > > In any case, it clearly helps your workload. > > Could you explain your config in more detail? > - what is your chunk_size? > - how many stripes (how many mpath devices)? > - what is the performance, of your test workload, of a single underlying > mpath device? > > And, in particular, what is your test workload? > - What is the nature of your IO (are you using a particular tool)? > - Are you using AIO? > - How many threads? > - Are you driving deep queue depths? Etc. > > I have various configs that I'll be testing to help verify the benefit. > The only other change Alasdair request is that the target version should > be bumped to 1.4 (rather than 1.3.2). > > Given that I can put some time to this now: we should be able to sort > all this out for upstream inclusion in 2.6.39. > > Thanks, > Mike Mike, the setup that I have used to verify and check upon the changes consisted of: - Benchmark iozone (seq write, seq read, random read and write), filesize 2000m, with 32 processes (no AIO used). - Disk-Setup 2 disks (queue_depth=192) -> each disk with 8 paths -> multipathed (multibus, rr_min_io=1) And a striped LVM out of these two (chunk_size=64KiB). The benchmark then runs on this LV. In detail: I was previosly working on a 2.6.32.x-stable kernel, thats where I created the stripe_merge function for. There I have seen improvements up to 3x -> ~600MiB -> ~1800MiB With the git-kernel only 1.25x better when applying the patch, but its still significantly better! ~990MiB -> ~1300MiB A single mpath device has ~1300MiB throughput. Regards, Mustafa ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3] dm stripe: implement merge method 2011-03-10 14:02 ` Mustafa Mesanovic @ 2011-03-12 22:42 ` Mike Snitzer 2011-03-14 11:54 ` Mustafa Mesanovic 0 siblings, 1 reply; 15+ messages in thread From: Mike Snitzer @ 2011-03-12 22:42 UTC (permalink / raw) To: Mustafa Mesanovic Cc: dm-devel, Neil Brown, akpm, cotte, heiko.carstens, linux-kernel, ehrhardt, Alasdair G. Kergon, Jeff Moyer [-- Attachment #1: Type: text/plain, Size: 3530 bytes --] Hi Mustafa, On Thu, Mar 10 2011 at 9:02am -0500, Mustafa Mesanovic <mume@linux.vnet.ibm.com> wrote: > On 03/08/2011 05:48 PM, Mike Snitzer wrote: > >In any case, it clearly helps your workload. > > > >Could you explain your config in more detail? > >- what is your chunk_size? > >- how many stripes (how many mpath devices)? > >- what is the performance, of your test workload, of a single underlying > > mpath device? > > > >And, in particular, what is your test workload? > >- What is the nature of your IO (are you using a particular tool)? > >- Are you using AIO? > >- How many threads? > >- Are you driving deep queue depths? Etc. > > > >I have various configs that I'll be testing to help verify the benefit. > >The only other change Alasdair request is that the target version should > >be bumped to 1.4 (rather than 1.3.2). > > > >Given that I can put some time to this now: we should be able to sort > >all this out for upstream inclusion in 2.6.39. > > > >Thanks, > >Mike > Mike, > > the setup that I have used to verify and check upon the changes > consisted of: > > - Benchmark > iozone (seq write, seq read, random read and write), > filesize 2000m, with 32 processes (no AIO used). > > - Disk-Setup > 2 disks (queue_depth=192) -> each disk with 8 paths > -> multipathed (multibus, rr_min_io=1) > > And a striped LVM out of these two (chunk_size=64KiB). > > The benchmark then runs on this LV. What record size are you using? Which filesystem are you using? Also, were you using O_DIRECT? If not then I'm having a hard time understanding why implementing stripe_merge was so beneficial for you. stripe_merge doesn't help buffered IO. Please share your exact iozone command line. In my testing with aio-stress I have seen the number of calls to stripe_map be inversely proportional to the record size (when record size is <= chunk_size). That is, with the following aio-stress commandline: aio-stress -O -o 0 -o 1 -r $RECORD_SIZE -d 64 -b 16 -i 16 -s 2048 /dev/snitm/striped_lv I varied the $RECORD_SIZE from 4k to 256k (striped_lv is using a 64k chunk_size across 8 mpath devices). The number of stripe_map_sector() calls resulting from having implemented stripe_merge is fixed at 1048560 (when reading and then writing 2048m). And there is one stripe_map_sector() call for each stripe_map() call. The following table shows the stripe_map_sector and stripe_map call count for writes then reads of 2048m (using $record_size AIO). AIO does make use of dm_merge_bvec and stripe_merge. record_size stripe_map_sector calls stripe_map calls 4k 2097152 1048592 8k 1572864 524304 16k 1310720 262160 32k 1179648 131088 64k 1114112 65552 128k 1114112 65552 256k 1114112 65552 The above shows that bios are being assembled using larger payloads (up to chunk_size) given that AIO does make use of stripe_merge. When I did the same accounting (via attached systemtap script) for a buffered iozone run with a file size of 2000m (using -i 0 -i 1 -i 2) I saw that dm_merge_bvec() was _never_ called and the number of stripe_map_sector calls was very close to the stripe_map calls. Mike p.s. All the above aside, one of our more elaborate benchmarks against XFS has seen a significant benefit from stripe_merge() being present... I still need to understand that benchmark's IO workload though. [-- Attachment #2: dm_stripe_map_count.stp --] [-- Type: text/plain, Size: 606 bytes --] # stap dm_stripe_map_count.stp -c '<benchmark>' global dm_merge_bvec_count global stripe_map_sector_count global stripe_map_count probe module("dm_mod").function("dm_merge_bvec") { dm_merge_bvec_count++ } probe module("dm_mod").function("stripe_map_sector") { stripe_map_sector_count++ } probe module("dm_mod").function("stripe_map") { stripe_map_count++ } probe end { printf ("dm_merge_bvec_count calls: %d\n", dm_merge_bvec_count) printf ("stripe_map_sector calls: %d\n", stripe_map_sector_count) printf ("stripe_map calls: %d\n", stripe_map_count) } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3] dm stripe: implement merge method 2011-03-12 22:42 ` Mike Snitzer @ 2011-03-14 11:54 ` Mustafa Mesanovic 2011-03-14 14:33 ` Mike Snitzer 0 siblings, 1 reply; 15+ messages in thread From: Mustafa Mesanovic @ 2011-03-14 11:54 UTC (permalink / raw) To: Mike Snitzer, dm-devel Cc: Neil Brown, akpm, cotte, heiko.carstens, linux-kernel, ehrhardt, Alasdair G. Kergon, Jeff Moyer On 03/12/2011 11:42 PM, Mike Snitzer wrote: > Hi Mustafa, > > On Thu, Mar 10 2011 at 9:02am -0500, > Mustafa Mesanovic<mume@linux.vnet.ibm.com> wrote: > >> On 03/08/2011 05:48 PM, Mike Snitzer wrote: >>> In any case, it clearly helps your workload. >>> >>> Could you explain your config in more detail? >>> - what is your chunk_size? >>> - how many stripes (how many mpath devices)? >>> - what is the performance, of your test workload, of a single underlying >>> mpath device? >>> >>> And, in particular, what is your test workload? >>> - What is the nature of your IO (are you using a particular tool)? >>> - Are you using AIO? >>> - How many threads? >>> - Are you driving deep queue depths? Etc. >>> >>> I have various configs that I'll be testing to help verify the benefit. >>> The only other change Alasdair request is that the target version should >>> be bumped to 1.4 (rather than 1.3.2). >>> >>> Given that I can put some time to this now: we should be able to sort >>> all this out for upstream inclusion in 2.6.39. >>> >>> Thanks, >>> Mike >> Mike, >> >> the setup that I have used to verify and check upon the changes >> consisted of: >> >> - Benchmark >> iozone (seq write, seq read, random read and write), >> filesize 2000m, with 32 processes (no AIO used). >> >> - Disk-Setup >> 2 disks (queue_depth=192) -> each disk with 8 paths >> -> multipathed (multibus, rr_min_io=1) >> >> And a striped LVM out of these two (chunk_size=64KiB). >> >> The benchmark then runs on this LV. > What record size are you using? > Which filesystem are you using? > Also, were you using O_DIRECT? If not then I'm having a hard time > understanding why implementing stripe_merge was so beneficial for you. > stripe_merge doesn't help buffered IO. > > Please share your exact iozone command line. > > In my testing with aio-stress I have seen the number of calls to > stripe_map be inversely proportional to the record size (when record > size is<= chunk_size). > > That is, with the following aio-stress commandline: > aio-stress -O -o 0 -o 1 -r $RECORD_SIZE -d 64 -b 16 -i 16 -s 2048 /dev/snitm/striped_lv > > I varied the $RECORD_SIZE from 4k to 256k (striped_lv is using a 64k > chunk_size across 8 mpath devices). > > The number of stripe_map_sector() calls resulting from having > implemented stripe_merge is fixed at 1048560 (when reading and then > writing 2048m). And there is one stripe_map_sector() call for each > stripe_map() call. > > The following table shows the stripe_map_sector and stripe_map call > count for writes then reads of 2048m (using $record_size AIO). AIO does > make use of dm_merge_bvec and stripe_merge. > > record_size stripe_map_sector calls stripe_map calls > 4k 2097152 1048592 > 8k 1572864 524304 > 16k 1310720 262160 > 32k 1179648 131088 > 64k 1114112 65552 > 128k 1114112 65552 > 256k 1114112 65552 > > The above shows that bios are being assembled using larger payloads (up > to chunk_size) given that AIO does make use of stripe_merge. > > When I did the same accounting (via attached systemtap script) for a > buffered iozone run with a file size of 2000m (using -i 0 -i 1 -i 2) I > saw that dm_merge_bvec() was _never_ called and the number of > stripe_map_sector calls was very close to the stripe_map calls. > > Mike > > p.s. > All the above aside, one of our more elaborate benchmarks against XFS > has seen a significant benefit from stripe_merge() being present... I > still need to understand that benchmark's IO workload though. I used 64k record size, and ext3 as filesystem. No, I was not using O_DIRECT. But I have measured as well with O_DIRECT, and the benefits there are significant too. stripe_merge() helps a lot. The reason of splitting I/O records into 4KiB chunks happens at dm_set_device_limits(), thats what I explained in my v1 patch. If the target has no own merge_fn, max_sectors will be set to PAGE_SIZE, what in my case is 4KiB. Then __bio_add_page checks upon max_sectors and does not add any more pages to a bio. The bio stays at 4KiB. Now by avoiding the "wrong" setting of max_sectors for the dm target, __bio_add_page will be able to add more than one page to the bios. So this is my iozone call: # iozone -s 2000m -r 64k -t 32 -e -w -R -C -i 0 -F<mntpt>/Child0 ....<mntpt>/Child31 For direct I/O (O_DIRECT) add '-I'. dm_merge_bvec/stripe_merge is being called only on reads, thats what I have observed when I was testing the patch on my 2.6.32.x-stable kernel. Maybe it depends if the I/O is page cached or aio based...this might be worth a further analysis. On writes another path must be walked through, but I have not further analysed it so far. In think it helps to avoid "overhead" in passing always 4KiB bios to the dm target. In my opinion it is "cheaper"/"faster" to pass one big bio down to the dm target instead of passing 4KiB max each bio. I used iostat to check on the devices and the sizes of the requests, just try to start an iostat process which collects I/O statistics during your runs. e.g. 'iostat -dmx 2> outfile&' - check out "avgrq-sz". And yes during my iostat runs I figured out that the writes are still dropping into the dm in 4KiB chunks, this is what I will analyse next. Maybe there will be another patch(es) to fix that. Mustafa ps: aio-stress did not work for me, sorry but I did not have the time to check on that and to search where the error might be... ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3] dm stripe: implement merge method 2011-03-14 11:54 ` Mustafa Mesanovic @ 2011-03-14 14:33 ` Mike Snitzer 0 siblings, 0 replies; 15+ messages in thread From: Mike Snitzer @ 2011-03-14 14:33 UTC (permalink / raw) To: Mustafa Mesanovic Cc: dm-devel, Neil Brown, akpm, cotte, heiko.carstens, linux-kernel, ehrhardt, Alasdair G. Kergon, Jeff Moyer On Mon, Mar 14 2011 at 7:54am -0400, Mustafa Mesanovic <mume@linux.vnet.ibm.com> wrote: > On 03/12/2011 11:42 PM, Mike Snitzer wrote: > >What record size are you using? > >Which filesystem are you using? > >Also, were you using O_DIRECT? If not then I'm having a hard time > >understanding why implementing stripe_merge was so beneficial for you. > >stripe_merge doesn't help buffered IO. To clarify (thanks to the clue about dropping caches): stripe_merge doesn't help _cached_ buffered IO ;) > I used 64k record size, and ext3 as filesystem. > > No, I was not using O_DIRECT. But I have measured as well with O_DIRECT, and > the benefits there are significant too. > > stripe_merge() helps a lot. The reason of splitting I/O records into 4KiB > chunks happens at dm_set_device_limits(), thats what I explained in my v1 patch. > If the target has no own merge_fn, max_sectors will be set to PAGE_SIZE, what > in my case is 4KiB. Then __bio_add_page checks upon max_sectors and does not > add any more pages to a bio. The bio stays at 4KiB. > > Now by avoiding the "wrong" setting of max_sectors for the dm target, > __bio_add_page will be able to add more than one page to the bios. Right, I understand the limitation that the patch addresses. But given that I hadn't dropped caches I was missing _why_ it was helping you so much for buffered IO. > So this is my iozone call: > # iozone -s 2000m -r 64k -t 32 -e -w -R -C -i 0 > -F<mntpt>/Child0 ....<mntpt>/Child31 > For direct I/O (O_DIRECT) add '-I'. I've been using a comparable iozone commandline (except I was using -i 0 -i 1 -i 2 in a single iozone run). If I write (-i 0), drop caches, and then read (-i 1) I see the benefit associated with stripe_merge() and buffered reads: iozone -s 2000m -r 64k -t 8 -e -w -i 0 -F ... dm_merge_bvec_count calls: 192 stripe_map_sector calls: 8192527 stripe_map calls: 8192921 (do _not_ drop_caches) iozone -s 2000m -r 64k -t 8 -e -w -i 1 -F ... dm_merge_bvec_count calls: 0 stripe_map_sector calls: 6 stripe_map calls: 262 echo 3 > /proc/sys/vm/drop_caches iozone -s 2000m -r 64k -t 8 -e -w -i 1 -F ... ... dm_merge_bvec_count calls: 8147899 stripe_map_sector calls: 4205979 stripe_map calls: 247675 > dm_merge_bvec/stripe_merge is being called only on reads, thats what I have > observed when I was testing the patch on my 2.6.32.x-stable kernel. Maybe it > depends if the I/O is page cached or aio based...this might be worth a > further analysis. On writes another path must be walked through, but I have > not further analysed it so far. Direct I/O (and AIO) writes do use dm_merge_bvec/stripe_merge: iozone -s 2000m -r 64k -t 8 -e -w -I -i 0 -F ... dm_merge_bvec_count calls: 16344806 stripe_map_sector calls: 8683179 stripe_map calls: 511595 > In think it helps to avoid "overhead" in passing always 4KiB bios to the > dm target. In my opinion it is "cheaper"/"faster" to pass one big bio > down to the dm target instead of passing 4KiB max each bio. > > I used iostat to check on the devices and the sizes of the requests, just try > to start an iostat process which collects I/O statistics during your > runs. e.g. 'iostat -dmx 2> outfile&' - check out "avgrq-sz". > > And yes during my iostat runs I figured out that the writes are still dropping > into the dm in 4KiB chunks, this is what I will analyse next. Yes, for buffered writes that is definitely the case. My iostat runs show this too, e.g.: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util dm-50 0.00 11970.00 0.00 45.50 0.00 23264.00 1022.59 175.31 1398.88 21.98 100.00 dm-52 0.00 11967.50 0.00 41.50 0.00 20742.00 999.61 170.87 1330.83 24.10 100.00 dm-56 0.00 11962.00 0.00 47.50 0.00 24320.00 1024.00 167.65 1324.91 21.05 100.00 dm-57 0.00 11970.00 0.00 45.00 0.00 23040.00 1024.00 174.03 1409.62 22.22 100.00 dm-58 0.00 11970.00 0.00 47.00 0.00 23808.00 1013.11 176.86 1415.36 21.28 100.00 dm-60 0.00 11962.00 0.00 45.00 0.00 23040.00 1024.00 171.99 1375.34 22.22 100.00 dm-62 0.00 11962.00 0.00 59.50 0.00 29952.00 1006.79 139.20 1129.12 16.81 100.00 dm-64 0.00 11967.50 0.00 63.50 0.00 32510.00 1023.94 136.29 1116.40 15.75 100.00 dm-66 0.00 0.00 0.00 96459.50 0.00 385838.00 8.00 167097.20 675.99 0.01 100.00 NOTE: dm-66 is the bio-based striped_lv volume, the other dm-* are the underlying request-based mpath devices. > Maybe there will be another patch(es) to fix that. Doubtful, and it certainly isn't a DM-only phenomenon. writeback is always done in terms of PAGE_SIZE IOs. > Mustafa > > ps: > aio-stress did not work for me, sorry but I did not have the time to check on that > and to search where the error might be... Odd, works fine for me. I'm using the following commndline: aio-stress -O -o 0 -o 1 -r 64 -d 128 -b 16 -i 16 -s 2048 /dev/snitm/striped_lv Can you share how you're using it and what the error is? ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4] dm stripe: implement merge method 2011-03-08 2:21 ` [PATCH v3] dm stripe: implement merge method Mike Snitzer 2011-03-08 10:29 ` Mustafa Mesanovic @ 2011-03-16 20:21 ` Mike Snitzer 1 sibling, 0 replies; 15+ messages in thread From: Mike Snitzer @ 2011-03-16 20:21 UTC (permalink / raw) To: Mustafa Mesanovic, Alasdair G. Kergon Cc: cotte, heiko.carstens, linux-kernel, ehrhardt, dm-devel, akpm From: Mustafa Mesanovic <mume@linux.vnet.ibm.com> When the stripe target's underlying devices provide a merge_bvec_fn (like all DM devices do via dm_merge_bvec) it is important to call down to them when building a biovec that doesn't span a stripe boundary. Without the merge method, a striped DM device stacked on DM devices would cause bios with a single page to be submitted. The small bios that were built resulted in unnecessary overhead that hurt performance. This change really helps filesystems (e.g. XFS and now ext4) which take care to assemble larger bios. By implementing stripe_merge(), DM and the stripe target no longer undermine the filesystem's work by only allowing a single page per bio. Buffered IO sees the biggest improvement (particularly uncached reads, buffered writes to a lesser degree). This is especially so for more capable "enterprise" storage LUNs. The performance improvement has been measured to be ~12-35% -- when a reasonable chunk_size is used (e.g. 64K) in conjunction with a stripe count that is a power of 2. In contrast, the performance penalty is ~5-7% for the pathological worst case stripe configuration (small chunk_size with a stripe count that is not a power of 2). The reason for this is that stripe_map_sector() is now called once for every call to dm_merge_bvec(). stripe_map_sector() will use slower division if stripe count isn't a power of 2. Signed-off-by: Mustafa Mesanovic <mume@linux.vnet.ibm.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> --- drivers/md/dm-stripe.c | 23 ++++++++++++++++++++++- 1 files changed, 22 insertions(+), 1 deletions(-) diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index dddfa14..3d80cf0 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -396,9 +396,29 @@ static void stripe_io_hints(struct dm_target *ti, blk_limits_io_opt(limits, chunk_size * sc->stripes); } +static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm, + struct bio_vec *biovec, int max_size) +{ + struct stripe_c *sc = ti->private; + sector_t bvm_sector = bvm->bi_sector; + uint32_t stripe; + struct request_queue *q; + + stripe_map_sector(sc, bvm_sector, &stripe, &bvm_sector); + + q = bdev_get_queue(sc->stripe[stripe].dev->bdev); + if (!q->merge_bvec_fn) + return max_size; + + bvm->bi_bdev = sc->stripe[stripe].dev->bdev; + bvm->bi_sector = sc->stripe[stripe].physical_start + bvm_sector; + + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); +} + static struct target_type stripe_target = { .name = "striped", - .version = {1, 3, 1}, + .version = {1, 4, 0}, .module = THIS_MODULE, .ctr = stripe_ctr, .dtr = stripe_dtr, @@ -407,6 +427,7 @@ static struct target_type stripe_target = { .status = stripe_status, .iterate_devices = stripe_iterate_devices, .io_hints = stripe_io_hints, + .merge = stripe_merge, }; int __init dm_stripe_init(void) ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH] dm: improve read performance 2011-03-07 10:10 ` Mustafa Mesanovic 2011-03-08 2:21 ` [PATCH v3] dm stripe: implement merge method Mike Snitzer @ 2011-03-17 5:12 ` Nikanth Karthikesan 2011-03-17 13:08 ` Mike Snitzer 1 sibling, 1 reply; 15+ messages in thread From: Nikanth Karthikesan @ 2011-03-17 5:12 UTC (permalink / raw) To: Mustafa Mesanovic Cc: Neil Brown, akpm, snitzer, dm-devel, linux-kernel, heiko.carstens, cotte, ehrhardt, hare On Monday, March 07, 2011 03:40:01 pm Mustafa Mesanovic wrote: > On 12/27/2010 01:23 PM, Mustafa Mesanovic wrote: > > On Mon December 27 2010 12:54:59 Neil Brown wrote: > >> On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic > >> > >> <mume@linux.vnet.ibm.com> wrote: > >>> From: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > >>> > >>> A short explanation in prior: in this case we have "stacked" dm > >>> devices. Two multipathed luns combined together to one striped logical > >>> volume. > >>> > >>> I/O throughput degradation happens at __bio_add_page when bio's get > >>> checked upon max_sectors. In this setup max_sectors is always set to 8 > >>> -> what is 4KiB. > >>> A standalone striped logical volume on luns which are not multipathed > >>> do not have the problem: the logical volume will take over the > >>> max_sectors from luns below. > > [...] > > >>> Using the patch improves read I/O up to 3x. In this specific case from > >>> 600MiB/s up to 1800MiB/s. > >> > >> and using this patch will cause IO to fail sometimes. > >> If an IO request which is larger than a page crosses a device boundary > >> in the underlying e.g. RAID0, the RAID0 will return an error as such > >> things should not happen - they are prevented by merge_bvec_fn. > >> > >> If merge_bvec_fn is not being honoured, then you MUST limit requests to > >> a single entry iovec of at most one page. > >> > >> NeilBrown > > > > Thank you for that hint, I will try to write a merge_bvec_fn for > > dm-stripe.c which solves the problem, if that is ok? > > > > Mustafa Mesanovic > > Now here my new suggestion to fix this issue, what is your opinion? > I tested this with different setups, and it worked fine and I had > very good performance improvements. > Some minor style nitpicks. > [RFC][PATCH] dm: improve read performance - v2 > > This patch adds a merge_fn for the dm stripe target. This merge_fn > prevents dm_set_device_limits() setting the max_sectors to 4KiB > (PAGE_SIZE). (As in a prior patch already mentioned.) > Now the read performance improved up to 3x higher compared to before. > > What happened before: > I/O throughput degradation happened at __bio_add_page() when bio's got > checked at the very beginning upon max_sectors. In this setup max_sectors > is always set to 8. So bio's entered the dm target with a max of 4KiB. > > Now dm-stripe target will have its own merge_fn so max_sectors will not > pushed down to 8 (4KiB), and bio's can get bigger than 4KiB. > > Signed-off-by: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > --- > > dm-stripe.c | 24 ++++++++++++++++++++++++ > 1 file changed, 24 insertions(+) > > Index: linux-2.6/drivers/md/dm-stripe.c > =================================================================== > --- linux-2.6.orig/drivers/md/dm-stripe.c 2011-02-28 10:23:37.000000000 > +0100 +++ linux-2.6/drivers/md/dm-stripe.c 2011-02-28 10:24:29.000000000 > +0100 @@ -396,6 +396,29 @@ > blk_limits_io_opt(limits, chunk_size * sc->stripes); > } > > +static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm, > + struct bio_vec *biovec, int max_size) > +{ > + struct stripe_c *sc = (struct stripe_c *) ti->private; > + sector_t offset, chunk; > + uint32_t stripe; > + struct request_queue *q; > + > + offset = bvm->bi_sector - ti->begin; > + chunk = offset>> sc->chunk_shift; > + stripe = sector_div(chunk, sc->stripes); > + > + if (!bdev_get_queue(sc->stripe[stripe].dev->bdev)->merge_bvec_fn) > + return max_size; > + > + bvm->bi_bdev = sc->stripe[stripe].dev->bdev; > + q = bdev_get_queue(bvm->bi_bdev); Initializing q at the top would simplify the check fro merge_bvec_fn above. > + bvm->bi_sector = sc->stripe[stripe].physical_start + > + (chunk<< sc->chunk_shift) + (offset& sc->chunk_mask); > + Can this be written as bvm->bi_sector = sc->stripe[stripe].physical_start + bvm->bi_sector - ti->begin; or even better bvm->bi_sector = sc->stripe[stripe].physical_start + dm_target_offset(ti, bvm->bi_sector); > > + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); > +} > + > static struct target_type stripe_target = { > .name = "striped", > .version = {1, 3, 1}, > @@ -403,6 +426,7 @@ > .ctr = stripe_ctr, > .dtr = stripe_dtr, > .map = stripe_map, > + .merge = stripe_merge, > .end_io = stripe_end_io, > .status = stripe_status, > .iterate_devices = stripe_iterate_devices, > > > Reviewed-by: Nikanth Karthikesan <knikanth@suse.de> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH] dm: improve read performance 2011-03-17 5:12 ` [RFC][PATCH] dm: improve read performance Nikanth Karthikesan @ 2011-03-17 13:08 ` Mike Snitzer 2011-03-18 4:59 ` Nikanth Karthikesan 0 siblings, 1 reply; 15+ messages in thread From: Mike Snitzer @ 2011-03-17 13:08 UTC (permalink / raw) To: Nikanth Karthikesan Cc: Mustafa Mesanovic, Neil Brown, akpm, dm-devel, linux-kernel, heiko.carstens, cotte, ehrhardt, hare On Thu, Mar 17 2011 at 1:12am -0400, Nikanth Karthikesan <knikanth@suse.de> wrote: > On Monday, March 07, 2011 03:40:01 pm Mustafa Mesanovic wrote: > > On 12/27/2010 01:23 PM, Mustafa Mesanovic wrote: > > > On Mon December 27 2010 12:54:59 Neil Brown wrote: > > >> On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic > > >> > > >> <mume@linux.vnet.ibm.com> wrote: > > >>> From: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > > >>> > > >>> A short explanation in prior: in this case we have "stacked" dm > > >>> devices. Two multipathed luns combined together to one striped logical > > >>> volume. > > >>> > > >>> I/O throughput degradation happens at __bio_add_page when bio's get > > >>> checked upon max_sectors. In this setup max_sectors is always set to 8 > > >>> -> what is 4KiB. > > >>> A standalone striped logical volume on luns which are not multipathed > > >>> do not have the problem: the logical volume will take over the > > >>> max_sectors from luns below. > > > > [...] > > > > >>> Using the patch improves read I/O up to 3x. In this specific case from > > >>> 600MiB/s up to 1800MiB/s. > > >> > > >> and using this patch will cause IO to fail sometimes. > > >> If an IO request which is larger than a page crosses a device boundary > > >> in the underlying e.g. RAID0, the RAID0 will return an error as such > > >> things should not happen - they are prevented by merge_bvec_fn. > > >> > > >> If merge_bvec_fn is not being honoured, then you MUST limit requests to > > >> a single entry iovec of at most one page. > > >> > > >> NeilBrown > > > > > > Thank you for that hint, I will try to write a merge_bvec_fn for > > > dm-stripe.c which solves the problem, if that is ok? > > > > > > Mustafa Mesanovic > > > > Now here my new suggestion to fix this issue, what is your opinion? > > I tested this with different setups, and it worked fine and I had > > very good performance improvements. > > > > Some minor style nitpicks. > > > [RFC][PATCH] dm: improve read performance - v2 > > > > This patch adds a merge_fn for the dm stripe target. This merge_fn > > prevents dm_set_device_limits() setting the max_sectors to 4KiB > > (PAGE_SIZE). (As in a prior patch already mentioned.) > > Now the read performance improved up to 3x higher compared to before. > > > > What happened before: > > I/O throughput degradation happened at __bio_add_page() when bio's got > > checked at the very beginning upon max_sectors. In this setup max_sectors > > is always set to 8. So bio's entered the dm target with a max of 4KiB. > > > > Now dm-stripe target will have its own merge_fn so max_sectors will not > > pushed down to 8 (4KiB), and bio's can get bigger than 4KiB. > > > > Signed-off-by: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > > --- > > > > dm-stripe.c | 24 ++++++++++++++++++++++++ > > 1 file changed, 24 insertions(+) > > > > Index: linux-2.6/drivers/md/dm-stripe.c > > =================================================================== > > --- linux-2.6.orig/drivers/md/dm-stripe.c 2011-02-28 10:23:37.000000000 > > +0100 +++ linux-2.6/drivers/md/dm-stripe.c 2011-02-28 10:24:29.000000000 > > +0100 @@ -396,6 +396,29 @@ > > blk_limits_io_opt(limits, chunk_size * sc->stripes); > > } > > > > +static int stripe_merge(struct dm_target *ti, struct bvec_merge_data *bvm, > > + struct bio_vec *biovec, int max_size) > > +{ > > + struct stripe_c *sc = (struct stripe_c *) ti->private; > > + sector_t offset, chunk; > > + uint32_t stripe; > > + struct request_queue *q; > > + > > + offset = bvm->bi_sector - ti->begin; > > + chunk = offset>> sc->chunk_shift; > > + stripe = sector_div(chunk, sc->stripes); > > + > > + if (!bdev_get_queue(sc->stripe[stripe].dev->bdev)->merge_bvec_fn) > > + return max_size; > > + > > + bvm->bi_bdev = sc->stripe[stripe].dev->bdev; > > + q = bdev_get_queue(bvm->bi_bdev); > > Initializing q at the top would simplify the check fro merge_bvec_fn above. > > > + bvm->bi_sector = sc->stripe[stripe].physical_start + > > + (chunk<< sc->chunk_shift) + (offset& sc->chunk_mask); > > + > > Can this be written as > > bvm->bi_sector = sc->stripe[stripe].physical_start + > bvm->bi_sector - ti->begin; > > or even better > bvm->bi_sector = sc->stripe[stripe].physical_start + > dm_target_offset(ti, bvm->bi_sector); > > > > > + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); > > +} > > + > > static struct target_type stripe_target = { > > .name = "striped", > > .version = {1, 3, 1}, > > @@ -403,6 +426,7 @@ > > .ctr = stripe_ctr, > > .dtr = stripe_dtr, > > .map = stripe_map, > > + .merge = stripe_merge, > > .end_io = stripe_end_io, > > .status = stripe_status, > > .iterate_devices = stripe_iterate_devices, > > > > > > > > > Reviewed-by: Nikanth Karthikesan <knikanth@suse.de> You reviewed an old version, v4 was posted to dm-devel and is available here: https://patchwork.kernel.org/patch/639801/ It should address all your concerns. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC][PATCH] dm: improve read performance 2011-03-17 13:08 ` Mike Snitzer @ 2011-03-18 4:59 ` Nikanth Karthikesan 0 siblings, 0 replies; 15+ messages in thread From: Nikanth Karthikesan @ 2011-03-18 4:59 UTC (permalink / raw) To: Mike Snitzer Cc: Mustafa Mesanovic, Neil Brown, akpm, dm-devel, linux-kernel, heiko.carstens, cotte, ehrhardt, hare On Thursday, March 17, 2011 06:38:46 pm Mike Snitzer wrote: > On Thu, Mar 17 2011 at 1:12am -0400, > > Nikanth Karthikesan <knikanth@suse.de> wrote: > > On Monday, March 07, 2011 03:40:01 pm Mustafa Mesanovic wrote: > > > On 12/27/2010 01:23 PM, Mustafa Mesanovic wrote: > > > > On Mon December 27 2010 12:54:59 Neil Brown wrote: > > > >> On Mon, 27 Dec 2010 12:19:55 +0100 Mustafa Mesanovic > > > >> > > > >> <mume@linux.vnet.ibm.com> wrote: > > > >>> From: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > > > >>> > > > >>> A short explanation in prior: in this case we have "stacked" dm > > > >>> devices. Two multipathed luns combined together to one striped > > > >>> logical volume. > > > >>> > > > >>> I/O throughput degradation happens at __bio_add_page when bio's get > > > >>> checked upon max_sectors. In this setup max_sectors is always set > > > >>> to 8 -> what is 4KiB. > > > >>> A standalone striped logical volume on luns which are not > > > >>> multipathed do not have the problem: the logical volume will take > > > >>> over the max_sectors from luns below. > > > > > > [...] > > > > > > >>> Using the patch improves read I/O up to 3x. In this specific case > > > >>> from 600MiB/s up to 1800MiB/s. > > > >> > > > >> and using this patch will cause IO to fail sometimes. > > > >> If an IO request which is larger than a page crosses a device > > > >> boundary in the underlying e.g. RAID0, the RAID0 will return an > > > >> error as such things should not happen - they are prevented by > > > >> merge_bvec_fn. > > > >> > > > >> If merge_bvec_fn is not being honoured, then you MUST limit requests > > > >> to a single entry iovec of at most one page. > > > >> > > > >> NeilBrown > > > > > > > > Thank you for that hint, I will try to write a merge_bvec_fn for > > > > dm-stripe.c which solves the problem, if that is ok? > > > > > > > > Mustafa Mesanovic > > > > > > Now here my new suggestion to fix this issue, what is your opinion? > > > I tested this with different setups, and it worked fine and I had > > > very good performance improvements. > > > > Some minor style nitpicks. > > > > > [RFC][PATCH] dm: improve read performance - v2 > > > > > > This patch adds a merge_fn for the dm stripe target. This merge_fn > > > prevents dm_set_device_limits() setting the max_sectors to 4KiB > > > (PAGE_SIZE). (As in a prior patch already mentioned.) > > > Now the read performance improved up to 3x higher compared to before. > > > > > > What happened before: > > > I/O throughput degradation happened at __bio_add_page() when bio's got > > > checked at the very beginning upon max_sectors. In this setup > > > max_sectors is always set to 8. So bio's entered the dm target with a > > > max of 4KiB. > > > > > > Now dm-stripe target will have its own merge_fn so max_sectors will not > > > pushed down to 8 (4KiB), and bio's can get bigger than 4KiB. > > > > > > Signed-off-by: Mustafa Mesanovic<mume@linux.vnet.ibm.com> > > > --- > > > > > > dm-stripe.c | 24 ++++++++++++++++++++++++ > > > 1 file changed, 24 insertions(+) > > > > > > Index: linux-2.6/drivers/md/dm-stripe.c > > > =================================================================== > > > --- linux-2.6.orig/drivers/md/dm-stripe.c 2011-02-28 10:23:37.000000000 > > > +0100 +++ linux-2.6/drivers/md/dm-stripe.c 2011-02-28 > > > 10:24:29.000000000 +0100 @@ -396,6 +396,29 @@ > > > > > > blk_limits_io_opt(limits, chunk_size * sc->stripes); > > > > > > } > > > > > > +static int stripe_merge(struct dm_target *ti, struct bvec_merge_data > > > *bvm, + struct bio_vec *biovec, int max_size) > > > +{ > > > + struct stripe_c *sc = (struct stripe_c *) ti->private; > > > + sector_t offset, chunk; > > > + uint32_t stripe; > > > + struct request_queue *q; > > > + > > > + offset = bvm->bi_sector - ti->begin; > > > + chunk = offset>> sc->chunk_shift; > > > + stripe = sector_div(chunk, sc->stripes); > > > + > > > + if (!bdev_get_queue(sc->stripe[stripe].dev->bdev)->merge_bvec_fn) > > > + return max_size; > > > + > > > + bvm->bi_bdev = sc->stripe[stripe].dev->bdev; > > > + q = bdev_get_queue(bvm->bi_bdev); > > > > Initializing q at the top would simplify the check fro merge_bvec_fn > > above. > > > > > + bvm->bi_sector = sc->stripe[stripe].physical_start + > > > + (chunk<< sc->chunk_shift) + (offset& sc->chunk_mask); > > > + > > > > Can this be written as > > > > bvm->bi_sector = sc->stripe[stripe].physical_start + > > > > bvm->bi_sector - ti->begin; > > > > or even better > > bvm->bi_sector = sc->stripe[stripe].physical_start + > > > > dm_target_offset(ti, bvm->bi_sector); > > > > > > + return min(max_size, q->merge_bvec_fn(q, bvm, biovec)); > > > +} > > > + > > > > > > static struct target_type stripe_target = { > > > > > > .name = "striped", > > > .version = {1, 3, 1}, > > > > > > @@ -403,6 +426,7 @@ > > > > > > .ctr = stripe_ctr, > > > .dtr = stripe_dtr, > > > .map = stripe_map, > > > > > > + .merge = stripe_merge, > > > > > > .end_io = stripe_end_io, > > > .status = stripe_status, > > > .iterate_devices = stripe_iterate_devices, > > > > Reviewed-by: Nikanth Karthikesan <knikanth@suse.de> > > You reviewed an old version, v4 was posted to dm-devel and is > available here: https://patchwork.kernel.org/patch/639801/ > oops.. sorry. > It should address all your concerns. Yes, it does. Thanks Nikanth ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2011-03-18 5:03 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-12-27 11:19 [RFC][PATCH] dm: improve read performance Mustafa Mesanovic 2010-12-27 11:54 ` Neil Brown 2010-12-27 12:23 ` Mustafa Mesanovic 2011-03-07 10:10 ` Mustafa Mesanovic 2011-03-08 2:21 ` [PATCH v3] dm stripe: implement merge method Mike Snitzer 2011-03-08 10:29 ` Mustafa Mesanovic 2011-03-08 16:48 ` Mike Snitzer 2011-03-10 14:02 ` Mustafa Mesanovic 2011-03-12 22:42 ` Mike Snitzer 2011-03-14 11:54 ` Mustafa Mesanovic 2011-03-14 14:33 ` Mike Snitzer 2011-03-16 20:21 ` [PATCH v4] " Mike Snitzer 2011-03-17 5:12 ` [RFC][PATCH] dm: improve read performance Nikanth Karthikesan 2011-03-17 13:08 ` Mike Snitzer 2011-03-18 4:59 ` Nikanth Karthikesan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox