linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] block: fix bio merge checks when virt_boundary is set
@ 2016-03-15 15:17 Vitaly Kuznetsov
  2016-03-15 16:03 ` Keith Busch
  2016-03-16 15:40 ` Ming Lei
  0 siblings, 2 replies; 12+ messages in thread
From: Vitaly Kuznetsov @ 2016-03-15 15:17 UTC (permalink / raw)
  To: linux-block
  Cc: linux-kernel, Jens Axboe, Dan Williams, Martin K. Petersen,
	Sagi Grimberg, Mike Snitzer, K. Y. Srinivasan, Cathy Avery,
	Keith Busch

Hyper-V storage driver, which switched to using virt_boundary some time
ago, experiences significant slowdown on non-page-aligned IO. E.g.

With virt_boundary set:
 # time mkfs.ntfs -Q -s 512 /dev/sdc1
 ...
 real	0m9.406s
 user	0m0.014s
 sys	0m0.672s

Without virt_boundary set (unsafe):
 # time mkfs.ntfs -Q -s 512 /dev/sdc1
 ...
 real	0m6.657s
 user	0m0.012s
 sys	0m6.423s

The reason of the slowdown is the fact that bios don't get merged and we
end up sending many short requests to the host. My investigation led me to
the following code (__bvec_gap_to_prev()):

    return offset ||
           ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));

Here is an example: we have two bio_vec with the following content:
    bprv.bv_offset = 512
    bprv.bv_len = 512

    bnxt.bv_offset = 1024
    bnxt.bv_len = 512

    bprv.bv_page == bnxt.bv_page
    virt_boundary is set to PAGE_SIZE-1

The above mentioned code will report that a gap will appear if we merge
these two (as offset = 1024) but this doesn't look sane. On top of that,
we have the following optimization in bio_add_pc_page():

    if (page == prev->bv_page &&
        offset == prev->bv_offset + prev->bv_len) {
            prev->bv_len += len;
            bio->bi_iter.bi_size += len;
            goto done;
        }

But we don't have such check in other places, which check virt_boundary.
Modify the check in __bvec_gap_to_prev() to the following:
1) Report no gap in case bnxt->bv_offset == bprv->bv_offset + bprv->bv_len
   when bprv.bv_page == bnxt.bv_page.
2) Continue reporting no gap in (bprv->bv_offset + bprv->bv_len) &
   queue_virt_boundary(q) case.

Reported-by: John R. Kozee II <jkozee@bowser-morner.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
- The condition I'm changing was there since SG_GAPS so I may be missing
  something important, thus RFC.
---
 block/bio-integrity.c  |  7 +++++--
 block/bio.c            |  4 +++-
 block/blk-merge.c      |  2 +-
 include/linux/blkdev.h | 17 +++++++++--------
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 711e4d8d..f8560da 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -136,7 +136,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
 			   unsigned int len, unsigned int offset)
 {
 	struct bio_integrity_payload *bip = bio_integrity(bio);
-	struct bio_vec *iv;
+	struct bio_vec *iv, bv;
 
 	if (bip->bip_vcnt >= bip->bip_max_vcnt) {
 		printk(KERN_ERR "%s: bip_vec full\n", __func__);
@@ -144,10 +144,13 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
 	}
 
 	iv = bip->bip_vec + bip->bip_vcnt;
+	bv.bv_page = page;
+	bv.bv_len = len;
+	bv.bv_offset = offset;
 
 	if (bip->bip_vcnt &&
 	    bvec_gap_to_prev(bdev_get_queue(bio->bi_bdev),
-			     &bip->bip_vec[bip->bip_vcnt - 1], offset))
+			     &bip->bip_vec[bip->bip_vcnt - 1], &bv))
 		return 0;
 
 	iv->bv_page = page;
diff --git a/block/bio.c b/block/bio.c
index cf75915..1583581 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -730,6 +730,8 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 	 */
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
+		struct bio_vec bv = {.bv_page = page, .bv_len = len,
+				     .bv_offset = offset};
 
 		if (page == prev->bv_page &&
 		    offset == prev->bv_offset + prev->bv_len) {
@@ -742,7 +744,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvec_gap_to_prev(q, prev, offset))
+		if (bvec_gap_to_prev(q, prev, &bv))
 			return 0;
 	}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2613531..8c6c3e2 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -100,7 +100,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 		 * If the queue doesn't support SG gaps and adding this
 		 * offset would create a gap, disallow it.
 		 */
-		if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
+		if (bvprvp && bvec_gap_to_prev(q, bvprvp, &bv))
 			goto split;
 
 		if (sectors + (bv.bv_len >> 9) > max_sectors) {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 413c84f..b4fa29d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1373,10 +1373,11 @@ static inline void put_dev_sector(Sector p)
 }
 
 static inline bool __bvec_gap_to_prev(struct request_queue *q,
-				struct bio_vec *bprv, unsigned int offset)
+				struct bio_vec *bprv, struct bio_vec *bnxt)
 {
-	return offset ||
-		((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
+	if (bprv->bv_page == bnxt->bv_page)
+		return bnxt->bv_offset != bprv->bv_offset + bprv->bv_len;
+	return (bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q);
 }
 
 /*
@@ -1384,11 +1385,11 @@ static inline bool __bvec_gap_to_prev(struct request_queue *q,
  * the SG list. Most drivers don't care about this, but some do.
  */
 static inline bool bvec_gap_to_prev(struct request_queue *q,
-				struct bio_vec *bprv, unsigned int offset)
+				struct bio_vec *bprv, struct bio_vec *bnxt)
 {
 	if (!queue_virt_boundary(q))
 		return false;
-	return __bvec_gap_to_prev(q, bprv, offset);
+	return __bvec_gap_to_prev(q, bprv, bnxt);
 }
 
 static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
@@ -1400,7 +1401,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
 		bio_get_last_bvec(prev, &pb);
 		bio_get_first_bvec(next, &nb);
 
-		return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
+		return __bvec_gap_to_prev(q, &pb, &nb);
 	}
 
 	return false;
@@ -1545,7 +1546,7 @@ static inline bool integrity_req_gap_back_merge(struct request *req,
 	struct bio_integrity_payload *bip_next = bio_integrity(next);
 
 	return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
-				bip_next->bip_vec[0].bv_offset);
+				&bip_next->bip_vec[0]);
 }
 
 static inline bool integrity_req_gap_front_merge(struct request *req,
@@ -1555,7 +1556,7 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
 	struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
 
 	return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
-				bip_next->bip_vec[0].bv_offset);
+				&bip_next->bip_vec[0]);
 }
 
 #else /* CONFIG_BLK_DEV_INTEGRITY */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-15 15:17 [PATCH RFC] block: fix bio merge checks when virt_boundary is set Vitaly Kuznetsov
@ 2016-03-15 16:03 ` Keith Busch
  2016-03-16 10:17   ` Vitaly Kuznetsov
  2016-03-16 15:40 ` Ming Lei
  1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2016-03-15 16:03 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: linux-block, linux-kernel, Jens Axboe, Dan Williams,
	Martin K. Petersen, Sagi Grimberg, Mike Snitzer, K. Y. Srinivasan,
	Cathy Avery

On Tue, Mar 15, 2016 at 04:17:56PM +0100, Vitaly Kuznetsov wrote:
> The reason of the slowdown is the fact that bios don't get merged and we
> end up sending many short requests to the host. My investigation led me to
> the following code (__bvec_gap_to_prev()):
> 
>     return offset ||
>            ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
> 
> Here is an example: we have two bio_vec with the following content:
>     bprv.bv_offset = 512
>     bprv.bv_len = 512
> 
>     bnxt.bv_offset = 1024
>     bnxt.bv_len = 512
> 
>     bprv.bv_page == bnxt.bv_page
>     virt_boundary is set to PAGE_SIZE-1
> 
> The above mentioned code will report that a gap will appear if we merge
> these two (as offset = 1024) but this doesn't look sane. On top of that,
> we have the following optimization in bio_add_pc_page():
> 
>     if (page == prev->bv_page &&
>         offset == prev->bv_offset + prev->bv_len) {
>             prev->bv_len += len;
>             bio->bi_iter.bi_size += len;
>             goto done;
>         }

This part sounds odd. Why is a filesystem using bio_add_pc_page? Shouldn't
these go through "bio_add_page" instead? That already has an optimization
to combine bio's within the same page.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-15 16:03 ` Keith Busch
@ 2016-03-16 10:17   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 12+ messages in thread
From: Vitaly Kuznetsov @ 2016-03-16 10:17 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, linux-kernel, Jens Axboe, Dan Williams,
	Martin K. Petersen, Sagi Grimberg, Mike Snitzer, K. Y. Srinivasan,
	Cathy Avery

Keith Busch <keith.busch@intel.com> writes:

> On Tue, Mar 15, 2016 at 04:17:56PM +0100, Vitaly Kuznetsov wrote:
>> The reason of the slowdown is the fact that bios don't get merged and we
>> end up sending many short requests to the host. My investigation led me to
>> the following code (__bvec_gap_to_prev()):
>> 
>>     return offset ||
>>            ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>> 
>> Here is an example: we have two bio_vec with the following content:
>>     bprv.bv_offset = 512
>>     bprv.bv_len = 512
>> 
>>     bnxt.bv_offset = 1024
>>     bnxt.bv_len = 512
>> 
>>     bprv.bv_page == bnxt.bv_page
>>     virt_boundary is set to PAGE_SIZE-1
>> 
>> The above mentioned code will report that a gap will appear if we merge
>> these two (as offset = 1024) but this doesn't look sane. On top of that,
>> we have the following optimization in bio_add_pc_page():
>> 
>>     if (page == prev->bv_page &&
>>         offset == prev->bv_offset + prev->bv_len) {
>>             prev->bv_len += len;
>>             bio->bi_iter.bi_size += len;
>>             goto done;
>>         }
>
> This part sounds odd. Why is a filesystem using bio_add_pc_page? Shouldn't
> these go through "bio_add_page" instead? That already has an optimization
> to combine bio's within the same page.

Not sure I know enough to comment here and it is most probably unrelated
to the issue I'm seeing (bio_add_pc_page() doesn't pop up when I do
'mkfs.ntfs') but in this particular place I see same page check before
we do bvec_gap_to_prev() but there is no such check in other places and
bios in the same page are always being split:

return offset || ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));

will always return 'true' because offset is the offset of the second
bio. That's what I'm trying to address.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-15 15:17 [PATCH RFC] block: fix bio merge checks when virt_boundary is set Vitaly Kuznetsov
  2016-03-15 16:03 ` Keith Busch
@ 2016-03-16 15:40 ` Ming Lei
  2016-03-16 16:26   ` Vitaly Kuznetsov
  1 sibling, 1 reply; 12+ messages in thread
From: Ming Lei @ 2016-03-16 15:40 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: linux-block, Linux Kernel Mailing List, Jens Axboe, Dan Williams,
	Martin K. Petersen, Sagi Grimberg, Mike Snitzer, K. Y. Srinivasan,
	Cathy Avery, Keith Busch

On Tue, Mar 15, 2016 at 11:17 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> Hyper-V storage driver, which switched to using virt_boundary some time
> ago, experiences significant slowdown on non-page-aligned IO. E.g.
>
> With virt_boundary set:
>  # time mkfs.ntfs -Q -s 512 /dev/sdc1
>  ...
>  real   0m9.406s
>  user   0m0.014s
>  sys    0m0.672s
>
> Without virt_boundary set (unsafe):
>  # time mkfs.ntfs -Q -s 512 /dev/sdc1
>  ...
>  real   0m6.657s
>  user   0m0.012s
>  sys    0m6.423s
>
> The reason of the slowdown is the fact that bios don't get merged and we
> end up sending many short requests to the host. My investigation led me to
> the following code (__bvec_gap_to_prev()):
>
>     return offset ||
>            ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>
> Here is an example: we have two bio_vec with the following content:
>     bprv.bv_offset = 512
>     bprv.bv_len = 512
>
>     bnxt.bv_offset = 1024
>     bnxt.bv_len = 512
>
>     bprv.bv_page == bnxt.bv_page
>     virt_boundary is set to PAGE_SIZE-1
>
> The above mentioned code will report that a gap will appear if we merge
> these two (as offset = 1024) but this doesn't look sane. On top of that,
> we have the following optimization in bio_add_pc_page():
>
>     if (page == prev->bv_page &&
>         offset == prev->bv_offset + prev->bv_len) {
>             prev->bv_len += len;
>             bio->bi_iter.bi_size += len;
>             goto done;
>         }
>
> But we don't have such check in other places, which check virt_boundary.

We do have the above merge in bio_add_page(), so the two bios in
your above example shouldn't have been observed if the two buffers
are added to bio via the bio_add_page().

If you see short bios in above example, maybe you need to check ntfs code:

- if bio_add_page() is used to add buffer
- if using one standalone bio to transfer each 512byte, even they
are in same page and the sector is continuous

> Modify the check in __bvec_gap_to_prev() to the following:
> 1) Report no gap in case bnxt->bv_offset == bprv->bv_offset + bprv->bv_len
>    when bprv.bv_page == bnxt.bv_page.
> 2) Continue reporting no gap in (bprv->bv_offset + bprv->bv_len) &
>    queue_virt_boundary(q) case.
>
> Reported-by: John R. Kozee II <jkozee@bowser-morner.com>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
> - The condition I'm changing was there since SG_GAPS so I may be missing
>   something important, thus RFC.
> ---
>  block/bio-integrity.c  |  7 +++++--
>  block/bio.c            |  4 +++-
>  block/blk-merge.c      |  2 +-
>  include/linux/blkdev.h | 17 +++++++++--------
>  4 files changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
> index 711e4d8d..f8560da 100644
> --- a/block/bio-integrity.c
> +++ b/block/bio-integrity.c
> @@ -136,7 +136,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
>                            unsigned int len, unsigned int offset)
>  {
>         struct bio_integrity_payload *bip = bio_integrity(bio);
> -       struct bio_vec *iv;
> +       struct bio_vec *iv, bv;
>
>         if (bip->bip_vcnt >= bip->bip_max_vcnt) {
>                 printk(KERN_ERR "%s: bip_vec full\n", __func__);
> @@ -144,10 +144,13 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
>         }
>
>         iv = bip->bip_vec + bip->bip_vcnt;
> +       bv.bv_page = page;
> +       bv.bv_len = len;
> +       bv.bv_offset = offset;
>
>         if (bip->bip_vcnt &&
>             bvec_gap_to_prev(bdev_get_queue(bio->bi_bdev),
> -                            &bip->bip_vec[bip->bip_vcnt - 1], offset))
> +                            &bip->bip_vec[bip->bip_vcnt - 1], &bv))
>                 return 0;
>
>         iv->bv_page = page;
> diff --git a/block/bio.c b/block/bio.c
> index cf75915..1583581 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -730,6 +730,8 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>          */
>         if (bio->bi_vcnt > 0) {
>                 struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +               struct bio_vec bv = {.bv_page = page, .bv_len = len,
> +                                    .bv_offset = offset};
>
>                 if (page == prev->bv_page &&
>                     offset == prev->bv_offset + prev->bv_len) {
> @@ -742,7 +744,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>                  * If the queue doesn't support SG gaps and adding this
>                  * offset would create a gap, disallow it.
>                  */
> -               if (bvec_gap_to_prev(q, prev, offset))
> +               if (bvec_gap_to_prev(q, prev, &bv))
>                         return 0;
>         }
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..8c6c3e2 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -100,7 +100,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>                  * If the queue doesn't support SG gaps and adding this
>                  * offset would create a gap, disallow it.
>                  */
> -               if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
> +               if (bvprvp && bvec_gap_to_prev(q, bvprvp, &bv))
>                         goto split;
>
>                 if (sectors + (bv.bv_len >> 9) > max_sectors) {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 413c84f..b4fa29d 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1373,10 +1373,11 @@ static inline void put_dev_sector(Sector p)
>  }
>
>  static inline bool __bvec_gap_to_prev(struct request_queue *q,
> -                               struct bio_vec *bprv, unsigned int offset)
> +                               struct bio_vec *bprv, struct bio_vec *bnxt)
>  {
> -       return offset ||
> -               ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
> +       if (bprv->bv_page == bnxt->bv_page)
> +               return bnxt->bv_offset != bprv->bv_offset + bprv->bv_len;
> +       return (bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q);

Why do you remove check on 'offset'?

>  }
>
>  /*
> @@ -1384,11 +1385,11 @@ static inline bool __bvec_gap_to_prev(struct request_queue *q,
>   * the SG list. Most drivers don't care about this, but some do.
>   */
>  static inline bool bvec_gap_to_prev(struct request_queue *q,
> -                               struct bio_vec *bprv, unsigned int offset)
> +                               struct bio_vec *bprv, struct bio_vec *bnxt)
>  {
>         if (!queue_virt_boundary(q))
>                 return false;
> -       return __bvec_gap_to_prev(q, bprv, offset);
> +       return __bvec_gap_to_prev(q, bprv, bnxt);
>  }
>
>  static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
> @@ -1400,7 +1401,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
>                 bio_get_last_bvec(prev, &pb);
>                 bio_get_first_bvec(next, &nb);
>
> -               return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
> +               return __bvec_gap_to_prev(q, &pb, &nb);
>         }
>
>         return false;
> @@ -1545,7 +1546,7 @@ static inline bool integrity_req_gap_back_merge(struct request *req,
>         struct bio_integrity_payload *bip_next = bio_integrity(next);
>
>         return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
> -                               bip_next->bip_vec[0].bv_offset);
> +                               &bip_next->bip_vec[0]);
>  }
>
>  static inline bool integrity_req_gap_front_merge(struct request *req,
> @@ -1555,7 +1556,7 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
>         struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
>
>         return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
> -                               bip_next->bip_vec[0].bv_offset);
> +                               &bip_next->bip_vec[0]);
>  }
>
>  #else /* CONFIG_BLK_DEV_INTEGRITY */
> --
> 2.5.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-16 15:40 ` Ming Lei
@ 2016-03-16 16:26   ` Vitaly Kuznetsov
  2016-03-16 22:38     ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Vitaly Kuznetsov @ 2016-03-16 16:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Linux Kernel Mailing List, Jens Axboe, Dan Williams,
	Martin K. Petersen, Sagi Grimberg, Mike Snitzer, K. Y. Srinivasan,
	Cathy Avery, Keith Busch

Ming Lei <tom.leiming@gmail.com> writes:

> On Tue, Mar 15, 2016 at 11:17 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> Hyper-V storage driver, which switched to using virt_boundary some time
>> ago, experiences significant slowdown on non-page-aligned IO. E.g.
>>
>> With virt_boundary set:
>>  # time mkfs.ntfs -Q -s 512 /dev/sdc1
>>  ...
>>  real   0m9.406s
>>  user   0m0.014s
>>  sys    0m0.672s
>>
>> Without virt_boundary set (unsafe):
>>  # time mkfs.ntfs -Q -s 512 /dev/sdc1
>>  ...
>>  real   0m6.657s
>>  user   0m0.012s
>>  sys    0m6.423s
>>
>> The reason of the slowdown is the fact that bios don't get merged and we
>> end up sending many short requests to the host. My investigation led me to
>> the following code (__bvec_gap_to_prev()):
>>
>>     return offset ||
>>            ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>>
>> Here is an example: we have two bio_vec with the following content:
>>     bprv.bv_offset = 512
>>     bprv.bv_len = 512
>>
>>     bnxt.bv_offset = 1024
>>     bnxt.bv_len = 512
>>
>>     bprv.bv_page == bnxt.bv_page
>>     virt_boundary is set to PAGE_SIZE-1
>>
>> The above mentioned code will report that a gap will appear if we merge
>> these two (as offset = 1024) but this doesn't look sane. On top of that,
>> we have the following optimization in bio_add_pc_page():
>>
>>     if (page == prev->bv_page &&
>>         offset == prev->bv_offset + prev->bv_len) {
>>             prev->bv_len += len;
>>             bio->bi_iter.bi_size += len;
>>             goto done;
>>         }
>>
>> But we don't have such check in other places, which check virt_boundary.
>
> We do have the above merge in bio_add_page(), so the two bios in
> your above example shouldn't have been observed if the two buffers
> are added to bio via the bio_add_page().
>
> If you see short bios in above example, maybe you need to check ntfs code:
>
> - if bio_add_page() is used to add buffer
> - if using one standalone bio to transfer each 512byte, even they
> are in same page and the sector is continuous

I'm not using ntfs, mkfs.ntfs is a userspace application which shows the
regression when virt_boundary is in place. I should have avoided
mentioning bio_add_pc_page() here as it is unrelated to the issue.

In particular, I'm concearned about the following call sites:
blk_bio_segment_split()
ll_back_merge_fn()
ll_front_merge_fn()

>> Modify the check in __bvec_gap_to_prev() to the following:
>> 1) Report no gap in case bnxt->bv_offset == bprv->bv_offset + bprv->bv_len
>>    when bprv.bv_page == bnxt.bv_page.
>> 2) Continue reporting no gap in (bprv->bv_offset + bprv->bv_len) &
>>    queue_virt_boundary(q) case.
>>
>> Reported-by: John R. Kozee II <jkozee@bowser-morner.com>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> - The condition I'm changing was there since SG_GAPS so I may be missing
>>   something important, thus RFC.
>> ---
>>  block/bio-integrity.c  |  7 +++++--
>>  block/bio.c            |  4 +++-
>>  block/blk-merge.c      |  2 +-
>>  include/linux/blkdev.h | 17 +++++++++--------
>>  4 files changed, 18 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
>> index 711e4d8d..f8560da 100644
>> --- a/block/bio-integrity.c
>> +++ b/block/bio-integrity.c
>> @@ -136,7 +136,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
>>                            unsigned int len, unsigned int offset)
>>  {
>>         struct bio_integrity_payload *bip = bio_integrity(bio);
>> -       struct bio_vec *iv;
>> +       struct bio_vec *iv, bv;
>>
>>         if (bip->bip_vcnt >= bip->bip_max_vcnt) {
>>                 printk(KERN_ERR "%s: bip_vec full\n", __func__);
>> @@ -144,10 +144,13 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
>>         }
>>
>>         iv = bip->bip_vec + bip->bip_vcnt;
>> +       bv.bv_page = page;
>> +       bv.bv_len = len;
>> +       bv.bv_offset = offset;
>>
>>         if (bip->bip_vcnt &&
>>             bvec_gap_to_prev(bdev_get_queue(bio->bi_bdev),
>> -                            &bip->bip_vec[bip->bip_vcnt - 1], offset))
>> +                            &bip->bip_vec[bip->bip_vcnt - 1], &bv))
>>                 return 0;
>>
>>         iv->bv_page = page;
>> diff --git a/block/bio.c b/block/bio.c
>> index cf75915..1583581 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -730,6 +730,8 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>>          */
>>         if (bio->bi_vcnt > 0) {
>>                 struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
>> +               struct bio_vec bv = {.bv_page = page, .bv_len = len,
>> +                                    .bv_offset = offset};
>>
>>                 if (page == prev->bv_page &&
>>                     offset == prev->bv_offset + prev->bv_len) {
>> @@ -742,7 +744,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>>                  * If the queue doesn't support SG gaps and adding this
>>                  * offset would create a gap, disallow it.
>>                  */
>> -               if (bvec_gap_to_prev(q, prev, offset))
>> +               if (bvec_gap_to_prev(q, prev, &bv))
>>                         return 0;
>>         }
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 2613531..8c6c3e2 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -100,7 +100,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>>                  * If the queue doesn't support SG gaps and adding this
>>                  * offset would create a gap, disallow it.
>>                  */
>> -               if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
>> +               if (bvprvp && bvec_gap_to_prev(q, bvprvp, &bv))
>>                         goto split;
>>
>>                 if (sectors + (bv.bv_len >> 9) > max_sectors) {
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 413c84f..b4fa29d 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1373,10 +1373,11 @@ static inline void put_dev_sector(Sector p)
>>  }
>>
>>  static inline bool __bvec_gap_to_prev(struct request_queue *q,
>> -                               struct bio_vec *bprv, unsigned int offset)
>> +                               struct bio_vec *bprv, struct bio_vec *bnxt)
>>  {
>> -       return offset ||
>> -               ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>> +       if (bprv->bv_page == bnxt->bv_page)
>> +               return bnxt->bv_offset != bprv->bv_offset + bprv->bv_len;
>> +       return (bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q);
>
> Why do you remove check on 'offset'?
>

Because this check is wrong in my opinion and that's what's causing the
issue.

Let me try to give my example again.

We have two bios,

     bprv.bv_offset = 512
     bprv.bv_len = 512

     bnxt.bv_offset = 1024
     bnxt.bv_len = 512

     bprv.bv_page == bnxt.bv_page
     virt_boundary is set to PAGE_SIZE-1

we call __bvec_gap_to_prev(q, &bprv, bnxt.offset) and 'offset' check
will report that a gap will appear if we merge these two bios. This
seems wrong.

>>  }
>>
>>  /*
>> @@ -1384,11 +1385,11 @@ static inline bool __bvec_gap_to_prev(struct request_queue *q,
>>   * the SG list. Most drivers don't care about this, but some do.
>>   */
>>  static inline bool bvec_gap_to_prev(struct request_queue *q,
>> -                               struct bio_vec *bprv, unsigned int offset)
>> +                               struct bio_vec *bprv, struct bio_vec *bnxt)
>>  {
>>         if (!queue_virt_boundary(q))
>>                 return false;
>> -       return __bvec_gap_to_prev(q, bprv, offset);
>> +       return __bvec_gap_to_prev(q, bprv, bnxt);
>>  }
>>
>>  static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
>> @@ -1400,7 +1401,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
>>                 bio_get_last_bvec(prev, &pb);
>>                 bio_get_first_bvec(next, &nb);
>>
>> -               return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
>> +               return __bvec_gap_to_prev(q, &pb, &nb);
>>         }
>>
>>         return false;
>> @@ -1545,7 +1546,7 @@ static inline bool integrity_req_gap_back_merge(struct request *req,
>>         struct bio_integrity_payload *bip_next = bio_integrity(next);
>>
>>         return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
>> -                               bip_next->bip_vec[0].bv_offset);
>> +                               &bip_next->bip_vec[0]);
>>  }
>>
>>  static inline bool integrity_req_gap_front_merge(struct request *req,
>> @@ -1555,7 +1556,7 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
>>         struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
>>
>>         return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
>> -                               bip_next->bip_vec[0].bv_offset);
>> +                               &bip_next->bip_vec[0]);
>>  }
>>
>>  #else /* CONFIG_BLK_DEV_INTEGRITY */
>> --
>> 2.5.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-16 16:26   ` Vitaly Kuznetsov
@ 2016-03-16 22:38     ` Keith Busch
  2016-03-17 11:20       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2016-03-16 22:38 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Ming Lei, linux-block, Linux Kernel Mailing List, Jens Axboe,
	Dan Williams, Martin K. Petersen, Sagi Grimberg, Mike Snitzer,
	K. Y. Srinivasan, Cathy Avery

On Wed, Mar 16, 2016 at 05:26:28PM +0100, Vitaly Kuznetsov wrote:
> Ming Lei <tom.leiming@gmail.com> writes:
> > We do have the above merge in bio_add_page(), so the two bios in
> > your above example shouldn't have been observed if the two buffers
> > are added to bio via the bio_add_page().
> >
> > If you see short bios in above example, maybe you need to check ntfs code:
> >
> > - if bio_add_page() is used to add buffer
> > - if using one standalone bio to transfer each 512byte, even they
> > are in same page and the sector is continuous
> 
> I'm not using ntfs, mkfs.ntfs is a userspace application which shows the
> regression when virt_boundary is in place. I should have avoided
> mentioning bio_add_pc_page() here as it is unrelated to the issue.
> 
> In particular, I'm concearned about the following call sites:
> blk_bio_segment_split()
> ll_back_merge_fn()
> ll_front_merge_fn()

I don't think blk_bio_segment_split would have seen such a bio vector
if it pages were added with bio_add_page. Those should already have
been combined. In any case, I think you can get what you're after just
by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the following
look ok to you?

---
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2613531..4aa8e44 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -96,13 +96,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
 	const unsigned max_sectors = get_max_io_size(q, bio);
 
 	bio_for_each_segment(bv, bio, iter) {
-		/*
-		 * If the queue doesn't support SG gaps and adding this
-		 * offset would create a gap, disallow it.
-		 */
-		if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
-			goto split;
-
 		if (sectors + (bv.bv_len >> 9) > max_sectors) {
 			/*
 			 * Consider this a new segment if we're splitting in
@@ -139,6 +132,13 @@ new_segment:
 		if (nsegs == queue_max_segments(q))
 			goto split;
 
+		/*
+		 * If the queue doesn't support SG gaps and adding this
+		 * offset would create a gap, disallow it.
+		 */
+		if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
+			goto split;
+
 		nsegs++;
 		bvprv = bv;
 		bvprvp = &bvprv;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 413c84f..69cffbe 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1400,7 +1400,8 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
 		bio_get_last_bvec(prev, &pb);
 		bio_get_first_bvec(next, &nb);
 
-		return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
+		if (!BIOVEC_PHYS_MERGEABLE(&pb, &nb))
+			return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
 	}
 
 	return false;
--

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-16 22:38     ` Keith Busch
@ 2016-03-17 11:20       ` Vitaly Kuznetsov
  2016-03-17 16:39         ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Vitaly Kuznetsov @ 2016-03-17 11:20 UTC (permalink / raw)
  To: Keith Busch
  Cc: Ming Lei, linux-block, Linux Kernel Mailing List, Jens Axboe,
	Dan Williams, Martin K. Petersen, Sagi Grimberg, Mike Snitzer,
	K. Y. Srinivasan, Cathy Avery

Keith Busch <keith.busch@intel.com> writes:

> On Wed, Mar 16, 2016 at 05:26:28PM +0100, Vitaly Kuznetsov wrote:
>> Ming Lei <tom.leiming@gmail.com> writes:
>> > We do have the above merge in bio_add_page(), so the two bios in
>> > your above example shouldn't have been observed if the two buffers
>> > are added to bio via the bio_add_page().
>> >
>> > If you see short bios in above example, maybe you need to check ntfs code:
>> >
>> > - if bio_add_page() is used to add buffer
>> > - if using one standalone bio to transfer each 512byte, even they
>> > are in same page and the sector is continuous
>> 
>> I'm not using ntfs, mkfs.ntfs is a userspace application which shows the
>> regression when virt_boundary is in place. I should have avoided
>> mentioning bio_add_pc_page() here as it is unrelated to the issue.
>> 
>> In particular, I'm concearned about the following call sites:
>> blk_bio_segment_split()
>> ll_back_merge_fn()
>> ll_front_merge_fn()
>
> I don't think blk_bio_segment_split would have seen such a bio vector
> if it pages were added with bio_add_page. Those should already have
> been combined. In any case, I think you can get what you're after just
> by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the following
> look ok to you?
>

Thanks, it does.

Just tested against 4.5, the test was:

# time mkfs.ntfs -s 512 -Q /dev/sdc1

The results are:

non-patched kernel:
real 0m35.552s
user 0m0.006s
sys 0m28.316s

my patch:
real 0m6.277s
user 0m0.010s
sys 0m5.870s

your patch:
real 0m4.247s
user 0m0.005s
sys 0m4.136s

Will you send it or would you like me to do that with your Suggested-by?

(a nitpick below)

> ---
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..4aa8e44 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -96,13 +96,6 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>  	const unsigned max_sectors = get_max_io_size(q, bio);
>
>  	bio_for_each_segment(bv, bio, iter) {
> -		/*
> -		 * If the queue doesn't support SG gaps and adding this
> -		 * offset would create a gap, disallow it.
> -		 */
> -		if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
> -			goto split;
> -
>  		if (sectors + (bv.bv_len >> 9) > max_sectors) {
>  			/*
>  			 * Consider this a new segment if we're splitting in
> @@ -139,6 +132,13 @@ new_segment:
>  		if (nsegs == queue_max_segments(q))
>  			goto split;
>
> +		/*
> +		 * If the queue doesn't support SG gaps and adding this
> +		 * offset would create a gap, disallow it.
> +		 */
> +		if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
> +			goto split;
> +
>  		nsegs++;
>  		bvprv = bv;
>  		bvprvp = &bvprv;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 413c84f..69cffbe 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1400,7 +1400,8 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
>  		bio_get_last_bvec(prev, &pb);
>  		bio_get_first_bvec(next, &nb);
>
> -		return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
> +		if (!BIOVEC_PHYS_MERGEABLE(&pb, &nb))
> +			return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
>  	}

Any reason to put this check here and not move to __bvec_gap_to_prev()?
I find it misleading that __bvec_gap_to_prev() reports a gap when offset
!= 0 not checking BIOVEC_PHYS_MERGEABLE().

>
>  	return false;
> --

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-17 11:20       ` Vitaly Kuznetsov
@ 2016-03-17 16:39         ` Keith Busch
  2016-03-18  2:59           ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2016-03-17 16:39 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Ming Lei, linux-block, Linux Kernel Mailing List, Jens Axboe,
	Dan Williams, Martin K. Petersen, Sagi Grimberg, Mike Snitzer,
	K. Y. Srinivasan, Cathy Avery

On Thu, Mar 17, 2016 at 12:20:28PM +0100, Vitaly Kuznetsov wrote:
> Keith Busch <keith.busch@intel.com> writes:
> > been combined. In any case, I think you can get what you're after just
> > by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the following
> > look ok to you?
> >
> 
> Thanks, it does.

Cool, thanks for confirming.

> Will you send it or would you like me to do that with your Suggested-by?

I'm not confident yet this doesn't break anything, particularly since
we moved the gap check after the length check. Just wanted to confirm
the concept addressed your concern, but still need to take a closer look
and test before submitting.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-17 16:39         ` Keith Busch
@ 2016-03-18  2:59           ` Ming Lei
  2016-03-30 13:07             ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2016-03-18  2:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: Vitaly Kuznetsov, linux-block, Linux Kernel Mailing List,
	Jens Axboe, Dan Williams, Martin K. Petersen, Sagi Grimberg,
	Mike Snitzer, K. Y. Srinivasan, Cathy Avery

On Fri, Mar 18, 2016 at 12:39 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Thu, Mar 17, 2016 at 12:20:28PM +0100, Vitaly Kuznetsov wrote:
>> Keith Busch <keith.busch@intel.com> writes:
>> > been combined. In any case, I think you can get what you're after just
>> > by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the following
>> > look ok to you?
>> >
>>
>> Thanks, it does.
>
> Cool, thanks for confirming.
>
>> Will you send it or would you like me to do that with your Suggested-by?
>
> I'm not confident yet this doesn't break anything, particularly since
> we moved the gap check after the length check. Just wanted to confirm
> the concept addressed your concern, but still need to take a closer look
> and test before submitting.

IMO, the change on blk_bio_segment_split() is correct, because actually it
is a sg gap and the check should have been done between segments
instead of bvecs. So it is reasonable to move the check just before populating
a new segment.

But for the 2nd change in bio_will_gap(), which should fix Vitaly's problem, I
am still not sure if it is completely correct. bio_will_gap() is used
to check if two
bios may be merged. Suppose two bios are continues physically, the last bvec
in 1st bio and the first bvec in 2nd bio might not be in one same segment
because of segment size limit.

The root cause might be from blkdev_writepage(), and I guess these small
bios are from there.

thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-18  2:59           ` Ming Lei
@ 2016-03-30 13:07             ` Ming Lei
  2016-04-20 13:48               ` Vitaly Kuznetsov
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2016-03-30 13:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Vitaly Kuznetsov, linux-block, Linux Kernel Mailing List,
	Jens Axboe, Dan Williams, Martin K. Petersen, Sagi Grimberg,
	Mike Snitzer, K. Y. Srinivasan, Cathy Avery

[-- Attachment #1: Type: text/plain, Size: 1941 bytes --]

On Fri, Mar 18, 2016 at 10:59 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Fri, Mar 18, 2016 at 12:39 AM, Keith Busch <keith.busch@intel.com> wrote:
>> On Thu, Mar 17, 2016 at 12:20:28PM +0100, Vitaly Kuznetsov wrote:
>>> Keith Busch <keith.busch@intel.com> writes:
>>> > been combined. In any case, I think you can get what you're after just
>>> > by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the following
>>> > look ok to you?
>>> >
>>>
>>> Thanks, it does.
>>
>> Cool, thanks for confirming.
>>
>>> Will you send it or would you like me to do that with your Suggested-by?
>>
>> I'm not confident yet this doesn't break anything, particularly since
>> we moved the gap check after the length check. Just wanted to confirm
>> the concept addressed your concern, but still need to take a closer look
>> and test before submitting.
>
> IMO, the change on blk_bio_segment_split() is correct, because actually it
> is a sg gap and the check should have been done between segments
> instead of bvecs. So it is reasonable to move the check just before populating
> a new segment.

Thinking of the 1st part change further, looks it is just correct in concept,
but wrong from current implementation. Because of bios/reqs merge,
blk_rq_map_sg() may end one segment in any bvec in theroy, so I guess
that is why each non-1st bvec need the check to make sure no sg gap.
Looks a very crazy limit, :-)

>
> But for the 2nd change in bio_will_gap(), which should fix Vitaly's problem, I
> am still not sure if it is completely correct. bio_will_gap() is used
> to check if two
> bios may be merged. Suppose two bios are continues physically, the last bvec
> in 1st bio and the first bvec in 2nd bio might not be in one same segment
> because of segment size limit.

How about the attached patch?


>
> The root cause might be from blkdev_writepage(), and I guess these small
> bios are from there.
>
> thanks,
> Ming Lei



-- 
Ming Lei

[-- Attachment #2: 0001-block-loose-check-on-sg-gap.patch --]
[-- Type: text/x-patch, Size: 2257 bytes --]

From 5f60ae1d686f025445fdf09f546d4d055d255ce9 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@canonical.com>
Date: Fri, 18 Mar 2016 12:41:53 +0800
Subject: [PATCH] block: loose check on sg gap

If the last bvec of the 1st bio and the 1st bvec of the next
bio are contineous physically, and the latter can be merged
to last segment of the 1st bio, we should think they don't
violate sg gap(or virt boundary) limit.

Vitaly reported lots of unmergeable small bios are observed
when running mkfs.ntfs on Hyper-V virtual storage, and performance
becomes quite low, so this patch is figured out for fix the
performance issue.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 include/linux/blkdev.h | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7e5d7e0..3962527 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1394,6 +1394,25 @@ static inline bool bvec_gap_to_prev(struct request_queue *q,
 	return __bvec_gap_to_prev(q, bprv, offset);
 }
 
+/*
+ * Check if the two bvecs from two bios can be merged to one segment.
+ * If yes, no need to check gap between the two bios since the 1st bio
+ * and the 1st bvec in the 2nd bio can be handled in one segment.
+ */
+static inline bool bios_segs_mergeable(struct request_queue *q,
+		struct bio *prev, struct bio_vec *prev_last_bv,
+		struct bio_vec *next_first_bv)
+{
+	if (!BIOVEC_PHYS_MERGEABLE(prev_last_bv, next_first_bv))
+		return false;
+	if (!BIOVEC_SEG_BOUNDARY(q, prev_last_bv, next_first_bv))
+		return false;
+	if (prev->bi_seg_back_size + next_first_bv->bv_len >
+			queue_max_segment_size(q))
+		return false;
+	return true;
+}
+
 static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
 			 struct bio *next)
 {
@@ -1403,7 +1422,8 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
 		bio_get_last_bvec(prev, &pb);
 		bio_get_first_bvec(next, &nb);
 
-		return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
+		if (!bios_segs_mergeable(q, prev, &pb, &nb))
+			return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
 	}
 
 	return false;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-03-30 13:07             ` Ming Lei
@ 2016-04-20 13:48               ` Vitaly Kuznetsov
  2016-12-15 14:03                 ` Dexuan Cui
  0 siblings, 1 reply; 12+ messages in thread
From: Vitaly Kuznetsov @ 2016-04-20 13:48 UTC (permalink / raw)
  To: Ming Lei
  Cc: Keith Busch, linux-block, Linux Kernel Mailing List, Jens Axboe,
	Dan Williams, Martin K. Petersen, Sagi Grimberg, Mike Snitzer,
	K. Y. Srinivasan, Cathy Avery

Ming Lei <tom.leiming@gmail.com> writes:

> On Fri, Mar 18, 2016 at 10:59 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Fri, Mar 18, 2016 at 12:39 AM, Keith Busch <keith.busch@intel.com> wrote:
>>> On Thu, Mar 17, 2016 at 12:20:28PM +0100, Vitaly Kuznetsov wrote:
>>>> Keith Busch <keith.busch@intel.com> writes:
>>>> > been combined. In any case, I think you can get what you're after just
>>>> > by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the following
>>>> > look ok to you?
>>>> >
>>>>
>>>> Thanks, it does.
>>>
>>> Cool, thanks for confirming.
>>>
>>>> Will you send it or would you like me to do that with your Suggested-by?
>>>
>>> I'm not confident yet this doesn't break anything, particularly since
>>> we moved the gap check after the length check. Just wanted to confirm
>>> the concept addressed your concern, but still need to take a closer look
>>> and test before submitting.
>>
>> IMO, the change on blk_bio_segment_split() is correct, because actually it
>> is a sg gap and the check should have been done between segments
>> instead of bvecs. So it is reasonable to move the check just before populating
>> a new segment.
>
> Thinking of the 1st part change further, looks it is just correct in concept,
> but wrong from current implementation. Because of bios/reqs merge,
> blk_rq_map_sg() may end one segment in any bvec in theroy, so I guess
> that is why each non-1st bvec need the check to make sure no sg gap.
> Looks a very crazy limit, :-)
>
>>
>> But for the 2nd change in bio_will_gap(), which should fix Vitaly's problem, I
>> am still not sure if it is completely correct. bio_will_gap() is used
>> to check if two
>> bios may be merged. Suppose two bios are continues physically, the last bvec
>> in 1st bio and the first bvec in 2nd bio might not be in one same segment
>> because of segment size limit.
>
> How about the attached patch?
>

I just wanted to revive the discussion as the issue persists. I
re-tested your patch against 4.6-rc4 and it efficiently solves the
issue.

pre-patch:
# time mkfs.ntfs /dev/sdb1
Cluster size has been automatically set to 4096 bytes.
Initializing device with zeroes: 100% - Done.
Creating NTFS volume structures.
mkntfs completed successfully. Have a nice day.

real8m10.977s
user0m0.115s
sys0m12.672s

post-patch:
# time mkfs.ntfs /dev/sdb1
Cluster size has been automatically set to 4096 bytes.
Initializing device with zeroes: 100% - Done.
Creating NTFS volume structures.
mkntfs completed successfully. Have a nice day.

real0m42.430s
user0m0.171s
sys0m7.675s

Will you send this patch? Please let me know if I can further
assist. Thanks!

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
  2016-04-20 13:48               ` Vitaly Kuznetsov
@ 2016-12-15 14:03                 ` Dexuan Cui
  0 siblings, 0 replies; 12+ messages in thread
From: Dexuan Cui @ 2016-12-15 14:03 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Ming Lei
  Cc: Keith Busch, linux-block@vger.kernel.org,
	Linux Kernel Mailing List, Jens Axboe, Dan Williams,
	Martin K. Petersen, Sagi Grimberg, Long Li, Mike Snitzer,
	KY Srinivasan, Cathy Avery

> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Vitaly Kuznetsov
> Sent: Wednesday, April 20, 2016 21:48
> To: Ming Lei <tom.leiming@gmail.com>
> Cc: Keith Busch <keith.busch@intel.com>; linux-block@vger.kernel.org; Lin=
ux
> Kernel Mailing List <linux-kernel@vger.kernel.org>; Jens Axboe
> <axboe@kernel.dk>; Dan Williams <dan.j.williams@intel.com>; Martin K.
> Petersen <martin.petersen@oracle.com>; Sagi Grimberg
> <sagig@mellanox.com>; Mike Snitzer <snitzer@redhat.com>; KY Srinivasan
> <kys@microsoft.com>; Cathy Avery <cavery@redhat.com>
> Subject: Re: [PATCH RFC] block: fix bio merge checks when virt_boundary i=
s set
>=20
> Ming Lei <tom.leiming@gmail.com> writes:
>=20
> > On Fri, Mar 18, 2016 at 10:59 AM, Ming Lei <tom.leiming@gmail.com> wrot=
e:
> >> On Fri, Mar 18, 2016 at 12:39 AM, Keith Busch <keith.busch@intel.com>
> wrote:
> >>> On Thu, Mar 17, 2016 at 12:20:28PM +0100, Vitaly Kuznetsov wrote:
> >>>> Keith Busch <keith.busch@intel.com> writes:
> >>>> > been combined. In any case, I think you can get what you're after =
just
> >>>> > by moving the gap check after BIOVEC_PHYS_MERGABLE. Does the
> following
> >>>> > look ok to you?
> >>>> >
> >>>>
> >>>> Thanks, it does.
> >>>
> >>> Cool, thanks for confirming.
> >>>
> >>>> Will you send it or would you like me to do that with your Suggested=
-by?
> >>>
> >>> I'm not confident yet this doesn't break anything, particularly since
> >>> we moved the gap check after the length check. Just wanted to confirm
> >>> the concept addressed your concern, but still need to take a closer l=
ook
> >>> and test before submitting.
> >>
> >> IMO, the change on blk_bio_segment_split() is correct, because actuall=
y it
> >> is a sg gap and the check should have been done between segments
> >> instead of bvecs. So it is reasonable to move the check just before po=
pulating
> >> a new segment.
> >
> > Thinking of the 1st part change further, looks it is just correct in co=
ncept,
> > but wrong from current implementation. Because of bios/reqs merge,
> > blk_rq_map_sg() may end one segment in any bvec in theroy, so I guess
> > that is why each non-1st bvec need the check to make sure no sg gap.
> > Looks a very crazy limit, :-)
> >
> >>
> >> But for the 2nd change in bio_will_gap(), which should fix Vitaly's pr=
oblem, I
> >> am still not sure if it is completely correct. bio_will_gap() is used
> >> to check if two
> >> bios may be merged. Suppose two bios are continues physically, the las=
t bvec
> >> in 1st bio and the first bvec in 2nd bio might not be in one same segm=
ent
> >> because of segment size limit.
> >
> > How about the attached patch?
> >
>=20
> I just wanted to revive the discussion as the issue persists. I
> re-tested your patch against 4.6-rc4 and it efficiently solves the
> issue.
>=20
> pre-patch:
> # time mkfs.ntfs /dev/sdb1
> Cluster size has been automatically set to 4096 bytes.
> Initializing device with zeroes: 100% - Done.
> Creating NTFS volume structures.
> mkntfs completed successfully. Have a nice day.
>=20
> real8m10.977s
> user0m0.115s
> sys0m12.672s
>=20
> post-patch:
> # time mkfs.ntfs /dev/sdb1
> Cluster size has been automatically set to 4096 bytes.
> Initializing device with zeroes: 100% - Done.
> Creating NTFS volume structures.
> mkntfs completed successfully. Have a nice day.
>=20
> real0m42.430s
> user0m0.171s
> sys0m7.675s
>=20
> Will you send this patch? Please let me know if I can further
> assist. Thanks!
>=20
> --
>   Vitaly

Hi, I'm reviving the thread because I'm suffering from exactly the same iss=
ue.
This is the thread I created today:=20
"Big I/O requests are split into small ones due to unaligned ext4 partition=
 boundary?"
http://marc.info/?t=3D148180346100002&r=3D1&w=3D2

Ming's patch can fix this issue for me.=20

Stable 4.4 and later are affected too.
I didn't check 4.3.x kernels, but for Linux guest on Hyper-V, any kernel wi=
th the
patch "storvsc: get rid of bounce buffer"
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=
=3D81988a0e6b031bc80da15257201810ddcf989e64
should be affected.

Thanks,
-- Dexuan

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-12-15 14:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-15 15:17 [PATCH RFC] block: fix bio merge checks when virt_boundary is set Vitaly Kuznetsov
2016-03-15 16:03 ` Keith Busch
2016-03-16 10:17   ` Vitaly Kuznetsov
2016-03-16 15:40 ` Ming Lei
2016-03-16 16:26   ` Vitaly Kuznetsov
2016-03-16 22:38     ` Keith Busch
2016-03-17 11:20       ` Vitaly Kuznetsov
2016-03-17 16:39         ` Keith Busch
2016-03-18  2:59           ` Ming Lei
2016-03-30 13:07             ` Ming Lei
2016-04-20 13:48               ` Vitaly Kuznetsov
2016-12-15 14:03                 ` Dexuan Cui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).