All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: Ming Lei <tom.leiming@gmail.com>
Cc: linux-block@vger.kernel.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>,
	Dan Williams <dan.j.williams@intel.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Sagi Grimberg <sagig@mellanox.com>,
	Mike Snitzer <snitzer@redhat.com>,
	"K. Y. Srinivasan" <kys@microsoft.com>,
	Cathy Avery <cavery@redhat.com>,
	Keith Busch <keith.busch@intel.com>
Subject: Re: [PATCH RFC] block: fix bio merge checks when virt_boundary is set
Date: Wed, 16 Mar 2016 17:26:28 +0100	[thread overview]
Message-ID: <87oaae4cej.fsf@vitty.brq.redhat.com> (raw)
In-Reply-To: <CACVXFVMs-fOaNVUyedg66T83MiOc7a+op3LeHVt6ogVBQnYdeQ@mail.gmail.com> (Ming Lei's message of "Wed, 16 Mar 2016 23:40:02 +0800")

Ming Lei <tom.leiming@gmail.com> writes:

> On Tue, Mar 15, 2016 at 11:17 PM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> Hyper-V storage driver, which switched to using virt_boundary some time
>> ago, experiences significant slowdown on non-page-aligned IO. E.g.
>>
>> With virt_boundary set:
>>  # time mkfs.ntfs -Q -s 512 /dev/sdc1
>>  ...
>>  real   0m9.406s
>>  user   0m0.014s
>>  sys    0m0.672s
>>
>> Without virt_boundary set (unsafe):
>>  # time mkfs.ntfs -Q -s 512 /dev/sdc1
>>  ...
>>  real   0m6.657s
>>  user   0m0.012s
>>  sys    0m6.423s
>>
>> The reason of the slowdown is the fact that bios don't get merged and we
>> end up sending many short requests to the host. My investigation led me to
>> the following code (__bvec_gap_to_prev()):
>>
>>     return offset ||
>>            ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>>
>> Here is an example: we have two bio_vec with the following content:
>>     bprv.bv_offset = 512
>>     bprv.bv_len = 512
>>
>>     bnxt.bv_offset = 1024
>>     bnxt.bv_len = 512
>>
>>     bprv.bv_page == bnxt.bv_page
>>     virt_boundary is set to PAGE_SIZE-1
>>
>> The above mentioned code will report that a gap will appear if we merge
>> these two (as offset = 1024) but this doesn't look sane. On top of that,
>> we have the following optimization in bio_add_pc_page():
>>
>>     if (page == prev->bv_page &&
>>         offset == prev->bv_offset + prev->bv_len) {
>>             prev->bv_len += len;
>>             bio->bi_iter.bi_size += len;
>>             goto done;
>>         }
>>
>> But we don't have such check in other places, which check virt_boundary.
>
> We do have the above merge in bio_add_page(), so the two bios in
> your above example shouldn't have been observed if the two buffers
> are added to bio via the bio_add_page().
>
> If you see short bios in above example, maybe you need to check ntfs code:
>
> - if bio_add_page() is used to add buffer
> - if using one standalone bio to transfer each 512byte, even they
> are in same page and the sector is continuous

I'm not using ntfs, mkfs.ntfs is a userspace application which shows the
regression when virt_boundary is in place. I should have avoided
mentioning bio_add_pc_page() here as it is unrelated to the issue.

In particular, I'm concearned about the following call sites:
blk_bio_segment_split()
ll_back_merge_fn()
ll_front_merge_fn()

>> Modify the check in __bvec_gap_to_prev() to the following:
>> 1) Report no gap in case bnxt->bv_offset == bprv->bv_offset + bprv->bv_len
>>    when bprv.bv_page == bnxt.bv_page.
>> 2) Continue reporting no gap in (bprv->bv_offset + bprv->bv_len) &
>>    queue_virt_boundary(q) case.
>>
>> Reported-by: John R. Kozee II <jkozee@bowser-morner.com>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> - The condition I'm changing was there since SG_GAPS so I may be missing
>>   something important, thus RFC.
>> ---
>>  block/bio-integrity.c  |  7 +++++--
>>  block/bio.c            |  4 +++-
>>  block/blk-merge.c      |  2 +-
>>  include/linux/blkdev.h | 17 +++++++++--------
>>  4 files changed, 18 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/bio-integrity.c b/block/bio-integrity.c
>> index 711e4d8d..f8560da 100644
>> --- a/block/bio-integrity.c
>> +++ b/block/bio-integrity.c
>> @@ -136,7 +136,7 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
>>                            unsigned int len, unsigned int offset)
>>  {
>>         struct bio_integrity_payload *bip = bio_integrity(bio);
>> -       struct bio_vec *iv;
>> +       struct bio_vec *iv, bv;
>>
>>         if (bip->bip_vcnt >= bip->bip_max_vcnt) {
>>                 printk(KERN_ERR "%s: bip_vec full\n", __func__);
>> @@ -144,10 +144,13 @@ int bio_integrity_add_page(struct bio *bio, struct page *page,
>>         }
>>
>>         iv = bip->bip_vec + bip->bip_vcnt;
>> +       bv.bv_page = page;
>> +       bv.bv_len = len;
>> +       bv.bv_offset = offset;
>>
>>         if (bip->bip_vcnt &&
>>             bvec_gap_to_prev(bdev_get_queue(bio->bi_bdev),
>> -                            &bip->bip_vec[bip->bip_vcnt - 1], offset))
>> +                            &bip->bip_vec[bip->bip_vcnt - 1], &bv))
>>                 return 0;
>>
>>         iv->bv_page = page;
>> diff --git a/block/bio.c b/block/bio.c
>> index cf75915..1583581 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -730,6 +730,8 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>>          */
>>         if (bio->bi_vcnt > 0) {
>>                 struct bio_vec *prev = &bio->bi_io_vec[bio->bi_vcnt - 1];
>> +               struct bio_vec bv = {.bv_page = page, .bv_len = len,
>> +                                    .bv_offset = offset};
>>
>>                 if (page == prev->bv_page &&
>>                     offset == prev->bv_offset + prev->bv_len) {
>> @@ -742,7 +744,7 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page
>>                  * If the queue doesn't support SG gaps and adding this
>>                  * offset would create a gap, disallow it.
>>                  */
>> -               if (bvec_gap_to_prev(q, prev, offset))
>> +               if (bvec_gap_to_prev(q, prev, &bv))
>>                         return 0;
>>         }
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 2613531..8c6c3e2 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -100,7 +100,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>>                  * If the queue doesn't support SG gaps and adding this
>>                  * offset would create a gap, disallow it.
>>                  */
>> -               if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
>> +               if (bvprvp && bvec_gap_to_prev(q, bvprvp, &bv))
>>                         goto split;
>>
>>                 if (sectors + (bv.bv_len >> 9) > max_sectors) {
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 413c84f..b4fa29d 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1373,10 +1373,11 @@ static inline void put_dev_sector(Sector p)
>>  }
>>
>>  static inline bool __bvec_gap_to_prev(struct request_queue *q,
>> -                               struct bio_vec *bprv, unsigned int offset)
>> +                               struct bio_vec *bprv, struct bio_vec *bnxt)
>>  {
>> -       return offset ||
>> -               ((bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q));
>> +       if (bprv->bv_page == bnxt->bv_page)
>> +               return bnxt->bv_offset != bprv->bv_offset + bprv->bv_len;
>> +       return (bprv->bv_offset + bprv->bv_len) & queue_virt_boundary(q);
>
> Why do you remove check on 'offset'?
>

Because this check is wrong in my opinion and that's what's causing the
issue.

Let me try to give my example again.

We have two bios,

     bprv.bv_offset = 512
     bprv.bv_len = 512

     bnxt.bv_offset = 1024
     bnxt.bv_len = 512

     bprv.bv_page == bnxt.bv_page
     virt_boundary is set to PAGE_SIZE-1

we call __bvec_gap_to_prev(q, &bprv, bnxt.offset) and 'offset' check
will report that a gap will appear if we merge these two bios. This
seems wrong.

>>  }
>>
>>  /*
>> @@ -1384,11 +1385,11 @@ static inline bool __bvec_gap_to_prev(struct request_queue *q,
>>   * the SG list. Most drivers don't care about this, but some do.
>>   */
>>  static inline bool bvec_gap_to_prev(struct request_queue *q,
>> -                               struct bio_vec *bprv, unsigned int offset)
>> +                               struct bio_vec *bprv, struct bio_vec *bnxt)
>>  {
>>         if (!queue_virt_boundary(q))
>>                 return false;
>> -       return __bvec_gap_to_prev(q, bprv, offset);
>> +       return __bvec_gap_to_prev(q, bprv, bnxt);
>>  }
>>
>>  static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
>> @@ -1400,7 +1401,7 @@ static inline bool bio_will_gap(struct request_queue *q, struct bio *prev,
>>                 bio_get_last_bvec(prev, &pb);
>>                 bio_get_first_bvec(next, &nb);
>>
>> -               return __bvec_gap_to_prev(q, &pb, nb.bv_offset);
>> +               return __bvec_gap_to_prev(q, &pb, &nb);
>>         }
>>
>>         return false;
>> @@ -1545,7 +1546,7 @@ static inline bool integrity_req_gap_back_merge(struct request *req,
>>         struct bio_integrity_payload *bip_next = bio_integrity(next);
>>
>>         return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
>> -                               bip_next->bip_vec[0].bv_offset);
>> +                               &bip_next->bip_vec[0]);
>>  }
>>
>>  static inline bool integrity_req_gap_front_merge(struct request *req,
>> @@ -1555,7 +1556,7 @@ static inline bool integrity_req_gap_front_merge(struct request *req,
>>         struct bio_integrity_payload *bip_next = bio_integrity(req->bio);
>>
>>         return bvec_gap_to_prev(req->q, &bip->bip_vec[bip->bip_vcnt - 1],
>> -                               bip_next->bip_vec[0].bv_offset);
>> +                               &bip_next->bip_vec[0]);
>>  }
>>
>>  #else /* CONFIG_BLK_DEV_INTEGRITY */
>> --
>> 2.5.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-block" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
  Vitaly

  reply	other threads:[~2016-03-16 16:26 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-15 15:17 [PATCH RFC] block: fix bio merge checks when virt_boundary is set Vitaly Kuznetsov
2016-03-15 16:03 ` Keith Busch
2016-03-16 10:17   ` Vitaly Kuznetsov
2016-03-16 15:40 ` Ming Lei
2016-03-16 16:26   ` Vitaly Kuznetsov [this message]
2016-03-16 22:38     ` Keith Busch
2016-03-17 11:20       ` Vitaly Kuznetsov
2016-03-17 16:39         ` Keith Busch
2016-03-18  2:59           ` Ming Lei
2016-03-30 13:07             ` Ming Lei
2016-04-20 13:48               ` Vitaly Kuznetsov
2016-12-15 14:03                 ` Dexuan Cui
2016-12-15 14:03                   ` Dexuan Cui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87oaae4cej.fsf@vitty.brq.redhat.com \
    --to=vkuznets@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=cavery@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=keith.busch@intel.com \
    --cc=kys@microsoft.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=sagig@mellanox.com \
    --cc=snitzer@redhat.com \
    --cc=tom.leiming@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.