public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ming Lin <mlin@kernel.org>
To: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	dm-devel@redhat.com, linux-kernel@vger.kernel.org,
	Christoph Hellwig <hch@lst.de>, Jeff Moyer <jmoyer@redhat.com>,
	Dongsu Park <dpark@posteo.net>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	"Alasdair G. Kergon" <agk@redhat.com>
Subject: Re: [PATCH v5 00/11] simplify block layer based on immutable biovecs
Date: Mon, 27 Jul 2015 15:11:30 -0700	[thread overview]
Message-ID: <1438035090.28978.19.camel@ssi> (raw)
In-Reply-To: <20150727175048.GA18183@redhat.com>

On Mon, 2015-07-27 at 13:50 -0400, Mike Snitzer wrote:
> On Thu, Jul 23 2015 at  2:21pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Mon, 2015-07-13 at 11:35 -0400, Mike Snitzer wrote:
> > > On Mon, Jul 13 2015 at  1:12am -0400,
> > > Ming Lin <mlin@kernel.org> wrote:
> > > 
> > > > On Mon, 2015-07-06 at 00:11 -0700, mlin@kernel.org wrote:
> > > > > Hi Mike,
> > > > > 
> > > > > On Wed, 2015-06-10 at 17:46 -0400, Mike Snitzer wrote:
> > > > > > I've been busy getting DM changes for the 4.2 merge window finalized.
> > > > > > As such I haven't connected with others on the team to discuss this
> > > > > > issue.
> > > > > > 
> > > > > > I'll see if we can make time in the next 2 days.  But I also have
> > > > > > RHEL-specific kernel deadlines I'm coming up against.
> > > > > > 
> > > > > > Seems late to be staging this extensive a change for 4.2... are you
> > > > > > pushing for this code to land in the 4.2 merge window?  Or do we have
> > > > > > time to work this further and target the 4.3 merge?
> > > > > > 
> > > > > 
> > > > > 4.2-rc1 was out.
> > > > > Would you have time to work together for 4.3 merge? 
> > > > 
> > > > Ping ...
> > > > 
> > > > What can I do to move forward?
> > > 
> > > You can show further testing.  Particularly that you've covered all the
> > > edge cases.
> > > 
> > > Until someone can produce some perf test results where they are actually
> > > properly controlling for the splitting, we have no useful information.
> > > 
> > > The primary concerns associated with this patchset are:
> > > 1) In the context of RAID, XFS's use of bio_add_page() used to build up
> > >    optimal IOs when the underlying block device provides striping info
> > >    via IO limits.  With this patchset how large will bios become in
> > >    practice _without_ bio_add_page() being bounded by the underlying IO
> > >    limits?
> > 
> > Totally new to XFS code.
> > Did you mean xfs_buf_ioapply_map() -> bio_add_page()?
> 
> Yes.  But there is also:
> xfs_vm_writepage -> xfs_submit_ioend -> xfs_bio_add_buffer -> bio_add_page
> 
> Basically in the old code XFS sized IO accordingly based on the
> bio_add_page feedback loop.
> 
> > The largest size could be BIO_MAX_PAGES pages, that is 256 pages(1M
> > bytes).
> 
> Independent of this late splitting work (but related): we really should
> look to fixup/extend BIO_MAX_PAGES to cover just barely "too large"
> configurations, e.g. 10+2 RAID6 with 128K chunk, so 1280K for a full
> stripe.  Ideally we'd be able to read/reite full stripes.
> 
> > > 2) The late splitting that occurs for the (presummably) large bios that
> > >    are sent down.. how does it cope/perform in the face of very
> > >    low/fragmented system memory?
> > 
> > I tested in qemu-kvm with 1G/1100M/1200M memory.
> > 10 HDDs were attached to qemu via virtio-blk.
> > Then created MD RAID6 array and mkfs.xfs on it.
> > 
> > I use bs=2M, so there will be a lot of bio splits.
> > 
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1200
> > time_based
> > group_reporting
> > numjobs=8
> > gtod_reduce=0
> > norandommap
> > 
> > [job1]
> > bs=2M
> > directory=/mnt
> > size=100M
> > rw=write
> > 
> > Here is the results:
> > 
> > memory		4.2-rc2		4.2-rc2-patched
> > ------		-------		---------------
> > 1G		OOM		OOM
> > 1100M		fail		OK
> > 1200M		OK		OK
> > 
> > "fail" means it hit a page allocation failure.
> > http://minggr.net/pub/block_patches_tests/dmesg.4.2.0-rc2
> > 
> > I tested 3 times for each kernel to confirm that with 1100M memory,
> > 4.2-rc2 always hit a page allocation failure and 4.2-rc2-patched is OK.
> > 
> > So the patched kernel performs better in this case.
> 
> Interesting.  Seems to prove Kent's broader point that he used mempools
> and handles allocations better than the old code did.
> 
> > > 3) More open-ended comment than question: Linux has evolved to perform
> > >    well on "enterprise" systems.  We generally don't fall off a cliff on 
> > >    performance like we used to.  The concern associated with this
> > >    patchset is that if it goes in without _real_ due-diligence on
> > >    "enterprise" scale systems and workloads it'll be too late once we
> > >    notice the problem(s).
> > > 
> > > So we really need answers to 1 and 2 above in order to feel better about
> > > the risks associated 3.
> > > 
> > > Alasdair's feedback to you on testing still applies (and hasn't been
> > > done AFAIK):
> > > https://www.redhat.com/archives/dm-devel/2015-May/msg00203.html
> > > 
> > > Particularly:
> > > "you might need to instrument the kernels to tell you the sizes of the
> > > bios being created and the amount of splitting actually happening."
> > 
> > I added a debug patch to record the amount of splitting actually
> > happened. https://goo.gl/Iiyg4Y
> > 
> > In the qemu 1200M memory test case,
> > 
> > $ cat /sys/block/md0/queue/split
> > discard split: 0, write same split: 0, segment split: 27400
> > 
> > > 
> > > and
> > > 
> > > "You may also want to test systems with a restricted amount of available
> > > memory to show how the splitting via worker thread performs.  (Again,
> > > instrument to prove the extent to which the new code is being exercised.)"
> > 
> > Does above test with qemu make sense?
> 
> The test is showing that systems with limited memory are performing
> better but, without looking at the patchset in detail, I'm not sure what
> your splitting accounting patch is showing.
> 
> Are you saying that:
> 1) the code only splits via worker threads
> 2) with 27400 splits in the 1200M case the splitting certainly isn't
>    making things any worse.

With this patchset, bio_add_page() always create as large as possible
bio(1M bytes max). The patch accounts how many times the bio was split
due to device limitation, for example, bio->bi_phys_segments >
queue_max_segments(q).

It's more interesting if we look at how many bios are allocated for each
application IO request.

e.g. 10+2 RAID6 with 128K chunk.

Assume we only consider device max_segments limitation.

# cat /sys/block/md0/queue/max_segments 
126

So blk_queue_split() will split the bio if its size > 126 pages(504K
bytes).

Let's do a 1280K request.

# dd if=/dev/zero of=/dev/md0 bs=1280k count=1 oflag=direct

With below debug patch,

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a4aa6e5..2fde2ce 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -259,6 +259,10 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_split(q, &bio, q->bio_split);
 
+	if (!strcmp(current->comm, "dd") && bio_data_dir(bio) == WRITE)
+		printk("%s: bio %p, offset %lu, size %uK\n", __func__,
+			bio, bio->bi_iter.bi_sector<<9, bio->bi_iter.bi_size>>10);
+
 	if (mddev == NULL || mddev->pers == NULL
 	    || !mddev->ready) {
 		bio_io_error(bio);

For non-patched kernel, 10 bios were allocated.

[   11.921775] md_make_request: bio ffff8800469c5d00, offset 0, size 128K
[   11.945692] md_make_request: bio ffff8800471df700, offset 131072, size 128K
[   11.946596] md_make_request: bio ffff8800471df200, offset 262144, size 128K
[   11.947694] md_make_request: bio ffff8800471df300, offset 393216, size 128K
[   11.949421] md_make_request: bio ffff8800471df900, offset 524288, size 128K
[   11.956345] md_make_request: bio ffff8800471df000, offset 655360, size 128K
[   11.957586] md_make_request: bio ffff8800471dfb00, offset 786432, size 128K
[   11.959086] md_make_request: bio ffff8800471dfc00, offset 917504, size 128K
[   11.964221] md_make_request: bio ffff8800471df400, offset 1048576, size 128K
[   11.965117] md_make_request: bio ffff8800471df800, offset 1179648, size 128K

For patched kernel, only 2 bios were allocated at base case and 0 split.

[   20.034036] md_make_request: bio ffff880046a2ee00, offset 0, size 1024K
[   20.046104] md_make_request: bio ffff880046a2e500, offset 1048576, size 256K

4 bios allocated for worst case and 2 splits.
One of the worst case could be the memory is so segmented that 1M bio comprised
of 256 bi_phys_segments. So it needs 2 splits.

1280K = 1M + 256K

ffff880046a30900 and ffff880046a21500 are the original bios.
ffff880046a30200 and ffff880046a21e00 are the split bios.

[   13.049323] md_make_request: bio ffff880046a30200, offset 0, size 504K
[   13.080057] md_make_request: bio ffff880046a21e00, offset 516096, size 504K
[   13.082857] md_make_request: bio ffff880046a30900, offset 1032192, size 16K
[   13.084983] md_make_request: bio ffff880046a21500, offset 1048576, size 256K

# cat /sys/block/md0/queue/split 
discard split: 0, write same split: 0, segment split: 2

> 
> But for me the bigger take away is: the old merge_bvec code (no late
> splitting) is more prone to allocation failure then the new code.

Yes, as I showed above.

> 
> On that point alone I'm OK with this patchset going forward.
> 
> I'll reviewer the implementation details as they relate to DM now, but
> that is just a formality.  My hope is that I'll be abke to provide my
> Acked-by very soon.

Great! Thanks.

> 
> Mike



  reply	other threads:[~2015-07-27 22:11 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-06  7:11 [PATCH v5 00/11] simplify block layer based on immutable biovecs mlin
2015-07-06  7:11 ` [PATCH v5 01/11] block: make generic_make_request handle arbitrarily sized bios mlin
2015-07-06  7:11 ` [PATCH v5 02/11] block: simplify bio_add_page() mlin
2015-07-06  7:11 ` [PATCH v5 03/11] bcache: remove driver private bio splitting code mlin
2015-07-06  7:11 ` [PATCH v5 04/11] btrfs: remove bio splitting and merge_bvec_fn() calls mlin
2015-07-06  7:11 ` [PATCH v5 05/11] block: remove split code in blkdev_issue_discard mlin
2015-07-06  7:11 ` [PATCH v5 06/11] md/raid5: split bio for chunk_aligned_read mlin
2015-07-06  7:11 ` [PATCH v5 07/11] md/raid5: get rid of bio_fits_rdev() mlin
2015-07-06  7:11 ` [PATCH v5 08/11] block: kill merge_bvec_fn() completely mlin
2015-07-06  7:11 ` [PATCH v5 09/11] fs: use helper bio_add_page() instead of open coding on bi_io_vec mlin
2015-07-06  7:11 ` [PATCH v5 10/11] block: remove bio_get_nr_vecs() mlin
2015-07-06  7:11 ` [PATCH v5 11/11] Documentation: update notes in biovecs about arbitrarily sized bios mlin
2015-07-13  5:12 ` [PATCH v5 00/11] simplify block layer based on immutable biovecs Ming Lin
2015-07-13 15:35   ` Mike Snitzer
2015-07-14 20:51     ` Ming Lin
2015-07-24 19:50       ` Kent Overstreet
2015-07-16  7:06     ` Ming Lin
2015-07-16 13:13       ` Jeff Moyer
2015-07-23 18:21     ` Ming Lin
2015-07-27 17:50       ` Mike Snitzer
2015-07-27 22:11         ` Ming Lin [this message]
2015-07-27 22:16           ` Ming Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1438035090.28978.19.camel@ssi \
    --to=mlin@kernel.org \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=dpark@posteo.net \
    --cc=hch@lst.de \
    --cc=jmoyer@redhat.com \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox