Re: [LSF/MM ATTEND] block: multipage bvecs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Boaz Harrosh <boaz@plexistor.com>
To: Ming Lei <tom.leiming@gmail.com>, lsf-pc@lists.linuxfoundation.org
Cc: linux-block@vger.kernel.org,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>
Subject: Re: [LSF/MM ATTEND] block: multipage bvecs
Date: Sun, 28 Feb 2016 13:17:43 +0200	[thread overview]
Message-ID: <56D2D757.2000204@plexistor.com> (raw)
In-Reply-To: <CACVXFVPH2z9H9ZudEV9yirg2mJuz6f7Z4LB5MQrabFEh82Xa7A@mail.gmail.com>

On 02/26/2016 06:33 PM, Ming Lei wrote:
> Hi,
> 
> I'd like to participate in LSF/MM and discuss multipage bvecs.
> 
> Kent posted the idea[1] before, but never pushed out.
> I have studied multipage bvecs for a while, and think
> it is a good idea to improve block subsystem.
> 
> Multipage bvecs means that one 'struct bio_bvec' can hold
> multiple pages which are physically contiguous instead
> of one single page used in current kernel.
> 

Hi Ming Lei

This is an interesting talk for me.

I don't know if you ever tried it but I did. If I take a regular
SSD disk or a PCIE flash card that I have in my machine and
I stick a pointer to a page and bv_len = PAGE_SIZE * 8 and call
submit_bio, I get 8 pages worth of IO with a single bvec and it
all just works.

Yes Yes I know it would break bunch of other places, probably
the single bvec case works better. But just to say that current
code is not that picky in assuming a single page size.

I would like to see an audit and test cases done in this regard
but to keep current API and make this transparent. I think
that all the below places you mentioned can be made transparent
to "big bvec" if coded carefully, and there need not be a separate
API for multi-page / single-page bvecs. It should all just work.
I might be wrong, have not looked at this deeply, but is my gut
feeling, that it can be possible.

Thanks for bringing up the issue
Boaz

> IMO, there are several advantages by supporting multipage bvecs:
> 
> - currently one bio from bio_alloc() can only hold at most 256
> vectors, which means one bio can be used to transfer at most
> 1Mbytes(256*4K). With multipage bvecs fs can submit bigger
> chunk via single bio because big physically contiguous segment
> is very common.
> 
> - CPU consumed in iterating bvec table should be decreased
> 
> - block merge gets simplified a lot, and segment can be merged
> just inside bio_add_page(), then singlepage bvec needn't to store
> in bvec table, finally the segment can be splitted to driver with
> proper size. blk_bio_map_sg() gets simplified too. Recent days,
> block merge becomes a bit complicated and we saw quite bug reports/fixes
> in block merge.
> 
> I'd like to hear opinions from fs guys about multipage bvecs based bio
> because this should bring up some change to the bio interface(one bio
> will represent bigger I/O than before).
> 
> Also I hope to discuss with guys in fs, dm, md, bcache... about
> the implementation because this feature will bring changes on
> these subsystems. So far, I have the following ideas:
> 
> 1) change on bio_for_each_segment()
> 
> bvec returned from this iterator helper need to keep as singlepage
> vector as before, so most users of bio iterator don't need change
> 
> 2) change on bio_for_each_segment_all()
> 
> bio_for_each_segment_all() has to be changed because callers may
> change the bvec and assume it is always singlepage now.
> 
> I think bio_for_each_segment_all() need to be splitted into
> bio_for_each_segment_all_rd() and bio_for_each_segment_all_wt().
> 
> Both two new helpers returns pointer to bio_bvec like before.
> 
> *_rd() is used to iterate each vector for reading the pointed bvec,
> and caller can not write to this vector. This helper can still
> return singlepage bvec like before, so one extra local/temp 'bio_bvec'
> variable has to be added for conversion from multipage bvec to
> singlepage bvec.
> 
> *_wt() is used to iterate each vector for changing the bvec, and
> only allowed for iterating bio with singlepage bvecs, there are
> just several such cases, such as bio bounce, bio_alloc_pages(),
> raid1 and raid10.
> 
> 3) change bvecs of cloned bio
> Such as bio bounce and raid1, one bio is cloned from the incoming
> bio, and each bvec of the cloned bio may be updated. We have to
> introduce singlepage version of bio_clone() to make the cloned bio
> only include singlepage bvec, then the bvecs can be updated like
> before.
> 
> One problem is that the cloned bio may not hold all singlepage bvec
> converted from multipage bvecs in the source bio, and one simple
> solution is to split the source bio and make sure its size can't be
> bigger than 1Mbytes(256 single page vectors).
> 
> 4) introduce bio_for_each_mp_segment()
> 
> bvec returned from this iterator helper will become multipage bvec
> which should be the actual/real segment, so drivers may switch to
> this helper if they can handle multipage segment directly, which
> should be common case.
> 
> 
> [1] http://marc.info/?l=linux-kernel&m=141680246629547&w=2
> 
> Thanks,
> Ming Lei
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2016-02-28 11:17 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-26 16:33 [LSF/MM ATTEND] block: multipage bvecs Ming Lei
2016-02-28 11:17 ` Boaz Harrosh [this message]
2016-02-28 14:34   ` Ming Lei
2016-02-28 14:41     ` Ming Lei
2016-02-28 16:01   ` [Lsf-pc] " James Bottomley
2016-02-29  9:41     ` Boaz Harrosh
2016-02-28 16:08   ` Christoph Hellwig
2016-02-29 10:16     ` Boaz Harrosh
2016-02-29 15:46       ` James Bottomley
2016-02-28 16:07 ` Christoph Hellwig
2016-02-28 16:26   ` James Bottomley
2016-02-28 16:29     ` Christoph Hellwig
2016-02-28 16:45       ` James Bottomley
2016-02-28 16:59         ` Ming Lei
2016-02-28 17:09           ` James Bottomley
2016-02-28 18:49             ` Ming Lei
2016-03-03  8:58 ` Christoph Hellwig
2016-03-03 11:04   ` Ming Lei
2016-03-03 12:11     ` Christoph Hellwig
2016-03-03 23:49       ` Ming Lin
2016-03-07  8:44       ` Ming Lei
2016-03-21 15:55         ` Christoph Hellwig
2016-03-22  0:12           ` Ming Lei
2016-03-05  8:35 ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56D2D757.2000204@plexistor.com \
    --to=boaz@plexistor.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linuxfoundation.org \
    --cc=tom.leiming@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).