linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Boaz Harrosh <bharrosh@panasas.com>
To: <tao.peng@emc.com>
Cc: <bergwolf@gmail.com>, <bhalevy@tonian.com>,
	<Trond.Myklebust@netapp.com>, <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH-RESEND 4/4] pnfsblock: do not ask for layout in pg_init
Date: Thu, 1 Dec 2011 10:00:40 -0800	[thread overview]
Message-ID: <4ED7C0C8.8010504@panasas.com> (raw)
In-Reply-To: <4ED7BA55.8010200@panasas.com>

On 12/01/2011 09:33 AM, Boaz Harrosh wrote:
> On 11/30/2011 09:05 PM, tao.peng@emc.com wrote:
>>> -----Original Message-----
>> Why return or forget the 1 lo_seg? What you really need to avoid seg
>> list exploding is to have LRU based caching and merge them when
>> necessary, instead of asking and dropping lseg again and again...
>>
> 
> Ya Ya, I'm talking in abstract now giving the all picture. If in our
> above case the application has closed the file and will not ever open it
> again then I'm right, right? of course once you got it you can keep it
> around in a cache. Though I think that with heavy segmenting keeping segs
> passed ROC is extremely harder to manage. Be careful with that in current
> implementation.
> 
>> Removing the IO_MAX limitation can be a second optimization. I was
>> hoping to remove it if current IO_MAX thing turns out hurting
>> performance. And one reason for IO_MAX is to avoid the likelihood
>> that server returns short layout, because current implementation is
>> limited to retry nfs_read/write_data as a whole instead of splitting
>> it up. I think that if we do it this way, the IO_MAX can be removed
>> later when necessary, by introducing a splitting mechanism either on
>> nfs_read/write_data or on desc. Now that you ask for it, I think
>> following approach is possible:
>>
>> 1. remove the limit on IO_MAX at .pg_test.
>> 2. ask for layout at .pg_doio for the size of current IO desc
>> 3. if server returns short layout, split nfs_read/write_data or desc and issue the pagelist covered by lseg.
>> 4. do 2 and 3 in a loop until all pages in current desc is handled.
>>
> 
> So in effect you are doing what I suggest two passes 
> one: what's the next hole,
> second: Collect pages slicing by returned lo_seg
> 
> Don't you think it is more simple to do a 3 line preliminary
> loop in "one:"?
> 
> and keep the current code that is now exactly built to do
> "second:"
> 
> You are suggesting to effectively repeat current code using
> the first .pg_init...pg_doio for one: and hacking for blocks
> "second:"
> 
> I want 3 lines of code for one: and keep second: exactly as is.
> 
> <snip>
> 
>> Looks like you are suggesting going through the dirty page list twice
>> before issuing IO, one just for getting the IO size information and
>> another one for page collapsing. The whole point of moving layoutget
>> to .pg_doio is to collect real IO size there because we don't know it
>> at .pg_init. And it is done without any changes to generic NFS IO
>> path. I'm not sure if it is appropriate to change generic IO routine
>> to collect the information before .pg_init (I'd be happy too if we
>> can do it there).
> 
> That's what you are suggesting as well look in your step 4.:
> 	do 2 and 3 in a loop until all pages in current desc is handled
> sounds like another loop on all pages no?
> 
> BTW: we are already doing two passes in the system we already looked
> through all the pages when building the io desc at .write/read_pages
> 
> At first we can do above 3 lines loop in .pg_init. Second we can
> just collect that information in generic nfs at the first loop
> we are already doing
> 
>>>
>>>     a. We want it at .pg_init so we have a layout at .pg_test to inspect.
>>>
>>>        Done properly will let you, in blocks, slice by extents at .pg_test
>>>        and .write_pages can send the complete paglist to md (bio chaining)
>>>
>> Unlike objects and files, blocks don't slice by extents, not at .pg_test, nor at .read/write_pagelist.
>>  
> 
> What? What do you do? Send a scsi scatter list command?
> I don't think so. Somewhere you have to see an extent boundary of the data
> and send the continue of the next extent on disk as a new block request
> in a new bio with a new disk offset. No?
> 
> I'm not saying to do this right away, but you could simplify the code a lot
> by slicing by extent inside .pg_test
> 
> <>
>>>
>>>     At first version:
>>>       A good approximation which gives you an exact middle point
>>>       between blocks B.2 and objects/files B.2, is dirty count.
>>>     At later patch:
>>>       Have generic NFS collect the above O1, N1, and Nn for you and base
>>>       your decision on that.
>>>
>>
>> Well, unless you put both the two parts in... The first version is
>> ignoring the fact that blocks MDS cannot give out file stripping
>> information as easily as objects and files do. And I will stand
>> against it alone because all it does is to benefit objects while
>> hurting blocks (files don't care because they use whole file layout,
>> at least for now).
> 
> Lets make a computerize lets find O1..N1 and put in the Generic

computerize => compromise

> code for everyone objects files-segmented and blocks. Because I
> need it two. And I'd like O1..Nn but for first step I'd love
> O1..N1 a lot.
> 
> Please see my point. You created a system that only blocks can
> benefit from unless objects repeats the same exact but duplicated
> hacks as blocks.
> 
> Please do the *very* simple thing I suggest which we can all enjoy.
> 
>>
>> Otherwise, I would suggest having private hack for blocks because we have a real problem to solve.
>>
> 
> It's just as real for objects, and files when they will do segments.
> 
> My suggestion is to have two helpers at pnfs.c one for blocks and
> objects, one for files. Which can be called in .pg_init.
> The blocks/objects does a simple loop counting the first contiguous
> chunk and asks for a layout, like today. files one just sends
> all-file request.
> 
> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


  reply	other threads:[~2011-12-01 18:00 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-03  4:56 [PATCH-RESEND 4/4] pnfsblock: do not ask for layout in pg_init Peng Tao
2011-11-29 17:48 ` Jim Rees
2011-11-30  5:43   ` tao.peng
2011-11-30 12:57 ` Benny Halevy
2011-11-30 13:17   ` Peng Tao
2011-12-01  1:18     ` Boaz Harrosh
2011-12-01  5:05       ` tao.peng
2011-12-01  9:57         ` Benny Halevy
2011-12-01 17:33         ` Boaz Harrosh
2011-12-01 18:00           ` Boaz Harrosh [this message]
2011-12-02  4:59           ` tao.peng
2011-12-07 14:08             ` Boaz Harrosh
2011-12-08  3:32               ` tao.peng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ED7C0C8.8010504@panasas.com \
    --to=bharrosh@panasas.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=bergwolf@gmail.com \
    --cc=bhalevy@tonian.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=tao.peng@emc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).