Re: [PATCH 11/12] NFSv4.1: layoutcommit

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
@ 2011-03-24 13:57 William A. (Andy) Adamson
  2011-03-24 16:37 ` Benny Halevy
  0 siblings, 1 reply; 15+ messages in thread
From: William A. (Andy) Adamson @ 2011-03-24 13:57 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Fred Isaman, Trond Myklebust, NFS list

>> Only whole file layout support means that there is only one IOMODE_RW layout
>> segment.
>>
>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
>> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
>> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
>> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
>> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>
>The code in this patch is new and different enough from the one I/we
>signed-off originally that they don't make sense here.

Hi Benny

OK with me

>>
>>+             /* references matched in nfs4_layoutcommit_release */
>> +             wdata->lseg->pls_lc_cred =
>> +                     get_rpccred(wdata->args.context->state->owner->so_cred);
>> +             mark_inode_dirty_sync(wdata->inode);
>> +             dprintk("%s: Set layoutcommit for inode %lu ",
>> +                     __func__, wdata->inode->i_ino);
>> +     }
>> +     if (end_pos > wdata->lseg->pls_end_pos)
>> +             wdata->lseg->pls_end_pos = end_pos;
>
> The end_pos is essentially per inode, why maintain it per lseg?
> How do you see this working with multiple lsegs in mind?

The end-pos is per lseg, not per inode - each layoutcommit applies to
a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.

>From Section 18.42.3
.  The byte-range being committed is
   specified through the byte-range (loca_offset and loca_length).  This
   byte-range MUST overlap with one or more existing layouts previously
   granted via LAYOUTGET


   Also, loca_last_write_offset MUST overlap the range
   described by loca_offset and loca_length.

For the multiple lseg case: if the lsegs are merged, bookeeping
end_pos per lseg just works. If a layoutdriver does not use merged
lsegs, then there is a bit of work to do to walk the list of lsegs and
determine the final end_pos for a given LAYOUTCOMMIT.  If there are
multiple non-contiguous lsegs, each used for WRITEs then multiple
LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
byte-range will not overlap as required.

>> +pnfs_layoutcommit_inode(struct inode *inode, int sync)
>
> "bool sync" makes more sense

>> +{
>> +     struct nfs4_layoutcommit_data *data;
>> +     struct nfs_inode *nfsi = NFS_I(inode);
>> +     struct pnfs_layout_segment *lseg;
>> +     struct rpc_cred *cred;
>> +     loff_t end_pos;
>> +     int status = 0;
>> +
>> +     dprintk("--> %s inode %lu\n", __func__, inode->i_ino);
>> +
>> +     /* Note kzalloc ensures data->res.seq_res.sr_slot == NULL */
>> +     data = kzalloc(sizeof(*data), GFP_NOFS);
>> +     spin_lock(&inode->i_lock);
>> +
> >+     if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>
> previously (i.e. in the linux-pnfs tree :) this function is called only
> if layoutcommit_needed(), now I worry may waste a kzalloc too frequently.
> I suggest testing (and not clearing) NFS_INO_LAYOUTCOMMIT before doing
> the allocation to prevent that.

Agreed.

>> +     end_pos = lseg->pls_end_pos;
>> +     cred = lseg->pls_lc_cred;
>> +     lseg->pls_end_pos = 0;
>> +     lseg->pls_lc_cred = NULL;
>> +
>> +     if (!data) {
>
> eh?
> why not test this before test_and_clear_bit(NFS_INO_LAYOUTCOMMIT ?

Because we should clear the LAYOUTCOMMIT needed information from the inode.
The LAYOUTCOMMIT for the file layout is an optimization. If the client
can't alloc the required buffer, the compound just won't be sent.

-->Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-24 13:57 [PATCH 11/12] NFSv4.1: layoutcommit William A. (Andy) Adamson
@ 2011-03-24 16:37 ` Benny Halevy
  2011-03-24 16:48   ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: Benny Halevy @ 2011-03-24 16:37 UTC (permalink / raw)
  To: William A. (Andy) Adamson; +Cc: Fred Isaman, Trond Myklebust, NFS list

On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>>> Only whole file layout support means that there is only one IOMODE_RW layout
>>> segment.
>>>
>>> Signed-off-by: Andy Adamson <andros@netapp.com>
>>> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
>>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
>>> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
>>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>>> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
>>> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
>>> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
>>> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
>>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>>
>> The code in this patch is new and different enough from the one I/we
>> signed-off originally that they don't make sense here.
> 
> Hi Benny
> 
> OK with me
> 
>>>
>>> +             /* references matched in nfs4_layoutcommit_release */
>>> +             wdata->lseg->pls_lc_cred =
>>> +                     get_rpccred(wdata->args.context->state->owner->so_cred);
>>> +             mark_inode_dirty_sync(wdata->inode);
>>> +             dprintk("%s: Set layoutcommit for inode %lu ",
>>> +                     __func__, wdata->inode->i_ino);
>>> +     }
>>> +     if (end_pos > wdata->lseg->pls_end_pos)
>>> +             wdata->lseg->pls_end_pos = end_pos;
>>
>> The end_pos is essentially per inode, why maintain it per lseg?
>> How do you see this working with multiple lsegs in mind?
> 
> The end-pos is per lseg, not per inode - each layoutcommit applies to
> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
> 
> From Section 18.42.3
> .  The byte-range being committed is
>    specified through the byte-range (loca_offset and loca_length).  This
>    byte-range MUST overlap with one or more existing layouts previously
>    granted via LAYOUTGET
> 
> 
>    Also, loca_last_write_offset MUST overlap the range
>    described by loca_offset and loca_length.
> 
> For the multiple lseg case: if the lsegs are merged, bookeeping
> end_pos per lseg just works. If a layoutdriver does not use merged
> lsegs, then there is a bit of work to do to walk the list of lsegs and
> determine the final end_pos for a given LAYOUTCOMMIT.  If there are
> multiple non-contiguous lsegs, each used for WRITEs then multiple
> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
> byte-range will not overlap as required.
> 

For the current layout types I believe that the LAYOUTCOMMIT can "merge"
multiple layout segments into a single LAYOUTCOMMIT, with a byte range
covering all segments and a last_byte_written offset which is just the maximum.
Future layout types may need this method though...

Benny

>>> +pnfs_layoutcommit_inode(struct inode *inode, int sync)
>>
>> "bool sync" makes more sense
> 
>>> +{
>>> +     struct nfs4_layoutcommit_data *data;
>>> +     struct nfs_inode *nfsi = NFS_I(inode);
>>> +     struct pnfs_layout_segment *lseg;
>>> +     struct rpc_cred *cred;
>>> +     loff_t end_pos;
>>> +     int status = 0;
>>> +
>>> +     dprintk("--> %s inode %lu\n", __func__, inode->i_ino);
>>> +
>>> +     /* Note kzalloc ensures data->res.seq_res.sr_slot == NULL */
>>> +     data = kzalloc(sizeof(*data), GFP_NOFS);
>>> +     spin_lock(&inode->i_lock);
>>> +
>>> +     if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
>>
>> previously (i.e. in the linux-pnfs tree :) this function is called only
>> if layoutcommit_needed(), now I worry may waste a kzalloc too frequently.
>> I suggest testing (and not clearing) NFS_INO_LAYOUTCOMMIT before doing
>> the allocation to prevent that.
> 
> Agreed.
> 
>>> +     end_pos = lseg->pls_end_pos;
>>> +     cred = lseg->pls_lc_cred;
>>> +     lseg->pls_end_pos = 0;
>>> +     lseg->pls_lc_cred = NULL;
>>> +
>>> +     if (!data) {
>>
>> eh?
>> why not test this before test_and_clear_bit(NFS_INO_LAYOUTCOMMIT ?
> 
> Because we should clear the LAYOUTCOMMIT needed information from the inode.
> The LAYOUTCOMMIT for the file layout is an optimization. If the client
> can't alloc the required buffer, the compound just won't be sent.
> 
> -->Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-24 16:37 ` Benny Halevy
@ 2011-03-24 16:48   ` Trond Myklebust
  2011-03-24 16:54     ` Fred Isaman
  2011-03-24 16:58     ` Benny Halevy
  0 siblings, 2 replies; 15+ messages in thread
From: Trond Myklebust @ 2011-03-24 16:48 UTC (permalink / raw)
  To: Benny Halevy; +Cc: William A. (Andy) Adamson, Fred Isaman, NFS list

On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
> >>> Only whole file layout support means that there is only one IOMODE_RW layout
> >>> segment.
> >>>
> >>> Signed-off-by: Andy Adamson <andros@netapp.com>
> >>> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
> >>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
> >>> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
> >>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> >>> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
> >>> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
> >>> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
> >>> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
> >>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> >>
> >> The code in this patch is new and different enough from the one I/we
> >> signed-off originally that they don't make sense here.
> > 
> > Hi Benny
> > 
> > OK with me
> > 
> >>>
> >>> +             /* references matched in nfs4_layoutcommit_release */
> >>> +             wdata->lseg->pls_lc_cred =
> >>> +                     get_rpccred(wdata->args.context->state->owner->so_cred);
> >>> +             mark_inode_dirty_sync(wdata->inode);
> >>> +             dprintk("%s: Set layoutcommit for inode %lu ",
> >>> +                     __func__, wdata->inode->i_ino);
> >>> +     }
> >>> +     if (end_pos > wdata->lseg->pls_end_pos)
> >>> +             wdata->lseg->pls_end_pos = end_pos;
> >>
> >> The end_pos is essentially per inode, why maintain it per lseg?
> >> How do you see this working with multiple lsegs in mind?
> > 
> > The end-pos is per lseg, not per inode - each layoutcommit applies to
> > a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
> > 
> > From Section 18.42.3
> > .  The byte-range being committed is
> >    specified through the byte-range (loca_offset and loca_length).  This
> >    byte-range MUST overlap with one or more existing layouts previously
> >    granted via LAYOUTGET
> > 
> > 
> >    Also, loca_last_write_offset MUST overlap the range
> >    described by loca_offset and loca_length.
> > 
> > For the multiple lseg case: if the lsegs are merged, bookeeping
> > end_pos per lseg just works. If a layoutdriver does not use merged
> > lsegs, then there is a bit of work to do to walk the list of lsegs and
> > determine the final end_pos for a given LAYOUTCOMMIT.  If there are
> > multiple non-contiguous lsegs, each used for WRITEs then multiple
> > LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
> > byte-range will not overlap as required.
> > 
> 
> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
> covering all segments and a last_byte_written offset which is just the maximum.
> Future layout types may need this method though...

Is that safe?

What if I'm doing blocks and have written layout segment 1 & 3, but not
layout segment 2? I don't want to have the MDS commit layout segment 2,
and make the (lack of) data there visible to future readers.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-24 16:48   ` Trond Myklebust
@ 2011-03-24 16:54     ` Fred Isaman
  2011-03-24 16:58     ` Benny Halevy
  1 sibling, 0 replies; 15+ messages in thread
From: Fred Isaman @ 2011-03-24 16:54 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Benny Halevy, William A. (Andy) Adamson, NFS list

On Thu, Mar 24, 2011 at 12:48 PM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
>> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>> >>> Only whole file layout support means that there is only one IOMODE_RW layout
>> >>> segment.
>> >>>
>> >>> Signed-off-by: Andy Adamson <andros@netapp.com>
>> >>> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
>> >>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
>> >>> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
>> >>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>> >>> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
>> >>> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
>> >>> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
>> >>> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
>> >>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>> >>
>> >> The code in this patch is new and different enough from the one I/we
>> >> signed-off originally that they don't make sense here.
>> >
>> > Hi Benny
>> >
>> > OK with me
>> >
>> >>>
>> >>> +             /* references matched in nfs4_layoutcommit_release */
>> >>> +             wdata->lseg->pls_lc_cred =
>> >>> +                     get_rpccred(wdata->args.context->state->owner->so_cred);
>> >>> +             mark_inode_dirty_sync(wdata->inode);
>> >>> +             dprintk("%s: Set layoutcommit for inode %lu ",
>> >>> +                     __func__, wdata->inode->i_ino);
>> >>> +     }
>> >>> +     if (end_pos > wdata->lseg->pls_end_pos)
>> >>> +             wdata->lseg->pls_end_pos = end_pos;
>> >>
>> >> The end_pos is essentially per inode, why maintain it per lseg?
>> >> How do you see this working with multiple lsegs in mind?
>> >
>> > The end-pos is per lseg, not per inode - each layoutcommit applies to
>> > a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>> >
>> > From Section 18.42.3
>> > .  The byte-range being committed is
>> >    specified through the byte-range (loca_offset and loca_length).  This
>> >    byte-range MUST overlap with one or more existing layouts previously
>> >    granted via LAYOUTGET
>> >
>> >
>> >    Also, loca_last_write_offset MUST overlap the range
>> >    described by loca_offset and loca_length.
>> >
>> > For the multiple lseg case: if the lsegs are merged, bookeeping
>> > end_pos per lseg just works. If a layoutdriver does not use merged
>> > lsegs, then there is a bit of work to do to walk the list of lsegs and
>> > determine the final end_pos for a given LAYOUTCOMMIT.  If there are
>> > multiple non-contiguous lsegs, each used for WRITEs then multiple
>> > LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
>> > byte-range will not overlap as required.
>> >
>>
>> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
>> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
>> covering all segments and a last_byte_written offset which is just the maximum.
>> Future layout types may need this method though...
>
> Is that safe?
>
> What if I'm doing blocks and have written layout segment 1 & 3, but not
> layout segment 2? I don't want to have the MDS commit layout segment 2,
> and make the (lack of) data there visible to future readers.
>

No, it is not safe.  Avoiding this problem is one of the major reasons
for putting the bookkeeping in the lseg.

Fred

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-24 16:48   ` Trond Myklebust
  2011-03-24 16:54     ` Fred Isaman
@ 2011-03-24 16:58     ` Benny Halevy
  2011-03-24 17:15       ` Trond Myklebust
  1 sibling, 1 reply; 15+ messages in thread
From: Benny Halevy @ 2011-03-24 16:58 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: William A. (Andy) Adamson, Fred Isaman, NFS list, David Black,
	Rees, James

On 2011-03-24 18:48, Trond Myklebust wrote:
> On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
>> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>>>>> Only whole file layout support means that there is only one IOMODE_RW layout
>>>>> segment.
>>>>>
>>>>> Signed-off-by: Andy Adamson <andros@netapp.com>
>>>>> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
>>>>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
>>>>> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
>>>>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>>>>> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
>>>>> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
>>>>> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
>>>>> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
>>>>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>>>>
>>>> The code in this patch is new and different enough from the one I/we
>>>> signed-off originally that they don't make sense here.
>>>
>>> Hi Benny
>>>
>>> OK with me
>>>
>>>>>
>>>>> +             /* references matched in nfs4_layoutcommit_release */
>>>>> +             wdata->lseg->pls_lc_cred =
>>>>> +                     get_rpccred(wdata->args.context->state->owner->so_cred);
>>>>> +             mark_inode_dirty_sync(wdata->inode);
>>>>> +             dprintk("%s: Set layoutcommit for inode %lu ",
>>>>> +                     __func__, wdata->inode->i_ino);
>>>>> +     }
>>>>> +     if (end_pos > wdata->lseg->pls_end_pos)
>>>>> +             wdata->lseg->pls_end_pos = end_pos;
>>>>
>>>> The end_pos is essentially per inode, why maintain it per lseg?
>>>> How do you see this working with multiple lsegs in mind?
>>>
>>> The end-pos is per lseg, not per inode - each layoutcommit applies to
>>> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>>>
>>> From Section 18.42.3
>>> .  The byte-range being committed is
>>>    specified through the byte-range (loca_offset and loca_length).  This
>>>    byte-range MUST overlap with one or more existing layouts previously
>>>    granted via LAYOUTGET
>>>
>>>
>>>    Also, loca_last_write_offset MUST overlap the range
>>>    described by loca_offset and loca_length.
>>>
>>> For the multiple lseg case: if the lsegs are merged, bookeeping
>>> end_pos per lseg just works. If a layoutdriver does not use merged
>>> lsegs, then there is a bit of work to do to walk the list of lsegs and
>>> determine the final end_pos for a given LAYOUTCOMMIT.  If there are
>>> multiple non-contiguous lsegs, each used for WRITEs then multiple
>>> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
>>> byte-range will not overlap as required.
>>>
>>
>> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
>> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
>> covering all segments and a last_byte_written offset which is just the maximum.
>> Future layout types may need this method though...
> 
> Is that safe?
> 
> What if I'm doing blocks and have written layout segment 1 & 3, but not
> layout segment 2? I don't want to have the MDS commit layout segment 2,
> and make the (lack of) data there visible to future readers.
> 

I'm not the real expert on pnfs-blocks but my interpretation of rfc5663 is that the
list of extents in pnfs_block_layoutupdate4 may be sparse (or holey if you'd like).
Note that the client may have written just parts of the layout it got in one layout segment.
In this case too, you don't want to send multiple LAYOUTCOMMITs for each contiguous
area...

Benny

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-24 16:58     ` Benny Halevy
@ 2011-03-24 17:15       ` Trond Myklebust
  2011-03-25  9:39         ` Benny Halevy
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-24 17:15 UTC (permalink / raw)
  To: Benny Halevy
  Cc: William A. (Andy) Adamson, Fred Isaman, NFS list, David Black,
	Rees, James

On Thu, 2011-03-24 at 18:58 +0200, Benny Halevy wrote:
> On 2011-03-24 18:48, Trond Myklebust wrote:
> > On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
> >> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
> >>>>> Only whole file layout support means that there is only one IOMODE_RW layout
> >>>>> segment.
> >>>>>
> >>>>> Signed-off-by: Andy Adamson <andros@netapp.com>
> >>>>> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
> >>>>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
> >>>>> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
> >>>>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> >>>>> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
> >>>>> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
> >>>>> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
> >>>>> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
> >>>>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
> >>>>
> >>>> The code in this patch is new and different enough from the one I/we
> >>>> signed-off originally that they don't make sense here.
> >>>
> >>> Hi Benny
> >>>
> >>> OK with me
> >>>
> >>>>>
> >>>>> +             /* references matched in nfs4_layoutcommit_release */
> >>>>> +             wdata->lseg->pls_lc_cred =
> >>>>> +                     get_rpccred(wdata->args.context->state->owner->so_cred);
> >>>>> +             mark_inode_dirty_sync(wdata->inode);
> >>>>> +             dprintk("%s: Set layoutcommit for inode %lu ",
> >>>>> +                     __func__, wdata->inode->i_ino);
> >>>>> +     }
> >>>>> +     if (end_pos > wdata->lseg->pls_end_pos)
> >>>>> +             wdata->lseg->pls_end_pos = end_pos;
> >>>>
> >>>> The end_pos is essentially per inode, why maintain it per lseg?
> >>>> How do you see this working with multiple lsegs in mind?
> >>>
> >>> The end-pos is per lseg, not per inode - each layoutcommit applies to
> >>> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
> >>>
> >>> From Section 18.42.3
> >>> .  The byte-range being committed is
> >>>    specified through the byte-range (loca_offset and loca_length).  This
> >>>    byte-range MUST overlap with one or more existing layouts previously
> >>>    granted via LAYOUTGET
> >>>
> >>>
> >>>    Also, loca_last_write_offset MUST overlap the range
> >>>    described by loca_offset and loca_length.
> >>>
> >>> For the multiple lseg case: if the lsegs are merged, bookeeping
> >>> end_pos per lseg just works. If a layoutdriver does not use merged
> >>> lsegs, then there is a bit of work to do to walk the list of lsegs and
> >>> determine the final end_pos for a given LAYOUTCOMMIT.  If there are
> >>> multiple non-contiguous lsegs, each used for WRITEs then multiple
> >>> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
> >>> byte-range will not overlap as required.
> >>>
> >>
> >> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
> >> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
> >> covering all segments and a last_byte_written offset which is just the maximum.
> >> Future layout types may need this method though...
> > 
> > Is that safe?
> > 
> > What if I'm doing blocks and have written layout segment 1 & 3, but not
> > layout segment 2? I don't want to have the MDS commit layout segment 2,
> > and make the (lack of) data there visible to future readers.
> > 
> 
> I'm not the real expert on pnfs-blocks but my interpretation of rfc5663 is that the
> list of extents in pnfs_block_layoutupdate4 may be sparse (or holey if you'd like).
> Note that the client may have written just parts of the layout it got in one layout segment.
> In this case too, you don't want to send multiple LAYOUTCOMMITs for each contiguous
> area...

Sure, but my understanding was that RFC5663 supports copy on write files
and that the actual copying of the block may need to be done by the
client (see section 2.3.4).

If the new block is still uninitialised when you call LAYOUTCOMMIT, then
that will corrupt the file if the client then dies before it finishes
processing segment 2, since the previously valid data blocks are being
replaced by uninitialised ones (which I presume will convert that
section into a pre-allocated hole???).

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-24 17:15       ` Trond Myklebust
@ 2011-03-25  9:39         ` Benny Halevy
  0 siblings, 0 replies; 15+ messages in thread
From: Benny Halevy @ 2011-03-25  9:39 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: William A. (Andy) Adamson, Fred Isaman, NFS list, David Black,
	Rees, James

On 2011-03-24 19:15, Trond Myklebust wrote:

> On Thu, 2011-03-24 at 18:58 +0200, Benny Halevy wrote:
>> On 2011-03-24 18:48, Trond Myklebust wrote:
>>> On Thu, 2011-03-24 at 18:37 +0200, Benny Halevy wrote:
>>>> On 2011-03-24 15:57, William A. (Andy) Adamson wrote:
>>>>>>> Only whole file layout support means that there is only one IOMODE_RW layout
>>>>>>> segment.
>>>>>>>
>>>>>>> Signed-off-by: Andy Adamson <andros@netapp.com>
>>>>>>> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
>>>>>>> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
>>>>>>> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
>>>>>>> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
>>>>>>> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
>>>>>>> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
>>>>>>> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
>>>>>>> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
>>>>>>> Signed-off-by: Benny Halevy <bhalevy@panasas.com>
>>>>>> The code in this patch is new and different enough from the one I/we
>>>>>> signed-off originally that they don't make sense here.
>>>>> Hi Benny
>>>>>
>>>>> OK with me
>>>>>
>>>>>>> +             /* references matched in nfs4_layoutcommit_release */
>>>>>>> +             wdata->lseg->pls_lc_cred =
>>>>>>> +                     get_rpccred(wdata->args.context->state->owner->so_cred);
>>>>>>> +             mark_inode_dirty_sync(wdata->inode);
>>>>>>> +             dprintk("%s: Set layoutcommit for inode %lu ",
>>>>>>> +                     __func__, wdata->inode->i_ino);
>>>>>>> +     }
>>>>>>> +     if (end_pos > wdata->lseg->pls_end_pos)
>>>>>>> +             wdata->lseg->pls_end_pos = end_pos;
>>>>>> The end_pos is essentially per inode, why maintain it per lseg?
>>>>>> How do you see this working with multiple lsegs in mind?
>>>>> The end-pos is per lseg, not per inode - each layoutcommit applies to
>>>>> a range of WRITES for a layoutsegment over the LAYOUTCOMMIT range.
>>>>>
>>>>> From Section 18.42.3
>>>>> .  The byte-range being committed is
>>>>>    specified through the byte-range (loca_offset and loca_length).  This
>>>>>    byte-range MUST overlap with one or more existing layouts previously
>>>>>    granted via LAYOUTGET
>>>>>
>>>>>
>>>>>    Also, loca_last_write_offset MUST overlap the range
>>>>>    described by loca_offset and loca_length.
>>>>>
>>>>> For the multiple lseg case: if the lsegs are merged, bookeeping
>>>>> end_pos per lseg just works. If a layoutdriver does not use merged
>>>>> lsegs, then there is a bit of work to do to walk the list of lsegs and
>>>>> determine the final end_pos for a given LAYOUTCOMMIT.  If there are
>>>>> multiple non-contiguous lsegs, each used for WRITEs then multiple
>>>>> LAYOUTCOMMITs will need to be sent, otherwise the LAYOUTCOMMIT
>>>>> byte-range will not overlap as required.
>>>>>
>>>> For the current layout types I believe that the LAYOUTCOMMIT can "merge"
>>>> multiple layout segments into a single LAYOUTCOMMIT, with a byte range
>>>> covering all segments and a last_byte_written offset which is just the maximum.
>>>> Future layout types may need this method though...
>>> Is that safe?
>>>
>>> What if I'm doing blocks and have written layout segment 1 & 3, but not
>>> layout segment 2? I don't want to have the MDS commit layout segment 2,
>>> and make the (lack of) data there visible to future readers.
>>>
>> I'm not the real expert on pnfs-blocks but my interpretation of rfc5663 is that the
>> list of extents in pnfs_block_layoutupdate4 may be sparse (or holey if you'd like).
>> Note that the client may have written just parts of the layout it got in one layout segment.
>> In this case too, you don't want to send multiple LAYOUTCOMMITs for each contiguous
>> area...
> Sure, but my understanding was that RFC5663 supports copy on write files
> and that the actual copying of the block may need to be done by the
> client (see section 2.3.4).
>
> If the new block is still uninitialised when you call LAYOUTCOMMIT, then
> that will corrupt the file if the client then dies before it finishes
> processing segment 2, since the previously valid data blocks are being
> replaced by uninitialised ones (which I presume will convert that
> section into a pre-allocated hole???).
>
The extents that LAYOUTCOMMITted MUST NOT be PNFS_BLOCK_INVALID_DATA,
as noted in 2.3.2:

   The bex_state
   field of each extent in the blu_commit_list MUST be set to
   PNFS_BLOCK_READ_WRITE_DATA.

Benny



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
@ 2011-03-24 14:45 William A. (Andy) Adamson
  2011-03-24 17:06 ` Benny Halevy
  0 siblings, 1 reply; 15+ messages in thread
From: William A. (Andy) Adamson @ 2011-03-24 14:45 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Fred Isaman, Trond Myklebust, NFS list

Hi Benny

int sync is used because the struct writeback_control->sync_mode (an
enum) is assigned.

-->Andy

>> +pnfs_layoutcommit_inode(struct inode *inode, int sync)
>
> "bool sync" makes more sense

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-24 14:45 William A. (Andy) Adamson
@ 2011-03-24 17:06 ` Benny Halevy
  0 siblings, 0 replies; 15+ messages in thread
From: Benny Halevy @ 2011-03-24 17:06 UTC (permalink / raw)
  To: William A. (Andy) Adamson; +Cc: Fred Isaman, Trond Myklebust, NFS list

On 2011-03-24 16:45, William A. (Andy) Adamson wrote:
> Hi Benny
> 
> int sync is used because the struct writeback_control->sync_mode (an
> enum) is assigned.

Then the compiler will do the assignment into bool once (I hope it
optimizes for the enum values which are {0,1}) and thereafter use the
bool value.

Benny

> 
> -->Andy
> 
>>> +pnfs_layoutcommit_inode(struct inode *inode, int sync)
>>
>> "bool sync" makes more sense

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 00/12] NFSv4.1 pnfs wave5 submission (try 2)
@ 2011-03-23 13:27 Fred Isaman
  2011-03-23 13:27 ` [PATCH 11/12] NFSv4.1: layoutcommit Fred Isaman
  0 siblings, 1 reply; 15+ messages in thread
From: Fred Isaman @ 2011-03-23 13:27 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust

This is try 2 of the submission.  The first accidently had Andy's
LAYOUTCOMMIT patch squashed into one of the other patches.  No other
change was made from version 1.

These are the "wave5" pnfs patches, which implement COMMIT to data
server and LAYOUTCOMMIT, using the new rules from the proposed errata
which has come out of the Feb 2011 interim IETF meeting.  At this
point, the intent is that the files layout pNFS client is spec
compliant, though it has some obvious limitations, such as using only
whole file layouts.

These apply to Trond's trond/nfs-for-next branch, and are also available
at git://linux-nfs.org/~iisaman/linux-pnfs.git under the tag
wave5-submission-02

Fred

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-23 13:27 [PATCH 00/12] NFSv4.1 pnfs wave5 submission (try 2) Fred Isaman
@ 2011-03-23 13:27 ` Fred Isaman
  2011-03-23 20:33   ` Benny Halevy
  2011-03-23 21:00   ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Fred Isaman @ 2011-03-23 13:27 UTC (permalink / raw)
  To: linux-nfs; +Cc: Trond Myklebust

From: Andy Adamson <andros@netapp.com>

The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to
the data server (as opposed to the MDS) and the data server WRITE
is not NFS_FILE_SYNC.

Only whole file layout support means that there is only one IOMODE_RW layout
segment.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
Tested-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Fred Isaman <iisaman@netapp.com>
---
 fs/nfs/file.c           |    3 +
 fs/nfs/nfs4_fs.h        |    2 +
 fs/nfs/nfs4filelayout.c |   18 +++++++
 fs/nfs/nfs4proc.c       |   94 ++++++++++++++++++++++++++++++++++
 fs/nfs/nfs4xdr.c        |  129 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/nfs/pnfs.c           |   94 ++++++++++++++++++++++++++++++++++
 fs/nfs/pnfs.h           |    9 +++-
 fs/nfs/write.c          |   15 +++++-
 include/linux/nfs4.h    |    1 +
 include/linux/nfs_fs.h  |    1 +
 include/linux/nfs_xdr.h |   23 ++++++++
 11 files changed, 387 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d85a534..85cb95d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -326,6 +326,9 @@ nfs_file_fsync(struct file *file, int datasync)
 		ret = xchg(&ctx->error, 0);
 	if (!ret && status < 0)
 		ret = status;
+	if (!ret && !datasync)
+		/* application has asked for meta-data sync */
+		ret = pnfs_layoutcommit_inode(inode, 1);
 	return ret;
 }
 
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index c64be1c..1e612d1 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -262,6 +262,8 @@ extern int nfs4_proc_destroy_session(struct nfs4_session *);
 extern int nfs4_init_session(struct nfs_server *server);
 extern int nfs4_proc_get_lease_time(struct nfs_client *clp,
 		struct nfs_fsinfo *fsinfo);
+extern int nfs4_proc_layoutcommit(struct nfs4_layoutcommit_data *data,
+				  int sync);
 
 static inline bool
 is_ds_only_client(struct nfs_client *clp)
diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
index 97e75a2..fc1a0e9 100644
--- a/fs/nfs/nfs4filelayout.c
+++ b/fs/nfs/nfs4filelayout.c
@@ -154,6 +154,23 @@ static int filelayout_read_done_cb(struct rpc_task *task,
 }
 
 /*
+ * We reference the rpc_cred of the first WRITE that triggers the need for
+ * a LAYOUTCOMMIT, and use it to send the layoutcommit compound.
+ * rfc5661 is not clear about which credential should be used.
+ */
+static void
+filelayout_set_layoutcommit(struct nfs_write_data *wdata)
+{
+	if (FILELAYOUT_LSEG(wdata->lseg)->commit_through_mds ||
+	    wdata->res.verf->committed == NFS_FILE_SYNC)
+		return;
+
+	pnfs_set_layoutcommit(wdata);
+	dprintk("%s ionde %lu pls_end_pos %lu\n", __func__, wdata->inode->i_ino,
+		(unsigned long) wdata->lseg->pls_end_pos);
+}
+
+/*
  * Call ops for the async read/write cases
  * In the case of dense layouts, the offset needs to be reset to its
  * original value.
@@ -210,6 +227,7 @@ static int filelayout_write_done_cb(struct rpc_task *task,
 		return -EAGAIN;
 	}
 
+	filelayout_set_layoutcommit(data);
 	return 0;
 }
 
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 5d61ccc..6f2f402 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5616,6 +5616,100 @@ int nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
 }
 EXPORT_SYMBOL_GPL(nfs4_proc_getdeviceinfo);
 
+static void nfs4_layoutcommit_prepare(struct rpc_task *task, void *calldata)
+{
+	struct nfs4_layoutcommit_data *data = calldata;
+	struct nfs_server *server = NFS_SERVER(data->args.inode);
+
+	if (nfs4_setup_sequence(server, &data->args.seq_args,
+				&data->res.seq_res, 1, task))
+		return;
+	rpc_call_start(task);
+}
+
+static void
+nfs4_layoutcommit_done(struct rpc_task *task, void *calldata)
+{
+	struct nfs4_layoutcommit_data *data = calldata;
+	struct nfs_server *server = NFS_SERVER(data->args.inode);
+
+	if (!nfs4_sequence_done(task, &data->res.seq_res))
+		return;
+
+	switch (task->tk_status) { /* Just ignore these failures */
+	case NFS4ERR_DELEG_REVOKED: /* layout was recalled */
+	case NFS4ERR_BADIOMODE:     /* no IOMODE_RW layout for range */
+	case NFS4ERR_BADLAYOUT:     /* no layout */
+	case NFS4ERR_GRACE:	    /* loca_recalim always false */
+		task->tk_status = 0;
+	}
+
+	if (nfs4_async_handle_error(task, server, NULL) == -EAGAIN) {
+		nfs_restart_rpc(task, server->nfs_client);
+		return;
+	}
+
+	if (task->tk_status == 0)
+		nfs_post_op_update_inode_force_wcc(data->args.inode,
+						   data->res.fattr);
+}
+
+static void nfs4_layoutcommit_release(void *calldata)
+{
+	struct nfs4_layoutcommit_data *data = calldata;
+
+	/* Matched by references in pnfs_set_layoutcommit */
+	put_lseg(data->lseg);
+	put_rpccred(data->cred);
+	kfree(data);
+}
+
+static const struct rpc_call_ops nfs4_layoutcommit_ops = {
+	.rpc_call_prepare = nfs4_layoutcommit_prepare,
+	.rpc_call_done = nfs4_layoutcommit_done,
+	.rpc_release = nfs4_layoutcommit_release,
+};
+
+int
+nfs4_proc_layoutcommit(struct nfs4_layoutcommit_data *data, int sync)
+{
+	struct rpc_message msg = {
+		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_LAYOUTCOMMIT],
+		.rpc_argp = &data->args,
+		.rpc_resp = &data->res,
+		.rpc_cred = data->cred,
+	};
+	struct rpc_task_setup task_setup_data = {
+		.task = &data->task,
+		.rpc_client = NFS_CLIENT(data->args.inode),
+		.rpc_message = &msg,
+		.callback_ops = &nfs4_layoutcommit_ops,
+		.callback_data = data,
+		.flags = RPC_TASK_ASYNC,
+	};
+	struct rpc_task *task;
+	int status = 0;
+
+	dprintk("NFS: %4d initiating layoutcommit call. sync %d "
+		"lbw: %llu inode %lu\n",
+		data->task.tk_pid, sync,
+		data->args.lastbytewritten,
+		data->args.inode->i_ino);
+
+	task = rpc_run_task(&task_setup_data);
+	if (IS_ERR(task))
+		return PTR_ERR(task);
+	if (!sync)
+		goto out;
+	status = nfs4_wait_for_completion_rpc_task(task);
+	if (status != 0)
+		goto out;
+	status = task->tk_status;
+out:
+	dprintk("%s: status %d\n", __func__, status);
+	rpc_put_task(task);
+	return status;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 struct nfs4_state_recovery_ops nfs40_reboot_recovery_ops = {
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 07cdf92..207d399 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -324,6 +324,18 @@ static int nfs4_stat_to_errno(int);
 #define decode_layoutget_maxsz	(op_decode_hdr_maxsz + 8 + \
 				decode_stateid_maxsz + \
 				XDR_QUADLEN(PNFS_LAYOUT_MAXSIZE))
+#define encode_layoutcommit_maxsz (op_encode_hdr_maxsz +          \
+				2 /* offset */ + \
+				2 /* length */ + \
+				1 /* reclaim */ + \
+				encode_stateid_maxsz + \
+				1 /* new offset (true) */ + \
+				2 /* last byte written */ + \
+				1 /* nt_timechanged (false) */ + \
+				1 /* layoutupdate4 layout type */ + \
+				1 /* NULL filelayout layoutupdate4 payload */)
+#define decode_layoutcommit_maxsz (op_decode_hdr_maxsz + 3)
+
 #else /* CONFIG_NFS_V4_1 */
 #define encode_sequence_maxsz	0
 #define decode_sequence_maxsz	0
@@ -727,6 +739,17 @@ static int nfs4_stat_to_errno(int);
 				decode_sequence_maxsz + \
 				decode_putfh_maxsz +        \
 				decode_layoutget_maxsz)
+#define NFS4_enc_layoutcommit_sz (compound_encode_hdr_maxsz + \
+				encode_sequence_maxsz +\
+				encode_putfh_maxsz + \
+				encode_layoutcommit_maxsz + \
+				encode_getattr_maxsz)
+#define NFS4_dec_layoutcommit_sz (compound_decode_hdr_maxsz + \
+				decode_sequence_maxsz + \
+				decode_putfh_maxsz + \
+				decode_layoutcommit_maxsz + \
+				decode_getattr_maxsz)
+
 
 const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH +
 				      compound_encode_hdr_maxsz +
@@ -1816,6 +1839,34 @@ encode_layoutget(struct xdr_stream *xdr,
 	hdr->nops++;
 	hdr->replen += decode_layoutget_maxsz;
 }
+
+static int
+encode_layoutcommit(struct xdr_stream *xdr,
+		    const struct nfs4_layoutcommit_args *args,
+		    struct compound_hdr *hdr)
+{
+	__be32 *p;
+
+	dprintk("%s: lbw: %llu type: %d\n", __func__, args->lastbytewritten,
+		NFS_SERVER(args->inode)->pnfs_curr_ld->id);
+
+	p = reserve_space(xdr, 48 + NFS4_STATEID_SIZE);
+	*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
+	/* Only whole file layouts */
+	p = xdr_encode_hyper(p, 0); /* offset */
+	p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
+	*p++ = cpu_to_be32(0); /* reclaim */
+	p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
+	*p++ = cpu_to_be32(1); /* newoffset = TRUE */
+	p = xdr_encode_hyper(p, args->lastbytewritten);
+	*p++ = cpu_to_be32(0); /* Never send time_modify_changed */
+	*p++ = cpu_to_be32(NFS_SERVER(args->inode)->pnfs_curr_ld->id);/* type */
+	*p++ = cpu_to_be32(0); /* no file layout payload */
+
+	hdr->nops++;
+	hdr->replen += decode_layoutcommit_maxsz;
+	return 0;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 /*
@@ -2607,6 +2658,26 @@ static void nfs4_xdr_enc_layoutget(struct rpc_rqst *req,
 	encode_layoutget(xdr, args, &hdr);
 	encode_nops(&hdr);
 }
+
+/*
+ *  Encode LAYOUTCOMMIT request
+ */
+static int nfs4_xdr_enc_layoutcommit(struct rpc_rqst *req,
+				     struct xdr_stream *xdr,
+				     struct nfs4_layoutcommit_args *args)
+{
+	struct compound_hdr hdr = {
+		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
+	};
+
+	encode_compound_hdr(xdr, req, &hdr);
+	encode_sequence(xdr, &args->seq_args, &hdr);
+	encode_putfh(xdr, NFS_FH(args->inode), &hdr);
+	encode_layoutcommit(xdr, args, &hdr);
+	encode_getfattr(xdr, args->bitmask, &hdr);
+	encode_nops(&hdr);
+	return 0;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 static void print_overflow_msg(const char *func, const struct xdr_stream *xdr)
@@ -5007,6 +5078,35 @@ out_overflow:
 	print_overflow_msg(__func__, xdr);
 	return -EIO;
 }
+
+static int decode_layoutcommit(struct xdr_stream *xdr,
+			       struct rpc_rqst *req,
+			       struct nfs4_layoutcommit_res *res)
+{
+	__be32 *p;
+	__u32 sizechanged;
+	int status;
+
+	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
+	if (status)
+		return status;
+
+	p = xdr_inline_decode(xdr, 4);
+	if (unlikely(!p))
+		goto out_overflow;
+	sizechanged = be32_to_cpup(p);
+
+	if (sizechanged) {
+		/* throw away new size */
+		p = xdr_inline_decode(xdr, 8);
+		if (unlikely(!p))
+			goto out_overflow;
+	}
+	return 0;
+out_overflow:
+	print_overflow_msg(__func__, xdr);
+	return -EIO;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 /*
@@ -6068,6 +6168,34 @@ static int nfs4_xdr_dec_layoutget(struct rpc_rqst *rqstp,
 out:
 	return status;
 }
+
+/*
+ * Decode LAYOUTCOMMIT response
+ */
+static int nfs4_xdr_dec_layoutcommit(struct rpc_rqst *rqstp,
+				     struct xdr_stream *xdr,
+				     struct nfs4_layoutcommit_res *res)
+{
+	struct compound_hdr hdr;
+	int status;
+
+	status = decode_compound_hdr(xdr, &hdr);
+	if (status)
+		goto out;
+	status = decode_sequence(xdr, &res->seq_res, rqstp);
+	if (status)
+		goto out;
+	status = decode_putfh(xdr);
+	if (status)
+		goto out;
+	status = decode_layoutcommit(xdr, rqstp, res);
+	if (status)
+		goto out;
+	decode_getfattr(xdr, res->fattr, res->server,
+			!RPC_IS_ASYNC(rqstp->rq_task));
+out:
+	return status;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 /**
@@ -6269,6 +6397,7 @@ struct rpc_procinfo	nfs4_procedures[] = {
 	PROC(RECLAIM_COMPLETE,	enc_reclaim_complete,	dec_reclaim_complete),
 	PROC(GETDEVICEINFO,	enc_getdeviceinfo,	dec_getdeviceinfo),
 	PROC(LAYOUTGET,		enc_layoutget,		dec_layoutget),
+	PROC(LAYOUTCOMMIT,	enc_layoutcommit,	dec_layoutcommit),
 #endif /* CONFIG_NFS_V4_1 */
 };
 
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index c675659..2a08ca0 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -946,3 +946,97 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
 	dprintk("%s End (trypnfs:%d)\n", __func__, trypnfs);
 	return trypnfs;
 }
+
+/*
+ * Currently there is only one (whole file) write lseg.
+ */
+static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
+{
+	struct pnfs_layout_segment *lseg, *rv = NULL;
+
+	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
+		if (lseg->pls_range.iomode == IOMODE_RW)
+			rv = lseg;
+	return rv;
+}
+
+void
+pnfs_set_layoutcommit(struct nfs_write_data *wdata)
+{
+	struct nfs_inode *nfsi = NFS_I(wdata->inode);
+	loff_t end_pos = wdata->args.offset + wdata->res.count;
+
+	spin_lock(&nfsi->vfs_inode.i_lock);
+	if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
+		/* references matched in nfs4_layoutcommit_release */
+		get_lseg(wdata->lseg);
+		wdata->lseg->pls_lc_cred =
+			get_rpccred(wdata->args.context->state->owner->so_cred);
+		mark_inode_dirty_sync(wdata->inode);
+		dprintk("%s: Set layoutcommit for inode %lu ",
+			__func__, wdata->inode->i_ino);
+	}
+	if (end_pos > wdata->lseg->pls_end_pos)
+		wdata->lseg->pls_end_pos = end_pos;
+	spin_unlock(&nfsi->vfs_inode.i_lock);
+}
+EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
+
+int
+pnfs_layoutcommit_inode(struct inode *inode, int sync)
+{
+	struct nfs4_layoutcommit_data *data;
+	struct nfs_inode *nfsi = NFS_I(inode);
+	struct pnfs_layout_segment *lseg;
+	struct rpc_cred *cred;
+	loff_t end_pos;
+	int status = 0;
+
+	dprintk("--> %s inode %lu\n", __func__, inode->i_ino);
+
+	/* Note kzalloc ensures data->res.seq_res.sr_slot == NULL */
+	data = kzalloc(sizeof(*data), GFP_NOFS);
+	spin_lock(&inode->i_lock);
+
+	if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
+		spin_unlock(&inode->i_lock);
+		kfree(data);
+		goto out;
+	}
+	/*
+	 * Currently only one (whole file) write lseg which is referenced
+	 * in pnfs_set_layoutcommit and will be found.
+	 */
+	lseg = pnfs_list_write_lseg(inode);
+
+	end_pos = lseg->pls_end_pos;
+	cred = lseg->pls_lc_cred;
+	lseg->pls_end_pos = 0;
+	lseg->pls_lc_cred = NULL;
+
+	if (!data) {
+		put_lseg(lseg);
+		spin_unlock(&inode->i_lock);
+		put_rpccred(cred);
+		status = -ENOMEM;
+		goto out;
+	} else {
+		memcpy(&data->args.stateid.data, nfsi->layout->plh_stateid.data,
+			sizeof(nfsi->layout->plh_stateid.data));
+	}
+	spin_unlock(&inode->i_lock);
+
+	data->args.inode = inode;
+	data->lseg = lseg;
+	data->cred = cred;
+	nfs_fattr_init(&data->fattr);
+	data->args.bitmask = NFS_SERVER(inode)->cache_consistency_bitmask;
+	data->res.fattr = &data->fattr;
+	data->args.lastbytewritten = end_pos - 1;
+	data->res.server = NFS_SERVER(inode);
+
+	status = nfs4_proc_layoutcommit(data, sync);
+out:
+	dprintk("<-- %s status %d\n", __func__, status);
+	return status;
+}
diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
index 5370f1b..0806c77 100644
--- a/fs/nfs/pnfs.h
+++ b/fs/nfs/pnfs.h
@@ -43,6 +43,8 @@ struct pnfs_layout_segment {
 	atomic_t pls_refcount;
 	unsigned long pls_flags;
 	struct pnfs_layout_hdr *pls_layout;
+	struct rpc_cred	*pls_lc_cred; /* LAYOUTCOMMIT credential */
+	loff_t pls_end_pos; /* LAYOUTCOMMIT write end */
 };
 
 enum pnfs_try_status {
@@ -152,7 +154,8 @@ bool pnfs_roc(struct inode *ino);
 void pnfs_roc_release(struct inode *ino);
 void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
 bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
-
+void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
+int pnfs_layoutcommit_inode(struct inode *inode, int sync);
 
 static inline int lo_fail_bit(u32 iomode)
 {
@@ -325,6 +328,10 @@ static inline void pnfs_clear_request_commit(struct nfs_page *req)
 {
 }
 
+static inline int pnfs_layoutcommit_inode(struct inode *inode, int sync)
+{
+	return 0;
+}
 #endif /* CONFIG_NFS_V4_1 */
 
 #endif /* FS_NFS_PNFS_H */
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index e7aeda0..a03c11f 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1562,7 +1562,20 @@ static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_contr
 
 int nfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return nfs_commit_unstable_pages(inode, wbc);
+	int ret;
+
+	ret = nfs_commit_unstable_pages(inode, wbc);
+	if (ret >= 0 && test_bit(NFS_INO_LAYOUTCOMMIT, &NFS_I(inode)->flags)) {
+		int status, sync = wbc->sync_mode;
+
+		if (wbc->nonblocking || wbc->for_background)
+				sync = 0;
+
+		status = pnfs_layoutcommit_inode(inode, sync);
+		if (status < 0)
+			return status;
+	}
+	return ret;
 }
 
 /*
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index 134716e..bf22cfa 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -560,6 +560,7 @@ enum {
 	NFSPROC4_CLNT_RECLAIM_COMPLETE,
 	NFSPROC4_CLNT_LAYOUTGET,
 	NFSPROC4_CLNT_GETDEVICEINFO,
+	NFSPROC4_CLNT_LAYOUTCOMMIT,
 };
 
 /* nfs41 types */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 9d434f0..5765126 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -223,6 +223,7 @@ struct nfs_inode {
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
 #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */
 #define NFS_INO_PNFS_COMMIT	(8)		/* use pnfs code for commit */
+#define NFS_INO_LAYOUTCOMMIT	(9)		/* layoutcommit required */
 
 static inline struct nfs_inode *NFS_I(const struct inode *inode)
 {
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index ac0c0e5..84f3585 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -236,6 +236,29 @@ struct nfs4_getdeviceinfo_res {
 	struct nfs4_sequence_res seq_res;
 };
 
+struct nfs4_layoutcommit_args {
+	nfs4_stateid stateid;
+	__u64 lastbytewritten;
+	struct inode *inode;
+	const u32 *bitmask;
+	struct nfs4_sequence_args seq_args;
+};
+
+struct nfs4_layoutcommit_res {
+	struct nfs_fattr *fattr;
+	const struct nfs_server *server;
+	struct nfs4_sequence_res seq_res;
+};
+
+struct nfs4_layoutcommit_data {
+	struct rpc_task task;
+	struct nfs_fattr fattr;
+	struct pnfs_layout_segment *lseg;
+	struct rpc_cred *cred;
+	struct nfs4_layoutcommit_args args;
+	struct nfs4_layoutcommit_res res;
+};
+
 /*
  * Arguments to the open call.
  */
-- 
1.7.2.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-23 13:27 ` [PATCH 11/12] NFSv4.1: layoutcommit Fred Isaman
@ 2011-03-23 20:33   ` Benny Halevy
  2011-03-23 21:00   ` Christoph Hellwig
  1 sibling, 0 replies; 15+ messages in thread
From: Benny Halevy @ 2011-03-23 20:33 UTC (permalink / raw)
  To: Fred Isaman, Andy Adamson; +Cc: linux-nfs, Trond Myklebust

On 2011-03-23 15:27, Fred Isaman wrote:
> From: Andy Adamson <andros@netapp.com>
> 
> The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to
> the data server (as opposed to the MDS) and the data server WRITE
> is not NFS_FILE_SYNC.
> 
> Only whole file layout support means that there is only one IOMODE_RW layout
> segment.
> 
> Signed-off-by: Andy Adamson <andros@netapp.com>
> Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
> Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
> Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
> Signed-off-by: Mingyang Guo <guomingyang@nrchpc.ac.cn>
> Signed-off-by: Tao Guo <guotao@nrchpc.ac.cn>
> Signed-off-by: Zhang Jingwang <zhangjingwang@nrchpc.ac.cn>
> Tested-by: Boaz Harrosh <bharrosh@panasas.com>
> Signed-off-by: Benny Halevy <bhalevy@panasas.com>

The code in this patch is new and different enough from the one I/we
signed-off originally that they don't make sense here.

> Signed-off-by: Fred Isaman <iisaman@netapp.com>
> ---
>  fs/nfs/file.c           |    3 +
>  fs/nfs/nfs4_fs.h        |    2 +
>  fs/nfs/nfs4filelayout.c |   18 +++++++
>  fs/nfs/nfs4proc.c       |   94 ++++++++++++++++++++++++++++++++++
>  fs/nfs/nfs4xdr.c        |  129 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nfs/pnfs.c           |   94 ++++++++++++++++++++++++++++++++++
>  fs/nfs/pnfs.h           |    9 +++-
>  fs/nfs/write.c          |   15 +++++-
>  include/linux/nfs4.h    |    1 +
>  include/linux/nfs_fs.h  |    1 +
>  include/linux/nfs_xdr.h |   23 ++++++++
>  11 files changed, 387 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index d85a534..85cb95d 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -326,6 +326,9 @@ nfs_file_fsync(struct file *file, int datasync)
>  		ret = xchg(&ctx->error, 0);
>  	if (!ret && status < 0)
>  		ret = status;
> +	if (!ret && !datasync)
> +		/* application has asked for meta-data sync */
> +		ret = pnfs_layoutcommit_inode(inode, 1);
>  	return ret;
>  }
>  
> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
> index c64be1c..1e612d1 100644
> --- a/fs/nfs/nfs4_fs.h
> +++ b/fs/nfs/nfs4_fs.h
> @@ -262,6 +262,8 @@ extern int nfs4_proc_destroy_session(struct nfs4_session *);
>  extern int nfs4_init_session(struct nfs_server *server);
>  extern int nfs4_proc_get_lease_time(struct nfs_client *clp,
>  		struct nfs_fsinfo *fsinfo);
> +extern int nfs4_proc_layoutcommit(struct nfs4_layoutcommit_data *data,
> +				  int sync);

bool sync is better

>  
>  static inline bool
>  is_ds_only_client(struct nfs_client *clp)
> diff --git a/fs/nfs/nfs4filelayout.c b/fs/nfs/nfs4filelayout.c
> index 97e75a2..fc1a0e9 100644
> --- a/fs/nfs/nfs4filelayout.c
> +++ b/fs/nfs/nfs4filelayout.c
> @@ -154,6 +154,23 @@ static int filelayout_read_done_cb(struct rpc_task *task,
>  }
>  
>  /*
> + * We reference the rpc_cred of the first WRITE that triggers the need for
> + * a LAYOUTCOMMIT, and use it to send the layoutcommit compound.
> + * rfc5661 is not clear about which credential should be used.
> + */
> +static void
> +filelayout_set_layoutcommit(struct nfs_write_data *wdata)
> +{
> +	if (FILELAYOUT_LSEG(wdata->lseg)->commit_through_mds ||
> +	    wdata->res.verf->committed == NFS_FILE_SYNC)
> +		return;
> +
> +	pnfs_set_layoutcommit(wdata);
> +	dprintk("%s ionde %lu pls_end_pos %lu\n", __func__, wdata->inode->i_ino,
> +		(unsigned long) wdata->lseg->pls_end_pos);
> +}
> +
> +/*
>   * Call ops for the async read/write cases
>   * In the case of dense layouts, the offset needs to be reset to its
>   * original value.
> @@ -210,6 +227,7 @@ static int filelayout_write_done_cb(struct rpc_task *task,
>  		return -EAGAIN;
>  	}
>  
> +	filelayout_set_layoutcommit(data);
>  	return 0;
>  }
>  
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 5d61ccc..6f2f402 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -5616,6 +5616,100 @@ int nfs4_proc_getdeviceinfo(struct nfs_server *server, struct pnfs_device *pdev)
>  }
>  EXPORT_SYMBOL_GPL(nfs4_proc_getdeviceinfo);
>  
> +static void nfs4_layoutcommit_prepare(struct rpc_task *task, void *calldata)
> +{
> +	struct nfs4_layoutcommit_data *data = calldata;
> +	struct nfs_server *server = NFS_SERVER(data->args.inode);
> +
> +	if (nfs4_setup_sequence(server, &data->args.seq_args,
> +				&data->res.seq_res, 1, task))
> +		return;
> +	rpc_call_start(task);
> +}
> +
> +static void
> +nfs4_layoutcommit_done(struct rpc_task *task, void *calldata)
> +{
> +	struct nfs4_layoutcommit_data *data = calldata;
> +	struct nfs_server *server = NFS_SERVER(data->args.inode);
> +
> +	if (!nfs4_sequence_done(task, &data->res.seq_res))
> +		return;
> +
> +	switch (task->tk_status) { /* Just ignore these failures */
> +	case NFS4ERR_DELEG_REVOKED: /* layout was recalled */
> +	case NFS4ERR_BADIOMODE:     /* no IOMODE_RW layout for range */
> +	case NFS4ERR_BADLAYOUT:     /* no layout */
> +	case NFS4ERR_GRACE:	    /* loca_recalim always false */
> +		task->tk_status = 0;
> +	}
> +
> +	if (nfs4_async_handle_error(task, server, NULL) == -EAGAIN) {
> +		nfs_restart_rpc(task, server->nfs_client);
> +		return;
> +	}
> +
> +	if (task->tk_status == 0)
> +		nfs_post_op_update_inode_force_wcc(data->args.inode,
> +						   data->res.fattr);
> +}
> +
> +static void nfs4_layoutcommit_release(void *calldata)
> +{
> +	struct nfs4_layoutcommit_data *data = calldata;
> +
> +	/* Matched by references in pnfs_set_layoutcommit */
> +	put_lseg(data->lseg);
> +	put_rpccred(data->cred);
> +	kfree(data);
> +}
> +
> +static const struct rpc_call_ops nfs4_layoutcommit_ops = {
> +	.rpc_call_prepare = nfs4_layoutcommit_prepare,
> +	.rpc_call_done = nfs4_layoutcommit_done,
> +	.rpc_release = nfs4_layoutcommit_release,
> +};
> +
> +int
> +nfs4_proc_layoutcommit(struct nfs4_layoutcommit_data *data, int sync)

bool sync

> +{
> +	struct rpc_message msg = {
> +		.rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_LAYOUTCOMMIT],
> +		.rpc_argp = &data->args,
> +		.rpc_resp = &data->res,
> +		.rpc_cred = data->cred,
> +	};
> +	struct rpc_task_setup task_setup_data = {
> +		.task = &data->task,
> +		.rpc_client = NFS_CLIENT(data->args.inode),
> +		.rpc_message = &msg,
> +		.callback_ops = &nfs4_layoutcommit_ops,
> +		.callback_data = data,
> +		.flags = RPC_TASK_ASYNC,
> +	};
> +	struct rpc_task *task;
> +	int status = 0;
> +
> +	dprintk("NFS: %4d initiating layoutcommit call. sync %d "
> +		"lbw: %llu inode %lu\n",
> +		data->task.tk_pid, sync,
> +		data->args.lastbytewritten,
> +		data->args.inode->i_ino);
> +
> +	task = rpc_run_task(&task_setup_data);
> +	if (IS_ERR(task))
> +		return PTR_ERR(task);
> +	if (!sync)
> +		goto out;
> +	status = nfs4_wait_for_completion_rpc_task(task);
> +	if (status != 0)
> +		goto out;
> +	status = task->tk_status;
> +out:
> +	dprintk("%s: status %d\n", __func__, status);
> +	rpc_put_task(task);
> +	return status;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  struct nfs4_state_recovery_ops nfs40_reboot_recovery_ops = {
> diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
> index 07cdf92..207d399 100644
> --- a/fs/nfs/nfs4xdr.c
> +++ b/fs/nfs/nfs4xdr.c
> @@ -324,6 +324,18 @@ static int nfs4_stat_to_errno(int);
>  #define decode_layoutget_maxsz	(op_decode_hdr_maxsz + 8 + \
>  				decode_stateid_maxsz + \
>  				XDR_QUADLEN(PNFS_LAYOUT_MAXSIZE))
> +#define encode_layoutcommit_maxsz (op_encode_hdr_maxsz +          \
> +				2 /* offset */ + \
> +				2 /* length */ + \
> +				1 /* reclaim */ + \
> +				encode_stateid_maxsz + \
> +				1 /* new offset (true) */ + \
> +				2 /* last byte written */ + \
> +				1 /* nt_timechanged (false) */ + \
> +				1 /* layoutupdate4 layout type */ + \
> +				1 /* NULL filelayout layoutupdate4 payload */)
> +#define decode_layoutcommit_maxsz (op_decode_hdr_maxsz + 3)
> +
>  #else /* CONFIG_NFS_V4_1 */
>  #define encode_sequence_maxsz	0
>  #define decode_sequence_maxsz	0
> @@ -727,6 +739,17 @@ static int nfs4_stat_to_errno(int);
>  				decode_sequence_maxsz + \
>  				decode_putfh_maxsz +        \
>  				decode_layoutget_maxsz)
> +#define NFS4_enc_layoutcommit_sz (compound_encode_hdr_maxsz + \
> +				encode_sequence_maxsz +\
> +				encode_putfh_maxsz + \
> +				encode_layoutcommit_maxsz + \
> +				encode_getattr_maxsz)
> +#define NFS4_dec_layoutcommit_sz (compound_decode_hdr_maxsz + \
> +				decode_sequence_maxsz + \
> +				decode_putfh_maxsz + \
> +				decode_layoutcommit_maxsz + \
> +				decode_getattr_maxsz)
> +
>  
>  const u32 nfs41_maxwrite_overhead = ((RPC_MAX_HEADER_WITH_AUTH +
>  				      compound_encode_hdr_maxsz +
> @@ -1816,6 +1839,34 @@ encode_layoutget(struct xdr_stream *xdr,
>  	hdr->nops++;
>  	hdr->replen += decode_layoutget_maxsz;
>  }
> +
> +static int
> +encode_layoutcommit(struct xdr_stream *xdr,
> +		    const struct nfs4_layoutcommit_args *args,
> +		    struct compound_hdr *hdr)
> +{
> +	__be32 *p;
> +
> +	dprintk("%s: lbw: %llu type: %d\n", __func__, args->lastbytewritten,
> +		NFS_SERVER(args->inode)->pnfs_curr_ld->id);
> +
> +	p = reserve_space(xdr, 48 + NFS4_STATEID_SIZE);
> +	*p++ = cpu_to_be32(OP_LAYOUTCOMMIT);
> +	/* Only whole file layouts */
> +	p = xdr_encode_hyper(p, 0); /* offset */
> +	p = xdr_encode_hyper(p, NFS4_MAX_UINT64); /* length */
> +	*p++ = cpu_to_be32(0); /* reclaim */
> +	p = xdr_encode_opaque_fixed(p, args->stateid.data, NFS4_STATEID_SIZE);
> +	*p++ = cpu_to_be32(1); /* newoffset = TRUE */
> +	p = xdr_encode_hyper(p, args->lastbytewritten);
> +	*p++ = cpu_to_be32(0); /* Never send time_modify_changed */
> +	*p++ = cpu_to_be32(NFS_SERVER(args->inode)->pnfs_curr_ld->id);/* type */

s/type/layout type/

> +	*p++ = cpu_to_be32(0); /* no file layout payload */
> +
> +	hdr->nops++;
> +	hdr->replen += decode_layoutcommit_maxsz;
> +	return 0;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  /*
> @@ -2607,6 +2658,26 @@ static void nfs4_xdr_enc_layoutget(struct rpc_rqst *req,
>  	encode_layoutget(xdr, args, &hdr);
>  	encode_nops(&hdr);
>  }
> +
> +/*
> + *  Encode LAYOUTCOMMIT request
> + */
> +static int nfs4_xdr_enc_layoutcommit(struct rpc_rqst *req,
> +				     struct xdr_stream *xdr,
> +				     struct nfs4_layoutcommit_args *args)
> +{
> +	struct compound_hdr hdr = {
> +		.minorversion = nfs4_xdr_minorversion(&args->seq_args),
> +	};
> +
> +	encode_compound_hdr(xdr, req, &hdr);
> +	encode_sequence(xdr, &args->seq_args, &hdr);
> +	encode_putfh(xdr, NFS_FH(args->inode), &hdr);
> +	encode_layoutcommit(xdr, args, &hdr);
> +	encode_getfattr(xdr, args->bitmask, &hdr);
> +	encode_nops(&hdr);
> +	return 0;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  static void print_overflow_msg(const char *func, const struct xdr_stream *xdr)
> @@ -5007,6 +5078,35 @@ out_overflow:
>  	print_overflow_msg(__func__, xdr);
>  	return -EIO;
>  }
> +
> +static int decode_layoutcommit(struct xdr_stream *xdr,
> +			       struct rpc_rqst *req,
> +			       struct nfs4_layoutcommit_res *res)
> +{
> +	__be32 *p;
> +	__u32 sizechanged;
> +	int status;
> +
> +	status = decode_op_hdr(xdr, OP_LAYOUTCOMMIT);
> +	if (status)
> +		return status;
> +
> +	p = xdr_inline_decode(xdr, 4);
> +	if (unlikely(!p))
> +		goto out_overflow;
> +	sizechanged = be32_to_cpup(p);
> +
> +	if (sizechanged) {
> +		/* throw away new size */
> +		p = xdr_inline_decode(xdr, 8);
> +		if (unlikely(!p))
> +			goto out_overflow;
> +	}
> +	return 0;
> +out_overflow:
> +	print_overflow_msg(__func__, xdr);
> +	return -EIO;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  /*
> @@ -6068,6 +6168,34 @@ static int nfs4_xdr_dec_layoutget(struct rpc_rqst *rqstp,
>  out:
>  	return status;
>  }
> +
> +/*
> + * Decode LAYOUTCOMMIT response
> + */
> +static int nfs4_xdr_dec_layoutcommit(struct rpc_rqst *rqstp,
> +				     struct xdr_stream *xdr,
> +				     struct nfs4_layoutcommit_res *res)
> +{
> +	struct compound_hdr hdr;
> +	int status;
> +
> +	status = decode_compound_hdr(xdr, &hdr);
> +	if (status)
> +		goto out;
> +	status = decode_sequence(xdr, &res->seq_res, rqstp);
> +	if (status)
> +		goto out;
> +	status = decode_putfh(xdr);
> +	if (status)
> +		goto out;
> +	status = decode_layoutcommit(xdr, rqstp, res);
> +	if (status)
> +		goto out;
> +	decode_getfattr(xdr, res->fattr, res->server,
> +			!RPC_IS_ASYNC(rqstp->rq_task));
> +out:
> +	return status;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  /**
> @@ -6269,6 +6397,7 @@ struct rpc_procinfo	nfs4_procedures[] = {
>  	PROC(RECLAIM_COMPLETE,	enc_reclaim_complete,	dec_reclaim_complete),
>  	PROC(GETDEVICEINFO,	enc_getdeviceinfo,	dec_getdeviceinfo),
>  	PROC(LAYOUTGET,		enc_layoutget,		dec_layoutget),
> +	PROC(LAYOUTCOMMIT,	enc_layoutcommit,	dec_layoutcommit),
>  #endif /* CONFIG_NFS_V4_1 */
>  };
>  
> diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
> index c675659..2a08ca0 100644
> --- a/fs/nfs/pnfs.c
> +++ b/fs/nfs/pnfs.c
> @@ -946,3 +946,97 @@ pnfs_try_to_read_data(struct nfs_read_data *rdata,
>  	dprintk("%s End (trypnfs:%d)\n", __func__, trypnfs);
>  	return trypnfs;
>  }
> +
> +/*
> + * Currently there is only one (whole file) write lseg.
> + */
> +static struct pnfs_layout_segment *pnfs_list_write_lseg(struct inode *inode)
> +{
> +	struct pnfs_layout_segment *lseg, *rv = NULL;
> +
> +	list_for_each_entry(lseg, &NFS_I(inode)->layout->plh_segs, pls_list)
> +		if (lseg->pls_range.iomode == IOMODE_RW)
> +			rv = lseg;
> +	return rv;
> +}
> +
> +void
> +pnfs_set_layoutcommit(struct nfs_write_data *wdata)
> +{
> +	struct nfs_inode *nfsi = NFS_I(wdata->inode);
> +	loff_t end_pos = wdata->args.offset + wdata->res.count;
> +
> +	spin_lock(&nfsi->vfs_inode.i_lock);
> +	if (!test_and_set_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {
> +		/* references matched in nfs4_layoutcommit_release */
> +		get_lseg(wdata->lseg);
> +		wdata->lseg->pls_lc_cred =
> +			get_rpccred(wdata->args.context->state->owner->so_cred);
> +		mark_inode_dirty_sync(wdata->inode);
> +		dprintk("%s: Set layoutcommit for inode %lu ",
> +			__func__, wdata->inode->i_ino);
> +	}
> +	if (end_pos > wdata->lseg->pls_end_pos)
> +		wdata->lseg->pls_end_pos = end_pos;

The end_pos is essentially per inode, why maintain it per lseg?
How do you see this working with multiple lsegs in mind?

> +	spin_unlock(&nfsi->vfs_inode.i_lock);
> +}
> +EXPORT_SYMBOL_GPL(pnfs_set_layoutcommit);
> +
> +int
> +pnfs_layoutcommit_inode(struct inode *inode, int sync)

"bool sync" makes more sense

> +{
> +	struct nfs4_layoutcommit_data *data;
> +	struct nfs_inode *nfsi = NFS_I(inode);
> +	struct pnfs_layout_segment *lseg;
> +	struct rpc_cred *cred;
> +	loff_t end_pos;
> +	int status = 0;
> +
> +	dprintk("--> %s inode %lu\n", __func__, inode->i_ino);
> +
> +	/* Note kzalloc ensures data->res.seq_res.sr_slot == NULL */
> +	data = kzalloc(sizeof(*data), GFP_NOFS);
> +	spin_lock(&inode->i_lock);
> +
> +	if (!test_and_clear_bit(NFS_INO_LAYOUTCOMMIT, &nfsi->flags)) {

previously (i.e. in the linux-pnfs tree :) this function is called only
if layoutcommit_needed(), now I worry may waste a kzalloc too frequently.
I suggest testing (and not clearing) NFS_INO_LAYOUTCOMMIT before doing
the allocation to prevent that.

> +		spin_unlock(&inode->i_lock);
> +		kfree(data);
> +		goto out;
> +	}
> +	/*
> +	 * Currently only one (whole file) write lseg which is referenced
> +	 * in pnfs_set_layoutcommit and will be found.
> +	 */
> +	lseg = pnfs_list_write_lseg(inode);
> +
> +	end_pos = lseg->pls_end_pos;
> +	cred = lseg->pls_lc_cred;
> +	lseg->pls_end_pos = 0;
> +	lseg->pls_lc_cred = NULL;
> +
> +	if (!data) {

eh?
why not test this before test_and_clear_bit(NFS_INO_LAYOUTCOMMIT ?

Benny

> +		put_lseg(lseg);
> +		spin_unlock(&inode->i_lock);
> +		put_rpccred(cred);
> +		status = -ENOMEM;
> +		goto out;
> +	} else {
> +		memcpy(&data->args.stateid.data, nfsi->layout->plh_stateid.data,
> +			sizeof(nfsi->layout->plh_stateid.data));
> +	}
> +	spin_unlock(&inode->i_lock);
> +
> +	data->args.inode = inode;
> +	data->lseg = lseg;
> +	data->cred = cred;
> +	nfs_fattr_init(&data->fattr);
> +	data->args.bitmask = NFS_SERVER(inode)->cache_consistency_bitmask;
> +	data->res.fattr = &data->fattr;
> +	data->args.lastbytewritten = end_pos - 1;
> +	data->res.server = NFS_SERVER(inode);
> +
> +	status = nfs4_proc_layoutcommit(data, sync);
> +out:
> +	dprintk("<-- %s status %d\n", __func__, status);
> +	return status;
> +}
> diff --git a/fs/nfs/pnfs.h b/fs/nfs/pnfs.h
> index 5370f1b..0806c77 100644
> --- a/fs/nfs/pnfs.h
> +++ b/fs/nfs/pnfs.h
> @@ -43,6 +43,8 @@ struct pnfs_layout_segment {
>  	atomic_t pls_refcount;
>  	unsigned long pls_flags;
>  	struct pnfs_layout_hdr *pls_layout;
> +	struct rpc_cred	*pls_lc_cred; /* LAYOUTCOMMIT credential */
> +	loff_t pls_end_pos; /* LAYOUTCOMMIT write end */
>  };
>  
>  enum pnfs_try_status {
> @@ -152,7 +154,8 @@ bool pnfs_roc(struct inode *ino);
>  void pnfs_roc_release(struct inode *ino);
>  void pnfs_roc_set_barrier(struct inode *ino, u32 barrier);
>  bool pnfs_roc_drain(struct inode *ino, u32 *barrier);
> -
> +void pnfs_set_layoutcommit(struct nfs_write_data *wdata);
> +int pnfs_layoutcommit_inode(struct inode *inode, int sync);
>  
>  static inline int lo_fail_bit(u32 iomode)
>  {
> @@ -325,6 +328,10 @@ static inline void pnfs_clear_request_commit(struct nfs_page *req)
>  {
>  }
>  
> +static inline int pnfs_layoutcommit_inode(struct inode *inode, int sync)
> +{
> +	return 0;
> +}
>  #endif /* CONFIG_NFS_V4_1 */
>  
>  #endif /* FS_NFS_PNFS_H */
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index e7aeda0..a03c11f 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1562,7 +1562,20 @@ static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_contr
>  
>  int nfs_write_inode(struct inode *inode, struct writeback_control *wbc)
>  {
> -	return nfs_commit_unstable_pages(inode, wbc);
> +	int ret;
> +
> +	ret = nfs_commit_unstable_pages(inode, wbc);
> +	if (ret >= 0 && test_bit(NFS_INO_LAYOUTCOMMIT, &NFS_I(inode)->flags)) {
> +		int status, sync = wbc->sync_mode;
> +
> +		if (wbc->nonblocking || wbc->for_background)
> +				sync = 0;
> +
> +		status = pnfs_layoutcommit_inode(inode, sync);
> +		if (status < 0)
> +			return status;
> +	}
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
> index 134716e..bf22cfa 100644
> --- a/include/linux/nfs4.h
> +++ b/include/linux/nfs4.h
> @@ -560,6 +560,7 @@ enum {
>  	NFSPROC4_CLNT_RECLAIM_COMPLETE,
>  	NFSPROC4_CLNT_LAYOUTGET,
>  	NFSPROC4_CLNT_GETDEVICEINFO,
> +	NFSPROC4_CLNT_LAYOUTCOMMIT,
>  };
>  
>  /* nfs41 types */
> diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
> index 9d434f0..5765126 100644
> --- a/include/linux/nfs_fs.h
> +++ b/include/linux/nfs_fs.h
> @@ -223,6 +223,7 @@ struct nfs_inode {
>  #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
>  #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */
>  #define NFS_INO_PNFS_COMMIT	(8)		/* use pnfs code for commit */
> +#define NFS_INO_LAYOUTCOMMIT	(9)		/* layoutcommit required */
>  
>  static inline struct nfs_inode *NFS_I(const struct inode *inode)
>  {
> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
> index ac0c0e5..84f3585 100644
> --- a/include/linux/nfs_xdr.h
> +++ b/include/linux/nfs_xdr.h
> @@ -236,6 +236,29 @@ struct nfs4_getdeviceinfo_res {
>  	struct nfs4_sequence_res seq_res;
>  };
>  
> +struct nfs4_layoutcommit_args {
> +	nfs4_stateid stateid;
> +	__u64 lastbytewritten;
> +	struct inode *inode;
> +	const u32 *bitmask;
> +	struct nfs4_sequence_args seq_args;
> +};
> +
> +struct nfs4_layoutcommit_res {
> +	struct nfs_fattr *fattr;
> +	const struct nfs_server *server;
> +	struct nfs4_sequence_res seq_res;
> +};
> +
> +struct nfs4_layoutcommit_data {
> +	struct rpc_task task;
> +	struct nfs_fattr fattr;
> +	struct pnfs_layout_segment *lseg;
> +	struct rpc_cred *cred;
> +	struct nfs4_layoutcommit_args args;
> +	struct nfs4_layoutcommit_res res;
> +};
> +
>  /*
>   * Arguments to the open call.
>   */


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-23 13:27 ` [PATCH 11/12] NFSv4.1: layoutcommit Fred Isaman
  2011-03-23 20:33   ` Benny Halevy
@ 2011-03-23 21:00   ` Christoph Hellwig
  2011-03-23 21:15     ` Trond Myklebust
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2011-03-23 21:00 UTC (permalink / raw)
  To: Fred Isaman; +Cc: linux-nfs, Trond Myklebust

> @@ -326,6 +326,9 @@ nfs_file_fsync(struct file *file, int datasync)
>  		ret = xchg(&ctx->error, 0);
>  	if (!ret && status < 0)
>  		ret = status;
> +	if (!ret && !datasync)
> +		/* application has asked for meta-data sync */
> +		ret = pnfs_layoutcommit_inode(inode, 1);

I don't think that is correct.  AFAIK the layoutcommit is needed to
get back to the data after a crash - basically the only thing a
!datasync fsync can skip is dirty timestamps.

>  int nfs_write_inode(struct inode *inode, struct writeback_control *wbc)
>  {
> -	return nfs_commit_unstable_pages(inode, wbc);
> +	int ret;
> +
> +	ret = nfs_commit_unstable_pages(inode, wbc);
> +	if (ret >= 0 && test_bit(NFS_INO_LAYOUTCOMMIT, &NFS_I(inode)->flags)) {
> +		int status, sync = wbc->sync_mode;
> +
> +		if (wbc->nonblocking || wbc->for_background)
> +				sync = 0;

incorrect indentation.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-23 21:00   ` Christoph Hellwig
@ 2011-03-23 21:15     ` Trond Myklebust
  2011-03-23 21:21       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Trond Myklebust @ 2011-03-23 21:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Fred Isaman, linux-nfs

On Wed, 2011-03-23 at 17:00 -0400, Christoph Hellwig wrote:
> > @@ -326,6 +326,9 @@ nfs_file_fsync(struct file *file, int datasync)
> >  		ret = xchg(&ctx->error, 0);
> >  	if (!ret && status < 0)
> >  		ret = status;
> > +	if (!ret && !datasync)
> > +		/* application has asked for meta-data sync */
> > +		ret = pnfs_layoutcommit_inode(inode, 1);
> 
> I don't think that is correct.  AFAIK the layoutcommit is needed to
> get back to the data after a crash - basically the only thing a
> !datasync fsync can skip is dirty timestamps.

Hi Christoph,

We had a spec clarification about this recently. The result is that for
'files' layout types:

1) DATA_SYNC writes and unstable writes with COMMIT must store enough
data to disk to allow the server to recover the data if it crashes (i.e.
it provides the equivalent of O_DSYNC and/or fdatasync()). FILE_SYNC
writes must commit all data + metadata (i.e. they do the equivalent of
O_SYNC writes).
2) This means that layoutcommit is only needed when the layout specifies
COMMIT to the data servers, or if the writes to the data servers return
a DATA_SYNC. All it does is commit the time stamps + file size to disk
on the metadata server (and so you avoid the need to do an expensive
recovery process on crash).

IOW: the above should be correct.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-23 21:15     ` Trond Myklebust
@ 2011-03-23 21:21       ` Christoph Hellwig
  2011-03-23 21:26         ` Trond Myklebust
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2011-03-23 21:21 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Christoph Hellwig, Fred Isaman, linux-nfs

On Wed, Mar 23, 2011 at 05:15:15PM -0400, Trond Myklebust wrote:
> We had a spec clarification about this recently. The result is that for
> 'files' layout types:
> 
> 1) DATA_SYNC writes and unstable writes with COMMIT must store enough
> data to disk to allow the server to recover the data if it crashes (i.e.
> it provides the equivalent of O_DSYNC and/or fdatasync()). FILE_SYNC
> writes must commit all data + metadata (i.e. they do the equivalent of
> O_SYNC writes).
> 2) This means that layoutcommit is only needed when the layout specifies
> COMMIT to the data servers, or if the writes to the data servers return
> a DATA_SYNC. All it does is commit the time stamps + file size to disk
> on the metadata server (and so you avoid the need to do an expensive
> recovery process on crash).
> 
> IOW: the above should be correct.

Okay, care to add a comment explaining this fact?  The name layoutcommit
doesn't quite suggest that it doesn't commit anything but timestamps
and size.

Can you recover the size from the data servers with the code above?  The
size should also be on disk for an fdatasync.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 11/12] NFSv4.1: layoutcommit
  2011-03-23 21:21       ` Christoph Hellwig
@ 2011-03-23 21:26         ` Trond Myklebust
  0 siblings, 0 replies; 15+ messages in thread
From: Trond Myklebust @ 2011-03-23 21:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Fred Isaman, linux-nfs

On Wed, 2011-03-23 at 17:21 -0400, Christoph Hellwig wrote:
> Okay, care to add a comment explaining this fact?  The name layoutcommit
> doesn't quite suggest that it doesn't commit anything but timestamps
> and size.
> 
> Can you recover the size from the data servers with the code above?  The
> size should also be on disk for an fdatasync.

Yes. The specification is that there can be no loss of data (even in the
case of a server crash) once the writes are committed to the data
server.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-03-25  9:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-24 13:57 [PATCH 11/12] NFSv4.1: layoutcommit William A. (Andy) Adamson
2011-03-24 16:37 ` Benny Halevy
2011-03-24 16:48   ` Trond Myklebust
2011-03-24 16:54     ` Fred Isaman
2011-03-24 16:58     ` Benny Halevy
2011-03-24 17:15       ` Trond Myklebust
2011-03-25  9:39         ` Benny Halevy
  -- strict thread matches above, loose matches on Subject: below --
2011-03-24 14:45 William A. (Andy) Adamson
2011-03-24 17:06 ` Benny Halevy
2011-03-23 13:27 [PATCH 00/12] NFSv4.1 pnfs wave5 submission (try 2) Fred Isaman
2011-03-23 13:27 ` [PATCH 11/12] NFSv4.1: layoutcommit Fred Isaman
2011-03-23 20:33   ` Benny Halevy
2011-03-23 21:00   ` Christoph Hellwig
2011-03-23 21:15     ` Trond Myklebust
2011-03-23 21:21       ` Christoph Hellwig
2011-03-23 21:26         ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).