upgrade/downgrade race

All of lore.kernel.org
 help / color / mirror / Atom feed

* upgrade/downgrade race
@ 2015-09-09 13:37 Andrew W Elble
  2015-09-09 15:58 ` Andrew W Elble
  2015-09-09 17:12 ` Trond Myklebust
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew W Elble @ 2015-09-09 13:37 UTC (permalink / raw)
  To: linux-nfs


In attempting to troubleshoot other issues, we've run into this race
with 4.1.4 (both client and server) with a few cherry-picked patches
from upstream. This is my attempt at a redacted packet-capture.

These all affect the same fh/stateid:

116 -> OPEN (will be an upgrade / for write)
117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6

121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
122 -> OPEN (completed first / seqid = 0x7)

Attempts to write using that stateid fail because the stateid doesn't
have write access.

Any thoughts? I can share more data from the capture if needed.

==========

116     Sep  3, 2015 15:07:04.163242000 EDT     V4 Call SEQUENCE | PUTFH | OPEN DH: 0x8b975243/ | ACCESS FH: 0x8b975243, [Check: RD MD XT XE] | GETATTR FH: 0x8b975243
117     Sep  3, 2015 15:07:04.163289000 EDT     V4 Call SEQUENCE | PUTFH | OPEN_DOWNGRADE | GETATTR FH: 0x8b975243

...

121     Sep  3, 2015 15:07:04.163426000 EDT     V4 Reply (Call In 117) SEQUENCE | PUTFH | OPEN_DOWNGRADE | GETATTR
122     Sep  3, 2015 15:07:04.163443000 EDT     V4 Reply (Call In 116) SEQUENCE | PUTFH | OPEN StateID: 0x1f68 | ACCESS, [Allowed: RD MD XT XE] | GETATTR

...

155     Sep  3, 2015 15:07:04.165286000 EDT     V4 Call SEQUENCE | TEST_STATEID
156     Sep  3, 2015 15:07:04.165417000 EDT     V4 Reply (Call In 155) SEQUENCE | TEST_STATEID
157     Sep  3, 2015 15:07:04.165469000 EDT     V4 Call SEQUENCE | PUTFH | WRITE StateID: 0x072b Offset: 0 Len: 289 | GETATTR FH: 0x8b975243
158     Sep  3, 2015 15:07:04.165597000 EDT     V4 Reply (Call In 157) SEQUENCE | PUTFH | WRITE Status: NFS4ERR_OPENMODE
159     Sep  3, 2015 15:07:04.165713000 EDT     V4 Call SEQUENCE | TEST_STATEID
160     Sep  3, 2015 15:07:04.165839000 EDT     V4 Reply (Call In 159) SEQUENCE | TEST_STATEID
161     Sep  3, 2015 15:07:04.165913000 EDT     V4 Call SEQUENCE | PUTFH | WRITE StateID: 0x072b Offset: 0 Len: 289 | GETATTR FH: 0x8b975243
162     Sep  3, 2015 15:07:04.166040000 EDT     V4 Reply (Call In 161) SEQUENCE | PUTFH | WRITE Status: NFS4ERR_OPENMODE
163     Sep  3, 2015 15:07:04.166153000 EDT     V4 Call SEQUENCE | TEST_STATEID
164     Sep  3, 2015 15:07:04.166284000 EDT     V4 Reply (Call In 163) SEQUENCE | TEST_STATEID
165     Sep  3, 2015 15:07:04.166335000 EDT     V4 Call SEQUENCE | PUTFH | WRITE StateID: 0x072b Offset: 0 Len: 289 | GETATTR FH: 0x8b975243
166     Sep  3, 2015 15:07:04.166463000 EDT     V4 Reply (Call In 165) SEQUENCE | PUTFH | WRITE Status: NFS4ERR_OPENMODE

Thanks,

Andy

-- 
Andrew W. Elble
aweits@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 13:37 upgrade/downgrade race Andrew W Elble
@ 2015-09-09 15:58 ` Andrew W Elble
  2015-09-09 17:12 ` Trond Myklebust
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew W Elble @ 2015-09-09 15:58 UTC (permalink / raw)
  To: linux-nfs


Or, put into really half-baked (I've only spent an evening looking at
nfs client code) terms, isn't something like this required?

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 11fe5d7..15b8150 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1923,8 +1923,16 @@ static void nfs4_open_done(struct rpc_task *task, void *calldata)
 		renew_lease(data->o_res.server, data->timestamp);
 		if (!(data->o_res.rflags & NFS4_OPEN_RESULT_CONFIRM))
 			nfs_confirm_seqid(&data->owner->so_seqid, 0);
+
+		if (nfs4_stateid_is_newer(&data->state->open_stateid, &data->o_res.stateid) &&
+		    !can_open_cached(data->state, data->o_arg.fmode, data->o_arg.open_flags)) {
+		    rpc_restart_call_prepare(task);
+		    goto out;
+	        }
 	}
 	data->rpc_done = 1;
+out:
+	return;
 }
 
 static void nfs4_open_release(void *calldata)


-- 
Andrew W. Elble
aweits@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 13:37 upgrade/downgrade race Andrew W Elble
  2015-09-09 15:58 ` Andrew W Elble
@ 2015-09-09 17:12 ` Trond Myklebust
  2015-09-09 17:49   ` Trond Myklebust
  1 sibling, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2015-09-09 17:12 UTC (permalink / raw)
  To: Andrew W Elble; +Cc: Linux NFS Mailing List

On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
>
> In attempting to troubleshoot other issues, we've run into this race
> with 4.1.4 (both client and server) with a few cherry-picked patches
> from upstream. This is my attempt at a redacted packet-capture.
>
> These all affect the same fh/stateid:
>
> 116 -> OPEN (will be an upgrade / for write)
> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
>
> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> 122 -> OPEN (completed first / seqid = 0x7)
>
> Attempts to write using that stateid fail because the stateid doesn't
> have write access.
>
> Any thoughts? I can share more data from the capture if needed.

Bruce & Jeff,

Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
being executed after the OPEN here? Surely, if that is the case, the
server should be returning NFS4ERR_OLD_STATEID and failing the
OPEN_DOWNGRADE operation?

Cheers
  Trond

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 17:12 ` Trond Myklebust
@ 2015-09-09 17:49   ` Trond Myklebust
  2015-09-09 18:49     ` Jeff Layton
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2015-09-09 17:49 UTC (permalink / raw)
  To: Andrew W Elble, Bruce James Fields, Jeffrey Layton; +Cc: Linux NFS Mailing List

+Bruce, +Jeff...

On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
> On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
>>
>> In attempting to troubleshoot other issues, we've run into this race
>> with 4.1.4 (both client and server) with a few cherry-picked patches
>> from upstream. This is my attempt at a redacted packet-capture.
>>
>> These all affect the same fh/stateid:
>>
>> 116 -> OPEN (will be an upgrade / for write)
>> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
>>
>> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
>> 122 -> OPEN (completed first / seqid = 0x7)
>>
>> Attempts to write using that stateid fail because the stateid doesn't
>> have write access.
>>
>> Any thoughts? I can share more data from the capture if needed.
>
> Bruce & Jeff,
>
> Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> being executed after the OPEN here? Surely, if that is the case, the
> server should be returning NFS4ERR_OLD_STATEID and failing the
> OPEN_DOWNGRADE operation?
>
> Cheers
>   Trond

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 17:49   ` Trond Myklebust
@ 2015-09-09 18:49     ` Jeff Layton
  2015-09-09 19:01       ` Trond Myklebust
  2015-09-09 19:04       ` Bruce James Fields
  0 siblings, 2 replies; 16+ messages in thread
From: Jeff Layton @ 2015-09-09 18:49 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew W Elble, Bruce James Fields, Linux NFS Mailing List

On Wed, 9 Sep 2015 13:49:44 -0400
Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> +Bruce, +Jeff...
> 
> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> <trond.myklebust@primarydata.com> wrote:
> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> >>
> >> In attempting to troubleshoot other issues, we've run into this race
> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> >> from upstream. This is my attempt at a redacted packet-capture.
> >>
> >> These all affect the same fh/stateid:
> >>
> >> 116 -> OPEN (will be an upgrade / for write)
> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> >>
> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> >> 122 -> OPEN (completed first / seqid = 0x7)
> >>
> >> Attempts to write using that stateid fail because the stateid doesn't
> >> have write access.
> >>
> >> Any thoughts? I can share more data from the capture if needed.
> >
> > Bruce & Jeff,
> >
> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > being executed after the OPEN here? Surely, if that is the case, the
> > server should be returning NFS4ERR_OLD_STATEID and failing the
> > OPEN_DOWNGRADE operation?
> >

The problem there is that we do the seqid checks at the beginning of
the operation. In this case it's likely that it was 0x6 when the
OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
seqid, and then the downgrade finished and bumped it again. When we bump
the seqid we don't verify it against what came in originally.

The question is whether that's wrong from the POV of the spec. RFC5661
doesn't seem to explicitly require that we serialize such operations on
the server. The closest thing I can find is this in 3.3.12:

"The server is required to increment the "seqid" field by
 one at each transition of the stateid.  This is important since the
 client will inspect the seqid in OPEN stateids to determine the order
 of OPEN processing done by the server."

If we do need to fix this on the server, it's likely to be pretty ugly:

We'd either need to serialize seqid morphing operations (ugh), or make
update_stateid do an cmpxchg to swap it into place (or add some extra
locking around it), and then have some way to unwind all of the changes
if that fails. That may be impossible however -- we're likely closing
struct files after all.

Now, all of that said, I think the client has some bugs in its seqid
handling as well. It should have realized that the stateid was a r/o
one after the OPEN_DOWNGRADE came back with the higher seqid, but it
still issued a WRITE just afterward. That seems wrong.

-- 
Jeff Layton <jeff.layton@primarydata.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 18:49     ` Jeff Layton
@ 2015-09-09 19:01       ` Trond Myklebust
  2015-09-09 19:18         ` Jeff Layton
  2015-09-09 19:04       ` Bruce James Fields
  1 sibling, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2015-09-09 19:01 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Andrew W Elble, Bruce James Fields, Linux NFS Mailing List

On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@primarydata.com> wrote:
> On Wed, 9 Sep 2015 13:49:44 -0400
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
>
>> +Bruce, +Jeff...
>>
>> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
>> <trond.myklebust@primarydata.com> wrote:
>> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
>> >>
>> >> In attempting to troubleshoot other issues, we've run into this race
>> >> with 4.1.4 (both client and server) with a few cherry-picked patches
>> >> from upstream. This is my attempt at a redacted packet-capture.
>> >>
>> >> These all affect the same fh/stateid:
>> >>
>> >> 116 -> OPEN (will be an upgrade / for write)
>> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
>> >>
>> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
>> >> 122 -> OPEN (completed first / seqid = 0x7)
>> >>
>> >> Attempts to write using that stateid fail because the stateid doesn't
>> >> have write access.
>> >>
>> >> Any thoughts? I can share more data from the capture if needed.
>> >
>> > Bruce & Jeff,
>> >
>> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
>> > being executed after the OPEN here? Surely, if that is the case, the
>> > server should be returning NFS4ERR_OLD_STATEID and failing the
>> > OPEN_DOWNGRADE operation?
>> >
>
> The problem there is that we do the seqid checks at the beginning of
> the operation. In this case it's likely that it was 0x6 when the
> OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> seqid, and then the downgrade finished and bumped it again. When we bump
> the seqid we don't verify it against what came in originally.
>
> The question is whether that's wrong from the POV of the spec. RFC5661
> doesn't seem to explicitly require that we serialize such operations on
> the server. The closest thing I can find is this in 3.3.12:

RFC5661, section 8.2.2
  Except for layout stateids (Section 12.5.3), when a client sends a
   stateid to the server, it has two choices with regard to the seqid
   sent.  It may set the seqid to zero to indicate to the server that it
   wishes the most up-to-date seqid for that stateid's "other" field to
   be used.  This would be the common choice in the case of a stateid
   sent with a READ or WRITE operation.  It also may set a non-zero
   value, in which case the server checks if that seqid is the correct
   one.  In that case, the server is required to return
   NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
   and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
   value.  This would be the common choice in the case of stateids sent
   with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
   parallel for the same owner, a client might close a file without
   knowing that an OPEN upgrade had been done by the server, changing
   the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
   upgrade would be cancelled before the client even received an
   indication that an upgrade had happened.

The suggestion there is clearly that the client can rely on the server
not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
OPEN. Otherwise, what is the difference between sending a non-zero
seqid and zero?

> "The server is required to increment the "seqid" field by
>  one at each transition of the stateid.  This is important since the
>  client will inspect the seqid in OPEN stateids to determine the order
>  of OPEN processing done by the server."
>
> If we do need to fix this on the server, it's likely to be pretty ugly:
>
> We'd either need to serialize seqid morphing operations (ugh), or make
> update_stateid do an cmpxchg to swap it into place (or add some extra
> locking around it), and then have some way to unwind all of the changes
> if that fails. That may be impossible however -- we're likely closing
> struct files after all.

Updates to the state are already required to be atomic. You can't have
a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.

>
> Now, all of that said, I think the client has some bugs in its seqid
> handling as well. It should have realized that the stateid was a r/o
> one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> still issued a WRITE just afterward. That seems wrong.

No. The client is relying on the server not reordering the
OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
and for both operations to succeed.

Trond

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 18:49     ` Jeff Layton
  2015-09-09 19:01       ` Trond Myklebust
@ 2015-09-09 19:04       ` Bruce James Fields
  1 sibling, 0 replies; 16+ messages in thread
From: Bruce James Fields @ 2015-09-09 19:04 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Trond Myklebust, Andrew W Elble, Linux NFS Mailing List

On Wed, Sep 09, 2015 at 02:49:35PM -0400, Jeff Layton wrote:
> On Wed, 9 Sep 2015 13:49:44 -0400
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
> > +Bruce, +Jeff...
> > 
> > On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> > <trond.myklebust@primarydata.com> wrote:
> > > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> > >>
> > >> In attempting to troubleshoot other issues, we've run into this race
> > >> with 4.1.4 (both client and server) with a few cherry-picked patches
> > >> from upstream. This is my attempt at a redacted packet-capture.
> > >>
> > >> These all affect the same fh/stateid:
> > >>
> > >> 116 -> OPEN (will be an upgrade / for write)
> > >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> > >>
> > >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> > >> 122 -> OPEN (completed first / seqid = 0x7)
> > >>
> > >> Attempts to write using that stateid fail because the stateid doesn't
> > >> have write access.
> > >>
> > >> Any thoughts? I can share more data from the capture if needed.
> > >
> > > Bruce & Jeff,
> > >
> > > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > > being executed after the OPEN here? Surely, if that is the case, the
> > > server should be returning NFS4ERR_OLD_STATEID and failing the
> > > OPEN_DOWNGRADE operation?
> > >
> 
> The problem there is that we do the seqid checks at the beginning of
> the operation. In this case it's likely that it was 0x6 when the
> OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> seqid, and then the downgrade finished and bumped it again. When we bump
> the seqid we don't verify it against what came in originally.
> 
> The question is whether that's wrong from the POV of the spec. RFC5661
> doesn't seem to explicitly require that we serialize such operations on
> the server. The closest thing I can find is this in 3.3.12:
> 
> "The server is required to increment the "seqid" field by
>  one at each transition of the stateid.  This is important since the
>  client will inspect the seqid in OPEN stateids to determine the order
>  of OPEN processing done by the server."
> 
> If we do need to fix this on the server, it's likely to be pretty ugly:
> 
> We'd either need to serialize seqid morphing operations (ugh),

I thought that was required.

--b.

> or make
> update_stateid do an cmpxchg to swap it into place (or add some extra
> locking around it), and then have some way to unwind all of the changes
> if that fails. That may be impossible however -- we're likely closing
> struct files after all.
> 
> Now, all of that said, I think the client has some bugs in its seqid
> handling as well. It should have realized that the stateid was a r/o
> one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> still issued a WRITE just afterward. That seems wrong.
> 
> -- 
> Jeff Layton <jeff.layton@primarydata.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 19:01       ` Trond Myklebust
@ 2015-09-09 19:18         ` Jeff Layton
  2015-09-09 20:40           ` Bruce James Fields
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2015-09-09 19:18 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Andrew W Elble, Bruce James Fields, Linux NFS Mailing List

On Wed, 9 Sep 2015 15:01:54 -0400
Trond Myklebust <trond.myklebust@primarydata.com> wrote:

> On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@primarydata.com> wrote:
> > On Wed, 9 Sep 2015 13:49:44 -0400
> > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> >
> >> +Bruce, +Jeff...
> >>
> >> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> >> <trond.myklebust@primarydata.com> wrote:
> >> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> >> >>
> >> >> In attempting to troubleshoot other issues, we've run into this race
> >> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> >> >> from upstream. This is my attempt at a redacted packet-capture.
> >> >>
> >> >> These all affect the same fh/stateid:
> >> >>
> >> >> 116 -> OPEN (will be an upgrade / for write)
> >> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> >> >>
> >> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> >> >> 122 -> OPEN (completed first / seqid = 0x7)
> >> >>
> >> >> Attempts to write using that stateid fail because the stateid doesn't
> >> >> have write access.
> >> >>
> >> >> Any thoughts? I can share more data from the capture if needed.
> >> >
> >> > Bruce & Jeff,
> >> >
> >> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> >> > being executed after the OPEN here? Surely, if that is the case, the
> >> > server should be returning NFS4ERR_OLD_STATEID and failing the
> >> > OPEN_DOWNGRADE operation?
> >> >
> >
> > The problem there is that we do the seqid checks at the beginning of
> > the operation. In this case it's likely that it was 0x6 when the
> > OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> > seqid, and then the downgrade finished and bumped it again. When we bump
> > the seqid we don't verify it against what came in originally.
> >
> > The question is whether that's wrong from the POV of the spec. RFC5661
> > doesn't seem to explicitly require that we serialize such operations on
> > the server. The closest thing I can find is this in 3.3.12:
> 
> RFC5661, section 8.2.2
>   Except for layout stateids (Section 12.5.3), when a client sends a
>    stateid to the server, it has two choices with regard to the seqid
>    sent.  It may set the seqid to zero to indicate to the server that it
>    wishes the most up-to-date seqid for that stateid's "other" field to
>    be used.  This would be the common choice in the case of a stateid
>    sent with a READ or WRITE operation.  It also may set a non-zero
>    value, in which case the server checks if that seqid is the correct
>    one.  In that case, the server is required to return
>    NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
>    and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
>    value.  This would be the common choice in the case of stateids sent
>    with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
>    parallel for the same owner, a client might close a file without
>    knowing that an OPEN upgrade had been done by the server, changing
>    the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
>    upgrade would be cancelled before the client even received an
>    indication that an upgrade had happened.
> 
> The suggestion there is clearly that the client can rely on the server
> not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
> OPEN. Otherwise, what is the difference between sending a non-zero
> seqid and zero?
> 
> > "The server is required to increment the "seqid" field by
> >  one at each transition of the stateid.  This is important since the
> >  client will inspect the seqid in OPEN stateids to determine the order
> >  of OPEN processing done by the server."
> >
> > If we do need to fix this on the server, it's likely to be pretty ugly:
> >
> > We'd either need to serialize seqid morphing operations (ugh), or make
> > update_stateid do an cmpxchg to swap it into place (or add some extra
> > locking around it), and then have some way to unwind all of the changes
> > if that fails. That may be impossible however -- we're likely closing
> > struct files after all.
> 
> Updates to the state are already required to be atomic. You can't have
> a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.
> 
> >
> > Now, all of that said, I think the client has some bugs in its seqid
> > handling as well. It should have realized that the stateid was a r/o
> > one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> > still issued a WRITE just afterward. That seems wrong.
> 
> No. The client is relying on the server not reordering the
> OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
> the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
> and for both operations to succeed.
> 
> Trond

In that case, the "simple" fix would be to add a mutex to
nfs4_ol_stateid. Lock that in nfs4_preprocess_seqid_op, and ensure that
we unlock it after bumping the seqid (or on error).

Bruce, any thoughts?
-- 
Jeff Layton <jeff.layton@primarydata.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 19:18         ` Jeff Layton
@ 2015-09-09 20:40           ` Bruce James Fields
  2015-09-09 21:00             ` Jeff Layton
  0 siblings, 1 reply; 16+ messages in thread
From: Bruce James Fields @ 2015-09-09 20:40 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Trond Myklebust, Andrew W Elble, Linux NFS Mailing List

On Wed, Sep 09, 2015 at 03:18:01PM -0400, Jeff Layton wrote:
> On Wed, 9 Sep 2015 15:01:54 -0400
> Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> 
> > On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > On Wed, 9 Sep 2015 13:49:44 -0400
> > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > >
> > >> +Bruce, +Jeff...
> > >>
> > >> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> > >> <trond.myklebust@primarydata.com> wrote:
> > >> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> > >> >>
> > >> >> In attempting to troubleshoot other issues, we've run into this race
> > >> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> > >> >> from upstream. This is my attempt at a redacted packet-capture.
> > >> >>
> > >> >> These all affect the same fh/stateid:
> > >> >>
> > >> >> 116 -> OPEN (will be an upgrade / for write)
> > >> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> > >> >>
> > >> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> > >> >> 122 -> OPEN (completed first / seqid = 0x7)
> > >> >>
> > >> >> Attempts to write using that stateid fail because the stateid doesn't
> > >> >> have write access.
> > >> >>
> > >> >> Any thoughts? I can share more data from the capture if needed.
> > >> >
> > >> > Bruce & Jeff,
> > >> >
> > >> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > >> > being executed after the OPEN here? Surely, if that is the case, the
> > >> > server should be returning NFS4ERR_OLD_STATEID and failing the
> > >> > OPEN_DOWNGRADE operation?
> > >> >
> > >
> > > The problem there is that we do the seqid checks at the beginning of
> > > the operation. In this case it's likely that it was 0x6 when the
> > > OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> > > seqid, and then the downgrade finished and bumped it again. When we bump
> > > the seqid we don't verify it against what came in originally.
> > >
> > > The question is whether that's wrong from the POV of the spec. RFC5661
> > > doesn't seem to explicitly require that we serialize such operations on
> > > the server. The closest thing I can find is this in 3.3.12:
> > 
> > RFC5661, section 8.2.2
> >   Except for layout stateids (Section 12.5.3), when a client sends a
> >    stateid to the server, it has two choices with regard to the seqid
> >    sent.  It may set the seqid to zero to indicate to the server that it
> >    wishes the most up-to-date seqid for that stateid's "other" field to
> >    be used.  This would be the common choice in the case of a stateid
> >    sent with a READ or WRITE operation.  It also may set a non-zero
> >    value, in which case the server checks if that seqid is the correct
> >    one.  In that case, the server is required to return
> >    NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
> >    and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
> >    value.  This would be the common choice in the case of stateids sent
> >    with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
> >    parallel for the same owner, a client might close a file without
> >    knowing that an OPEN upgrade had been done by the server, changing
> >    the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
> >    upgrade would be cancelled before the client even received an
> >    indication that an upgrade had happened.
> > 
> > The suggestion there is clearly that the client can rely on the server
> > not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
> > OPEN. Otherwise, what is the difference between sending a non-zero
> > seqid and zero?
> > 
> > > "The server is required to increment the "seqid" field by
> > >  one at each transition of the stateid.  This is important since the
> > >  client will inspect the seqid in OPEN stateids to determine the order
> > >  of OPEN processing done by the server."
> > >
> > > If we do need to fix this on the server, it's likely to be pretty ugly:
> > >
> > > We'd either need to serialize seqid morphing operations (ugh), or make
> > > update_stateid do an cmpxchg to swap it into place (or add some extra
> > > locking around it), and then have some way to unwind all of the changes
> > > if that fails. That may be impossible however -- we're likely closing
> > > struct files after all.
> > 
> > Updates to the state are already required to be atomic. You can't have
> > a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.
> > 
> > >
> > > Now, all of that said, I think the client has some bugs in its seqid
> > > handling as well. It should have realized that the stateid was a r/o
> > > one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> > > still issued a WRITE just afterward. That seems wrong.
> > 
> > No. The client is relying on the server not reordering the
> > OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
> > the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
> > and for both operations to succeed.
> > 
> > Trond
> 
> In that case, the "simple" fix would be to add a mutex to
> nfs4_ol_stateid. Lock that in nfs4_preprocess_seqid_op, and ensure that
> we unlock it after bumping the seqid (or on error).
> 
> Bruce, any thoughts?

Why isn't nfsd4_cstate_assign_replay()/nfsd4_cstate_clear_replay()
already doing this with the so_replay.rp_mutex lock?

Looking at it.... OK, sorry, that's 4.0 only.  I don't know if that
should be shared in the session case.

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 20:40           ` Bruce James Fields
@ 2015-09-09 21:00             ` Jeff Layton
  2015-09-09 21:39               ` Bruce James Fields
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2015-09-09 21:00 UTC (permalink / raw)
  To: Bruce James Fields
  Cc: Trond Myklebust, Andrew W Elble, Linux NFS Mailing List

On Wed, 9 Sep 2015 16:40:36 -0400
Bruce James Fields <bfields@fieldses.org> wrote:

> On Wed, Sep 09, 2015 at 03:18:01PM -0400, Jeff Layton wrote:
> > On Wed, 9 Sep 2015 15:01:54 -0400
> > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > 
> > > On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > > On Wed, 9 Sep 2015 13:49:44 -0400
> > > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > >
> > > >> +Bruce, +Jeff...
> > > >>
> > > >> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> > > >> <trond.myklebust@primarydata.com> wrote:
> > > >> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> > > >> >>
> > > >> >> In attempting to troubleshoot other issues, we've run into this race
> > > >> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> > > >> >> from upstream. This is my attempt at a redacted packet-capture.
> > > >> >>
> > > >> >> These all affect the same fh/stateid:
> > > >> >>
> > > >> >> 116 -> OPEN (will be an upgrade / for write)
> > > >> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> > > >> >>
> > > >> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> > > >> >> 122 -> OPEN (completed first / seqid = 0x7)
> > > >> >>
> > > >> >> Attempts to write using that stateid fail because the stateid doesn't
> > > >> >> have write access.
> > > >> >>
> > > >> >> Any thoughts? I can share more data from the capture if needed.
> > > >> >
> > > >> > Bruce & Jeff,
> > > >> >
> > > >> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > > >> > being executed after the OPEN here? Surely, if that is the case, the
> > > >> > server should be returning NFS4ERR_OLD_STATEID and failing the
> > > >> > OPEN_DOWNGRADE operation?
> > > >> >
> > > >
> > > > The problem there is that we do the seqid checks at the beginning of
> > > > the operation. In this case it's likely that it was 0x6 when the
> > > > OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> > > > seqid, and then the downgrade finished and bumped it again. When we bump
> > > > the seqid we don't verify it against what came in originally.
> > > >
> > > > The question is whether that's wrong from the POV of the spec. RFC5661
> > > > doesn't seem to explicitly require that we serialize such operations on
> > > > the server. The closest thing I can find is this in 3.3.12:
> > > 
> > > RFC5661, section 8.2.2
> > >   Except for layout stateids (Section 12.5.3), when a client sends a
> > >    stateid to the server, it has two choices with regard to the seqid
> > >    sent.  It may set the seqid to zero to indicate to the server that it
> > >    wishes the most up-to-date seqid for that stateid's "other" field to
> > >    be used.  This would be the common choice in the case of a stateid
> > >    sent with a READ or WRITE operation.  It also may set a non-zero
> > >    value, in which case the server checks if that seqid is the correct
> > >    one.  In that case, the server is required to return
> > >    NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
> > >    and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
> > >    value.  This would be the common choice in the case of stateids sent
> > >    with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
> > >    parallel for the same owner, a client might close a file without
> > >    knowing that an OPEN upgrade had been done by the server, changing
> > >    the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
> > >    upgrade would be cancelled before the client even received an
> > >    indication that an upgrade had happened.
> > > 
> > > The suggestion there is clearly that the client can rely on the server
> > > not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
> > > OPEN. Otherwise, what is the difference between sending a non-zero
> > > seqid and zero?
> > > 
> > > > "The server is required to increment the "seqid" field by
> > > >  one at each transition of the stateid.  This is important since the
> > > >  client will inspect the seqid in OPEN stateids to determine the order
> > > >  of OPEN processing done by the server."
> > > >
> > > > If we do need to fix this on the server, it's likely to be pretty ugly:
> > > >
> > > > We'd either need to serialize seqid morphing operations (ugh), or make
> > > > update_stateid do an cmpxchg to swap it into place (or add some extra
> > > > locking around it), and then have some way to unwind all of the changes
> > > > if that fails. That may be impossible however -- we're likely closing
> > > > struct files after all.
> > > 
> > > Updates to the state are already required to be atomic. You can't have
> > > a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.
> > > 
> > > >
> > > > Now, all of that said, I think the client has some bugs in its seqid
> > > > handling as well. It should have realized that the stateid was a r/o
> > > > one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> > > > still issued a WRITE just afterward. That seems wrong.
> > > 
> > > No. The client is relying on the server not reordering the
> > > OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
> > > the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
> > > and for both operations to succeed.
> > > 
> > > Trond
> > 
> > In that case, the "simple" fix would be to add a mutex to
> > nfs4_ol_stateid. Lock that in nfs4_preprocess_seqid_op, and ensure that
> > we unlock it after bumping the seqid (or on error).
> > 
> > Bruce, any thoughts?
> 
> Why isn't nfsd4_cstate_assign_replay()/nfsd4_cstate_clear_replay()
> already doing this with the so_replay.rp_mutex lock?
> 
> Looking at it.... OK, sorry, that's 4.0 only.  I don't know if that
> should be shared in the session case.
> 

Yeah, that's probably a bit heavyweight for v4.1. That mutex is in the
stateowner struct. The same stateowner could be opening different
files, and we wouldn't want to serialize those. I think we'd need
something in the stateid struct itself.

Trond also pointed out that we don't really need to serialize OPEN
calls, so we might be best off with something like a rw semaphore. Take
the read lock in OPEN, and the write lock for OPEN_DOWNGRADE/CLOSE.
LOCK/LOCKU will also need similar treatment of course. I'm not sure
about LAYOUTGET/LAYOUTRETURN/CLOSE though.

-- 
Jeff Layton <jeff.layton@primarydata.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 21:00             ` Jeff Layton
@ 2015-09-09 21:39               ` Bruce James Fields
  2015-09-09 22:08                 ` Jeff Layton
  2015-09-12 12:10                 ` Jeff Layton
  0 siblings, 2 replies; 16+ messages in thread
From: Bruce James Fields @ 2015-09-09 21:39 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Trond Myklebust, Andrew W Elble, Linux NFS Mailing List

On Wed, Sep 09, 2015 at 05:00:37PM -0400, Jeff Layton wrote:
> On Wed, 9 Sep 2015 16:40:36 -0400
> Bruce James Fields <bfields@fieldses.org> wrote:
> 
> > On Wed, Sep 09, 2015 at 03:18:01PM -0400, Jeff Layton wrote:
> > > On Wed, 9 Sep 2015 15:01:54 -0400
> > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > 
> > > > On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > > > On Wed, 9 Sep 2015 13:49:44 -0400
> > > > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > > >
> > > > >> +Bruce, +Jeff...
> > > > >>
> > > > >> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> > > > >> <trond.myklebust@primarydata.com> wrote:
> > > > >> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> > > > >> >>
> > > > >> >> In attempting to troubleshoot other issues, we've run into this race
> > > > >> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> > > > >> >> from upstream. This is my attempt at a redacted packet-capture.
> > > > >> >>
> > > > >> >> These all affect the same fh/stateid:
> > > > >> >>
> > > > >> >> 116 -> OPEN (will be an upgrade / for write)
> > > > >> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> > > > >> >>
> > > > >> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> > > > >> >> 122 -> OPEN (completed first / seqid = 0x7)
> > > > >> >>
> > > > >> >> Attempts to write using that stateid fail because the stateid doesn't
> > > > >> >> have write access.
> > > > >> >>
> > > > >> >> Any thoughts? I can share more data from the capture if needed.
> > > > >> >
> > > > >> > Bruce & Jeff,
> > > > >> >
> > > > >> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > > > >> > being executed after the OPEN here? Surely, if that is the case, the
> > > > >> > server should be returning NFS4ERR_OLD_STATEID and failing the
> > > > >> > OPEN_DOWNGRADE operation?
> > > > >> >
> > > > >
> > > > > The problem there is that we do the seqid checks at the beginning of
> > > > > the operation. In this case it's likely that it was 0x6 when the
> > > > > OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> > > > > seqid, and then the downgrade finished and bumped it again. When we bump
> > > > > the seqid we don't verify it against what came in originally.
> > > > >
> > > > > The question is whether that's wrong from the POV of the spec. RFC5661
> > > > > doesn't seem to explicitly require that we serialize such operations on
> > > > > the server. The closest thing I can find is this in 3.3.12:
> > > > 
> > > > RFC5661, section 8.2.2
> > > >   Except for layout stateids (Section 12.5.3), when a client sends a
> > > >    stateid to the server, it has two choices with regard to the seqid
> > > >    sent.  It may set the seqid to zero to indicate to the server that it
> > > >    wishes the most up-to-date seqid for that stateid's "other" field to
> > > >    be used.  This would be the common choice in the case of a stateid
> > > >    sent with a READ or WRITE operation.  It also may set a non-zero
> > > >    value, in which case the server checks if that seqid is the correct
> > > >    one.  In that case, the server is required to return
> > > >    NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
> > > >    and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
> > > >    value.  This would be the common choice in the case of stateids sent
> > > >    with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
> > > >    parallel for the same owner, a client might close a file without
> > > >    knowing that an OPEN upgrade had been done by the server, changing
> > > >    the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
> > > >    upgrade would be cancelled before the client even received an
> > > >    indication that an upgrade had happened.
> > > > 
> > > > The suggestion there is clearly that the client can rely on the server
> > > > not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
> > > > OPEN. Otherwise, what is the difference between sending a non-zero
> > > > seqid and zero?
> > > > 
> > > > > "The server is required to increment the "seqid" field by
> > > > >  one at each transition of the stateid.  This is important since the
> > > > >  client will inspect the seqid in OPEN stateids to determine the order
> > > > >  of OPEN processing done by the server."
> > > > >
> > > > > If we do need to fix this on the server, it's likely to be pretty ugly:
> > > > >
> > > > > We'd either need to serialize seqid morphing operations (ugh), or make
> > > > > update_stateid do an cmpxchg to swap it into place (or add some extra
> > > > > locking around it), and then have some way to unwind all of the changes
> > > > > if that fails. That may be impossible however -- we're likely closing
> > > > > struct files after all.
> > > > 
> > > > Updates to the state are already required to be atomic. You can't have
> > > > a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.
> > > > 
> > > > >
> > > > > Now, all of that said, I think the client has some bugs in its seqid
> > > > > handling as well. It should have realized that the stateid was a r/o
> > > > > one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> > > > > still issued a WRITE just afterward. That seems wrong.
> > > > 
> > > > No. The client is relying on the server not reordering the
> > > > OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
> > > > the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
> > > > and for both operations to succeed.
> > > > 
> > > > Trond
> > > 
> > > In that case, the "simple" fix would be to add a mutex to
> > > nfs4_ol_stateid. Lock that in nfs4_preprocess_seqid_op, and ensure that
> > > we unlock it after bumping the seqid (or on error).
> > > 
> > > Bruce, any thoughts?
> > 
> > Why isn't nfsd4_cstate_assign_replay()/nfsd4_cstate_clear_replay()
> > already doing this with the so_replay.rp_mutex lock?
> > 
> > Looking at it.... OK, sorry, that's 4.0 only.  I don't know if that
> > should be shared in the session case.
> > 
> 
> Yeah, that's probably a bit heavyweight for v4.1. That mutex is in the
> stateowner struct. The same stateowner could be opening different
> files, and we wouldn't want to serialize those. I think we'd need
> something in the stateid struct itself.
> 
> Trond also pointed out that we don't really need to serialize OPEN
> calls, so we might be best off with something like a rw semaphore. Take
> the read lock in OPEN, and the write lock for OPEN_DOWNGRADE/CLOSE.
> LOCK/LOCKU will also need similar treatment of course.

OK, I think I agree.  LOCK and LOCKU both need exclusive locks, right?

> I'm not sure about LAYOUTGET/LAYOUTRETURN/CLOSE though.

Me neither.

--b.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 21:39               ` Bruce James Fields
@ 2015-09-09 22:08                 ` Jeff Layton
  2015-09-12 12:10                 ` Jeff Layton
  1 sibling, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2015-09-09 22:08 UTC (permalink / raw)
  To: Bruce James Fields
  Cc: Trond Myklebust, Andrew W Elble, Linux NFS Mailing List

On Wed, 9 Sep 2015 17:39:07 -0400
Bruce James Fields <bfields@fieldses.org> wrote:

> On Wed, Sep 09, 2015 at 05:00:37PM -0400, Jeff Layton wrote:
> > On Wed, 9 Sep 2015 16:40:36 -0400
> > Bruce James Fields <bfields@fieldses.org> wrote:
> > 
> > > On Wed, Sep 09, 2015 at 03:18:01PM -0400, Jeff Layton wrote:
> > > > On Wed, 9 Sep 2015 15:01:54 -0400
> > > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > > 
> > > > > On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > > > > On Wed, 9 Sep 2015 13:49:44 -0400
> > > > > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > > > >
> > > > > >> +Bruce, +Jeff...
> > > > > >>
> > > > > >> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> > > > > >> <trond.myklebust@primarydata.com> wrote:
> > > > > >> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> > > > > >> >>
> > > > > >> >> In attempting to troubleshoot other issues, we've run into this race
> > > > > >> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> > > > > >> >> from upstream. This is my attempt at a redacted packet-capture.
> > > > > >> >>
> > > > > >> >> These all affect the same fh/stateid:
> > > > > >> >>
> > > > > >> >> 116 -> OPEN (will be an upgrade / for write)
> > > > > >> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> > > > > >> >>
> > > > > >> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> > > > > >> >> 122 -> OPEN (completed first / seqid = 0x7)
> > > > > >> >>
> > > > > >> >> Attempts to write using that stateid fail because the stateid doesn't
> > > > > >> >> have write access.
> > > > > >> >>
> > > > > >> >> Any thoughts? I can share more data from the capture if needed.
> > > > > >> >
> > > > > >> > Bruce & Jeff,
> > > > > >> >
> > > > > >> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > > > > >> > being executed after the OPEN here? Surely, if that is the case, the
> > > > > >> > server should be returning NFS4ERR_OLD_STATEID and failing the
> > > > > >> > OPEN_DOWNGRADE operation?
> > > > > >> >
> > > > > >
> > > > > > The problem there is that we do the seqid checks at the beginning of
> > > > > > the operation. In this case it's likely that it was 0x6 when the
> > > > > > OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> > > > > > seqid, and then the downgrade finished and bumped it again. When we bump
> > > > > > the seqid we don't verify it against what came in originally.
> > > > > >
> > > > > > The question is whether that's wrong from the POV of the spec. RFC5661
> > > > > > doesn't seem to explicitly require that we serialize such operations on
> > > > > > the server. The closest thing I can find is this in 3.3.12:
> > > > > 
> > > > > RFC5661, section 8.2.2
> > > > >   Except for layout stateids (Section 12.5.3), when a client sends a
> > > > >    stateid to the server, it has two choices with regard to the seqid
> > > > >    sent.  It may set the seqid to zero to indicate to the server that it
> > > > >    wishes the most up-to-date seqid for that stateid's "other" field to
> > > > >    be used.  This would be the common choice in the case of a stateid
> > > > >    sent with a READ or WRITE operation.  It also may set a non-zero
> > > > >    value, in which case the server checks if that seqid is the correct
> > > > >    one.  In that case, the server is required to return
> > > > >    NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
> > > > >    and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
> > > > >    value.  This would be the common choice in the case of stateids sent
> > > > >    with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
> > > > >    parallel for the same owner, a client might close a file without
> > > > >    knowing that an OPEN upgrade had been done by the server, changing
> > > > >    the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
> > > > >    upgrade would be cancelled before the client even received an
> > > > >    indication that an upgrade had happened.
> > > > > 
> > > > > The suggestion there is clearly that the client can rely on the server
> > > > > not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
> > > > > OPEN. Otherwise, what is the difference between sending a non-zero
> > > > > seqid and zero?
> > > > > 
> > > > > > "The server is required to increment the "seqid" field by
> > > > > >  one at each transition of the stateid.  This is important since the
> > > > > >  client will inspect the seqid in OPEN stateids to determine the order
> > > > > >  of OPEN processing done by the server."
> > > > > >
> > > > > > If we do need to fix this on the server, it's likely to be pretty ugly:
> > > > > >
> > > > > > We'd either need to serialize seqid morphing operations (ugh), or make
> > > > > > update_stateid do an cmpxchg to swap it into place (or add some extra
> > > > > > locking around it), and then have some way to unwind all of the changes
> > > > > > if that fails. That may be impossible however -- we're likely closing
> > > > > > struct files after all.
> > > > > 
> > > > > Updates to the state are already required to be atomic. You can't have
> > > > > a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.
> > > > > 
> > > > > >
> > > > > > Now, all of that said, I think the client has some bugs in its seqid
> > > > > > handling as well. It should have realized that the stateid was a r/o
> > > > > > one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> > > > > > still issued a WRITE just afterward. That seems wrong.
> > > > > 
> > > > > No. The client is relying on the server not reordering the
> > > > > OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
> > > > > the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
> > > > > and for both operations to succeed.
> > > > > 
> > > > > Trond
> > > > 
> > > > In that case, the "simple" fix would be to add a mutex to
> > > > nfs4_ol_stateid. Lock that in nfs4_preprocess_seqid_op, and ensure that
> > > > we unlock it after bumping the seqid (or on error).
> > > > 
> > > > Bruce, any thoughts?
> > > 
> > > Why isn't nfsd4_cstate_assign_replay()/nfsd4_cstate_clear_replay()
> > > already doing this with the so_replay.rp_mutex lock?
> > > 
> > > Looking at it.... OK, sorry, that's 4.0 only.  I don't know if that
> > > should be shared in the session case.
> > > 
> > 
> > Yeah, that's probably a bit heavyweight for v4.1. That mutex is in the
> > stateowner struct. The same stateowner could be opening different
> > files, and we wouldn't want to serialize those. I think we'd need
> > something in the stateid struct itself.
> > 
> > Trond also pointed out that we don't really need to serialize OPEN
> > calls, so we might be best off with something like a rw semaphore. Take
> > the read lock in OPEN, and the write lock for OPEN_DOWNGRADE/CLOSE.
> > LOCK/LOCKU will also need similar treatment of course.
> 
> OK, I think I agree.  LOCK and LOCKU both need exclusive locks, right?
> 

Right. Those can't really run in parallel.

> > I'm not sure about LAYOUTGET/LAYOUTRETURN/CLOSE though.
> 
> Me neither.
> 

Then we should probably start with the cases we do know, and then
extend the locking to those if it's required. I'll look at it when I
get time but it may be a bit...

-- 
Jeff Layton <jeff.layton@primarydata.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-09 21:39               ` Bruce James Fields
  2015-09-09 22:08                 ` Jeff Layton
@ 2015-09-12 12:10                 ` Jeff Layton
  2015-09-12 12:27                   ` Andrew W Elble
  2015-09-15 11:49                   ` Andrew W Elble
  1 sibling, 2 replies; 16+ messages in thread
From: Jeff Layton @ 2015-09-12 12:10 UTC (permalink / raw)
  To: Bruce James Fields
  Cc: Trond Myklebust, Andrew W Elble, Linux NFS Mailing List

[-- Attachment #1: Type: text/plain, Size: 7568 bytes --]

On Wed, 9 Sep 2015 17:39:07 -0400
Bruce James Fields <bfields@fieldses.org> wrote:

> On Wed, Sep 09, 2015 at 05:00:37PM -0400, Jeff Layton wrote:
> > On Wed, 9 Sep 2015 16:40:36 -0400
> > Bruce James Fields <bfields@fieldses.org> wrote:
> > 
> > > On Wed, Sep 09, 2015 at 03:18:01PM -0400, Jeff Layton wrote:
> > > > On Wed, 9 Sep 2015 15:01:54 -0400
> > > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > > 
> > > > > On Wed, Sep 9, 2015 at 2:49 PM, Jeff Layton <jeff.layton@primarydata.com> wrote:
> > > > > > On Wed, 9 Sep 2015 13:49:44 -0400
> > > > > > Trond Myklebust <trond.myklebust@primarydata.com> wrote:
> > > > > >
> > > > > >> +Bruce, +Jeff...
> > > > > >>
> > > > > >> On Wed, Sep 9, 2015 at 1:12 PM, Trond Myklebust
> > > > > >> <trond.myklebust@primarydata.com> wrote:
> > > > > >> > On Wed, Sep 9, 2015 at 9:37 AM, Andrew W Elble <aweits@rit.edu> wrote:
> > > > > >> >>
> > > > > >> >> In attempting to troubleshoot other issues, we've run into this race
> > > > > >> >> with 4.1.4 (both client and server) with a few cherry-picked patches
> > > > > >> >> from upstream. This is my attempt at a redacted packet-capture.
> > > > > >> >>
> > > > > >> >> These all affect the same fh/stateid:
> > > > > >> >>
> > > > > >> >> 116 -> OPEN (will be an upgrade / for write)
> > > > > >> >> 117 -> OPEN_DOWNGRADE (to read for the existing stateid / seqid = 0x6
> > > > > >> >>
> > > > > >> >> 121 -> OPEN_DOWNGRADE (completed last / seqid = 0x8)
> > > > > >> >> 122 -> OPEN (completed first / seqid = 0x7)
> > > > > >> >>
> > > > > >> >> Attempts to write using that stateid fail because the stateid doesn't
> > > > > >> >> have write access.
> > > > > >> >>
> > > > > >> >> Any thoughts? I can share more data from the capture if needed.
> > > > > >> >
> > > > > >> > Bruce & Jeff,
> > > > > >> >
> > > > > >> > Given that the client sent a non-zero seqid, why is the OPEN_DOWNGRADE
> > > > > >> > being executed after the OPEN here? Surely, if that is the case, the
> > > > > >> > server should be returning NFS4ERR_OLD_STATEID and failing the
> > > > > >> > OPEN_DOWNGRADE operation?
> > > > > >> >
> > > > > >
> > > > > > The problem there is that we do the seqid checks at the beginning of
> > > > > > the operation. In this case it's likely that it was 0x6 when the
> > > > > > OPEN_DOWNGRADE started. The OPEN completed first though and bumped the
> > > > > > seqid, and then the downgrade finished and bumped it again. When we bump
> > > > > > the seqid we don't verify it against what came in originally.
> > > > > >
> > > > > > The question is whether that's wrong from the POV of the spec. RFC5661
> > > > > > doesn't seem to explicitly require that we serialize such operations on
> > > > > > the server. The closest thing I can find is this in 3.3.12:
> > > > > 
> > > > > RFC5661, section 8.2.2
> > > > >   Except for layout stateids (Section 12.5.3), when a client sends a
> > > > >    stateid to the server, it has two choices with regard to the seqid
> > > > >    sent.  It may set the seqid to zero to indicate to the server that it
> > > > >    wishes the most up-to-date seqid for that stateid's "other" field to
> > > > >    be used.  This would be the common choice in the case of a stateid
> > > > >    sent with a READ or WRITE operation.  It also may set a non-zero
> > > > >    value, in which case the server checks if that seqid is the correct
> > > > >    one.  In that case, the server is required to return
> > > > >    NFS4ERR_OLD_STATEID if the seqid is lower than the most current value
> > > > >    and NFS4ERR_BAD_STATEID if the seqid is greater than the most current
> > > > >    value.  This would be the common choice in the case of stateids sent
> > > > >    with a CLOSE or OPEN_DOWNGRADE.  Because OPENs may be sent in
> > > > >    parallel for the same owner, a client might close a file without
> > > > >    knowing that an OPEN upgrade had been done by the server, changing
> > > > >    the lock in question.  If CLOSE were sent with a zero seqid, the OPEN
> > > > >    upgrade would be cancelled before the client even received an
> > > > >    indication that an upgrade had happened.
> > > > > 
> > > > > The suggestion there is clearly that the client can rely on the server
> > > > > not reordering those CLOSE/OPEN_DOWNGRADE operations w.r.t. a parallel
> > > > > OPEN. Otherwise, what is the difference between sending a non-zero
> > > > > seqid and zero?
> > > > > 
> > > > > > "The server is required to increment the "seqid" field by
> > > > > >  one at each transition of the stateid.  This is important since the
> > > > > >  client will inspect the seqid in OPEN stateids to determine the order
> > > > > >  of OPEN processing done by the server."
> > > > > >
> > > > > > If we do need to fix this on the server, it's likely to be pretty ugly:
> > > > > >
> > > > > > We'd either need to serialize seqid morphing operations (ugh), or make
> > > > > > update_stateid do an cmpxchg to swap it into place (or add some extra
> > > > > > locking around it), and then have some way to unwind all of the changes
> > > > > > if that fails. That may be impossible however -- we're likely closing
> > > > > > struct files after all.
> > > > > 
> > > > > Updates to the state are already required to be atomic. You can't have
> > > > > a stateid where an OPEN_DOWNGRADE or CLOSE only partially succeeded.
> > > > > 
> > > > > >
> > > > > > Now, all of that said, I think the client has some bugs in its seqid
> > > > > > handling as well. It should have realized that the stateid was a r/o
> > > > > > one after the OPEN_DOWNGRADE came back with the higher seqid, but it
> > > > > > still issued a WRITE just afterward. That seems wrong.
> > > > > 
> > > > > No. The client is relying on the server not reordering the
> > > > > OPEN_DOWNGRADE. It expects either for the OPEN to happen first, and
> > > > > the OPEN_DOWNGRADE to fail, or for the OPEN_DOWNGRADE to happen first,
> > > > > and for both operations to succeed.
> > > > > 
> > > > > Trond
> > > > 
> > > > In that case, the "simple" fix would be to add a mutex to
> > > > nfs4_ol_stateid. Lock that in nfs4_preprocess_seqid_op, and ensure that
> > > > we unlock it after bumping the seqid (or on error).
> > > > 
> > > > Bruce, any thoughts?
> > > 
> > > Why isn't nfsd4_cstate_assign_replay()/nfsd4_cstate_clear_replay()
> > > already doing this with the so_replay.rp_mutex lock?
> > > 
> > > Looking at it.... OK, sorry, that's 4.0 only.  I don't know if that
> > > should be shared in the session case.
> > > 
> > 
> > Yeah, that's probably a bit heavyweight for v4.1. That mutex is in the
> > stateowner struct. The same stateowner could be opening different
> > files, and we wouldn't want to serialize those. I think we'd need
> > something in the stateid struct itself.
> > 
> > Trond also pointed out that we don't really need to serialize OPEN
> > calls, so we might be best off with something like a rw semaphore. Take
> > the read lock in OPEN, and the write lock for OPEN_DOWNGRADE/CLOSE.
> > LOCK/LOCKU will also need similar treatment of course.
> 
> OK, I think I agree.  LOCK and LOCKU both need exclusive locks, right?
> 
> > I'm not sure about LAYOUTGET/LAYOUTRETURN/CLOSE though.
> 
> Me neither.
> 
> --b.

Andrew, could you test this patch out? This just covers open and lock
stateids. If it works, I'll clean up the comments and resend it to the
list as a PATCH email.

Assuming that it does, we'll need to consider what (if anything) to do
about layout stateids...

-- 
Jeff Layton <jeff.layton@primarydata.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-nfsd-serialize-state-seqid-morphing-operations.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-12 12:10                 ` Jeff Layton
@ 2015-09-12 12:27                   ` Andrew W Elble
  2015-09-15 11:49                   ` Andrew W Elble
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew W Elble @ 2015-09-12 12:27 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Bruce James Fields, Trond Myklebust, Linux NFS Mailing List

> Andrew, could you test this patch out? This just covers open and lock
> stateids. If it works, I'll clean up the comments and resend it to the
> list as a PATCH email.

Will do, thank you!

> Assuming that it does, we'll need to consider what (if anything) to do
> about layout stateids...

-- 
Andrew W. Elble
aweits@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-12 12:10                 ` Jeff Layton
  2015-09-12 12:27                   ` Andrew W Elble
@ 2015-09-15 11:49                   ` Andrew W Elble
  2015-09-15 11:59                     ` Jeff Layton
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew W Elble @ 2015-09-15 11:49 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Bruce James Fields, Trond Myklebust, Linux NFS Mailing List


> Andrew, could you test this patch out? This just covers open and lock
> stateids. If it works, I'll clean up the comments and resend it to the
> list as a PATCH email.
>
> Assuming that it does, we'll need to consider what (if anything) to do
> about layout stateids...

Jeff,

   We've run with no issues overnight. I'll let you know if we see
   anything weird in the next few days (It generally takes a few days at
   this point to trigger anything out-of-the-ordinary)


Thanks,

Andy

-- 
Andrew W. Elble
aweits@discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: upgrade/downgrade race
  2015-09-15 11:49                   ` Andrew W Elble
@ 2015-09-15 11:59                     ` Jeff Layton
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2015-09-15 11:59 UTC (permalink / raw)
  To: Andrew W Elble
  Cc: Bruce James Fields, Trond Myklebust, Linux NFS Mailing List

On Tue, 15 Sep 2015 07:49:33 -0400
Andrew W Elble <aweits@rit.edu> wrote:

> 
> > Andrew, could you test this patch out? This just covers open and lock
> > stateids. If it works, I'll clean up the comments and resend it to the
> > list as a PATCH email.
> >
> > Assuming that it does, we'll need to consider what (if anything) to do
> > about layout stateids...
> 
> Jeff,
> 
>    We've run with no issues overnight. I'll let you know if we see
>    anything weird in the next few days (It generally takes a few days at
>    this point to trigger anything out-of-the-ordinary)
> 
> 
> Thanks,
> 
> Andy
> 

Great. Thanks for helping test it!

-- 
Jeff Layton <jeff.layton@primarydata.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-09-15 11:59 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-09 13:37 upgrade/downgrade race Andrew W Elble
2015-09-09 15:58 ` Andrew W Elble
2015-09-09 17:12 ` Trond Myklebust
2015-09-09 17:49   ` Trond Myklebust
2015-09-09 18:49     ` Jeff Layton
2015-09-09 19:01       ` Trond Myklebust
2015-09-09 19:18         ` Jeff Layton
2015-09-09 20:40           ` Bruce James Fields
2015-09-09 21:00             ` Jeff Layton
2015-09-09 21:39               ` Bruce James Fields
2015-09-09 22:08                 ` Jeff Layton
2015-09-12 12:10                 ` Jeff Layton
2015-09-12 12:27                   ` Andrew W Elble
2015-09-15 11:49                   ` Andrew W Elble
2015-09-15 11:59                     ` Jeff Layton
2015-09-09 19:04       ` Bruce James Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.