linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
@ 2013-09-10 18:49 Joschi Brauchle
  2013-09-10 20:35 ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Joschi Brauchle @ 2013-09-10 18:49 UTC (permalink / raw)
  To: linux-nfs

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

Hello everyone,

we are administrating an NFS high-availability cluster running on 
SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster 
machines was updated to SLES11SP3 with kernel 3.0.82.


We are now experiencing severe hangs on NFS clients when the SLES11SP3 
server is running the NFS services. An strace on the hanging processes 
on the client side show that is is waiting up to 60+ seconds for a 
"utime()" call to complete.


The problem we see is matching the problem described in the thread "v3.5 
nfsd4 regression; utime sometimes takes 40+ seconds to return". If the 
NFS server is running on SLES11SP3, the little test program provided in 
this tread hangs at the "utime()" call for 60+ seconds. It hangs each 
time it is run! It finishes right away with 0 seconds delay is SLES11SP1 
is providing NFS services, each time.


Now, in the serverside logfiles of SLES11SP3 we see these messages (not 
so on SP1):
--------------
kernel: [99381.184976] RPC: AUTH_GSS upcall timed out.
kernel: [99381.184978] Please check user daemon is running.
--------------

We have always been running the NFS server without rpc.gssd on the 
server side, as the init script for the nfsserver also does not start 
rpc.gssd.


Once we started rpc.gssd on the SLES11SP3 server, using the test utility 
on the client shows that the first call to "utime()" succeeds right 
away, the second call takes ~25s to complete. But now, any consecutive 
runs of the utility finish with no more delay.


So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is also 
required on the server side for correct operation?

Has there been a change between kernel 2.6.32.59 and 3.0.x?

Thus, is the init script of the nfsserver in SLES11SP3 indeed missing to 
start rpc.gssd?

Thank you for your help!

Best regards,
-- 
Dipl.-Ing. Joschi Brauchle, M.S.

Institute for Communications Engineering (LNT)
Technische Universitaet Muenchen (TUM)
80290 Munich, Germany

Tel (work): +49 89 289-23474
Fax (work): +49 89 289-23490
E-mail: joschi.brauchle@tum.de
Web: http://www.lnt.ei.tum.de/



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4607 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
  2013-09-10 18:49 nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82) Joschi Brauchle
@ 2013-09-10 20:35 ` J. Bruce Fields
  2013-09-10 21:48   ` Joschi Brauchle
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2013-09-10 20:35 UTC (permalink / raw)
  To: Joschi Brauchle; +Cc: linux-nfs

On Tue, Sep 10, 2013 at 08:49:05PM +0200, Joschi Brauchle wrote:
> Hello everyone,
> 
> we are administrating an NFS high-availability cluster running on
> SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster
> machines was updated to SLES11SP3 with kernel 3.0.82.
> 
> 
> We are now experiencing severe hangs on NFS clients when the
> SLES11SP3 server is running the NFS services. An strace on the
> hanging processes on the client side show that is is waiting up to
> 60+ seconds for a "utime()" call to complete.
> 
> 
> The problem we see is matching the problem described in the thread
> "v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to
> return". If the NFS server is running on SLES11SP3, the little test
> program provided in this tread hangs at the "utime()" call for 60+
> seconds. It hangs each time it is run! It finishes right away with 0
> seconds delay is SLES11SP1 is providing NFS services, each time.
> 
> 
> Now, in the serverside logfiles of SLES11SP3 we see these messages
> (not so on SP1):
> --------------
> kernel: [99381.184976] RPC: AUTH_GSS upcall timed out.
> kernel: [99381.184978] Please check user daemon is running.
> --------------
> 
> We have always been running the NFS server without rpc.gssd on the
> server side, as the init script for the nfsserver also does not
> start rpc.gssd.
> 
> 
> Once we started rpc.gssd on the SLES11SP3 server, using the test
> utility on the client shows that the first call to "utime()"
> succeeds right away, the second call takes ~25s to complete. But
> now, any consecutive runs of the utility finish with no more delay.
> 
> 
> So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is
> also required on the server side for correct operation?
> 
> Has there been a change between kernel 2.6.32.59 and 3.0.x?
> 
> Thus, is the init script of the nfsserver in SLES11SP3 indeed
> missing to start rpc.gssd?

It should be starting rpc.gssd to allow callbacks, yes.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
  2013-09-10 20:35 ` J. Bruce Fields
@ 2013-09-10 21:48   ` Joschi Brauchle
  2013-09-10 21:55     ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Joschi Brauchle @ 2013-09-10 21:48 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2415 bytes --]

Am 10.09.2013 um 22:35 schrieb "J. Bruce Fields" <bfields@fieldses.org>:

> On Tue, Sep 10, 2013 at 08:49:05PM +0200, Joschi Brauchle wrote:
>> Hello everyone,
>> 
>> we are administrating an NFS high-availability cluster running on
>> SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster
>> machines was updated to SLES11SP3 with kernel 3.0.82.
>> 
>> 
>> We are now experiencing severe hangs on NFS clients when the
>> SLES11SP3 server is running the NFS services. An strace on the
>> hanging processes on the client side show that is is waiting up to
>> 60+ seconds for a "utime()" call to complete.
>> 
>> 
>> The problem we see is matching the problem described in the thread
>> "v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to
>> return". If the NFS server is running on SLES11SP3, the little test
>> program provided in this tread hangs at the "utime()" call for 60+
>> seconds. It hangs each time it is run! It finishes right away with 0
>> seconds delay is SLES11SP1 is providing NFS services, each time.
>> 
>> 
>> Now, in the serverside logfiles of SLES11SP3 we see these messages
>> (not so on SP1):
>> --------------
>> kernel: [99381.184976] RPC: AUTH_GSS upcall timed out.
>> kernel: [99381.184978] Please check user daemon is running.
>> --------------
>> 
>> We have always been running the NFS server without rpc.gssd on the
>> server side, as the init script for the nfsserver also does not
>> start rpc.gssd.
>> 
>> 
>> Once we started rpc.gssd on the SLES11SP3 server, using the test
>> utility on the client shows that the first call to "utime()"
>> succeeds right away, the second call takes ~25s to complete. But
>> now, any consecutive runs of the utility finish with no more delay.
>> 
>> 
>> So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is
>> also required on the server side for correct operation?
>> 
>> Has there been a change between kernel 2.6.32.59 and 3.0.x?
>> 
>> Thus, is the init script of the nfsserver in SLES11SP3 indeed
>> missing to start rpc.gssd?
> 
> It should be starting rpc.gssd to allow callbacks, yes.
> 
> --b.

Ok, we will run rpc.gssd on the server. Thanks. 

Could you please comment on having the nfs clients hang on utime() calls is to be expected when *not* running rpc.gssd? Or is this a problem that needs to be investigated?

Best regards,
J Brauchle

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5428 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
  2013-09-10 21:48   ` Joschi Brauchle
@ 2013-09-10 21:55     ` J. Bruce Fields
  2013-09-10 22:08       ` Joschi Brauchle
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2013-09-10 21:55 UTC (permalink / raw)
  To: Joschi Brauchle; +Cc: linux-nfs@vger.kernel.org

On Tue, Sep 10, 2013 at 09:48:12PM +0000, Joschi Brauchle wrote:
> Am 10.09.2013 um 22:35 schrieb "J. Bruce Fields" <bfields@fieldses.org>:
> 
> > On Tue, Sep 10, 2013 at 08:49:05PM +0200, Joschi Brauchle wrote:
> >> Hello everyone,
> >> 
> >> we are administrating an NFS high-availability cluster running on
> >> SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster
> >> machines was updated to SLES11SP3 with kernel 3.0.82.
> >> 
> >> 
> >> We are now experiencing severe hangs on NFS clients when the
> >> SLES11SP3 server is running the NFS services. An strace on the
> >> hanging processes on the client side show that is is waiting up to
> >> 60+ seconds for a "utime()" call to complete.
> >> 
> >> 
> >> The problem we see is matching the problem described in the thread
> >> "v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to
> >> return". If the NFS server is running on SLES11SP3, the little test
> >> program provided in this tread hangs at the "utime()" call for 60+
> >> seconds. It hangs each time it is run! It finishes right away with 0
> >> seconds delay is SLES11SP1 is providing NFS services, each time.
> >> 
> >> 
> >> Now, in the serverside logfiles of SLES11SP3 we see these messages
> >> (not so on SP1):
> >> --------------
> >> kernel: [99381.184976] RPC: AUTH_GSS upcall timed out.
> >> kernel: [99381.184978] Please check user daemon is running.
> >> --------------
> >> 
> >> We have always been running the NFS server without rpc.gssd on the
> >> server side, as the init script for the nfsserver also does not
> >> start rpc.gssd.
> >> 
> >> 
> >> Once we started rpc.gssd on the SLES11SP3 server, using the test
> >> utility on the client shows that the first call to "utime()"
> >> succeeds right away, the second call takes ~25s to complete. But
> >> now, any consecutive runs of the utility finish with no more delay.
> >> 
> >> 
> >> So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is
> >> also required on the server side for correct operation?
> >> 
> >> Has there been a change between kernel 2.6.32.59 and 3.0.x?
> >> 
> >> Thus, is the init script of the nfsserver in SLES11SP3 indeed
> >> missing to start rpc.gssd?
> > 
> > It should be starting rpc.gssd to allow callbacks, yes.
> > 
> > --b.
> 
> Ok, we will run rpc.gssd on the server. Thanks. 
> 
> Could you please comment on having the nfs clients hang on utime() calls is to be expected when *not* running rpc.gssd? Or is this a problem that needs to be investigated?

I think what happens is the utime call breaks a delegation, and the
delay is because the lack of gssd prevents the server from calling back
to the client to tell it that its delegation is broken, so the
delegation has to time out.

That said, the server does a null callback to the client to test whether
callbacks are working before it gives out any delegations, so I'm
surprised it wouldn't have noticed the broken callbacks then.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
  2013-09-10 21:55     ` J. Bruce Fields
@ 2013-09-10 22:08       ` Joschi Brauchle
  2013-09-10 22:11         ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Joschi Brauchle @ 2013-09-10 22:08 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3305 bytes --]

Am 10.09.2013 um 23:55 schrieb "J. Bruce Fields" <bfields@fieldses.org>:

> On Tue, Sep 10, 2013 at 09:48:12PM +0000, Joschi Brauchle wrote:
>> Am 10.09.2013 um 22:35 schrieb "J. Bruce Fields" <bfields@fieldses.org>:
>> 
>>> On Tue, Sep 10, 2013 at 08:49:05PM +0200, Joschi Brauchle wrote:
>>>> Hello everyone,
>>>> 
>>>> we are administrating an NFS high-availability cluster running on
>>>> SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster
>>>> machines was updated to SLES11SP3 with kernel 3.0.82.
>>>> 
>>>> 
>>>> We are now experiencing severe hangs on NFS clients when the
>>>> SLES11SP3 server is running the NFS services. An strace on the
>>>> hanging processes on the client side show that is is waiting up to
>>>> 60+ seconds for a "utime()" call to complete.
>>>> 
>>>> 
>>>> The problem we see is matching the problem described in the thread
>>>> "v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to
>>>> return". If the NFS server is running on SLES11SP3, the little test
>>>> program provided in this tread hangs at the "utime()" call for 60+
>>>> seconds. It hangs each time it is run! It finishes right away with 0
>>>> seconds delay is SLES11SP1 is providing NFS services, each time.
>>>> 
>>>> 
>>>> Now, in the serverside logfiles of SLES11SP3 we see these messages
>>>> (not so on SP1):
>>>> --------------
>>>> kernel: [99381.184976] RPC: AUTH_GSS upcall timed out.
>>>> kernel: [99381.184978] Please check user daemon is running.
>>>> --------------
>>>> 
>>>> We have always been running the NFS server without rpc.gssd on the
>>>> server side, as the init script for the nfsserver also does not
>>>> start rpc.gssd.
>>>> 
>>>> 
>>>> Once we started rpc.gssd on the SLES11SP3 server, using the test
>>>> utility on the client shows that the first call to "utime()"
>>>> succeeds right away, the second call takes ~25s to complete. But
>>>> now, any consecutive runs of the utility finish with no more delay.
>>>> 
>>>> 
>>>> So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is
>>>> also required on the server side for correct operation?
>>>> 
>>>> Has there been a change between kernel 2.6.32.59 and 3.0.x?
>>>> 
>>>> Thus, is the init script of the nfsserver in SLES11SP3 indeed
>>>> missing to start rpc.gssd?
>>> 
>>> It should be starting rpc.gssd to allow callbacks, yes.
>>> 
>>> --b.
>> 
>> Ok, we will run rpc.gssd on the server. Thanks. 
>> 
>> Could you please comment on having the nfs clients hang on utime() calls is to be expected when *not* running rpc.gssd? Or is this a problem that needs to be investigated?
> 
> I think what happens is the utime call breaks a delegation, and the
> delay is because the lack of gssd prevents the server from calling back
> to the client to tell it that its delegation is broken, so the
> delegation has to time out.
> 
> That said, the server does a null callback to the client to test whether
> callbacks are working before it gives out any delegations, so I'm
> surprised it wouldn't have noticed the broken callbacks then.
> 
> --b.

Is there any information I can provide to figure this out? At what time is the null callback sent to the client? Maybe I can tcpdump that sequence...

Best regards,
J Brauchle

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5428 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
  2013-09-10 22:08       ` Joschi Brauchle
@ 2013-09-10 22:11         ` J. Bruce Fields
  2013-09-13 11:32           ` Joschi Brauchle
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2013-09-10 22:11 UTC (permalink / raw)
  To: Joschi Brauchle; +Cc: linux-nfs@vger.kernel.org

On Tue, Sep 10, 2013 at 10:08:36PM +0000, Joschi Brauchle wrote:
> Am 10.09.2013 um 23:55 schrieb "J. Bruce Fields" <bfields@fieldses.org>:
> 
> > On Tue, Sep 10, 2013 at 09:48:12PM +0000, Joschi Brauchle wrote:
> >> Am 10.09.2013 um 22:35 schrieb "J. Bruce Fields" <bfields@fieldses.org>:
> >> 
> >>> On Tue, Sep 10, 2013 at 08:49:05PM +0200, Joschi Brauchle wrote:
> >>>> Hello everyone,
> >>>> 
> >>>> we are administrating an NFS high-availability cluster running on
> >>>> SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster
> >>>> machines was updated to SLES11SP3 with kernel 3.0.82.
> >>>> 
> >>>> 
> >>>> We are now experiencing severe hangs on NFS clients when the
> >>>> SLES11SP3 server is running the NFS services. An strace on the
> >>>> hanging processes on the client side show that is is waiting up to
> >>>> 60+ seconds for a "utime()" call to complete.
> >>>> 
> >>>> 
> >>>> The problem we see is matching the problem described in the thread
> >>>> "v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to
> >>>> return". If the NFS server is running on SLES11SP3, the little test
> >>>> program provided in this tread hangs at the "utime()" call for 60+
> >>>> seconds. It hangs each time it is run! It finishes right away with 0
> >>>> seconds delay is SLES11SP1 is providing NFS services, each time.
> >>>> 
> >>>> 
> >>>> Now, in the serverside logfiles of SLES11SP3 we see these messages
> >>>> (not so on SP1):
> >>>> --------------
> >>>> kernel: [99381.184976] RPC: AUTH_GSS upcall timed out.
> >>>> kernel: [99381.184978] Please check user daemon is running.
> >>>> --------------
> >>>> 
> >>>> We have always been running the NFS server without rpc.gssd on the
> >>>> server side, as the init script for the nfsserver also does not
> >>>> start rpc.gssd.
> >>>> 
> >>>> 
> >>>> Once we started rpc.gssd on the SLES11SP3 server, using the test
> >>>> utility on the client shows that the first call to "utime()"
> >>>> succeeds right away, the second call takes ~25s to complete. But
> >>>> now, any consecutive runs of the utility finish with no more delay.
> >>>> 
> >>>> 
> >>>> So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is
> >>>> also required on the server side for correct operation?
> >>>> 
> >>>> Has there been a change between kernel 2.6.32.59 and 3.0.x?
> >>>> 
> >>>> Thus, is the init script of the nfsserver in SLES11SP3 indeed
> >>>> missing to start rpc.gssd?
> >>> 
> >>> It should be starting rpc.gssd to allow callbacks, yes.
> >>> 
> >>> --b.
> >> 
> >> Ok, we will run rpc.gssd on the server. Thanks. 
> >> 
> >> Could you please comment on having the nfs clients hang on utime() calls is to be expected when *not* running rpc.gssd? Or is this a problem that needs to be investigated?
> > 
> > I think what happens is the utime call breaks a delegation, and the
> > delay is because the lack of gssd prevents the server from calling back
> > to the client to tell it that its delegation is broken, so the
> > delegation has to time out.
> > 
> > That said, the server does a null callback to the client to test whether
> > callbacks are working before it gives out any delegations, so I'm
> > surprised it wouldn't have noticed the broken callbacks then.
> > 
> > --b.
> 
> Is there any information I can provide to figure this out? At what time is the null callback sent to the client? Maybe I can tcpdump that sequence...

It happens when the client sends a SETCLIENTID, which I think will be
the first time it opens a file.  Running "rpcdebug -m nfsd -s proc" on
the server and then looking in the logs afterwards might also be
enlightening.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
  2013-09-10 22:11         ` J. Bruce Fields
@ 2013-09-13 11:32           ` Joschi Brauchle
  2013-09-17 13:31             ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Joschi Brauchle @ 2013-09-13 11:32 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: linux-nfs@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5119 bytes --]


On 09/11/2013 12:11 AM, J. Bruce Fields wrote:
> On Tue, Sep 10, 2013 at 10:08:36PM +0000, Joschi Brauchle wrote:
>> Am 10.09.2013 um 23:55 schrieb "J. Bruce Fields" <bfields@fieldses.org>:
>>
>>> On Tue, Sep 10, 2013 at 09:48:12PM +0000, Joschi Brauchle wrote:
>>>> Am 10.09.2013 um 22:35 schrieb "J. Bruce Fields" <bfields@fieldses.org>:
>>>>
>>>>> On Tue, Sep 10, 2013 at 08:49:05PM +0200, Joschi Brauchle wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> we are administrating an NFS high-availability cluster running on
>>>>>> SLES11SP1 with kernel 2.6.32.59. Just recently, one of the cluster
>>>>>> machines was updated to SLES11SP3 with kernel 3.0.82.
>>>>>>
>>>>>>
>>>>>> We are now experiencing severe hangs on NFS clients when the
>>>>>> SLES11SP3 server is running the NFS services. An strace on the
>>>>>> hanging processes on the client side show that is is waiting up to
>>>>>> 60+ seconds for a "utime()" call to complete.
>>>>>>
>>>>>>
>>>>>> The problem we see is matching the problem described in the thread
>>>>>> "v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to
>>>>>> return". If the NFS server is running on SLES11SP3, the little test
>>>>>> program provided in this tread hangs at the "utime()" call for 60+
>>>>>> seconds. It hangs each time it is run! It finishes right away with 0
>>>>>> seconds delay is SLES11SP1 is providing NFS services, each time.
>>>>>>
>>>>>>
>>>>>> Now, in the serverside logfiles of SLES11SP3 we see these messages
>>>>>> (not so on SP1):
>>>>>> --------------
>>>>>> kernel: [99381.184976] RPC: AUTH_GSS upcall timed out.
>>>>>> kernel: [99381.184978] Please check user daemon is running.
>>>>>> --------------
>>>>>>
>>>>>> We have always been running the NFS server without rpc.gssd on the
>>>>>> server side, as the init script for the nfsserver also does not
>>>>>> start rpc.gssd.
>>>>>>
>>>>>>
>>>>>> Once we started rpc.gssd on the SLES11SP3 server, using the test
>>>>>> utility on the client shows that the first call to "utime()"
>>>>>> succeeds right away, the second call takes ~25s to complete. But
>>>>>> now, any consecutive runs of the utility finish with no more delay.
>>>>>>
>>>>>>
>>>>>> So can anyone confirm that with kernel 3.0+ the rpc.gssd daemon is
>>>>>> also required on the server side for correct operation?
>>>>>>
>>>>>> Has there been a change between kernel 2.6.32.59 and 3.0.x?
>>>>>>
>>>>>> Thus, is the init script of the nfsserver in SLES11SP3 indeed
>>>>>> missing to start rpc.gssd?
>>>>>
>>>>> It should be starting rpc.gssd to allow callbacks, yes.
>>>>>
>>>>> --b.
>>>>
>>>> Ok, we will run rpc.gssd on the server. Thanks.
>>>>
>>>> Could you please comment on having the nfs clients hang on utime() calls is to be expected when *not* running rpc.gssd? Or is this a problem that needs to be investigated?
>>>
>>> I think what happens is the utime call breaks a delegation, and the
>>> delay is because the lack of gssd prevents the server from calling back
>>> to the client to tell it that its delegation is broken, so the
>>> delegation has to time out.
>>>
>>> That said, the server does a null callback to the client to test whether
>>> callbacks are working before it gives out any delegations, so I'm
>>> surprised it wouldn't have noticed the broken callbacks then.
>>>
>>> --b.
>>
>> Is there any information I can provide to figure this out? At what time is the null callback sent to the client? Maybe I can tcpdump that sequence...
>
> It happens when the client sends a SETCLIENTID, which I think will be
> the first time it opens a file.  Running "rpcdebug -m nfsd -s proc" on
> the server and then looking in the logs afterwards might also be
> enlightening.
>
> --b.
>

After three days of testing the NFS server *with* rpc.gssd running with 
multiple NFS clients, we made the following observation:

The hangs on "utime()" calls have **not** disappeared by simply starting
rpc.gssd on the server. The problem persists!

I seems like
a) on machines that are already connected to the NFS server when 
rpc.gssd is started, the hangs dissappear *mostly*. That is, running the 
utime-test-program causes about 1 spurious hang every 10 minutes.
b) on machines that connect to the NFS server at a later time (rpc.gssd 
already running on the server), the hangs seem appear  every "utime()" 
call.

The server emits spurious "RPC: AUTH_GSS upcall timed out. Please check 
user daemon is running." messages, although rpc.gssd is running. This 
may or may not be related, as this message may also be caused by clients 
where the root user access NFS shared with a "host/<hostname>" credential.

The output of "rpcdebug -m nfsd -s proc" to pastebin.com. Get it with
pbget http://pastebin.com/N34r5kWE

The IP of the newly connected host is: 192.168.109.154 and its 
SETCLIENTID call was logged. Unfortunately, this log was created while 
*many* other NFS clients were connected, hence it may not be too useful.

I'd be very grateful for any help or instructions on debugging/fixing 
this problem.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4607 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82)
  2013-09-13 11:32           ` Joschi Brauchle
@ 2013-09-17 13:31             ` J. Bruce Fields
  0 siblings, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2013-09-17 13:31 UTC (permalink / raw)
  To: Joschi Brauchle; +Cc: linux-nfs@vger.kernel.org

On Fri, Sep 13, 2013 at 01:32:47PM +0200, Joschi Brauchle wrote:
> After three days of testing the NFS server *with* rpc.gssd running
> with multiple NFS clients, we made the following observation:
> 
> The hangs on "utime()" calls have **not** disappeared by simply starting
> rpc.gssd on the server. The problem persists!
> 
> I seems like
> a) on machines that are already connected to the NFS server when
> rpc.gssd is started, the hangs dissappear *mostly*. That is, running
> the utime-test-program causes about 1 spurious hang every 10
> minutes.
> b) on machines that connect to the NFS server at a later time
> (rpc.gssd already running on the server), the hangs seem appear
> every "utime()" call.
> 
> The server emits spurious "RPC: AUTH_GSS upcall timed out. Please
> check user daemon is running." messages, although rpc.gssd is
> running. This may or may not be related, as this message may also be
> caused by clients where the root user access NFS shared with a
> "host/<hostname>" credential.
> 
> The output of "rpcdebug -m nfsd -s proc" to pastebin.com. Get it with
> pbget http://pastebin.com/N34r5kWE
> 
> The IP of the newly connected host is: 192.168.109.154 and its
> SETCLIENTID call was logged. Unfortunately, this log was created
> while *many* other NFS clients were connected, hence it may not be
> too useful.
> 
> I'd be very grateful for any help or instructions on
> debugging/fixing this problem.

NFSv4.0 callbacks have just broken for a while, I think; I'll look into
it.

Meanwhile you should be able to work around this by disabling leases on
the server (so, "echo 0 >/proc/sys/fs/leases-enable" before starting
nfsd).

(Or if you're more daring and running a very recent upstream kernel,
switching to NFSv4.1 should work too.)

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-09-17 13:31 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-10 18:49 nfsd4: utime sometimes takes 40+ seconds to return (but on SLES11SP3 with kernel 3.0.82) Joschi Brauchle
2013-09-10 20:35 ` J. Bruce Fields
2013-09-10 21:48   ` Joschi Brauchle
2013-09-10 21:55     ` J. Bruce Fields
2013-09-10 22:08       ` Joschi Brauchle
2013-09-10 22:11         ` J. Bruce Fields
2013-09-13 11:32           ` Joschi Brauchle
2013-09-17 13:31             ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).