linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Trond Myklebust <trond.myklebust@fys.uio.no>
To: Simon Kirby <sim@hostway.ca>
Cc: linux-nfs@vger.kernel.org
Subject: Re: NFS client/sunrpc getting stuck on 2.6.36
Date: Thu, 11 Nov 2010 13:22:47 +0800	[thread overview]
Message-ID: <1289452967.4062.10.camel@heimdal.trondhjem.org> (raw)
In-Reply-To: <20101111023520.GH16939@hostway.ca>

On Wed, 2010-11-10 at 18:35 -0800, Simon Kirby wrote:
> Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
> Unfortunately, it doesn't happen all the time...only certain load
> patterns seem to start it off.  Once it starts, I can't find a way to
> make it recover without rebooting.
> 
> In this case, tcpdump and rpc/nfs debug logs show a repeating command,
> but only 3-4 commands are being sent per second, even though all slots
> are in use, there is a backlog, 300 processes in rpc_wait_bit_killable,
> and the debug logs show that responses are coming back immediately. 
> Weird congestion/backoff issue?
> 
> Over a few seconds, the only numbers changing from "nfsstat" are "calls",
> "authrefresh", and "getattr".  Debug level 1 for nfs and rpc show this
> repeating:
> 
>  NFS: permission(0:4c/5284877), mask=0x1, res=0
>  RPC:  5778 reserved req ffff88012ae042d0 xid 64030e74
>  RPC:  5778 xprt_prepare_transmit
>  RPC:  5778 xprt_transmit(96)
>  RPC:  5778 xmit complete
>  RPC:  5778 xid 64030e74 complete (112 bytes received)
>  RPC:  5778 release request ffff88012ae042d0
>  NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
>  NFS: permission(0:4c/3247737045), mask=0x24, res=0
>  NFS: dentry_delete(tmp/t.sh, 8)
>  NFS: dentry_delete(tmp/9040-16697.done, 8)
>  NFS: dentry_delete(tmp/9040-29371.done, 8)
> 
> The only thing that changes between repeats are the pid and xid.
> 
> Debug level 255 for nfs and rpc:
> 
> NFS call  getattr
> RPC:       new task initialized, procpid 15460
> RPC:       allocated task ffff8800c09f7e00
> RPC:  5321 __rpc_execute flags=0x80
> RPC:  5321 call_start nfs3 proc GETATTR (sync)
> RPC:  5321 call_reserve (status 0)
> RPC:  5321 reserved req ffff88012ae042d0 xid 9e010e74
> RPC:  5321 call_reserveresult (status 0)
> RPC:  5321 call_refresh (status 0)
> RPC:  5321 looking up UNIX cred
> RPC:       looking up UNIX cred
> RPC:  5321 refreshing UNIX cred ffff880011ec1300
> RPC:  5321 call_refreshresult (status 0)
> RPC:  5321 call_allocate (status 0)
> RPC:  5321 allocated buffer of size 552 at ffff880129b1d000
> RPC:  5321 call_bind (status 0)
> RPC:  5321 call_connect xprt ffff88012a74d000 is connected
> RPC:  5321 call_transmit (status 0)
> RPC:  5321 xprt_prepare_transmit
> RPC:  5321 rpc_xdr_encode (status 0)
> RPC:  5321 marshaling UNIX cred ffff880011ec1300
> RPC:  5321 using AUTH_UNIX cred ffff880011ec1300 to wrap rpc data
> RPC:  5321 xprt_transmit(96)
> RPC:       xs_tcp_send_request(96) = 96
> RPC:  5321 xmit complete
> RPC:  5321 sleep_on(queue "xprt_pending" time 4494861481)
> RPC:       xs_tcp_data_ready...
> RPC:       xs_tcp_data_recv started
> RPC:       reading TCP record fragment of length 112
> RPC:       reading XID (4 bytes)
> RPC:       reading reply for XID 9e010e74
> RPC:       reading CALL/REPLY flag (4 bytes)
> RPC:       read reply XID 9e010e74
> RPC:  5321 added to queue ffff88012a74d3f8 "xprt_pending"
> RPC:  5321 setting alarm for 900 ms
> RPC:       XID 9e010e74 read 104 bytes
> RPC:       xprt = ffff88012a74d000, tcp_copied = 112, tcp_offset = 112, tcp_reclen = 112
> RPC:  5321 xid 9e010e74 complete (112 bytes received)
> RPC:  5321 __rpc_wake_up_task (now 4494861481)
> RPC:  5321 disabling timer
> RPC:  5321 removed from queue ffff88012a74d3f8 "xprt_pending"
> RPC:       __rpc_wake_up_task done
> RPC:       xs_tcp_data_recv done
> RPC:       wake_up_next(ffff88012a74d2e8 "xprt_resend")
> RPC:       wake_up_next(ffff88012a74d1d8 "xprt_sending")
> RPC:  5321 call_status (status 112)
> RPC:  5321 call_decode (status 112)
> RPC:  5321 validating UNIX cred ffff880011ec1300
> RPC:  5321 using AUTH_UNIX cred ffff880011ec1300 to unwrap rpc data
> RPC:  5321 call_decode result 0
> RPC:  5321 return 0, status 0
> RPC:  5321 release task
> RPC:       freeing buffer of size 552 at ffff880129b1d000
> RPC:  5321 release request ffff88012ae042d0
> RPC:       wake_up_next(ffff88012a74d508 "xprt_backlog")
> RPC:       rpc_release_client(ffff88012016e400)
> RPC:  5321 freeing task
> NFS reply getattr: 0
> NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
> NFS: (0:4c/3247737045) revalidation complete
> NFS: nfs_lookup_revalidate(HWD/tmp) is valid
> RPC:       looking up Generic cred
> NFS: permission(0:4c/3247737045), mask=0x24, res=0
> NFS: open dir(HWD/tmp)
> RPC:       looking up Generic cred
> NFS: readdir(HWD/tmp) starting at cookie 0
> NFS: readdir_search_pagecache() searching for offset 0
> NFS: find_dirent_page: searching page 0 for target 0
> NFS: found cookie 4 at index 0
> NFS: find_dirent_page: returns 0
> NFS: readdir_search_pagecache: returns 0
> NFS: nfs_do_filldir() filling starting @ cookie 4
> NFS: dentry_delete(tmp/t.sh, 8)
> NFS: dentry_delete(tmp/9040-16697.done, 8)
> NFS: dentry_delete(tmp/9040-29371.done, 8)
> NFS: nfs_do_filldir() filling ended @ cookie 512; returning = 0
> NFS: readdir(HWD/tmp) returns 0
> NFS: readdir(HWD/tmp) starting at cookie 5
> NFS: readdir_search_pagecache() searching for cookie 512
> NFS: find_dirent_page: searching page 0 for target 512
> NFS: find_dirent: examining cookie 4
> NFS: find_dirent: examining cookie 6
> NFS: find_dirent: examining cookie 11
> NFS: find_dirent: examining cookie 18
> NFS: find_dirent: examining cookie 512
> NFS: find_dirent_page: returns -523
> NFS: readdir_search_pagecache: returns -523
> NFS: readdir(HWD/tmp) returns 0
> RPC:       looking up Generic cred
> NFS: permission(0:4c/5284877), mask=0x1, res=0
> NFS: revalidating (0:4c/3247737045)
> 
> 900ms matches the probably-silly nfs mount settings we're currently using:
> 
> rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192
> 
> Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/

timeo=9 is a completely insane retransmit value for a tcp connection.
Please use the default timeo=600, and all will work correctly.

Cheers
  Trond


  reply	other threads:[~2010-11-11  5:22 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-11  2:35 NFS client/sunrpc getting stuck on 2.6.36 Simon Kirby
2010-11-11  5:22 ` Trond Myklebust [this message]
2010-11-11  8:49   ` Simon Kirby
2010-11-19 20:20   ` Simon Kirby
2010-11-19 21:24     ` Trond Myklebust
2010-11-19 22:03       ` Simon Kirby
2010-11-19 22:17         ` Trond Myklebust
2010-11-19 22:58           ` Simon Kirby
2010-11-19 23:17             ` Trond Myklebust
2010-11-21  6:43               ` Simon Kirby
2010-11-21 19:55                 ` Trond Myklebust
2010-11-21  6:40           ` Simon Kirby
2010-11-21 19:54             ` Trond Myklebust
2010-11-24  5:18               ` Simon Kirby
2010-11-24 15:05                 ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1289452967.4062.10.camel@heimdal.trondhjem.org \
    --to=trond.myklebust@fys.uio.no \
    --cc=linux-nfs@vger.kernel.org \
    --cc=sim@hostway.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).