From: David Flynn <davidf@rd.bbc.co.uk>
To: linux-nfs@vger.kernel.org
Subject: NFS4 BAD_STATEID loop (kernel 3.0)
Date: Mon, 24 Oct 2011 10:40:42 +0000 [thread overview]
Message-ID: <20111024104042.GD32587@rd.bbc.co.uk> (raw)
Dear All,
On a system running kernel 3.0, mounting a Solaris NFS4 export, we
observe a continuous 20Mbit/sec exchange between client and server that had
been occurring for 10 days.
>From /proc/mounts:
home:/home/ /home nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.29.190.21,minorversion=0,local_lock=none,addr=172.29.120.140 0 0
Reconciling logs, we find that around the time the exchange started,
the kernel reports a process having blocked:
[787321.680029] INFO: task bash:17799 blocked for more than 120 seconds.
[787321.680067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[787321.680104] bash D 0000000000000002 0 17799 1 0x00000000
[787321.680131] ffff880173ad9ca8 0000000000000086 ffff88017216b008 0000000000012a40
[787321.680171] ffff880173ad9fd8 0000000000012a40 ffff880173ad8000 0000000000012a40
[787321.680211] 0000000000012a40 0000000000012a40 ffff880173ad9fd8 0000000000012a40
[787321.680251] Call Trace:
[787321.680275] [<ffffffff8110e900>] ? __lock_page+0x70/0x70
[787321.680299] [<ffffffff815fea2c>] io_schedule+0x8c/0xd0
[787321.680321] [<ffffffff8110e90e>] sleep_on_page+0xe/0x20
[787321.680344] [<ffffffff815ff2af>] __wait_on_bit+0x5f/0x90
[787321.680366] [<ffffffff8110ead3>] wait_on_page_bit+0x73/0x80
[787321.680390] [<ffffffff81085980>] ? autoremove_wake_function+0x40/0x40
[787321.680416] [<ffffffff8111aeb5>] ? pagevec_lookup_tag+0x25/0x40
[787321.680439] [<ffffffff8110ed06>] filemap_fdatawait_range+0xf6/0x1a0
[787321.680483] [<ffffffffa022e740>] ? nfs_destroy_directcache+0x20/0x20 [nfs]
[787321.680507] [<ffffffff8111a3b1>] ? do_writepages+0x21/0x40
[787321.680530] [<ffffffff8110ff8b>] ? __filemap_fdatawrite_range+0x5b/0x60
[787321.680555] [<ffffffff81110000>] filemap_write_and_wait_range+0x70/0x80
[787321.680580] [<ffffffff8119b45a>] vfs_fsync_range+0x5a/0x90
[787321.680602] [<ffffffff8119b4fc>] vfs_fsync+0x1c/0x20
[787321.680628] [<ffffffffa0222be4>] nfs_file_flush+0x54/0x80 [nfs]
[787321.680653] [<ffffffff8116d66f>] filp_close+0x3f/0x90
[787321.680675] [<ffffffff8116e097>] sys_close+0xb7/0x120
[787321.680698] [<ffffffff816090c2>] system_call_fastpath+0x16/0x1b
A network trace showed the following exchange below being continually
repeated. Unfortunately this system too has since been rebooted.
Any thoughts on the matter would be most welcome,
Kind regards,
..david
No. Time Source Destination Protocol Size Info
9880 11:40:12.833617 172.29.190.21 172.29.120.140 NFS 1122 V4 COMPOUND Call (Reply In 9881) <EMPTY> PUTFH;WRITE;GETATTR
Frame 9880: 1122 bytes on wire (8976 bits), 1122 bytes captured (8976 bits)
Arrival Time: Oct 17, 2011 11:40:12.833617000 BST
Frame Length: 1122 bytes (8976 bits)
Capture Length: 1122 bytes (8976 bits)
Ethernet II, Src: ChelsioC_06:68:f9 (00:07:43:06:68:f9), Dst: All-HSRP-routers_be (00:00:0c:07:ac:be)
Internet Protocol, Src: 172.29.190.21 (172.29.190.21), Dst: 172.29.120.140 (172.29.120.140)
Transmission Control Protocol, Src Port: 816 (816), Dst Port: nfs (2049), Seq: 5199745, Ack: 275801, Len: 1056
Remote Procedure Call, Type:Call XID:0x5daa6e93
Network File System
[Program Version: 4]
[V4 Procedure: COMPOUND (1)]
Tag: <EMPTY>
length: 0
contents: <EMPTY>
minorversion: 0
Operations (count: 3)
Opcode: PUTFH (22)
filehandle
length: 36
[hash (CRC-32): 0x6e4b15f3]
decode type as: unknown
filehandle: 7df3a75d5e1cd908000ab44c5b000000efc80200000a0300...
Opcode: WRITE (38)
stateid
seqid: 0x00000000
Data: 4e06f15b800f82e300000000
offset: 11392
stable: FILE_SYNC4 (2)
Write length: 814
Data: <DATA>
length: 814
contents: <DATA>
fill bytes: opaque data
Opcode: GETATTR (9)
GETATTR4args
attr_request
bitmap[0] = 0x00000018
[2 attributes requested]
mand_attr: FATTR4_CHANGE (3)
mand_attr: FATTR4_SIZE (4)
bitmap[1] = 0x00300000
[2 attributes requested]
recc_attr: FATTR4_TIME_METADATA (52)
recc_attr: FATTR4_TIME_MODIFY (53)
No. Time Source Destination Protocol Size Info
9881 11:40:12.833956 172.29.120.140 172.29.190.21 NFS 122 V4 COMPOUND Reply (Call In 9880) <EMPTY> PUTFH;WRITE
Frame 9881: 122 bytes on wire (976 bits), 122 bytes captured (976 bits)
Arrival Time: Oct 17, 2011 11:40:12.833956000 BST
[Time delta from previous captured frame: 0.000339000 seconds]
Frame Length: 122 bytes (976 bits)
Capture Length: 122 bytes (976 bits)
Ethernet II, Src: Cisco_1e:f7:80 (00:13:5f:1e:f7:80), Dst: ChelsioC_06:68:f9 (00:07:43:06:68:f9)
Internet Protocol, Src: 172.29.120.140 (172.29.120.140), Dst: 172.29.190.21 (172.29.190.21)
Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 816 (816), Seq: 275801, Ack: 5200801, Len: 56
Remote Procedure Call, Type:Reply XID:0x5daa6e93
Network File System
[Program Version: 4]
[V4 Procedure: COMPOUND (1)]
Status: NFS4ERR_BAD_STATEID (10025)
Tag: <EMPTY>
length: 0
contents: <EMPTY>
Operations (count: 2)
Opcode: PUTFH (22)
Status: NFS4_OK (0)
Opcode: WRITE (38)
Status: NFS4ERR_BAD_STATEID (10025)
next reply other threads:[~2011-10-24 10:40 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-24 10:40 David Flynn [this message]
2011-10-24 11:22 ` NFS4 BAD_STATEID loop (kernel 3.0) Trond Myklebust
2011-10-24 13:17 ` David Flynn
2011-10-24 13:32 ` Trond Myklebust
2011-10-24 14:50 ` NFS4 BAD_STATEID loop (kernel 3.0.4) David Flynn
2011-10-24 15:31 ` Trond Myklebust
2011-10-24 15:55 ` David Flynn
2011-10-27 22:17 ` David Flynn
2011-10-29 0:25 ` NFS4ERR_STALE_CLIENTID loop David Flynn
2011-10-29 17:29 ` Trond Myklebust
2011-10-29 18:02 ` David Flynn
2011-10-29 18:22 ` Myklebust, Trond
2011-10-29 18:23 ` Chuck Lever
2011-10-29 18:26 ` Myklebust, Trond
2011-10-29 18:29 ` David Flynn
2011-10-29 18:15 ` J. Bruce Fields
2011-10-29 18:21 ` Myklebust, Trond
2011-10-29 18:47 ` J. Bruce Fields
2011-10-29 18:50 ` Chuck Lever
2011-10-29 19:19 ` Myklebust, Trond
2011-10-29 19:52 ` David Flynn
2011-10-29 20:42 ` Myklebust, Trond
2011-10-29 21:07 ` David Flynn
2011-10-29 21:12 ` Myklebust, Trond
2011-10-31 13:07 ` Chuck Lever
2011-10-31 13:21 ` David Flynn
2011-10-31 13:39 ` Chuck Lever
2011-10-24 13:43 ` NFS4 BAD_STATEID loop (kernel 3.0) Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111024104042.GD32587@rd.bbc.co.uk \
--to=davidf@rd.bbc.co.uk \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.