* [PATCH 0/1] nfs: Panic when commit fails
@ 2009-10-20 17:44 Terry Loftin
0 siblings, 0 replies; only message in thread
From: Terry Loftin @ 2009-10-20 17:44 UTC (permalink / raw)
To: SteveD, linux-nfs
Steve,
Please consider the following patch. Thanks in advance!
Summary:
At the end of an nfs write, if the nfs commit fails, all the writes
will be rescheduled. They are supposed to be rescheduled as NFS_FILE_SYNC
writes, but the rpc_task structure is not completely intialized and so
the option is not passed. When the rescheduled writes complete, the
return indicates that they are NFS_UNSTABLE and we try to do another
commit. This leads to a Panic because the commit data structure pointer
was set to null in the initial (failed) commit attempt.
Long-Winded Description:
This panic is pretty rare - it requires all writes to complete
succesfully, and the commit to fail.
We had an NFS clilent panic with this stack trace:
PID: 5981 TASK: ffff8801a51c86c0 CPU: 6 COMMAND: "nfsiod"
#0 [ffff88019f1ab800] crash_kexec at ffffffff8026eb22
#1 [ffff88019f1ab8e0] oops_end at ffffffff804c7099
#2 [ffff88019f1ab910] do_page_fault at ffffffff804c8d2a
#3 [ffff88019f1abc90] page_fault at ffffffff804c65d5
#4 [ffff88019f1abd18] nfs_direct_write_complete at ffffffffa0345bf7
#5 [ffff88019f1abdf8] nfs_direct_write_release at ffffffffa0346030
#6 [ffff88019f1abe28] rpc_release_calldata at ffffffffa02ec427
#7 [ffff88019f1abe38] rpc_free_task at ffffffffa02ec4ce
#8 [ffff88019f1abe68] rpc_async_release at ffffffffa02ec58b
#9 [ffff88019f1abe78] run_workqueue at ffffffff80252b17
#10 [ffff88019f1abec8] worker_thread at ffffffff80252c91
#11 [ffff88019f1abf28] kthread at ffffffff8025623d
#12 [ffff88019f1abf48] kernel_thread at ffffffff8021241a
Just prior to this, our nfs server crashed due to hw problems.
The Panic happens in nfs_direct_commit_schedule(), at line ~568 in
fs/nfs/direct.c:
data->inode = dreq->inode;
Here, data is NULL and is the cause of the Panic. This means that
dreq->commit_data is NULL. By inspection, this pointer is only set
to NULL in nfs_direct_commit_schedule(), so we must have called this
function twice.
This panic happens when a commit fails. In nfs_direct_commit_release(),
when a commit fails, we set dreq->flags to NFS_ODIRECT_RESCHED_WRITES
to reschedule all the associated writes. When these writes complete,
we try to do another commit, but, the commit_data pointer has already
been set to NULL during previous commit attempt.
The problem is that we shouldn't be doing a second commit attempt.
In nfs_direct_write_reschedule() we request a sync write by setting
the args->stable pointer to NFS_FILE_SYNC, but the server does not
see this, because the rpc_task structure isn't fully initialized.
Thus the response from the server for these rescheduled writes always
returns NFS_UNSTABLE, which causes us to do a second commit (See:
nfs_direct_write_release()) and causes a Panic.
I added code to nfs_direct_release() to fail every 8th commit. The
panic was reproduced on the 8th commit every time. I then added a
line to initialize the rpc_task structure in nfs_direct_write_reschedule().
I quit testing after logging over 64K commits (with 8K commit failures)
and no Panics, and no write errors detected.
This is pretty much a typo. Every other rpc_task struct appears to
be fully initialized.
-T
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2009-10-20 17:44 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-20 17:44 [PATCH 0/1] nfs: Panic when commit fails Terry Loftin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.