* RPC retransmission of write requests containing bogus data
@ 2008-10-17 11:01 Ian Campbell
[not found] ` <1224241273.9053.109.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Ian Campbell @ 2008-10-17 11:01 UTC (permalink / raw)
To: Trond.Myklebust, linux-nfs; +Cc: mchan-dY08KVG/lbpWk0Htik3J/w
[-- Attachment #1: Type: text/plain, Size: 7336 bytes --]
(please CC me, I am not currently subscribed to linux-nfs)
Hi,
For some time now I have been tracking down a mysterious host crash when
using Xen with blktap (userspace paravirtual disk implementation) under
stress conditions. I have now managed to create a simple test case with
no Xen components which appears to show RPC retransmissions of write
requests containing invalid data. I'll start at the beginning and
explain the original bug although the test case is much simpler.
blktap is a userspace daemon which provides paravirtualised disk support
to Xen guests. The pages of data for a block write are mapped from the
Xen guest into a domain 0 driver called blktap which has a userspace
component which implements qcow/vhd/etc and writes the data to the
backing file using O_DIRECT and aio in a zero copy manner. Once the the
aio is completed the pages are returned to the guest and unmapped from
domain 0. When a page is unmapped in this way the pte is set not present
and the PFN is mapped to an invalid MFN.
We have been seeing a crash in the domain 0 network driver's start_xmit
routine when it attempts to access data from a page where the PFN maps
to an invalid MFN. I added some tracing to the kernel and observed this
sequence of events on an individual page:
tap: 0/41 at 4fac7a165ad6 "rpc_init_task" c8792844 0
tap: 1/41 at 4fac7a1663e2 "nfs_direct_write_schedule" ece19850 2000
tap: 2/41 at 4fac7a166cca "call_start" c8792844 0
tap: 3/41 at 4fac7a167540 "call_reserve" c8792844 0
tap: 4/41 at 4fac7a167de6 "call_reserveresult" c8792844 0
tap: 5/41 at 4fac7a168620 "call_allocate" c8792844 0
tap: 6/41 at 4fac7a168f08 "call_bind" c8792844 0
tap: 7/41 at 4fac7a169712 "call_connect" c8792844 0
tap: 8/41 at 4fac7a169f28 "call_transmit" c8792844 0
tap: 9/41 at 4fac7a16a7f2 "call_encode" c8792844 0
tap: 10/41 at 4fac7a16afd8 "call_header" c8792844 0
tap: 11/41 at 4fac7a16bc6e "xs_tcp_send_request" c8792844 0
tap: 12/41 at 4fac7a16c9d0 "tcp_sendpage" 0 0
tap: 13/41 at 4fac7a16cec2 "do_tcp_sendpages (adding data to skb)" cef08b00 0
tap: 14/41 at 4fac7a16e068 "call_transmit_status" c8792844 0
tap: 15/41 at 4fac7a2ed8f4 "tcp_transmit_skb, skb_clone" c9dca500 c9dca5a8
tap: 16/41 at 4faeeeb9f566 "xprt_timer" c8792844 0
tap: 17/41 at 4faeeeb9ff0e "xprt_timer: !req->rq_received" c8792844 0
tap: 18/41 at 4faeeeba08ec "rpc_make_runnable" c8792844 0
tap: 19/41 at 4faeeec117b8 "call_status" c8792844 ffffff92
tap: 20/41 at 4faeeec11faa "timeout (minor)" c8792844 0
tap: 21/41 at 4faeeec12778 "call_bind" c8792844 0
tap: 22/41 at 4faeeec12ef8 "call_connect" c8792844 0
tap: 23/41 at 4faeeec13678 "call_transmit" c8792844 0
tap: 24/41 at 4faeeec13e46 "call_encode" c8792844 0
tap: 25/41 at 4faeeec145ae "call_header" c8792844 0
tap: 26/41 at 4faeeec15082 "xs_tcp_send_request" c8792844 0
tap: 27/41 at 4faeeec15d1e "tcp_sendpage" 0 0
tap: 28/41 at 4faeeec161ce "do_tcp_sendpages (adding data to skb)" d06afe40 0
tap: 29/41 at 4faeeec172ea "call_transmit_status" c8792844 0
tap: 30/41 at 4faf2e3280d4 "rpc_make_runnable" c8792844 0
tap: 31/41 at 4faf2e3440c4 "call_status" c8792844 88
tap: 32/41 at 4faf2e3449d6 "call_decode" c8792844 0
tap: 33/41 at 4faf2e345240 "call_verify" c8792844 0
tap: 34/41 at 4faf2e345a9e "rpc_exit_task" c8792844 0
tap: 35/41 at 4faf2e34652a "nfs_direct_write_result" ece19850 2000
tap: 36/41 at 4faf2e351000 "nfs_direct_write_release (completing)" ece19850 0
tap: 37/41 at 4faf2e3517ec "nfs_direct_write_complete,DEFAULT w/ iocb" ece19850 0
tap: 38/41 at 4faf2e35205c "nfs_direct_free_writedata" ece19850 0
tap: 39/41 at 4faf2e51026a "fast_flush_area" 0 0
tap: 40/41 at 4faf33e1813a "tcp_transmit_skb, skb_clone" d06afe40 d06afee8
(nb: fast_flush_area is the blktap function which returns the pages to
the guest and unmaps them from domain 0, it is called via ioctl from the
userspace process once the aio write returns successfully. 4fac7.... is
the tsc, processor is 2.33GHz)
So what we see is the initial request being constructed and transmitted
(around 11/41-15/41) followed by a timeout ~60s later (16/41-20/41)
which causes us to queue a retransmit (26/41-29/41) but, critically, not
yet actually transmit it. At this point we get the reply to the original
request and complete the NFS write (35/41-38/41) returning success to
userspace which causes it to unmap the pages and return them to the
guest (39/41). At this point (40/41) we attempt to transmit the
duplicate request and crash because the pages are not present any more.
By using libnetfilter_queue I was then able to reproduce this in
non-stress situations by introducing delays into the network. Stalling
all network traffic with the NFS server for 65s every 1000 packets seems
to do the trick (probably on the aggressive side, I didn't attempt to
minimise the interruption required to reproduce and I realise that this
represents a pretty crappy network and/or server).
Given the observed sequence of events I then constructed a fairly simple
test program (blktest2.c, attached) using O_DIRECT (but not aio, I don't
think it matters) which shows the same issue without involving any Xen
parts. The test writes a buffer filed with a counter to a file and
immediately after the write() returns fills the buffer with 0xdeadbeef.
By using tcpdump I capture and observe duplicated requests on the wire
containing 0xdeadbeef in the payload and not the expected counter
values. I usually see this in a matter of minutes. I've attached a pcap
of a single request/reply pair which was corrupted.
Presumably in the case of a decent NFS server the XID request cache
would prevent the bogus data actually reaching the disk but on a
non-decent server I suspect it might actually lead to corruption (AIUI
the request cache is not a hard requirement of the NFS protocol?).
Perhaps even a decent server might have timed out the entry in the cache
after such a delay?
The Xen case and the blktest2 repro was on 2.6.18. I have also
reproduced the blktest2 case on 2.6.27 native but not with Xen since no
domain 0 support exists just yet.
I can think of several possible option to fix this issue:
* Do not return success to userspace until all duplicated requests
have actually hit the wire, even if the response comes back
earlier than that.
* Cancel queued duplicates if a response comes in late.
* Copy the data pages on retransmission.
I think I prefer the 3rd option since the first two are a bit tricky
since request has been merged into a tcp stream and may have been
fragmented/segmented already etc. I don't think doing the copy on
retransmits has a massive performance impact -- you must already be
suffering from pretty severe network or server issues!
I have CC'd the maintainer of the bnx2 Ethernet driver (Michael Chan)
because I have so far only been able to reproduce this with the broadcom
NIC. Even in the same server if I switch to e1000 I cannot reproduce.
However given the analysis above I'm not convinced it is likely to be a
driver bug.
Ian.
[-- Attachment #2: netoutage.c --]
[-- Type: text/x-csrc, Size: 2928 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <netinet/in.h>
#include <linux/types.h>
#include <linux/netfilter.h> /* for NF_ACCEPT */
#include <libnetfilter_queue/libnetfilter_queue.h>
#include <err.h>
#include <errno.h>
/* "Configuration" */
#define NR 1000 /* Delay 1/NR */
#define HOWLONG 65 /* Delay for HOWLONG seconds */
static int n = 0;
static int cb(struct nfq_q_handle *qh, struct nfgenmsg *nfmsg,
struct nfq_data *nfa, void *data)
{
int id = 0;
struct nfqnl_msg_packet_hdr *ph;
time_t t = time(NULL);
ph = nfq_get_msg_packet_hdr(nfa);
if (ph)
id = ntohl(ph->packet_id);
n++;
if ((n%NR) == 0) {
printf("%d: delaying at %d for %ds ...", t, id, HOWLONG);
fflush(stdout);
sleep(HOWLONG);
printf("\n");
n = 0;
}
return nfq_set_verdict(qh, id, NF_ACCEPT, 0, NULL);
}
int main(int argc, char **argv)
{
struct nfq_handle *h;
struct nfq_q_handle *qh;
struct nfnl_handle *nh;
int fd;
int rv;
char buf[4096] __attribute__ ((aligned));
printf("opening library handle\n");
h = nfq_open();
if (!h) {
fprintf(stderr, "error during nfq_open()\n");
exit(1);
}
printf("unbinding existing nf_queue handler for AF_INET (if any)\n");
if (nfq_unbind_pf(h, AF_INET) < 0) {
fprintf(stderr, "error during nfq_unbind_pf()\n");
exit(1);
}
printf("binding nfnetlink_queue as nf_queue handler for AF_INET\n");
if (nfq_bind_pf(h, AF_INET) < 0) {
fprintf(stderr, "error during nfq_bind_pf()\n");
exit(1);
}
printf("binding this socket to queue '0'\n");
qh = nfq_create_queue(h, 0, &cb, NULL);
if (!qh) {
fprintf(stderr, "error during nfq_create_queue()\n");
exit(1);
}
#if 0
printf("setting copy_packet mode\n");
if (nfq_set_mode(qh, NFQNL_COPY_PACKET, 0xffff) < 0) {
fprintf(stderr, "can't set copy_packet mode\n");
exit(1);
}
#endif
#if 1
printf("setting copy_meta mode\n");
if (nfq_set_mode(qh, NFQNL_COPY_META, 0xffff) < 0) {
fprintf(stderr, "can't set copy_meta mode\n");
exit(1);
}
#endif
nh = nfq_nfnlh(h);
fd = nfnl_fd(nh);
if (nfnl_rcvbufsiz(nh, 10*1024*1024) < 0 ) {
fprintf(stderr, "error during nfnl_rcvbufsiz()\n");
exit(1);
}
while ((rv = recv(fd, buf, sizeof(buf), 0)) && rv >= 0) {
nfq_handle_packet(h, buf, rv);
}
printf("unbinding from queue 0, rv %d, errno %d\n", rv, errno);
nfq_destroy_queue(qh);
printf("closing library handle\n");
nfq_close(h);
return 0;
}
[-- Attachment #3: blktest2.c --]
[-- Type: text/x-csrc, Size: 1973 bytes --]
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <err.h>
#include <errno.h>
#define SLEEP 1
#define SIZE 4096
#define NR 256
#define NONCE 0xdead
#define DOCHECK 0
#define RETRYWRITE 0
static void setup(int *data, int val, int size)
{
int i;
for (i=0; i<SIZE/sizeof(int); i++)
data[i] = val;
}
#if DOCHECK
static void check(int *data, int val, int size)
{
int i;
for (i=0; i<SIZE/sizeof(int); i++) {
if (data[i] != val) {
err(1, "failed to verify %#x got %#x\n",
val, data[i]);
}
}
}
#endif
int main(int argc, char **argv)
{
int fd, rc, n, iter = 0;
const char *path;
int data[NR][SIZE/sizeof(int)];
if (argc != 2) {
fprintf(stderr, "usage: blktest2 [PATH]\n");
exit(1);
}
path = argv[1];
printf("opening %s for O_DIRECT access\n", path);
fd = open(path, O_CREAT|O_DIRECT|O_RDWR, 0666);
if (fd < 0)
err(1,"unable to open file");
while(1) {
if ((iter%10)==0)
printf("iteration %d ...", iter);
/* write */
for(n=0; n<NR; n++) {
do {
if (lseek(fd, n * SIZE, SEEK_SET) < 0)
err(1, "seek for write %d %d\n", iter, n);
setup(data[n], (NONCE<<16)+(iter<<8)+n, sizeof(data[n]));
rc = write(fd, data[n], SIZE);
setup(data[n], 0xdeadbeef, sizeof(data[n]));
if (rc == -1)
warn("write failed");
else if (rc != SIZE)
err(1, "only wrote %d/%d bytes\n", rc, SIZE);
} while(RETRYWRITE && rc == -1 && errno == EIO);
}
#if DOCHECK
/* check */
for(n=0; n<NR; n++) {
do {
if (lseek(fd, n * SIZE, SEEK_SET) < 0)
err(1, "seek for read %d %d\n", iter, n);
rc = read(fd, data[n], SIZE);
if (rc == -1)
warn("read failed");
else if (rc != SIZE)
warn("only read %d/%d bytes\n", rc, SIZE);
} while(rc == -1 && errno == EIO);
check(data[n], (NONCE<<16)+(iter<<8)+n, SIZE);
}
#endif
if ((iter%10)==0)
printf("\n", iter);
iter++;
}
}
[-- Attachment #4: corrupt.pcap --]
[-- Type: application/octet-stream, Size: 5236 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
[not found] ` <1224241273.9053.109.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2008-10-17 12:48 ` Trond Myklebust
2008-10-17 13:01 ` Ian Campbell
2008-10-17 13:32 ` Talpey, Thomas
1 sibling, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2008-10-17 12:48 UTC (permalink / raw)
To: Ian Campbell; +Cc: linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
On Fri, 2008-10-17 at 12:01 +0100, Ian Campbell wrote:
> (please CC me, I am not currently subscribed to linux-nfs)
>
> Hi,
>
> For some time now I have been tracking down a mysterious host crash when
> using Xen with blktap (userspace paravirtual disk implementation) under
> stress conditions. I have now managed to create a simple test case with
> no Xen components which appears to show RPC retransmissions of write
> requests containing invalid data. I'll start at the beginning and
> explain the original bug although the test case is much simpler.
>
> blktap is a userspace daemon which provides paravirtualised disk support
> to Xen guests. The pages of data for a block write are mapped from the
> Xen guest into a domain 0 driver called blktap which has a userspace
> component which implements qcow/vhd/etc and writes the data to the
> backing file using O_DIRECT and aio in a zero copy manner. Once the the
> aio is completed the pages are returned to the guest and unmapped from
> domain 0. When a page is unmapped in this way the pte is set not present
> and the PFN is mapped to an invalid MFN.
>
> We have been seeing a crash in the domain 0 network driver's start_xmit
> routine when it attempts to access data from a page where the PFN maps
> to an invalid MFN. I added some tracing to the kernel and observed this
> sequence of events on an individual page:
> tap: 0/41 at 4fac7a165ad6 "rpc_init_task" c8792844 0
> tap: 1/41 at 4fac7a1663e2 "nfs_direct_write_schedule" ece19850 2000
> tap: 2/41 at 4fac7a166cca "call_start" c8792844 0
> tap: 3/41 at 4fac7a167540 "call_reserve" c8792844 0
> tap: 4/41 at 4fac7a167de6 "call_reserveresult" c8792844 0
> tap: 5/41 at 4fac7a168620 "call_allocate" c8792844 0
> tap: 6/41 at 4fac7a168f08 "call_bind" c8792844 0
> tap: 7/41 at 4fac7a169712 "call_connect" c8792844 0
> tap: 8/41 at 4fac7a169f28 "call_transmit" c8792844 0
> tap: 9/41 at 4fac7a16a7f2 "call_encode" c8792844 0
> tap: 10/41 at 4fac7a16afd8 "call_header" c8792844 0
> tap: 11/41 at 4fac7a16bc6e "xs_tcp_send_request" c8792844 0
> tap: 12/41 at 4fac7a16c9d0 "tcp_sendpage" 0 0
> tap: 13/41 at 4fac7a16cec2 "do_tcp_sendpages (adding data to skb)" cef08b00 0
> tap: 14/41 at 4fac7a16e068 "call_transmit_status" c8792844 0
> tap: 15/41 at 4fac7a2ed8f4 "tcp_transmit_skb, skb_clone" c9dca500 c9dca5a8
> tap: 16/41 at 4faeeeb9f566 "xprt_timer" c8792844 0
> tap: 17/41 at 4faeeeb9ff0e "xprt_timer: !req->rq_received" c8792844 0
> tap: 18/41 at 4faeeeba08ec "rpc_make_runnable" c8792844 0
> tap: 19/41 at 4faeeec117b8 "call_status" c8792844 ffffff92
> tap: 20/41 at 4faeeec11faa "timeout (minor)" c8792844 0
> tap: 21/41 at 4faeeec12778 "call_bind" c8792844 0
> tap: 22/41 at 4faeeec12ef8 "call_connect" c8792844 0
> tap: 23/41 at 4faeeec13678 "call_transmit" c8792844 0
> tap: 24/41 at 4faeeec13e46 "call_encode" c8792844 0
> tap: 25/41 at 4faeeec145ae "call_header" c8792844 0
> tap: 26/41 at 4faeeec15082 "xs_tcp_send_request" c8792844 0
> tap: 27/41 at 4faeeec15d1e "tcp_sendpage" 0 0
> tap: 28/41 at 4faeeec161ce "do_tcp_sendpages (adding data to skb)" d06afe40 0
> tap: 29/41 at 4faeeec172ea "call_transmit_status" c8792844 0
> tap: 30/41 at 4faf2e3280d4 "rpc_make_runnable" c8792844 0
> tap: 31/41 at 4faf2e3440c4 "call_status" c8792844 88
> tap: 32/41 at 4faf2e3449d6 "call_decode" c8792844 0
> tap: 33/41 at 4faf2e345240 "call_verify" c8792844 0
> tap: 34/41 at 4faf2e345a9e "rpc_exit_task" c8792844 0
> tap: 35/41 at 4faf2e34652a "nfs_direct_write_result" ece19850 2000
> tap: 36/41 at 4faf2e351000 "nfs_direct_write_release (completing)" ece19850 0
> tap: 37/41 at 4faf2e3517ec "nfs_direct_write_complete,DEFAULT w/ iocb" ece19850 0
> tap: 38/41 at 4faf2e35205c "nfs_direct_free_writedata" ece19850 0
> tap: 39/41 at 4faf2e51026a "fast_flush_area" 0 0
> tap: 40/41 at 4faf33e1813a "tcp_transmit_skb, skb_clone" d06afe40 d06afee8
> (nb: fast_flush_area is the blktap function which returns the pages to
> the guest and unmaps them from domain 0, it is called via ioctl from the
> userspace process once the aio write returns successfully. 4fac7.... is
> the tsc, processor is 2.33GHz)
>
> So what we see is the initial request being constructed and transmitted
> (around 11/41-15/41) followed by a timeout ~60s later (16/41-20/41)
> which causes us to queue a retransmit (26/41-29/41) but, critically, not
> yet actually transmit it. At this point we get the reply to the original
> request and complete the NFS write (35/41-38/41) returning success to
> userspace which causes it to unmap the pages and return them to the
> guest (39/41). At this point (40/41) we attempt to transmit the
> duplicate request and crash because the pages are not present any more.
>
> By using libnetfilter_queue I was then able to reproduce this in
> non-stress situations by introducing delays into the network. Stalling
> all network traffic with the NFS server for 65s every 1000 packets seems
> to do the trick (probably on the aggressive side, I didn't attempt to
> minimise the interruption required to reproduce and I realise that this
> represents a pretty crappy network and/or server).
>
> Given the observed sequence of events I then constructed a fairly simple
> test program (blktest2.c, attached) using O_DIRECT (but not aio, I don't
> think it matters) which shows the same issue without involving any Xen
> parts. The test writes a buffer filed with a counter to a file and
> immediately after the write() returns fills the buffer with 0xdeadbeef.
> By using tcpdump I capture and observe duplicated requests on the wire
> containing 0xdeadbeef in the payload and not the expected counter
> values. I usually see this in a matter of minutes. I've attached a pcap
> of a single request/reply pair which was corrupted.
>
> Presumably in the case of a decent NFS server the XID request cache
> would prevent the bogus data actually reaching the disk but on a
> non-decent server I suspect it might actually lead to corruption (AIUI
> the request cache is not a hard requirement of the NFS protocol?).
> Perhaps even a decent server might have timed out the entry in the cache
> after such a delay?
>
> The Xen case and the blktest2 repro was on 2.6.18. I have also
> reproduced the blktest2 case on 2.6.27 native but not with Xen since no
> domain 0 support exists just yet.
>
> I can think of several possible option to fix this issue:
> * Do not return success to userspace until all duplicated requests
> have actually hit the wire, even if the response comes back
> earlier than that.
> * Cancel queued duplicates if a response comes in late.
> * Copy the data pages on retransmission.
>
> I think I prefer the 3rd option since the first two are a bit tricky
> since request has been merged into a tcp stream and may have been
> fragmented/segmented already etc. I don't think doing the copy on
> retransmits has a massive performance impact -- you must already be
> suffering from pretty severe network or server issues!
>
> I have CC'd the maintainer of the bnx2 Ethernet driver (Michael Chan)
> because I have so far only been able to reproduce this with the broadcom
> NIC. Even in the same server if I switch to e1000 I cannot reproduce.
> However given the analysis above I'm not convinced it is likely to be a
> driver bug.
>
> Ian.
I don't see how this could be an RPC bug. The networking layer is
supposed to either copy the data sent to the socket, or take a reference
to any pages that are pushed via the ->sendpage() abi.
IOW: the pages are supposed to be still referenced by the networking
layer even if the NFS layer and page cache have dropped their
references.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
Trond.Myklebust@netapp.com
www.netapp.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
2008-10-17 12:48 ` Trond Myklebust
@ 2008-10-17 13:01 ` Ian Campbell
[not found] ` <1224248469.9053.119.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Ian Campbell @ 2008-10-17 13:01 UTC (permalink / raw)
To: Trond Myklebust; +Cc: linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> I don't see how this could be an RPC bug. The networking layer is
> supposed to either copy the data sent to the socket, or take a reference
> to any pages that are pushed via the ->sendpage() abi.
>
> IOW: the pages are supposed to be still referenced by the networking
> layer even if the NFS layer and page cache have dropped their
> references.
The pages are still referenced by the networking layer. The problem is
that the userspace app has been told that the write has completed so it
is free to write new data to those pages.
Ian.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
[not found] ` <1224248469.9053.119.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2008-10-17 13:22 ` Trond Myklebust
2008-10-17 13:33 ` Ian Campbell
2008-10-20 14:25 ` Ian Campbell
0 siblings, 2 replies; 12+ messages in thread
From: Trond Myklebust @ 2008-10-17 13:22 UTC (permalink / raw)
To: Ian Campbell; +Cc: linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > I don't see how this could be an RPC bug. The networking layer is
> > supposed to either copy the data sent to the socket, or take a reference
> > to any pages that are pushed via the ->sendpage() abi.
> >
> > IOW: the pages are supposed to be still referenced by the networking
> > layer even if the NFS layer and page cache have dropped their
> > references.
>
> The pages are still referenced by the networking layer. The problem is
> that the userspace app has been told that the write has completed so it
> is free to write new data to those pages.
>
> Ian.
OK, I see your point.
Does this happen at all with NFSv4? I ask because the NFSv4 client will
always ensure that the TCP connection gets broken before a
retransmission. I wouldn't therefore expect any races between a reply to
the previous transmission and the new one...
--
Trond Myklebust
Linux NFS client maintainer
NetApp
Trond.Myklebust@netapp.com
www.netapp.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
[not found] ` <1224241273.9053.109.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2008-10-17 12:48 ` Trond Myklebust
@ 2008-10-17 13:32 ` Talpey, Thomas
[not found] ` <RTPCLUEXC1-PRDVjCRG000001ab-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
1 sibling, 1 reply; 12+ messages in thread
From: Talpey, Thomas @ 2008-10-17 13:32 UTC (permalink / raw)
To: Ian Campbell; +Cc: Trond.Myklebust, linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
At 07:01 AM 10/17/2008, Ian Campbell wrote:
>(please CC me, I am not currently subscribed to linux-nfs)
>...
>Presumably in the case of a decent NFS server the XID request cache
>would prevent the bogus data actually reaching the disk but on a
>non-decent server I suspect it might actually lead to corruption (AIUI
>the request cache is not a hard requirement of the NFS protocol?).
>Perhaps even a decent server might have timed out the entry in the cache
>after such a delay?
Unfortunately no - because 1) your retransmissions are not, in fact,
duplicates since the data has changed and 2) no NFSv3 reply cache
works perfectly, especially under heavy load. The NFSv4.1 session
addresses this, but that's not at issue here.
This is a really nasty race. The whole thing starts with the dropped
TCP segment evidenced at #2 of your trace. Then, the retransmission
appears to have been scheduled prior to the write reply making it back
to the client through the TCP storm, so the retransmit is actually pending
on the wire while the NFS write operation is completed.
The fix here is to break the connection before retrying, a long-standing
pet peeve of mine that NFSv3 historically does not do. Setting the
clnt->cl_discrtry bit in the RPC client struct is all that's required. The
NFSv4 client does this by default, btw.
Tom.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
2008-10-17 13:22 ` Trond Myklebust
@ 2008-10-17 13:33 ` Ian Campbell
2008-10-20 14:25 ` Ian Campbell
1 sibling, 0 replies; 12+ messages in thread
From: Ian Campbell @ 2008-10-17 13:33 UTC (permalink / raw)
To: Trond Myklebust; +Cc: linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
On Fri, 2008-10-17 at 09:22 -0400, Trond Myklebust wrote:
> On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> > On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > > I don't see how this could be an RPC bug. The networking layer is
> > > supposed to either copy the data sent to the socket, or take a reference
> > > to any pages that are pushed via the ->sendpage() abi.
> > >
> > > IOW: the pages are supposed to be still referenced by the networking
> > > layer even if the NFS layer and page cache have dropped their
> > > references.
> >
> > The pages are still referenced by the networking layer. The problem is
> > that the userspace app has been told that the write has completed so it
> > is free to write new data to those pages.
> >
> > Ian.
>
> OK, I see your point.
>
> Does this happen at all with NFSv4? I ask because the NFSv4 client will
> always ensure that the TCP connection gets broken before a
> retransmission. I wouldn't therefore expect any races between a reply to
> the previous transmission and the new one...
I believe we've only tested with NFSv3.
Unfortunately I've just lent the box which reproduces this issue to a
colleague for the rest of the day. I'll test with v4 next week.
Ian.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
[not found] ` <RTPCLUEXC1-PRDVjCRG000001ab-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
@ 2008-10-17 13:36 ` Trond Myklebust
2008-10-17 13:51 ` Talpey, Thomas
0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2008-10-17 13:36 UTC (permalink / raw)
To: Talpey, Thomas; +Cc: Ian Campbell, linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
On Fri, 2008-10-17 at 09:32 -0400, Talpey, Thomas wrote:
> At 07:01 AM 10/17/2008, Ian Campbell wrote:
> >(please CC me, I am not currently subscribed to linux-nfs)
> >...
> >Presumably in the case of a decent NFS server the XID request cache
> >would prevent the bogus data actually reaching the disk but on a
> >non-decent server I suspect it might actually lead to corruption (AIUI
> >the request cache is not a hard requirement of the NFS protocol?).
> >Perhaps even a decent server might have timed out the entry in the cache
> >after such a delay?
>
> Unfortunately no - because 1) your retransmissions are not, in fact,
> duplicates since the data has changed and 2) no NFSv3 reply cache
> works perfectly, especially under heavy load. The NFSv4.1 session
> addresses this, but that's not at issue here.
>
> This is a really nasty race. The whole thing starts with the dropped
> TCP segment evidenced at #2 of your trace. Then, the retransmission
> appears to have been scheduled prior to the write reply making it back
> to the client through the TCP storm, so the retransmit is actually pending
> on the wire while the NFS write operation is completed.
>
> The fix here is to break the connection before retrying, a long-standing
> pet peeve of mine that NFSv3 historically does not do. Setting the
> clnt->cl_discrtry bit in the RPC client struct is all that's required. The
> NFSv4 client does this by default, btw.
>
> Tom.
It's not a perfect fix, which is why we haven't done that for NFSv3.
When you break the connection, there is the chance that a reply to a
non-idempotent request may get lost, and that the server doesn't
recognise the retransmission due to the above mentioned imperfections
with the replay cache. In that case, the client may get a downright
_wrong_ reply (for instance, it may see an EEXIST reply to a mkdir
request that was actually successful).
--
Trond Myklebust
Linux NFS client maintainer
NetApp
Trond.Myklebust@netapp.com
www.netapp.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
2008-10-17 13:36 ` Trond Myklebust
@ 2008-10-17 13:51 ` Talpey, Thomas
0 siblings, 0 replies; 12+ messages in thread
From: Talpey, Thomas @ 2008-10-17 13:51 UTC (permalink / raw)
To: Trond Myklebust; +Cc: Ian Campbell, linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
At 09:36 AM 10/17/2008, Trond Myklebust wrote:
>On Fri, 2008-10-17 at 09:32 -0400, Talpey, Thomas wrote:
>> The fix here is to break the connection before retrying, a long-standing
>> pet peeve of mine that NFSv3 historically does not do.
>
>It's not a perfect fix, which is why we haven't done that for NFSv3.
>
>When you break the connection, there is the chance that a reply to a
>non-idempotent request may get lost, and that the server doesn't
>recognise the retransmission due to the above mentioned imperfections
>with the replay cache. In that case, the client may get a downright
>_wrong_ reply (for instance, it may see an EEXIST reply to a mkdir
>request that was actually successful).
Well, the NFSv4 client will suffer the same issue in the above case.
So, it's choose your poison. The antidote is to adopt NFSv4.1 asap. ;-)
Sorry about the crossing replies btw. My own transmission was scheduled
before pulling email to look for other replies! Hmm...
Tom.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
2008-10-17 13:22 ` Trond Myklebust
2008-10-17 13:33 ` Ian Campbell
@ 2008-10-20 14:25 ` Ian Campbell
[not found] ` <1224512708.15736.12.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
1 sibling, 1 reply; 12+ messages in thread
From: Ian Campbell @ 2008-10-20 14:25 UTC (permalink / raw)
To: Trond Myklebust; +Cc: linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
[-- Attachment #1: Type: text/plain, Size: 1122 bytes --]
On Fri, 2008-10-17 at 09:22 -0400, Trond Myklebust wrote:
> On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> > On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > > I don't see how this could be an RPC bug. The networking layer is
> > > supposed to either copy the data sent to the socket, or take a reference
> > > to any pages that are pushed via the ->sendpage() abi.
> > >
> > > IOW: the pages are supposed to be still referenced by the networking
> > > layer even if the NFS layer and page cache have dropped their
> > > references.
> >
> > The pages are still referenced by the networking layer. The problem is
> > that the userspace app has been told that the write has completed so it
> > is free to write new data to those pages.
> >
> > Ian.
>
> OK, I see your point.
>
> Does this happen at all with NFSv4? I ask because the NFSv4 client will
> always ensure that the TCP connection gets broken before a
> retransmission. I wouldn't therefore expect any races between a reply to
> the previous transmission and the new one...
It does seem to happen with NFSv4 too (see attached).
Ian.
[-- Attachment #2: corrupt-over-nfsv4.pcap --]
[-- Type: application/octet-stream, Size: 7350 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
[not found] ` <1224512708.15736.12.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2008-10-20 16:38 ` Ian Campbell
[not found] ` <1224520738.15736.22.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
0 siblings, 1 reply; 12+ messages in thread
From: Ian Campbell @ 2008-10-20 16:38 UTC (permalink / raw)
To: Trond Myklebust; +Cc: linux-nfs, mchan-dY08KVG/lbpWk0Htik3J/w
On Mon, 2008-10-20 at 15:25 +0100, Ian Campbell wrote:
> On Fri, 2008-10-17 at 09:22 -0400, Trond Myklebust wrote:
> > On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> > > On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > > > I don't see how this could be an RPC bug. The networking layer is
> > > > supposed to either copy the data sent to the socket, or take a reference
> > > > to any pages that are pushed via the ->sendpage() abi.
> > > >
> > > > IOW: the pages are supposed to be still referenced by the networking
> > > > layer even if the NFS layer and page cache have dropped their
> > > > references.
> > >
> > > The pages are still referenced by the networking layer. The problem is
> > > that the userspace app has been told that the write has completed so it
> > > is free to write new data to those pages.
> > >
> > > Ian.
> >
> > OK, I see your point.
> >
> > Does this happen at all with NFSv4? I ask because the NFSv4 client will
> > always ensure that the TCP connection gets broken before a
> > retransmission. I wouldn't therefore expect any races between a reply to
> > the previous transmission and the new one...
>
> It does seem to happen with NFSv4 too (see attached).
Actually, that was NFSv4 on 2.6.18, I guess I should test with something
newer since it looks like the TCP connection reset stuff is newer (it's
43d78ef2ba5bec26d0315859e8324bfc0be23766 right?)
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
[not found] ` <1224520738.15736.22.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
@ 2008-10-23 14:42 ` Ian Campbell
2008-11-11 13:08 ` Ian Campbell
1 sibling, 0 replies; 12+ messages in thread
From: Ian Campbell @ 2008-10-23 14:42 UTC (permalink / raw)
To: Trond Myklebust; +Cc: linux-nfs, Thomas Talpey
(I've dropped Michael to bcc, since I think it's relatively clear this
isn't a NIC driver issue, apologies for the noise Michael)
(sorry to anyone who gets this twice, evolution had a bit of a fit and
I'm not sure if it went out properly)
On Mon, 2008-10-20 at 17:39 +0100, Ian Campbell wrote:
>
> Actually, that was NFSv4 on 2.6.18, I guess I should test with
> something newer since it looks like the TCP connection reset stuff is
> newer (it's 43d78ef2ba5bec26d0315859e8324bfc0be23766 right?)
As expected 2.6.27 appears to be fine when using NFSv4. I used the patch
below with NFSv3 and the problem also went away, although I appreciate
you have concerns with this approach.
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 5ee23e7..759589e 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -623,7 +623,8 @@ static int nfs_init_client(struct nfs_client *clp,
* Create a client RPC handle for doing FSSTAT with UNIX auth only
* - RFC 2623, sec 2.3.2
*/
- error = nfs_create_rpc_client(clp, timeparms, RPC_AUTH_UNIX, 0);
+ error = nfs_create_rpc_client(clp, timeparms, RPC_AUTH_UNIX,
+ RPC_CLNT_CREATE_DISCRTRY);
if (error < 0)
goto error;
nfs_mark_client_ready(clp, NFS_CS_READY);
FYI I've also been experimenting with patches along the lines of the
below. I've not got it to actually work yet though :-/
Ian.
diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h
index 64981a2..5c57112 100644
--- a/include/linux/sunrpc/sched.h
+++ b/include/linux/sunrpc/sched.h
@@ -89,6 +89,7 @@ struct rpc_task {
#ifdef RPC_DEBUG
unsigned short tk_pid; /* debugging aid */
#endif
+ int tk_nrtrans; /* number of times this task has transmitted */
};
#define tk_xprt tk_client->cl_xprt
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 99a52aa..72d1ceb 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -884,6 +884,8 @@ void xprt_transmit(struct rpc_task *task)
xprt->stat.req_u += xprt->stat.sends - xprt->stat.recvs;
xprt->stat.bklog_u += xprt->backlog.qlen;
+ task->tk_nrtrans++;
+
/* Don't race with disconnect */
if (!xprt_connected(xprt))
task->tk_status = -ENOTCONN;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 4486c59..1fb5310 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -423,8 +423,9 @@ static int xs_send_kvec(struct socket *sock, struct sockaddr *addr, int addrlen,
return kernel_sendmsg(sock, &msg, NULL, 0, 0);
}
-static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned int base, int more)
+static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned int base, int more, int copy)
{
+ static int once = 0;
struct page **ppage;
unsigned int remainder;
int err, sent = 0;
@@ -440,7 +441,14 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
remainder -= len;
if (remainder != 0 || more)
flags |= MSG_MORE;
- err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+ if (!once && copy) {
+ once = 1;
+ printk("using sock_no_sendpage for retransmit\n");
+ }
+ if (copy)
+ err = sock_no_sendpage(sock, *ppage, base, len, flags);
+ else
+ err = sock->ops->sendpage(sock, *ppage, base, len, flags);
if (remainder == 0 || err != len)
break;
sent += err;
@@ -463,7 +471,7 @@ static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned i
* @base: starting position in the buffer
*
*/
-static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen, struct xdr_buf *xdr, unsigned int base)
+static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen, struct xdr_buf *xdr, unsigned int base, int copy)
{
unsigned int remainder = xdr->len - base;
int err, sent = 0;
@@ -491,7 +499,7 @@ static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen,
if (base < xdr->page_len) {
unsigned int len = xdr->page_len - base;
remainder -= len;
- err = xs_send_pagedata(sock, xdr, base, remainder != 0);
+ err = xs_send_pagedata(sock, xdr, base, remainder != 0, copy);
if (remainder == 0 || err != len)
goto out;
sent += err;
@@ -582,7 +590,7 @@ static int xs_udp_send_request(struct rpc_task *task)
status = xs_sendpages(transport->sock,
xs_addr(xprt),
xprt->addrlen, xdr,
- req->rq_bytes_sent);
+ req->rq_bytes_sent, task->tk_nrtrans);
dprintk("RPC: xs_udp_send_request(%u) = %d\n",
xdr->len - req->rq_bytes_sent, status);
@@ -671,7 +679,7 @@ static int xs_tcp_send_request(struct rpc_task *task)
* called sendmsg(). */
while (1) {
status = xs_sendpages(transport->sock,
- NULL, 0, xdr, req->rq_bytes_sent);
+ NULL, 0, xdr, req->rq_bytes_sent, task->tk_nrtrans);
dprintk("RPC: xs_tcp_send_request(%u) = %d\n",
xdr->len - req->rq_bytes_sent, status);
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: RPC retransmission of write requests containing bogus data
[not found] ` <1224520738.15736.22.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2008-10-23 14:42 ` Ian Campbell
@ 2008-11-11 13:08 ` Ian Campbell
1 sibling, 0 replies; 12+ messages in thread
From: Ian Campbell @ 2008-11-11 13:08 UTC (permalink / raw)
To: Trond Myklebust; +Cc: linux-nfs
I cannot find my more recent post in my sent box (although I see it on
gmane.org) so apologies for replying to an older mail instead.
I eventually worked around this issue by not doing retransmissions since
the RPC_CLNT_CREATE_DISCRTRY backport to 2.6.18 was non-trivial. We
handle I/O errors at a higher level in our toolstack anyway so this is
not an issue for us.
Ian.
On Mon, 2008-10-20 at 17:39 +0100, Ian Campbell wrote:
> On Mon, 2008-10-20 at 15:25 +0100, Ian Campbell wrote:
> > On Fri, 2008-10-17 at 09:22 -0400, Trond Myklebust wrote:
> > > On Fri, 2008-10-17 at 14:01 +0100, Ian Campbell wrote:
> > > > On Fri, 2008-10-17 at 08:48 -0400, Trond Myklebust wrote:
> > > > > I don't see how this could be an RPC bug. The networking layer is
> > > > > supposed to either copy the data sent to the socket, or take a reference
> > > > > to any pages that are pushed via the ->sendpage() abi.
> > > > >
> > > > > IOW: the pages are supposed to be still referenced by the networking
> > > > > layer even if the NFS layer and page cache have dropped their
> > > > > references.
> > > >
> > > > The pages are still referenced by the networking layer. The problem is
> > > > that the userspace app has been told that the write has completed so it
> > > > is free to write new data to those pages.
> > > >
> > > > Ian.
> > >
> > > OK, I see your point.
> > >
> > > Does this happen at all with NFSv4? I ask because the NFSv4 client will
> > > always ensure that the TCP connection gets broken before a
> > > retransmission. I wouldn't therefore expect any races between a reply to
> > > the previous transmission and the new one...
> >
> > It does seem to happen with NFSv4 too (see attached).
>
> Actually, that was NFSv4 on 2.6.18, I guess I should test with something
> newer since it looks like the TCP connection reset stuff is newer (it's
> 43d78ef2ba5bec26d0315859e8324bfc0be23766 right?)
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-11-11 13:08 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-17 11:01 RPC retransmission of write requests containing bogus data Ian Campbell
[not found] ` <1224241273.9053.109.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2008-10-17 12:48 ` Trond Myklebust
2008-10-17 13:01 ` Ian Campbell
[not found] ` <1224248469.9053.119.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2008-10-17 13:22 ` Trond Myklebust
2008-10-17 13:33 ` Ian Campbell
2008-10-20 14:25 ` Ian Campbell
[not found] ` <1224512708.15736.12.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2008-10-20 16:38 ` Ian Campbell
[not found] ` <1224520738.15736.22.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>
2008-10-23 14:42 ` Ian Campbell
2008-11-11 13:08 ` Ian Campbell
2008-10-17 13:32 ` Talpey, Thomas
[not found] ` <RTPCLUEXC1-PRDVjCRG000001ab-rtwIt2gI0FxT+ZUat5FNkAK/GNPrWCqfQQ4Iyu8u01E@public.gmane.org>
2008-10-17 13:36 ` Trond Myklebust
2008-10-17 13:51 ` Talpey, Thomas
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.