* Buf starting 2.6.16 - rpc: bad TCP reclen
@ 2006-07-18 11:36 Razvan Gavril
2006-07-19 14:46 ` Bug " Razvan Gavril
0 siblings, 1 reply; 6+ messages in thread
From: Razvan Gavril @ 2006-07-18 11:36 UTC (permalink / raw)
To: nfs
I posted on the linux kernel mailing list but got no answer till now.
I have a nfs server and some diskless computers that that have there
root mounted via nfs from the server. In certain situations the
diskless computers fail to write correctly to their nfs mounted
filesystem (some files get corrupted). Looking into the nfs server's
dmesg, i see this messages:
RPC: bad TCP reclen 0x5e9c5bec (non-terminal)
RPC: bad TCP reclen 0x29db3277 (large)
RPC: bad TCP reclen 0x698f6ccf (large)
RPC: bad TCP reclen 0x336160a9 (large)
RPC: bad TCP reclen 0x773ffdff (large)
RPC: bad TCP reclen 0x231b8d5c (non-terminal)
RPC: bad TCP reclen 0x39902af4 (large)
RPC: bad TCP reclen 0x6048d9cc (non-terminal)
RPC: bad TCP reclen 0x212f7e14 (non-terminal)
This errors start to happen when upgrading to 2.6.16 from 2.6.15 but the
problem is still present in 2.6.17 kernel. For now i tested like this:
Client - Server - State
------------------------
2.6.15 - 2.6.15 - Works
2.6.15 - 2.6.16 - Errors
2.6.16 - 2.6.16 - Errors
2.6.16 - 2.6.17 - Errors
2.6.17 - 2.6.17 - Errors
From the looks of it the problem seems to be related to the nfs server
implemetation from the kernels newer that 2.6.15.
Those corrupted writes on client + dmesg messages on the server are easy
to duplicate when using Debian on the client computers and running this
script in parallel on more that 1 client:
while /bin/true ;do
apt-get update
err=$?
[[ $err != 0 ]] && echo "Exiting $err" && exit $err
# you can replace gdb with any other package
apt-get -y install gdb
err=$?
[[ $err != 0 ]] && echo "Exiting $err" && exit $err
apt-get -y remove gdb
err=$?
[[ $err != 0 ]] && echo "Exiting $err" && exit $err
sleep $(( $RANDOM % 3 ))
done
After a couple o minutes (1-5min) apt should give a segmentation fault
because one of its state files got corrupted (/lib/dpkg/status or
other). FYI, the clients DON'T have any common files/dirs so a race
condition in apt can't be the cause. It's easy to see that for every apt
segfault on the client you'll have a rpc error message on the server.
I also tried with some different script to reproduce the problem, for
example to copy a lot of files(small, big ..) from a nfs share to
another but the md5sum reported that every time the copying was
happening without corruption so using apt is the only solution to
reproduce the bug for now.
I'm here if you need any other info related to this problem.
--
Razvan Gavril
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: Bug starting 2.6.16 - rpc: bad TCP reclen
2006-07-18 11:36 Buf starting 2.6.16 - rpc: bad TCP reclen Razvan Gavril
@ 2006-07-19 14:46 ` Razvan Gavril
2006-07-19 15:22 ` Chuck Lever
0 siblings, 1 reply; 6+ messages in thread
From: Razvan Gavril @ 2006-07-19 14:46 UTC (permalink / raw)
To: Razvan Gavril; +Cc: nfs
Razvan Gavril wrote:
> I posted on the linux kernel mailing list but got no answer till now.
>
> I have a nfs server and some diskless computers that that have there
> root mounted via nfs from the server. In certain situations the
> diskless computers fail to write correctly to their nfs mounted
> filesystem (some files get corrupted). Looking into the nfs server's
> dmesg, i see this messages:
>
> RPC: bad TCP reclen 0x5e9c5bec (non-terminal)
> RPC: bad TCP reclen 0x29db3277 (large)
> RPC: bad TCP reclen 0x698f6ccf (large)
> RPC: bad TCP reclen 0x336160a9 (large)
> RPC: bad TCP reclen 0x773ffdff (large)
> RPC: bad TCP reclen 0x231b8d5c (non-terminal)
> RPC: bad TCP reclen 0x39902af4 (large)
> RPC: bad TCP reclen 0x6048d9cc (non-terminal)
> RPC: bad TCP reclen 0x212f7e14 (non-terminal)
>
> This errors start to happen when upgrading to 2.6.16 from 2.6.15 but the
> problem is still present in 2.6.17 kernel. For now i tested like this:
>
> Client - Server - State
> ------------------------
> 2.6.15 - 2.6.15 - Works
> 2.6.15 - 2.6.16 - Errors
> 2.6.16 - 2.6.16 - Errors
> 2.6.16 - 2.6.17 - Errors
> 2.6.17 - 2.6.17 - Errors
>
> From the looks of it the problem seems to be related to the nfs server
> implemetation from the kernels newer that 2.6.15.
>
> Those corrupted writes on client + dmesg messages on the server are easy
> to duplicate when using Debian on the client computers and running this
> script in parallel on more that 1 client:
>
> while /bin/true ;do
> apt-get update
> err=$?
> [[ $err != 0 ]] && echo "Exiting $err" && exit $err
>
> # you can replace gdb with any other package
> apt-get -y install gdb
> err=$?
> [[ $err != 0 ]] && echo "Exiting $err" && exit $err
>
> apt-get -y remove gdb
> err=$?
> [[ $err != 0 ]] && echo "Exiting $err" && exit $err
>
> sleep $(( $RANDOM % 3 ))
> done
>
> After a couple o minutes (1-5min) apt should give a segmentation fault
> because one of its state files got corrupted (/lib/dpkg/status or
> other). FYI, the clients DON'T have any common files/dirs so a race
> condition in apt can't be the cause. It's easy to see that for every apt
> segfault on the client you'll have a rpc error message on the server.
>
> I also tried with some different script to reproduce the problem, for
> example to copy a lot of files(small, big ..) from a nfs share to
> another but the md5sum reported that every time the copying was
> happening without corruption so using apt is the only solution to
> reproduce the bug for now.
>
> I'm here if you need any other info related to this problem.
>
> --
> Razvan Gavril
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> NFS maillist - NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs
>
Razvan Gavril wrote:
> I posted on the linux kernel mailing list but got no answer till now.
>
> I have a nfs server and some diskless computers that that have there
> root mounted via nfs from the server. In certain situations the
> diskless computers fail to write correctly to their nfs mounted
> filesystem (some files get corrupted). Looking into the nfs server's
> dmesg, i see this messages:
>
> RPC: bad TCP reclen 0x5e9c5bec (non-terminal)
> RPC: bad TCP reclen 0x29db3277 (large)
> RPC: bad TCP reclen 0x698f6ccf (large)
> RPC: bad TCP reclen 0x336160a9 (large)
> RPC: bad TCP reclen 0x773ffdff (large)
> RPC: bad TCP reclen 0x231b8d5c (non-terminal)
> RPC: bad TCP reclen 0x39902af4 (large)
> RPC: bad TCP reclen 0x6048d9cc (non-terminal)
> RPC: bad TCP reclen 0x212f7e14 (non-terminal)
>
> This errors start to happen when upgrading to 2.6.16 from 2.6.15 but the
> problem is still present in 2.6.17 kernel. For now i tested like this:
>
> Client - Server - State
> ------------------------
> 2.6.15 - 2.6.15 - Works
> 2.6.15 - 2.6.16 - Errors
> 2.6.16 - 2.6.16 - Errors
> 2.6.16 - 2.6.17 - Errors
> 2.6.17 - 2.6.17 - Errors
>
> From the looks of it the problem seems to be related to the nfs server
> implemetation from the kernels newer that 2.6.15.
>
> Those corrupted writes on client + dmesg messages on the server are easy
> to duplicate when using Debian on the client computers and running this
> script in parallel on more that 1 client:
>
> while /bin/true ;do
> apt-get update
> err=$?
> [[ $err != 0 ]] && echo "Exiting $err" && exit $err
>
> # you can replace gdb with any other package
> apt-get -y install gdb
> err=$?
> [[ $err != 0 ]] && echo "Exiting $err" && exit $err
>
> apt-get -y remove gdb
> err=$?
> [[ $err != 0 ]] && echo "Exiting $err" && exit $err
>
> sleep $(( $RANDOM % 3 ))
> done
>
> After a couple o minutes (1-5min) apt should give a segmentation fault
> because one of its state files got corrupted (/lib/dpkg/status or
> other). FYI, the clients DON'T have any common files/dirs so a race
> condition in apt can't be the cause. It's easy to see that for every apt
> segfault on the client you'll have a rpc error message on the server.
>
> I also tried with some different script to reproduce the problem, for
> example to copy a lot of files(small, big ..) from a nfs share to
> another but the md5sum reported that every time the copying was
> happening without corruption so using apt is the only solution to
> reproduce the bug for now.
>
> I'm here if you need any other info related to this problem.
>
>
Can someone at least confirm this bug and give me an idea where to start
debugging ?
Thanks
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Bug starting 2.6.16 - rpc: bad TCP reclen
2006-07-19 14:46 ` Bug " Razvan Gavril
@ 2006-07-19 15:22 ` Chuck Lever
2006-07-19 15:52 ` Razvan Gavril
0 siblings, 1 reply; 6+ messages in thread
From: Chuck Lever @ 2006-07-19 15:22 UTC (permalink / raw)
To: Razvan Gavril; +Cc: nfs
On 7/19/06, Razvan Gavril <razvan.g@plutohome.com> wrote:
> Razvan Gavril wrote:
> > I posted on the linux kernel mailing list but got no answer till now.
> >
> > I have a nfs server and some diskless computers that that have there
> > root mounted via nfs from the server. In certain situations the
> > diskless computers fail to write correctly to their nfs mounted
> > filesystem (some files get corrupted). Looking into the nfs server's
> > dmesg, i see this messages:
> >
> > RPC: bad TCP reclen 0x5e9c5bec (non-terminal)
> > RPC: bad TCP reclen 0x29db3277 (large)
> > RPC: bad TCP reclen 0x698f6ccf (large)
> > RPC: bad TCP reclen 0x336160a9 (large)
> > RPC: bad TCP reclen 0x773ffdff (large)
> > RPC: bad TCP reclen 0x231b8d5c (non-terminal)
> > RPC: bad TCP reclen 0x39902af4 (large)
> > RPC: bad TCP reclen 0x6048d9cc (non-terminal)
> > RPC: bad TCP reclen 0x212f7e14 (non-terminal)
> >
> > This errors start to happen when upgrading to 2.6.16 from 2.6.15 but the
> > problem is still present in 2.6.17 kernel. For now i tested like this:
> >
> > Client - Server - State
> > ------------------------
> > 2.6.15 - 2.6.15 - Works
> > 2.6.15 - 2.6.16 - Errors
> > 2.6.16 - 2.6.16 - Errors
> > 2.6.16 - 2.6.17 - Errors
> > 2.6.17 - 2.6.17 - Errors
> >
> > From the looks of it the problem seems to be related to the nfs server
> > implemetation from the kernels newer that 2.6.15.
> >
> > Those corrupted writes on client + dmesg messages on the server are easy
> > to duplicate when using Debian on the client computers and running this
> > script in parallel on more that 1 client:
> >
> > while /bin/true ;do
> > apt-get update
> > err=$?
> > [[ $err != 0 ]] && echo "Exiting $err" && exit $err
> >
> > # you can replace gdb with any other package
> > apt-get -y install gdb
> > err=$?
> > [[ $err != 0 ]] && echo "Exiting $err" && exit $err
> >
> > apt-get -y remove gdb
> > err=$?
> > [[ $err != 0 ]] && echo "Exiting $err" && exit $err
> >
> > sleep $(( $RANDOM % 3 ))
> > done
> >
> > After a couple o minutes (1-5min) apt should give a segmentation fault
> > because one of its state files got corrupted (/lib/dpkg/status or
> > other). FYI, the clients DON'T have any common files/dirs so a race
> > condition in apt can't be the cause. It's easy to see that for every apt
> > segfault on the client you'll have a rpc error message on the server.
> >
> > I also tried with some different script to reproduce the problem, for
> > example to copy a lot of files(small, big ..) from a nfs share to
> > another but the md5sum reported that every time the copying was
> > happening without corruption so using apt is the only solution to
> > reproduce the bug for now.
> >
> > I'm here if you need any other info related to this problem.
> >
> > --
> > Razvan Gavril
> >
> >
> >
> > -------------------------------------------------------------------------
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share your
> > opinions on IT & business topics through brief surveys -- and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > _______________________________________________
> > NFS maillist - NFS@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nfs
> >
> Razvan Gavril wrote:
> > I posted on the linux kernel mailing list but got no answer till now.
> >
> > I have a nfs server and some diskless computers that that have there
> > root mounted via nfs from the server. In certain situations the
> > diskless computers fail to write correctly to their nfs mounted
> > filesystem (some files get corrupted). Looking into the nfs server's
> > dmesg, i see this messages:
> >
> > RPC: bad TCP reclen 0x5e9c5bec (non-terminal)
> > RPC: bad TCP reclen 0x29db3277 (large)
> > RPC: bad TCP reclen 0x698f6ccf (large)
> > RPC: bad TCP reclen 0x336160a9 (large)
> > RPC: bad TCP reclen 0x773ffdff (large)
> > RPC: bad TCP reclen 0x231b8d5c (non-terminal)
> > RPC: bad TCP reclen 0x39902af4 (large)
> > RPC: bad TCP reclen 0x6048d9cc (non-terminal)
> > RPC: bad TCP reclen 0x212f7e14 (non-terminal)
> >
> > This errors start to happen when upgrading to 2.6.16 from 2.6.15 but the
> > problem is still present in 2.6.17 kernel. For now i tested like this:
> >
> > Client - Server - State
> > ------------------------
> > 2.6.15 - 2.6.15 - Works
> > 2.6.15 - 2.6.16 - Errors
> > 2.6.16 - 2.6.16 - Errors
> > 2.6.16 - 2.6.17 - Errors
> > 2.6.17 - 2.6.17 - Errors
> >
> > From the looks of it the problem seems to be related to the nfs server
> > implemetation from the kernels newer that 2.6.15.
> >
> > Those corrupted writes on client + dmesg messages on the server are easy
> > to duplicate when using Debian on the client computers and running this
> > script in parallel on more that 1 client:
> >
> > while /bin/true ;do
> > apt-get update
> > err=$?
> > [[ $err != 0 ]] && echo "Exiting $err" && exit $err
> >
> > # you can replace gdb with any other package
> > apt-get -y install gdb
> > err=$?
> > [[ $err != 0 ]] && echo "Exiting $err" && exit $err
> >
> > apt-get -y remove gdb
> > err=$?
> > [[ $err != 0 ]] && echo "Exiting $err" && exit $err
> >
> > sleep $(( $RANDOM % 3 ))
> > done
> >
> > After a couple o minutes (1-5min) apt should give a segmentation fault
> > because one of its state files got corrupted (/lib/dpkg/status or
> > other). FYI, the clients DON'T have any common files/dirs so a race
> > condition in apt can't be the cause. It's easy to see that for every apt
> > segfault on the client you'll have a rpc error message on the server.
> >
> > I also tried with some different script to reproduce the problem, for
> > example to copy a lot of files(small, big ..) from a nfs share to
> > another but the md5sum reported that every time the copying was
> > happening without corruption so using apt is the only solution to
> > reproduce the bug for now.
> >
> > I'm here if you need any other info related to this problem.
> >
> >
> Can someone at least confirm this bug and give me an idea where to start
> debugging ?
I haven't seen other reports like this recently. It could be a
hardware problem on your server or in your network (like a bad server
NIC). A network trace would be the way to start tracking this down.
sudo tcpdump -s0 -w /tmp/dump
on your server. Stop the dump when the server starts reporting the
TCP stream problems. Then take a look at the end of the dump with
tethereal.
--
"We who cut mere stones must always be envisioning cathedrals"
-- Quarry worker's creed
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Bug starting 2.6.16 - rpc: bad TCP reclen
2006-07-19 15:22 ` Chuck Lever
@ 2006-07-19 15:52 ` Razvan Gavril
2006-07-20 4:29 ` Neil Brown
0 siblings, 1 reply; 6+ messages in thread
From: Razvan Gavril @ 2006-07-19 15:52 UTC (permalink / raw)
To: Chuck Lever; +Cc: nfs
Chuck Lever wrote:
> I haven't seen other reports like this recently. It could be a
> hardware problem on your server or in your network (like a bad server
> NIC). A network trace would be the way to start tracking this down.
>
> sudo tcpdump -s0 -w /tmp/dump
>
> on your server. Stop the dump when the server starts reporting the
> TCP stream problems. Then take a look at the end of the dump with
> tethereal.
>
A hardware problem is almost impossible since we test it on a lot of
different computers. I'm really interested if someone can explain what
that message is meaning, it doesn't make much sense to me cause i don't
know anything about rpc.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Bug starting 2.6.16 - rpc: bad TCP reclen
2006-07-19 15:52 ` Razvan Gavril
@ 2006-07-20 4:29 ` Neil Brown
2006-07-20 8:38 ` Bernd Schubert
0 siblings, 1 reply; 6+ messages in thread
From: Neil Brown @ 2006-07-20 4:29 UTC (permalink / raw)
To: Razvan Gavril; +Cc: nfs, Chuck Lever
On Wednesday July 19, razvan.g@plutohome.com wrote:
> Chuck Lever wrote:
>
> > I haven't seen other reports like this recently. It could be a
> > hardware problem on your server or in your network (like a bad server
> > NIC). A network trace would be the way to start tracking this down.
> >
> > sudo tcpdump -s0 -w /tmp/dump
> >
> > on your server. Stop the dump when the server starts reporting the
> > TCP stream problems. Then take a look at the end of the dump with
> > tethereal.
> >
>
> A hardware problem is almost impossible since we test it on a lot of
> different computers. I'm really interested if someone can explain what
> that message is meaning, it doesn't make much sense to me cause i don't
> know anything about rpc.
The message means that data on the TCP connection is corrupted.
With tcp, every RPC message is prefixed by a 4 byte header.
The msb of this number is set to one to show it is the last of a
sequence of fragments (we don't support multiple rpc-fragments). The
rest of the number is the number of bytes in the RPC message.
This should be less than about 32000 as we don't support any messages
bigger than this. You are seeing number with the msb clear, and
numbers bigger than 32000.
It is very probably that this is not the first corruption in the TCP
stream, just the first that is being reported. I have no idea where
the corruption could be coming from.
NeilBrown
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Bug starting 2.6.16 - rpc: bad TCP reclen
2006-07-20 4:29 ` Neil Brown
@ 2006-07-20 8:38 ` Bernd Schubert
0 siblings, 0 replies; 6+ messages in thread
From: Bernd Schubert @ 2006-07-20 8:38 UTC (permalink / raw)
To: nfs; +Cc: Neil Brown, Chuck Lever
> The message means that data on the TCP connection is corrupted.
> With tcp, every RPC message is prefixed by a 4 byte header.
> The msb of this number is set to one to show it is the last of a
> sequence of fragments (we don't support multiple rpc-fragments). The
> rest of the number is the number of bytes in the RPC message.
> This should be less than about 32000 as we don't support any messages
> bigger than this. You are seeing number with the msb clear, and
> numbers bigger than 32000.
>
> It is very probably that this is not the first corruption in the TCP
> stream, just the first that is being reported. I have no idea where
> the corruption could be coming from.
>
Wouldn't tcp errors corrected get corrected by the tcp checksum (retransmit=
)? =
At least thats what my text book is saying.
We are occasionally seeing those messages, too. Don't know how to reproduce =
it, though. =
on our fileserver fileserver: RPC: bad TCP reclen 0x040d0a0d (non-terminal)=
=
on our compute cluster server: RPC: bad TCP reclen 0x2dacc6c9 (large)
Here its not related to 2.6.16 only. The compute cluster server shows those =
messages - both server and clients are using 2.6.15. Until recently it run =
2.6.11 and we never had those messages that time.
Our main fileserver is running 2.6.13, most clients still 2.6.11 and some =
clients 2.6.16. As far as I remember, those messages began when we updated =
the clients from 2.6.11 to 2.6.16.
Thanks,
Bernd
-- =
Bernd Schubert
PCI / Theoretische Chemie
Universit=E4t Heidelberg
INF 229
69120 Heidelberg
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3DDE=
VDEV
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-07-20 8:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-18 11:36 Buf starting 2.6.16 - rpc: bad TCP reclen Razvan Gavril
2006-07-19 14:46 ` Bug " Razvan Gavril
2006-07-19 15:22 ` Chuck Lever
2006-07-19 15:52 ` Razvan Gavril
2006-07-20 4:29 ` Neil Brown
2006-07-20 8:38 ` Bernd Schubert
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.