public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [BUG] NFS with multiple clients connected
@ 2006-07-06 15:15 Razvan Gavril
  2006-07-06 15:27 ` Trond Myklebust
  0 siblings, 1 reply; 3+ messages in thread
From: Razvan Gavril @ 2006-07-06 15:15 UTC (permalink / raw)
  To: linux-kernel

I have a nfs server(kernel-server) which i use as a boot server for 
several other machines on the network. Starting with 2.6.16 i started 
noticing that when having more than one of the clients doing a lot of 
in/out on their mounted nfs shares at list one of then starts to to have 
problems when writing (don't know about reading) files. For example dpkg 
writes strange things it the /var/lib/dpkg/status file even if it worked 
perfectly before the kernel upgrade.

Every time an diskless computer fails to write corectly to the nfs 
filesystem i got this messages on the nfs server (dmesg):

RPC: bad TCP reclen 0x3c390000 (large)
RPC: bad TCP reclen 0x31006261 (non-terminal)
RPC: bad TCP reclen 0x73752070 (non-terminal)
RPC: bad TCP reclen 0x52610100 (non-terminal)

Is very simple to spot this behaver (1 write-error for client / 1 rpc 
message in server's dmesg) because apt-get is always giving an error 
message when the /var/lib/dpkg/status file contains something that it 
shouldn't. An it also can be very ease to reproduce.

I tested with 2.6.17 and got the same error, although when using 2.6.15 
didn't got any errors and the clients worked perfect. Since i'm kind of 
forced to use a kernel version > 2.6.15 i really, really need to solve 
this bug. I would be glad to do it myself but i don't have the knowledge 
to do it so if is anybody that can help i can offer all the information 
that i could and also access to a system so he can track the problem.


--
Razvan Gavril




^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] NFS with multiple clients connected
  2006-07-06 15:15 Razvan Gavril
@ 2006-07-06 15:27 ` Trond Myklebust
  0 siblings, 0 replies; 3+ messages in thread
From: Trond Myklebust @ 2006-07-06 15:27 UTC (permalink / raw)
  To: Razvan Gavril; +Cc: linux-kernel

On Thu, 2006-07-06 at 18:15 +0300, Razvan Gavril wrote:
> I have a nfs server(kernel-server) which i use as a boot server for 
> several other machines on the network. Starting with 2.6.16 i started 
> noticing that when having more than one of the clients doing a lot of 
> in/out on their mounted nfs shares at list one of then starts to to have 
> problems when writing (don't know about reading) files. For example dpkg 
> writes strange things it the /var/lib/dpkg/status file even if it worked 
> perfectly before the kernel upgrade.
> 
> Every time an diskless computer fails to write corectly to the nfs 
> filesystem i got this messages on the nfs server (dmesg):
> 
> RPC: bad TCP reclen 0x3c390000 (large)
> RPC: bad TCP reclen 0x31006261 (non-terminal)
> RPC: bad TCP reclen 0x73752070 (non-terminal)
> RPC: bad TCP reclen 0x52610100 (non-terminal)
> 
> Is very simple to spot this behaver (1 write-error for client / 1 rpc 
> message in server's dmesg) because apt-get is always giving an error 
> message when the /var/lib/dpkg/status file contains something that it 
> shouldn't. An it also can be very ease to reproduce.
> 
> I tested with 2.6.17 and got the same error, although when using 2.6.15 
> didn't got any errors and the clients worked perfect. Since i'm kind of 
> forced to use a kernel version > 2.6.15 i really, really need to solve 
> this bug. I would be glad to do it myself but i don't have the knowledge 
> to do it so if is anybody that can help i can offer all the information 
> that i could and also access to a system so he can track the problem.
> 
> 
> --
> Razvan Gavril

Did the problem start when you upgraded the clients or the server?

Cheers,
  Trond



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] NFS with multiple clients connected
@ 2006-07-06 16:32 Razvan Gavril
  0 siblings, 0 replies; 3+ messages in thread
From: Razvan Gavril @ 2006-07-06 16:32 UTC (permalink / raw)
  To: linux-kernel

Trond Myklebust wrote:

> On Thu, 2006-07-06 at 18:15 +0300, Razvan Gavril wrote:
>> I have a nfs server(kernel-server) which i use as a boot server for 
>> several other machines on the network. Starting with 2.6.16 i started 
>> noticing that when having more than one of the clients doing a lot of 
>> in/out on their mounted nfs shares at list one of then starts to to have 
>> problems when writing (don't know about reading) files. For example dpkg 
>> writes strange things it the /var/lib/dpkg/status file even if it worked 
>> perfectly before the kernel upgrade.
>>
>> Every time an diskless computer fails to write corectly to the nfs 
>> filesystem i got this messages on the nfs server (dmesg):
>>
>> RPC: bad TCP reclen 0x3c390000 (large)
>> RPC: bad TCP reclen 0x31006261 (non-terminal)
>> RPC: bad TCP reclen 0x73752070 (non-terminal)
>> RPC: bad TCP reclen 0x52610100 (non-terminal)
>>
>> Is very simple to spot this behaver (1 write-error for client / 1 rpc 
>> message in server's dmesg) because apt-get is always giving an error 
>> message when the /var/lib/dpkg/status file contains something that it 
>> shouldn't. An it also can be very ease to reproduce.
>>
>> I tested with 2.6.17 and got the same error, although when using 2.6.15 
>> didn't got any errors and the clients worked perfect. Since i'm kind of 
>> forced to use a kernel version > 2.6.15 i really, really need to solve 
>> this bug. I would be glad to do it myself but i don't have the knowledge 
>> to do it so if is anybody that can help i can offer all the information 
>> that i could and also access to a system so he can track the problem.
>>
>>
>> --
>> Razvan Gavril
>
> Did the problem start when you upgraded the clients or the server?
>
> Cheers,
>   Trond
>
>
For now i only tested like this :

Server        Clients      State
-----------------------------------
2.6.15        2.6.15       Works
2.6.16        2.6.16       Fails
2.6.17        2.6.16       Fails
2.6.17        2.6.17       Fails


I did some more testing, i created a script that copies the files from a 
nfs share to another/same nfs share then checks the md5sums of the 
source and destination file, to my big surprise it worked ok. I did the 
test with relative small files (/lib) then with big files (dvds, avi) also.

The problem appears only when 2 diskless computers run apt-get in 
parallel so i suspect something related to bad file descriptors(?) 
,maybe apt is reading / writing / moving it's status files too quick. I 
can reproduce the problem in 20-30 seconds if i let more that 2 diskless 
computers run apt-get in parallel. I need to mention that the two 
diskless computers use completely different shares so there is no race 
condition in apt involved.

--
Razvan Gavril




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-07-06 16:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-06 16:32 [BUG] NFS with multiple clients connected Razvan Gavril
  -- strict thread matches above, loose matches on Subject: below --
2006-07-06 15:15 Razvan Gavril
2006-07-06 15:27 ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox