public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Lukas Razik <linux@razik.name>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Jim Rees <rees@umich.edu>,
	Trond Myklebust <Trond.Myklebust@netapp.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [BUG?] Maybe NFS bug since 2.6.37 on SPARC64
Date: Fri, 4 Nov 2011 09:44:18 +0000 (GMT)	[thread overview]
Message-ID: <1320399858.11675.YahooMailNeo@web24703.mail.ird.yahoo.com> (raw)
In-Reply-To: <39983D1A-70A8-49A1-A4E2-926637780F75@oracle.com>

>>  OK

>>  I've watched wireshark on cluster1 during start up of cluster2 (with 
> linux-2.6.32) which first tries 10003 and then 10005.
>>  The result is that cluster1 doesn't get a datagram for port 10003:
>>  http://net.razik.de/linux/T5120/cluster2_NFSROOT_MOUNT.png
>> 
>>  The first ARP request in the screenshot came _after_ the <tag> in 
> this kernel log:
>>  [ 6492.807917] IP-Config: Complete:
>>  [ 6492.807978]      device=eth0, addr=137.226.167.242, 
> mask=255.255.255.224, gw=137.226.167.225,
>>  [ 6492.808227]      host=cluster2, domain=, nis-domain=(none),
>>  [ 6492.808312]      bootserver=255.255.255.255, rootserver=137.226.167.241, 
> rootpath=
>>  [ 6492.808570] Looking up port of RPC 100003/2 on 137.226.167.241
>>  [ 6493.886014] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow 
> Control: Rx
>>  [ 6493.905840] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
>>  <tag>
>>  [ 6527.827055] rpcbind: server 137.226.167.241 not responding, timed out
>>  [ 6527.827237] Root-NFS: Unable to get nfsd port number from server, using 
> default
>>  [ 6527.827353] Looking up port of RPC 100005/1 on 137.226.167.241
>>  [ 6527.842212] VFS: Mounted root (nfs filesystem) on device 0:15.
>> 
>> 
>>  So I don't think that it's a problem of the hardware between the 
> machines.
>>  There's no reason why I wouldn't see an ARP requests from cluster2 
> which would have been sent _before_ the <tag> if there would be one. I 
> think: cluster2 never sends a request to port 10003.
>>  What do you think?
> 
> It agrees with our initial assessment that the first RPC request is failing.  
> The RPC client never gets the request through cluster2's network stack 
> because the NIC hasn't re-initialized when the request is sent.
> 
> It looks like your system does a PXE boot, which provides the IP configuration 
> shown above.  But then the kernel resets the NIC.  During that reset, the kernel 
> is attempting to contact the NFS server to mount the root file system.
> 
> We've set up NFSROOT to use UDP so that it will be relatively immune to 
> these initialization order problems.  The RPC client should be retrying the lost 
> request, but apparently it isn't.  What if you added "retrans=10" 
> to cluster2's mount options?  (on the chance that mount option setting would 
> be copied to the rpcbind client's RPC transport...)
> 
> IMO the correct way to fix this is to provide proper serialization in the 
> networking layer so that RPC requests are not even attempted until the NIC is 
> ready to carry traffic.  That may be a pipe dream though.
> 

I thank you three very much for your help! Now I'm sure that I haven't misconfigured anything...
But I don't see a work around to get the NFSROOT mounted during start up of a kernel >=2.6.37 .
It would be very sad with these nice Oracle (SUN) machines if no one could use them because of this bug.

Do you know a kernel developer who maybe would try to write a patch for this problem?
Or do you have another idea what I could do?

Regards,
Lukas

  parent reply	other threads:[~2011-11-04  9:44 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-03 19:43 [BUG?] Maybe NFS bug since 2.6.37 on SPARC64 Lukas Razik
2011-11-03 20:54 ` Trond Myklebust
2011-11-03 21:10   ` Chuck Lever
2011-11-03 21:11   ` Jim Rees
2011-11-03 21:16     ` Chuck Lever
2011-11-03 21:37       ` Lukas Razik
2011-11-03 21:51         ` Chuck Lever
2011-11-03 23:09           ` Lukas Razik
2011-11-03 23:59             ` Jim Rees
2011-11-04  0:59               ` Lukas Razik
2011-11-04  1:06             ` Chuck Lever
2011-11-04  1:33               ` Lukas Razik
2011-11-04  9:44               ` Lukas Razik [this message]
2011-11-04 13:20                 ` Jim Rees
2011-11-04 14:01                   ` Chuck Lever
2011-11-04 14:09                     ` Myklebust, Trond
2011-11-04 14:24                       ` J. Bruce Fields
2011-11-04 14:46                     ` Jim Rees
2011-11-04 15:02                       ` Lukas Razik
2011-11-04 15:18                       ` Myklebust, Trond
2011-11-04 15:46                       ` Lukas Razik
2011-11-04 22:55                         ` Chuck Lever
2011-11-04 23:17                           ` Lukas Razik
2011-11-04 13:54                 ` Chuck Lever
2011-11-04 14:57                   ` Lukas Razik
2011-11-04 16:56                   ` Lukas Razik
2011-11-04 17:55                   ` Lukas Razik
2011-11-04 23:15                     ` NFSROOT mount fails on SPARC after 2.6.37 Chuck Lever
2011-11-05  2:03                       ` David Miller
2011-11-05  2:38                         ` Trond Myklebust
2011-11-04 23:40                   ` [BUG?] Maybe NFS bug since 2.6.37 on SPARC64 Lukas Razik
2011-11-05  1:19                     ` Trond Myklebust
2011-11-05  1:52                       ` Lukas Razik
2011-11-05  2:14                       ` Lukas Razik
2011-11-05  2:30                         ` Trond Myklebust
2011-11-05  2:31                         ` Trond Myklebust
2011-11-05  2:31                         ` Trond Myklebust
2011-11-05  3:51                           ` Lukas Razik
2011-11-05 13:05                             ` Jim Rees
2011-11-12 11:35                               ` Lukas Razik
2011-11-12 18:49                                 ` Jim Rees
2011-11-12 21:06                                   ` Chuck Lever
2011-11-13  1:03                                     ` Lukas Razik
2011-11-13 19:32                                       ` Chuck Lever
2011-11-13 21:28                                         ` Lukas Razik
2011-11-13 22:19                                           ` Lukas Razik
2011-11-14 15:31                                             ` Chuck Lever
2011-11-03 21:18   ` Lukas Razik
2011-11-03 21:38     ` Jim Rees
2011-11-03 21:58       ` Lukas Razik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1320399858.11675.YahooMailNeo@web24703.mail.ird.yahoo.com \
    --to=linux@razik.name \
    --cc=Trond.Myklebust@netapp.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=rees@umich.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox