Re: NFS stops responding

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Jason Holmes <jholmes@psu.edu>
To: nfs@lists.sourceforge.net
Subject: Re: NFS stops responding
Date: Fri, 01 Oct 2004 11:40:38 -0400	[thread overview]
Message-ID: <415D7A76.7000404@psu.edu> (raw)
In-Reply-To: <415C5A33.50202@psu.edu>

FYI, I'm beginning to suspect that this is a problem more with the newer 
RedHat kernels than anything else.  I've only had one vanilla kernel NFS 
lockup since I moved the NFS servers to 2.6.8.1 (3 days) and that 
happened right after I did the move, so it could be coincidental.  Back 
when the servers ran RedHat kernels, the RedHat kernel clients never 
locked up whereas the vanilla clients did.  Yesterday I had 4 NFS 
lockups on the same machine running the RedHat 2.4.21-20.ELsmp kernel 
(the one that generated the trace below), but it hasn't locked up since 
I moved it to 2.6.8.1.  I guess I'll know for sure if my lockups don't 
come back for a week or so.

Thanks,

--
Jason Holmes

Jason Holmes wrote:
> Here's a 'sysrq-T' listing for a few hung processes.  Unfortunately, 
> this was on a 2.4.21-20.ELsmp RedHat kernel and not a vanilla kernel 
> (I'll send one of those along as soon as I can get one):
> 
> xauth         D 00000100e2d30370  1312  9600   9599 (NOTLB)
> 
> Call Trace: [<ffffffff80120d8a>]{io_schedule+42} 
> [<ffffffff801420ed>]{___wait_on_page+285}
>        [<ffffffff8014316a>]{do_generic_file_read+1258} 
> [<ffffffff80143770>]{file_read_actor+0}
>        [<ffffffff801438c5>]{generic_file_new_read+165} 
> [<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
>        [<ffffffff8015dfd2>]{sys_read+178} 
> [<ffffffff80110177>]{system_call+119}
> 
> bash          D 00000100e2bef130   824  9614      1          9666  9583 
> (NOTLB)
> 
> Call Trace: [<ffffffff80120d8a>]{io_schedule+42} 
> [<ffffffff801420ed>]{___wait_on_page+285}
>        [<ffffffff8014316a>]{do_generic_file_read+1258} 
> [<ffffffff80143770>]{file_read_actor+0}
>        [<ffffffff801438c5>]{generic_file_new_read+165} 
> [<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
>        [<ffffffff8015dfd2>]{sys_read+178} 
> [<ffffffff80110177>]{system_call+119}
> 
> bash          D 00000100db051e28     0  9666      1          9718  9614 
> (NOTLB)
> 
> Call Trace: [<ffffffff80120d8a>]{io_schedule+42} 
> [<ffffffff80142466>]{__lock_page+294}
>        [<ffffffff801430ca>]{do_generic_file_read+1098} 
> [<ffffffff80143770>]{file_read_actor+0}
>        [<ffffffff801438c5>]{generic_file_new_read+165} 
> [<ffffffffa02ec3a9>]{:nfs:nfs_file_read+217}
>        [<ffffffff8015dfd2>]{sys_read+178} 
> [<ffffffff80110177>]{system_call+119}
> 
> Thanks,
> 
> -- 
> Jason Holmes
> 
> Jason Holmes wrote:
> 
>> I have had similar problems with NFS recently and have yet to figure 
>> out a pattern.  They started around the 2.4.27 time frame, but that 
>> could just be coincidental.  I have 8 NFS servers and several hundred 
>> clients.  Every few days, one of the clients will start hanging 
>> connections to one of its mounts (all of the processes access that 
>> mount go into D state and never return - the machine has to be 
>> forcefully rebooted to get rid of them).  While one of the client 
>> machines are hanging on a mount, the other client machines are fine.  
>> Access to the other mounts are fine on the hanging machine.  The 
>> server is fine when this happens and I see no odd messages in the logs.
>>
>> The servers were originally running RedHat Enterprise 3 kernels - I 
>> have also tried 2.6.8.1 and have had the same problem.  Clients have 
>> been 2.4.27, 2.6.8.1, and the latest RedHat kernels.  The network is a 
>> simple private one and there is no packet loss.  I've tried both UDP 
>> and TCP v3 hard mounts.  Exports are synchronous.
>>
>> I'm currently hoping that one of my machines with sysrq enabled will 
>> hang to see if I can possibly get some information out of that that 
>> will shed some light on the situation.  I'd be happy to entertain any 
>> other debugging suggestions on this.  Unfortunately, I haven't been 
>> able to figure out how to force the problem to happen, so I'm at the 
>> mercy of waiting for it to just pop up.
>>
>> Thanks,
>>
>> -- 
>> Jason Holmes
>>
>> Douglas Furlong wrote:
>>
>>> Good morning all.
>>>
>>> Considering the exceedingly fast and speedy response I got yesterday
>>> with regards to my problem accessing edirectory.co.uk I thought I would
>>> try my luck with an NFS problem.
>>>
>>> All our unix systems at work have their home directory mounted via NFS
>>> to allow hot seating (not that they ever use it!).
>>>
>>> I have just recently upgraded to Fedora Core 2, running the most recent
>>> kernel.
>>>
>>> All the workstations are running Fedora Core 2, with the second from
>>> last kernel (due to CIFS/SMB problems in the latest one).
>>>
>>> Unfortunately there are two users who's connection to the NFS server is
>>> dropped and does not seem to want to reconnect. To date I have.
>>>
>>> 1) Replaced both of their PC's
>>> 2) Replaced switch
>>> 3) will replace network cables tomorrow
>>> 4) I have tried numerous version of the kernel including the testing
>>> kernel from rawhide.
>>> 5) Tried variations in the timeo=x value to see if that will help.
>>>
>>> These lockups vary in time between 30 minutes and 5 hours. Network
>>> connections are not affected by this lock up, I am able to ssh on to the
>>> box (that's how I collected the tcpdump data).
>>>
>>> I also have two windows PC's on this switch and things appear to be
>>> fine.
>>>
>>> I have 7 or 8 other systems running linux on the network and NFS
>>> communication is not affected.
>>>
>>> I have increased the number of servers on the NFS server from 8 to 16. I
>>> did this by editing /etc/init.d/nfs (don't think this is of any help).
>>>
>>> I took some tcpdump info on both the client and the server to try and
>>> see if I can work out what is going on. Initially it is not providing me
>>> with much information (but loads of data).
>>>
>>> I have attached two files, one from the client and one from the server.
>>> Main reason for attaching them is due to length of data. I had wanted to
>>> attach them as plain text to simplify access, but at 100k it's a bit too
>>> large.
>>> I didn't want to cut them down too much just in case I removed some
>>> pertinent information :(
>>
>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
>> Use IT products in your business? Tell us what you think of them. Give us
>> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out 
>> more
>> http://productguide.itmanagersjournal.com/guidepromo.tmpl
>> _______________________________________________
>> NFS maillist  -  NFS@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nfs
>>
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
> Use IT products in your business? Tell us what you think of them. Give us
> Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
> http://productguide.itmanagersjournal.com/guidepromo.tmpl
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

next prev parent reply	other threads:[~2004-10-01 15:40 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-09-30 13:39 NFS stops responding Douglas Furlong
2004-09-30 16:06 ` Jason Holmes
2004-09-30 19:10   ` Jason Holmes
2004-10-01 15:40     ` Jason Holmes [this message]
2004-10-07 10:56       ` Douglas Furlong
2004-10-13 15:07         ` Jason Holmes
  -- strict thread matches above, loose matches on Subject: below --
2010-04-14 21:06 Michael O'Donnell
2010-04-15 18:04 ` J. Bruce Fields
2010-04-17  0:17 ` Dennis Nezic
     [not found]   ` <20100416201700.215b0bea.dennisn-YN8wfZw00oOZ9vWoFJJngh2eb7JE58TQ@public.gmane.org>
2010-04-19 14:34     ` Michael O'Donnell
     [not found]       ` <4BCC69E4.70405-kx56TfycDUc@public.gmane.org>
2010-04-22 15:19         ` Dennis Nezic
2010-04-28 15:51           ` Dennis Nezic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=415D7A76.7000404@psu.edu \
    --to=jholmes@psu.edu \
    --cc=nfs@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox