* NFS client hangs under certain circumstances on SMP machine
@ 2006-02-28 20:35 ` Olivier Croquette
0 siblings, 0 replies; 5+ messages in thread
From: Olivier Croquette @ 2006-02-28 20:35 UTC (permalink / raw)
To: LKML; +Cc: nfs
Hi
I have already sent this message on the NFS mailing-list, but I had no
reaction there. May be you kernel hackers have an idea?
I have a strange problem since a few months on some Linux clients.
I have a file server accessed through:
- NFS from Linux clients (autofs, but direct mount causes same effect)
- Samba from Windows clients
This works since several years like a charm, but as I said there is a
strange problem that appeared recently:
I have a directory, to which I generate code from Windows (\\server\dir)
I can see it under Linux (/mount/dir) where I can access (compile) the
files.
However, when I regenerate the file under Windows again (ie. I overwrite
the old files), and I try to compile the files again under Linux, "make"
hangs simply in D state:
# ps aux | grep make
user 7177 0.0 0.0 1984 760 pts/1 D+ 16:13 0:00 make -f myMakefile
The load average goes up one unit each time I reproduce this test
(apparently, processes in non-interruptible state are considered as
running).
From then, the following actions does NOT unblock the process:
- stopping or restarting the NFS service on the server
- restarting the server
- restarting autofs on the client
- trying to unmount the NFS mount
If I reboot the client, all goes back to normal, until I repeat the
process below (ie. overwriting and compiling).
Typically, "shutdown -r" does not work, I have to "reboot -f".
There is nothing interesting in /var/log on the server nor on the
client.
Versions used on the server:
- SuSE 9.3
- kernel-default-2.6.11.4-21.11
- nfs-utils-1.0.7-3
- samba-3.0.13-1.1
- filesystem: reiserfs
On the client:
- SuSE 9.3
- kernel-smp-2.6.11.4-21.10
- nfs-utils-1.0.7-3
- mounts:
automount on /mount type autofs
(rw,fd=4,pgrp=6529,minproto=2,maxproto=4)
serv:/dir on /mount/dir type nfs (rw,addr=*IP*)
- CPU: P4 with hyper threading (2 virtual CPUs)
Note: maxcpus=0 does not make any difference regarding this issue. I
could not test yet with kernel compiled without SMP at all.
On the following clients with the very same server, network, and mount
tables I could not reproduce the problem:
- SuSE 9.1
- kernel-default-2.6.5-7.202.7
- nfs-utils-1.0.6-103
- CPU: P4 single core
- SuSE 10.0
- Kernel: 2.6.14.3-default (from kernel.org)
- nfs-utils-1.0.7-13
Any idea?
Seems to me as it is related to the SMP. What do you think?
How can I debug further?
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 5+ messages in thread* NFS client hangs under certain circumstances on SMP machine @ 2006-02-28 20:35 ` Olivier Croquette 0 siblings, 0 replies; 5+ messages in thread From: Olivier Croquette @ 2006-02-28 20:35 UTC (permalink / raw) To: LKML; +Cc: nfs Hi I have already sent this message on the NFS mailing-list, but I had no reaction there. May be you kernel hackers have an idea? I have a strange problem since a few months on some Linux clients. I have a file server accessed through: - NFS from Linux clients (autofs, but direct mount causes same effect) - Samba from Windows clients This works since several years like a charm, but as I said there is a strange problem that appeared recently: I have a directory, to which I generate code from Windows (\\server\dir) I can see it under Linux (/mount/dir) where I can access (compile) the files. However, when I regenerate the file under Windows again (ie. I overwrite the old files), and I try to compile the files again under Linux, "make" hangs simply in D state: # ps aux | grep make user 7177 0.0 0.0 1984 760 pts/1 D+ 16:13 0:00 make -f myMakefile The load average goes up one unit each time I reproduce this test (apparently, processes in non-interruptible state are considered as running). From then, the following actions does NOT unblock the process: - stopping or restarting the NFS service on the server - restarting the server - restarting autofs on the client - trying to unmount the NFS mount If I reboot the client, all goes back to normal, until I repeat the process below (ie. overwriting and compiling). Typically, "shutdown -r" does not work, I have to "reboot -f". There is nothing interesting in /var/log on the server nor on the client. Versions used on the server: - SuSE 9.3 - kernel-default-2.6.11.4-21.11 - nfs-utils-1.0.7-3 - samba-3.0.13-1.1 - filesystem: reiserfs On the client: - SuSE 9.3 - kernel-smp-2.6.11.4-21.10 - nfs-utils-1.0.7-3 - mounts: automount on /mount type autofs (rw,fd=4,pgrp=6529,minproto=2,maxproto=4) serv:/dir on /mount/dir type nfs (rw,addr=*IP*) - CPU: P4 with hyper threading (2 virtual CPUs) Note: maxcpus=0 does not make any difference regarding this issue. I could not test yet with kernel compiled without SMP at all. On the following clients with the very same server, network, and mount tables I could not reproduce the problem: - SuSE 9.1 - kernel-default-2.6.5-7.202.7 - nfs-utils-1.0.6-103 - CPU: P4 single core - SuSE 10.0 - Kernel: 2.6.14.3-default (from kernel.org) - nfs-utils-1.0.7-13 Any idea? Seems to me as it is related to the SMP. What do you think? How can I debug further? ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <44127800.4090905@free.fr>]
[parent not found: <20060313090033.GB12896@suse.de>]
* Re: [NFS] NFS client hangs under certain circumstances on SMP machine [not found] ` <20060313090033.GB12896@suse.de> @ 2006-03-13 9:51 ` Hans-Peter Jansen 0 siblings, 0 replies; 5+ messages in thread From: Hans-Peter Jansen @ 2006-03-13 9:51 UTC (permalink / raw) To: Olaf Kirch; +Cc: Olivier Croquette, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1139 bytes --] Hi Olaf, Am Montag, 13. März 2006 10:00 schrieb Olaf Kirch: > Hi Olivier, > > On Sat, Mar 11, 2006 at 08:10:56AM +0100, Olivier Croquette wrote: > > I think the corresponding patch is: > > +nfs-fix-client-hang-due-to-race-condition.patch > > > > I could not find a lot of info about it, however. > > Do you have a URL for this patch? I attached the (modified) patch here, that fixed it for me. You should be able to locate the original LKML post. Note that I disabled the first hunk, since it doesn't apply to 2.6.11.4-21.11 and isn't necessary, because you already added the locking for this case ;-). BTW, the nfs locking in later kernel releases is done much different (much more fine grained with atomic bitops, but also much more a hassle to apply to that kernel in question, thus I resigned it). > If you point out the exact patch that you think fixed the problem > on older kernels, we may consider including it in a future > update. I will test before and after the patch with the referenced test program and let you know. Is there a new 9.3 kernel release already scheduled? Pete [-- Attachment #2: NFS-fix-client-hang-due-to-race-condition.diff --] [-- Type: text/x-diff, Size: 7268 bytes --] From njw@osdl.org Wed Jul 6 23:27:44 2005 Return-Path: <linux-kernel-owner+hpj=40urpla.net-S262527AbVGFVan@vger.kernel.org> Received: from mail.lisa.loc ([unix socket]) by tyrex (Cyrus v2.2.12) with LMTPA; Wed, 06 Jul 2005 23:36:38 +0200 X-Sieve: CMU Sieve 2.2 Received: from localhost (localhost [127.0.0.1]) by mail.lisa.loc (Postfix) with ESMTP id B2BF12001B20 for <hp@localhost.lisa.loc>; Wed, 6 Jul 2005 23:36:38 +0200 (CEST) Delivery-Date: Wed, 06 Jul 2005 23:35:33 +0200 Received: from pop.kundenserver.de [212.227.15.181] by localhost with POP3 (fetchmail-6.2.5) for hp@localhost (single-drop); Wed, 06 Jul 2005 23:36:25 +0200 (CEST) Received: from [12.107.209.244] (helo=vger.kernel.org) by mxeu8.kundenserver.de with ESMTP (Nemesis), id 0MKt1w-1DqHYK2XiU-00069O for hpj@urpla.net; Wed, 06 Jul 2005 23:35:32 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262527AbVGFVan (ORCPT <rfc822;hpj@urpla.net>); Wed, 6 Jul 2005 17:30:43 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262537AbVGFV1b (ORCPT <rfc822;linux-kernel-outgoing>); Wed, 6 Jul 2005 17:27:31 -0400 Received: from smtp.osdl.org ([65.172.181.4]:60624 "EHLO smtp.osdl.org") by vger.kernel.org with ESMTP id S262522AbVGFVZM (ORCPT <rfc822;linux-kernel@vger.kernel.org>); Wed, 6 Jul 2005 17:25:12 -0400 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id j66LOtjA000774 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Wed, 6 Jul 2005 14:24:55 -0700 Received: from nwilson.pdx.osdl.net (nwilson.pdx.osdl.net [10.8.0.89]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with ESMTP id j66LOtda024497; Wed, 6 Jul 2005 14:24:55 -0700 Received: from nwilson.pdx.osdl.net (localhost [127.0.0.1]) by nwilson.pdx.osdl.net (8.13.3/8.13.1) with ESMTP id j66LRjYx013441; Wed, 6 Jul 2005 14:27:45 -0700 Received: (from njw@localhost) by nwilson.pdx.osdl.net (8.13.3/8.13.3/Submit) id j66LRin7013430; Wed, 6 Jul 2005 14:27:44 -0700 Date: Wed, 6 Jul 2005 14:27:44 -0700 From: Nick Wilson <njw@osdl.org> To: trond.myklebust@fys.uio.no Cc: akpm@osdl.org, linux-kernel@vger.kernel.org, nfs@lists.sourceforge.net Subject: [PATCH] NFS: fix client hang due to race condition Message-ID: <20050706212744.GC20698@njw.pdx.osdl.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.8i X-MIMEDefang-Filter: osdl$Revision: 1.111 $ X-Scanned-By: MIMEDefang 2.36 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org Envelope-To: hpj@urpla.net X-Virus-Scanned: amavisd-new at lisa.loc X-UID: 57427 X-Length: 7015 Status: RO X-Status: OC X-KMail-EncryptionState: X-KMail-SignatureState: X-KMail-MDN-Sent: The flags field in struct nfs_inode is protected by the BKL. The following two code paths (there may be more, but my test program only hits these two) modify the flags without obtaining the lock: nfs_end_data_update nfs_release nfs_file_release __fput fput filp_close sys_close syscall_call nfs_revalidate_mapping nfs_file_write do_sync_write vfs_write sys_write syscall_call Running multiple instances of a simple program [1] that opens, writes to, and closes NFS mounted files eventually results in the programs hanging on an SMP system (see kernel .config [3]). I've been testing this with 100 instances of the program: $ ./breaknfs 100 & Usually within 10 minutes, all instances of breaknfs will hang. They disappear from the output of 'top' and there is no NFS activity between the client and server. /proc/*/wchan shows 22 instances of breaknfs are waiting on nfs_wait_on_inode, and 78 on .text.lock.namei echo t > /proc/sysrq-trigger output [2] shows 22 instances of breaknfs similar to this...: breaknfs S 00100100 5060 5530 5523 5531 5529 (NOTLB) de0d1e24 00000086 c01178e0 00100100 de0d1de4 00000000 00000000 de0d1e14 de0d1dec c0309513 de0d1e0c c0127c7e 00000000 dfaff020 c140e400 000004c3 b37f50b5 0000003a c140e8c0 de7815b0 de7816d8 dbb5963c dbb59650 de0d0000 Call Trace: [<c01eac01>] nfs_wait_on_inode+0x1b1/0x1c0 [<c01eb2ac>] __nfs_revalidate_inode+0x2cc/0x340 [<c01e8b1c>] nfs_file_flush+0x8c/0xc0 [<c0159366>] filp_close+0x56/0x70 [<c01593e9>] sys_close+0x69/0x90 [<c0103039>] syscall_call+0x7/0xb ... and 78 similar to this: breaknfs D 00000310 5060 5523 5466 5524 (NOTLB) ddcafebc 00000082 c0369810 00000310 ddcaff58 ddcafe90 db975690 00000000 ddcafee0 ddcafe94 c0170a75 ddcaff58 00000000 dfaff020 c140e400 00000178 b2b3096d 0000003a c140e8c0 df839550 df839678 dbb59e70 dbb59e78 00000286 Call Trace: [<c03075b3>] __down+0x83/0xe0 [<c030772e>] __down_failed+0xa/0x10 [<c016b295>] .text.lock.namei+0xaa/0x1e5 [<c0158e5d>] filp_open+0x2d/0x50 [<c01592ad>] sys_open+0x4d/0x80 [<c0103039>] syscall_call+0x7/0xb NFS mount options from /proc/mounts: rw,v3,rsize=32768,wsize=32768,hard,intr,udp,lock,addr=njw I've reproduced this bug on 2.6.11.10, 2.6.12-mm2, and 2.6.13-rc2. With my patch against 2.6.13-rc2 below, I ran 100 instances of breaknfs with this patch for 14 hours and I was unable to get the client to hang. Thanks, Nick Wilson [1] http://developer.osdl.org/njw/nfs-bug/breaknfs.c [2] http://developer.osdl.org/njw/nfs-bug/alt-sysrq-t.txt [3] http://developer.osdl.org/njw/nfs-bug/kernel-config The flags field in struct nfs_inode is protected by the BKL. This patch fixes a couple places where the lock is not obtained before changing the flags. Signed-off-by: Nick Wilson <njw@osdl.org> won't apply and isn't necessary due to surrounding BKL: #@@ -1118,7 +1118,9 @@ void nfs_revalidate_mapping(struct inode # nfs_wb_all(inode); # } # invalidate_inode_pages2(mapping); #+ lock_kernel(); # nfsi->flags &= ~NFS_INO_INVALID_DATA; #+ unlock_kernel(); # if (S_ISDIR(inode->i_mode)) { # memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf)); # /* This ensures we revalidate child dentries */ --- inode.c | 4 ++++ 1 files changed, 4 insertions(+) --- linux.orig/fs/nfs/inode.c 2005-07-06 11:08:27.000000000 -0700 +++ linux/fs/nfs/inode.c 2005-07-06 11:20:19.000000000 -0700 @@ -1153,10 +1155,12 @@ void nfs_end_data_update(struct inode *i if (!nfs_have_delegation(inode, FMODE_READ)) { /* Mark the attribute cache for revalidation */ + lock_kernel(); nfsi->flags |= NFS_INO_INVALID_ATTR; /* Directories and symlinks: invalidate page cache too */ if (S_ISDIR(inode->i_mode) || S_ISLNK(inode->i_mode)) nfsi->flags |= NFS_INO_INVALID_DATA; + unlock_kernel(); } nfsi->cache_change_attribute ++; atomic_dec(&nfsi->data_updates); _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <5LjNF-1Q2-7@gated-at.bofh.it>]
* Re: NFS client hangs under certain circumstances on SMP machine [not found] <5LjNF-1Q2-7@gated-at.bofh.it> @ 2006-03-11 8:20 ` Olivier Croquette 2006-03-12 4:10 ` Trond Myklebust 0 siblings, 1 reply; 5+ messages in thread From: Olivier Croquette @ 2006-03-11 8:20 UTC (permalink / raw) To: linux-kernel Olivier Croquette wrote: > However, when I regenerate the file under Windows again (ie. I overwrite > the old files), and I try to compile the files again under Linux, "make" > hangs simply in D state: > > # ps aux | grep make > user 7177 0.0 0.0 1984 760 pts/1 D+ 16:13 0:00 make -f myMakefile I have upgraded to kernel 2.6.15 and it could not reproduce the problem since. Is it an effect of nfs-fix-client-hang-due-to-race-condition.patch? ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: NFS client hangs under certain circumstances on SMP machine 2006-03-11 8:20 ` Olivier Croquette @ 2006-03-12 4:10 ` Trond Myklebust 0 siblings, 0 replies; 5+ messages in thread From: Trond Myklebust @ 2006-03-12 4:10 UTC (permalink / raw) To: Olivier Croquette; +Cc: linux-kernel On Sat, 2006-03-11 at 09:20 +0100, Olivier Croquette wrote: > Olivier Croquette wrote: > > > However, when I regenerate the file under Windows again (ie. I overwrite > > the old files), and I try to compile the files again under Linux, "make" > > hangs simply in D state: > > > > # ps aux | grep make > > user 7177 0.0 0.0 1984 760 pts/1 D+ 16:13 0:00 make -f myMakefile > > I have upgraded to kernel 2.6.15 and it could not reproduce the problem > since. > > Is it an effect of nfs-fix-client-hang-due-to-race-condition.patch? Have you tried backing that patch out to see? Cheers, Trond ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2006-03-13 9:52 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-28 20:35 NFS client hangs under certain circumstances on SMP machine Olivier Croquette
2006-02-28 20:35 ` Olivier Croquette
[not found] ` <44127800.4090905@free.fr>
[not found] ` <20060313090033.GB12896@suse.de>
2006-03-13 9:51 ` [NFS] " Hans-Peter Jansen
[not found] <5LjNF-1Q2-7@gated-at.bofh.it>
2006-03-11 8:20 ` Olivier Croquette
2006-03-12 4:10 ` Trond Myklebust
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.