From: Stephan Koledin <skoledin@neolinear.com>
To: nfs@lists.sourceforge.net
Subject: Re: NFS lockups with 2.4.18
Date: Tue, 30 Sep 2003 19:08:40 -0400 [thread overview]
Message-ID: <3F7A0CF8.4000407@neolinear.com> (raw)
In-Reply-To: <3F745FA5.5010907@neolinear.com>
[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]
Hello all.
I think we're finally making some progress with our problem. We've
isolated the application and the system calls causing the lockups and I
performed a bunch of testing today to try and define the causes of the
lockups a little better. The recent 2.4.22 kernel from debian testing is
affected the same as all other kernels we tried. I suppose I could be
missing a more recent patch that fixes things - if so, please let me know.
Here is what we've found out:
- The problem is triggered by a distributed build process running on
multiple Solaris 8 machines. The Sun compiler that we're using issues
F_SETLKW calls to manage it's lock files during the build process. Since
the build is distributed across a number of machines, we are seeing
SETLKW attempts on the same file from several machines at the same time.
- We now have some code (attached) and various scripts to reproduce the
problem. It requires some coordination of requests from multiple
machines, but is easily reproducible, given a proper environment. I can
currently trigger the lockup within a couple seconds, even on a machine
with a substantial number of nfsd processes running.
- I am unable to reproduce the problem with Linux clients. Only Solaris
clients appear able to lock up the NFS server, even using the exact same
test code and methodology.
- lockd enters a "D" state, with a WCHAN of "down" when multiple
machines try to get a write lock (using F_SETLKW) on the same file(s).
There must be at least two machines attempting to lock the same file,
and the total number of processes/machines either with locks or waiting
for any locks must be approximately 3-5x the number of nfsd processes.
It really looks like it's triggered when there are 4x outstanding locks
or lock requests, but it's a bit difficult to track given the
distributed nature of the problem.
- For example, if I'm running 4 nfsd processes, the problem is triggered
with around 16 lock requests from different machines. When running a
more typical setup of 32 nfsd, triggering the problem requires a
substantial amount of computing resources. We do see the problem with
less than 128 machines requesting locks, but we cannot reproduce the
problem with only one or two machines.
- Once lockd enters a "D" state, the nfsd processes also begin hitting a
"D" state, and no more NFS requests are possible, even ones that do not
require locking.
Now that the problem is easily reproducible, I would love to test any
suggested solutions/patches. I wish I was more familiar with the
lockd/nfs code, but unfortunately I don't have much kernel coding
experience. Please let me know if there's anything else I can do (or
data I can provide) to help get to the bottom of this issue.
Thanks in advance for any help with this problem, although it does seems
like a rare issue, it is causing us a great deal of pain in our environment.
-Stephan
--
Stephan B Koledin
Network Systems Developer
http://neolinear.com/
[-- Attachment #2: setlkw.c --]
[-- Type: text/x-csrc, Size: 2043 bytes --]
/*
* To compile run: gcc lock.c
* or: cc lock.c
*
* To lock a file run: a.out <filename>
*
*/
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
extern int errno;
extern char *getenv();
static locking_debug_p()
{
return (getenv("LOCKING_DEBUG") != NULL);
}
__lock_file(filename)
char *filename;
{
int o_flags;
int fd;
struct flock lockstruct;
o_flags = O_RDWR|O_CREAT|O_TRUNC;
errno = 0;
/* Open the file */
if ((fd = open(filename, o_flags, 0666)) == -1) {
fprintf(stderr, "error opening");
}
lockstruct.l_type = F_WRLCK;
lockstruct.l_whence = SEEK_SET;
lockstruct.l_start = 0L;
lockstruct.l_len = 0L;
/* Try and set the lock */
if (fcntl(fd, F_SETLKW, &lockstruct) == -1) {
fprintf(stderr, "error Locking");
}
return fd;
}
__unlock_file(locknum)
int locknum;
{
struct flock lockstruct;
if (locking_debug_p())
fprintf(stderr, "Unlocking fd %d ...\n", locknum);
errno = 0;
lockstruct.l_type = F_UNLCK;
lockstruct.l_whence = SEEK_SET;
lockstruct.l_start = 0L;
lockstruct.l_len = 0L;
if ((fcntl(locknum, F_SETLKW, &lockstruct)) == -1) {
fprintf(stderr, "error unlocking ");
}
if ((close(locknum)) == -1) {
fprintf(stderr, "error closing ");
}
return 0;
}
void print_usage(const char* progname)
{
printf("\nUsage:\n %s <file_to_lock> \n", progname);
printf(" where <file_to_lock> is the name of an existing file\n");
}
main(argc, argv)
int argc;
char **argv;
{
int result;
/* Did the user specify the correct number of arguments? */
if (argc < 2) {
printf("Error: not enough arguments specified.\n");
print_usage(argv[0]);
exit(1);
}
if (argc > 2) {
printf("Warning: ignoring all arguments other than the first One.\n");
}
result = __lock_file(argv[1]);
printf("#");
fflush(stdout);
/* We sleep here just to give enough time to start other competing
* lock attempts on other machines
*/
sleep(60);
result = __unlock_file(result);
printf("!");
exit(0);
}
prev parent reply other threads:[~2003-09-30 23:08 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-09-24 16:51 NFS lockups with 2.4.18 Stephan Koledin
2003-09-25 19:54 ` Stephan Koledin
2003-09-26 15:47 ` Stephan Koledin
2003-09-30 23:08 ` Stephan Koledin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3F7A0CF8.4000407@neolinear.com \
--to=skoledin@neolinear.com \
--cc=nfs@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.