All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stephan Koledin <skoledin@neolinear.com>
To: nfs@lists.sourceforge.net
Subject: Re: NFS lockups with 2.4.18
Date: Tue, 30 Sep 2003 19:08:40 -0400	[thread overview]
Message-ID: <3F7A0CF8.4000407@neolinear.com> (raw)
In-Reply-To: <3F745FA5.5010907@neolinear.com>

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

Hello all.

I think we're finally making some progress with our problem. We've 
isolated the application and the system calls causing the lockups and I 
performed a bunch of testing today to try and define the causes of the 
lockups a little better. The recent 2.4.22 kernel from debian testing is 
affected the same as all other kernels we tried. I suppose I could be 
missing a more recent patch that fixes things - if so, please let me know.

Here is what we've found out:

- The problem is triggered by a distributed build process running on 
multiple Solaris 8 machines. The Sun compiler that we're using issues 
F_SETLKW calls to manage it's lock files during the build process. Since 
the build is distributed across a number of machines, we are seeing 
SETLKW attempts on the same file from several machines at the same time.

- We now have some code (attached) and various scripts to reproduce the 
problem.   It requires some coordination of requests from multiple 
machines, but is easily reproducible, given a proper environment. I can 
currently trigger the lockup within a couple seconds, even on a machine 
with a substantial number of nfsd processes running.

- I am unable to reproduce the problem with Linux clients. Only Solaris 
clients appear able to lock up the NFS server, even using the exact same 
test code and methodology.

- lockd enters a "D" state, with a WCHAN of "down" when multiple 
machines try to get a write lock (using F_SETLKW) on the same file(s). 
There must be at least two machines attempting to lock the same file, 
and the total number of processes/machines either with locks or waiting 
for any locks must be approximately 3-5x the number of nfsd processes. 
It really looks like it's triggered when there are 4x outstanding locks 
or lock requests, but it's a bit difficult to track given the 
distributed nature of the problem.

- For example, if I'm running 4 nfsd processes, the problem is triggered 
with around 16 lock requests from different machines. When running a 
more typical setup of 32 nfsd, triggering the problem requires a 
substantial amount of computing resources. We do see the problem with 
less than 128 machines requesting locks, but we cannot reproduce the 
problem with only one or two machines.

- Once lockd enters a "D" state, the nfsd processes also begin hitting a 
"D" state, and no more NFS requests are possible, even ones that do not 
require locking.

Now that the problem is easily reproducible, I would love to test any 
suggested solutions/patches. I wish I was more familiar with the 
lockd/nfs code, but unfortunately I don't have much kernel coding 
experience. Please let me know if there's anything else I can do (or 
data I can provide) to help get to the bottom of this issue.

Thanks in advance for any help with this problem, although it does seems 
like a rare issue, it is causing us a great deal of pain in our environment.

-Stephan

-- 
Stephan B Koledin
Network Systems Developer
http://neolinear.com/

[-- Attachment #2: setlkw.c --]
[-- Type: text/x-csrc, Size: 2043 bytes --]

/*
 * To compile run: gcc lock.c
 *	      or:   cc lock.c
 *
 * To lock a file run: a.out <filename> 
 *
 */

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

extern int errno;
extern char *getenv();


static locking_debug_p()
{
	return (getenv("LOCKING_DEBUG") != NULL);
}

__lock_file(filename)
char *filename;
{
	int o_flags;
	int fd;
	struct flock lockstruct;


	o_flags = O_RDWR|O_CREAT|O_TRUNC;

	errno = 0;

	/* Open the file */
	if ((fd = open(filename, o_flags, 0666)) == -1) {
		fprintf(stderr, "error opening");
	}

	lockstruct.l_type = F_WRLCK;
	lockstruct.l_whence = SEEK_SET;
	lockstruct.l_start = 0L;
	lockstruct.l_len = 0L;

	/* Try and set the lock */
	if (fcntl(fd, F_SETLKW, &lockstruct) == -1) {
	    fprintf(stderr, "error Locking");
	}
	
	return fd;
}


__unlock_file(locknum)
int locknum;
{
	struct flock lockstruct;

	if (locking_debug_p())
		fprintf(stderr, "Unlocking fd %d ...\n", locknum);

	errno = 0;

	lockstruct.l_type = F_UNLCK;
	lockstruct.l_whence = SEEK_SET;
	lockstruct.l_start = 0L;
	lockstruct.l_len = 0L;

	if ((fcntl(locknum, F_SETLKW, &lockstruct)) == -1) {
	    fprintf(stderr, "error unlocking ");
	}

	if ((close(locknum)) == -1) {
	    fprintf(stderr, "error closing ");
	}

	return 0;
}


void print_usage(const char* progname)
{
    printf("\nUsage:\n	 %s  <file_to_lock> \n", progname);
    printf("   where  <file_to_lock> is the name of an existing file\n");
}


main(argc, argv)
int argc;
char **argv;
{
  int result;

  /* Did the user specify the correct number of arguments? */
  if (argc < 2) {
    printf("Error: not enough arguments specified.\n");
    print_usage(argv[0]);
    exit(1);
  }

  if (argc > 2) {
    printf("Warning: ignoring all arguments other than the first One.\n");
  }

  result = __lock_file(argv[1]);
  printf("#");
  fflush(stdout);

  /* We sleep here just to give enough time to start other competing 
   * lock attempts on other machines 
   */
  sleep(60);
  
  
  result = __unlock_file(result);
  printf("!");

  exit(0);

}


      reply	other threads:[~2003-09-30 23:08 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-09-24 16:51 NFS lockups with 2.4.18 Stephan Koledin
2003-09-25 19:54 ` Stephan Koledin
2003-09-26 15:47   ` Stephan Koledin
2003-09-30 23:08     ` Stephan Koledin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3F7A0CF8.4000407@neolinear.com \
    --to=skoledin@neolinear.com \
    --cc=nfs@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.