Linux NFS development
 help / color / mirror / Atom feed
From: Stephan Koledin <skoledin@neolinear.com>
To: nfs@lists.sourceforge.net
Subject: Re: NFS lockups with 2.4.18
Date: Tue, 30 Sep 2003 19:08:40 -0400	[thread overview]
Message-ID: <3F7A0CF8.4000407@neolinear.com> (raw)
In-Reply-To: <3F745FA5.5010907@neolinear.com>

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

Hello all.

I think we're finally making some progress with our problem. We've 
isolated the application and the system calls causing the lockups and I 
performed a bunch of testing today to try and define the causes of the 
lockups a little better. The recent 2.4.22 kernel from debian testing is 
affected the same as all other kernels we tried. I suppose I could be 
missing a more recent patch that fixes things - if so, please let me know.

Here is what we've found out:

- The problem is triggered by a distributed build process running on 
multiple Solaris 8 machines. The Sun compiler that we're using issues 
F_SETLKW calls to manage it's lock files during the build process. Since 
the build is distributed across a number of machines, we are seeing 
SETLKW attempts on the same file from several machines at the same time.

- We now have some code (attached) and various scripts to reproduce the 
problem.   It requires some coordination of requests from multiple 
machines, but is easily reproducible, given a proper environment. I can 
currently trigger the lockup within a couple seconds, even on a machine 
with a substantial number of nfsd processes running.

- I am unable to reproduce the problem with Linux clients. Only Solaris 
clients appear able to lock up the NFS server, even using the exact same 
test code and methodology.

- lockd enters a "D" state, with a WCHAN of "down" when multiple 
machines try to get a write lock (using F_SETLKW) on the same file(s). 
There must be at least two machines attempting to lock the same file, 
and the total number of processes/machines either with locks or waiting 
for any locks must be approximately 3-5x the number of nfsd processes. 
It really looks like it's triggered when there are 4x outstanding locks 
or lock requests, but it's a bit difficult to track given the 
distributed nature of the problem.

- For example, if I'm running 4 nfsd processes, the problem is triggered 
with around 16 lock requests from different machines. When running a 
more typical setup of 32 nfsd, triggering the problem requires a 
substantial amount of computing resources. We do see the problem with 
less than 128 machines requesting locks, but we cannot reproduce the 
problem with only one or two machines.

- Once lockd enters a "D" state, the nfsd processes also begin hitting a 
"D" state, and no more NFS requests are possible, even ones that do not 
require locking.

Now that the problem is easily reproducible, I would love to test any 
suggested solutions/patches. I wish I was more familiar with the 
lockd/nfs code, but unfortunately I don't have much kernel coding 
experience. Please let me know if there's anything else I can do (or 
data I can provide) to help get to the bottom of this issue.

Thanks in advance for any help with this problem, although it does seems 
like a rare issue, it is causing us a great deal of pain in our environment.

-Stephan

-- 
Stephan B Koledin
Network Systems Developer
http://neolinear.com/

[-- Attachment #2: setlkw.c --]
[-- Type: text/x-csrc, Size: 2043 bytes --]

/*
 * To compile run: gcc lock.c
 *	      or:   cc lock.c
 *
 * To lock a file run: a.out <filename> 
 *
 */

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

extern int errno;
extern char *getenv();


static locking_debug_p()
{
	return (getenv("LOCKING_DEBUG") != NULL);
}

__lock_file(filename)
char *filename;
{
	int o_flags;
	int fd;
	struct flock lockstruct;


	o_flags = O_RDWR|O_CREAT|O_TRUNC;

	errno = 0;

	/* Open the file */
	if ((fd = open(filename, o_flags, 0666)) == -1) {
		fprintf(stderr, "error opening");
	}

	lockstruct.l_type = F_WRLCK;
	lockstruct.l_whence = SEEK_SET;
	lockstruct.l_start = 0L;
	lockstruct.l_len = 0L;

	/* Try and set the lock */
	if (fcntl(fd, F_SETLKW, &lockstruct) == -1) {
	    fprintf(stderr, "error Locking");
	}
	
	return fd;
}


__unlock_file(locknum)
int locknum;
{
	struct flock lockstruct;

	if (locking_debug_p())
		fprintf(stderr, "Unlocking fd %d ...\n", locknum);

	errno = 0;

	lockstruct.l_type = F_UNLCK;
	lockstruct.l_whence = SEEK_SET;
	lockstruct.l_start = 0L;
	lockstruct.l_len = 0L;

	if ((fcntl(locknum, F_SETLKW, &lockstruct)) == -1) {
	    fprintf(stderr, "error unlocking ");
	}

	if ((close(locknum)) == -1) {
	    fprintf(stderr, "error closing ");
	}

	return 0;
}


void print_usage(const char* progname)
{
    printf("\nUsage:\n	 %s  <file_to_lock> \n", progname);
    printf("   where  <file_to_lock> is the name of an existing file\n");
}


main(argc, argv)
int argc;
char **argv;
{
  int result;

  /* Did the user specify the correct number of arguments? */
  if (argc < 2) {
    printf("Error: not enough arguments specified.\n");
    print_usage(argv[0]);
    exit(1);
  }

  if (argc > 2) {
    printf("Warning: ignoring all arguments other than the first One.\n");
  }

  result = __lock_file(argv[1]);
  printf("#");
  fflush(stdout);

  /* We sleep here just to give enough time to start other competing 
   * lock attempts on other machines 
   */
  sleep(60);
  
  
  result = __unlock_file(result);
  printf("!");

  exit(0);

}


      reply	other threads:[~2003-09-30 23:08 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-09-24 16:51 NFS lockups with 2.4.18 Stephan Koledin
2003-09-25 19:54 ` Stephan Koledin
2003-09-26 15:47   ` Stephan Koledin
2003-09-30 23:08     ` Stephan Koledin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3F7A0CF8.4000407@neolinear.com \
    --to=skoledin@neolinear.com \
    --cc=nfs@lists.sourceforge.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox