Re: [PATCH 0/9] fsnotify: change locking order

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Paris <eparis@redhat.com>
To: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: linux-fsdevel@vger.kernel.org, viro@ZenIV.linux.org.uk
Subject: Re: [PATCH 0/9] fsnotify: change locking order
Date: Mon, 01 Aug 2011 16:38:22 -0400	[thread overview]
Message-ID: <4E370EBE.3050100@redhat.com> (raw)
In-Reply-To: <1308065393-29463-1-git-send-email-LinoSanfilippo@gmx.de>

[-- Attachment #1: Type: text/plain, Size: 6034 bytes --]

On 06/14/2011 11:29 AM, Lino Sanfilippo wrote:
> Hi Eric,
> 
> After the patch series "decouple mark lock and marks fsobject lock" 
> I sent on Feb. 21 this is another attempt to change the locking order
> used in fsnotify from 
> 
> 	mark->lock
> 	group->mark_lock
> 	inode/mnt->lock
> to 
> 	group->mark_lock
> 	mark->lock
> 	inode/mnt->lock
> 
> The former series still contained some flaws:
> 1. inodes/mounts that have marks and are not pinned could dissappear while 
> another thread still wants to access a marks inode/mount (see https://lkml.org/lkml/2011/3/1/373)
> 2. in the new introduced function remove_mark_locked() a group could be freed 
> while the groups mark mutex is held
> 
> Both issues should be fixed with this series.
> 
> The reason for changing the locking order is, as you may remember, that there are 
> still some races when adding/removing marks to a group 
> (see also https://lkml.org/lkml/2010/11/30/189).
> 
> So what this series does is change the locking order (PATCH 4) to
> 
> 	group->mark_mutex
> 	mark->lock
> 	inode/mnt->lock
> 
> Beside this the group mark_lock is turned into a mutex (PATCH 6), which allows us to 
> call functions that may sleep while this lock is held. 
> 
> At some places there is only a mark and no group available 
> (i.e. clear_marks_by_[inode|mount]()), so we first have to get the group from the mark
> to use the group->mark_mutex (PATCH 7).  
> 
> Since we cant get a group from a mark and use it without the danger that the group is 
> unregistered and freed, reference counting for groups is introduced and used 
> (PATCHES 1 to 3) for each mark that is added to the group. 
> With this freeing a group does not longer depend on the number of marks, but is 
> simply destroyed when the reference counter hits 0.
> 
> Since a group ref is now taken for each mark that is added to the group we first
> have to explicitly call fsnotify_destroy_group() to remove all marks from the group
> and release all group references held by those marks. This will also release the
> - possibly final - ref of the group held by the caller (PATCH 1).
> 
> Furthermore we now can take the mark lock with the group mutex held so we dont need an
> extra "clear list" any more if we clear all marks by group (PATCH 9).
> 
> For [add|remove]_mark() locked versions are introduced (PATCH 8) that can be 
> used if a caller already has the mark lock held. Thus we now have the possibility
> to use the groups mark mutex to also synchronize addition/removal of a mark or to
> replace the dnotify mutex. 
> This is not part of these patches. It would be the next step if these patches are 
> accepted to be merged.
> 
> This is against 2.6.39

I finally built and tested a v3.0 kernel with these patches (I know I'm
SOOOOOO far behind).  Not what I hoped for:

> [  150.937798] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds.  Have a nice day...
> [  150.945290] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
> [  150.946012] IP: [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> [  150.946012] PGD 2bf9e067 PUD 2bf9f067 PMD 0 
> [  150.946012] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  150.946012] CPU 0 
> [  150.946012] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ext4 jbd2 crc16 joydev ata_piix i2c_piix4 pcspkr uinput ipv6 autofs4 usbhid [last unloaded: scsi_wait_scan]
> [  150.946012] 
> [  150.946012] Pid: 2764, comm: syscall_thrash Not tainted 3.0.0+ #1 Red Hat KVM
> [  150.946012] RIP: 0010:[<ffffffff810ffd58>]  [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> [  150.946012] RSP: 0018:ffff88002c2e5df8  EFLAGS: 00010282
> [  150.946012] RAX: 000000004e370d9f RBX: 0000000000000000 RCX: ffff88003a029438
> [  150.946012] RDX: 0000000033630a5f RSI: 0000000000000000 RDI: ffff88003491c240
> [  150.946012] RBP: ffff88002c2e5e08 R08: 0000000000000000 R09: 0000000000000000
> [  150.946012] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a029428
> [  150.946012] R13: ffff88003a029428 R14: ffff88003a029428 R15: ffff88003499a610
> [  150.946012] FS:  00007f5a05420700(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
> [  150.946012] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  150.946012] CR2: 0000000000000070 CR3: 000000002a662000 CR4: 00000000000006f0
> [  150.946012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  150.946012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  150.946012] Process syscall_thrash (pid: 2764, threadinfo ffff88002c2e4000, task ffff88002bfbc760)
> [  150.946012] Stack:
> [  150.946012]  ffff88003a029438 ffff88003a029428 ffff88002c2e5e38 ffffffff81102f76
> [  150.946012]  ffff88003a029438 ffff88003a029598 ffffffff8160f9c0 ffff88002c221250
> [  150.946012]  ffff88002c2e5e68 ffffffff8115e9be ffff88002c2e5e68 ffff88003a029438
> [  150.946012] Call Trace:
> [  150.946012]  [<ffffffff81102f76>] shmem_evict_inode+0x76/0x130
> [  150.946012]  [<ffffffff8115e9be>] evict+0x7e/0x170
> [  150.946012]  [<ffffffff8115ee40>] iput_final+0xd0/0x190
> [  150.946012]  [<ffffffff8115ef33>] iput+0x33/0x40
> [  150.946012]  [<ffffffff81180205>] fsnotify_destroy_mark_locked+0x145/0x160
> [  150.946012]  [<ffffffff81180316>] fsnotify_destroy_mark+0x36/0x50
> [  150.946012]  [<ffffffff81181937>] sys_inotify_rm_watch+0x77/0xd0
> [  150.946012]  [<ffffffff815aca52>] system_call_fastpath+0x16/0x1b
> [  150.946012] Code: 67 4a 00 b8 e4 ff ff ff eb aa 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 8b 9f 40 05 00 00 
> [  150.946012]  83 7b 70 00 74 1c 4c 8d a3 80 00 00 00 4c 89 e7 e8 d2 5d 4a 
> [  150.946012] RIP  [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> [  150.946012]  RSP <ffff88002c2e5df8>
> [  150.946012] CR2: 0000000000000070

Looks at aweful lot like the problem from:
http://www.spinics.net/lists/linux-fsdevel/msg46101.html

Latest stress test program attached.....

-Eric

[-- Attachment #2: syscall_thrash.c --]
[-- Type: text/plain, Size: 9878 bytes --]

#define _GNU_SOURCE

#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/inotify.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define NUM_CORES		2
#define NUM_DATA_DUMPERS	4
#define WATCHER_MULTIPLIER	4
#define NUM_WATCHER_THREADS 	NUM_CORES
#define NUM_CLOSER_THREADS	NUM_WATCHER_THREADS * 2
#define NUM_ZERO_CLOSERS	1
#define NUM_FILE_CREATERS	2
#define NUM_INOTIFY_INSTANCES	2

#define TMP_DIR_NAME "/tmp/inotify_syscall_thrash"

struct watcher_struct {
	int inotify_fd;
	int file_num;
};

struct operator_struct {
	int inotify_fd;
};

static int stopped = 0;

static int high_wd = 0;
static int low_wd = INT_MAX;

void *add_watches(void *ptr);
void *close_watches(void *ptr);
void *zero_close_watches(void *ptr);
void *dump_data(void *ptr);
void *reset_low_wd(void *ptr);
void *create_files(void *ptr);
void *mount_tmpdir(void *ptr);
void sigfunc(int sig_num);

int handle_error(const char *arg)
{
	perror(arg);
	exit(1);
}

int main(void)
{
	int inotify_fd[NUM_INOTIFY_INSTANCES];
	struct watcher_struct *watcher_arg;
	struct operator_struct *operator_arg;
	pthread_t *watchers;
	pthread_t *closers;
	pthread_t *zero_closers;
	pthread_t *data_dumpers;
	pthread_t *file_creaters;
	pthread_t low_wd_reseter;
	pthread_t mounter;
	pthread_attr_t attr;
	int iret;
	void *ret;
	int i, j, k;
	struct sigaction setmask;

	/* close cleanly on cntl+c */
	sigemptyset( &setmask.sa_mask );
	setmask.sa_handler = sigfunc;
	setmask.sa_flags   = 0;
	sigaction( SIGINT,  &setmask, (struct sigaction *) NULL );

	/* create and inotify instance an make it O_NONBLOCK */
	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++) {
		int fd  = inotify_init();
		if (fd < 0)
			handle_error("opening inotify_fd");
		iret = fcntl(fd, F_SETFL, O_NONBLOCK);
		if (iret)
			handle_error("setting NONBLOCK");
		inotify_fd[i] = fd;
	}

	/* make sure the directory exists */
	mkdir(TMP_DIR_NAME, S_IRWXU);

	/* set up a pthread attr with a tiny stack */
	iret = pthread_attr_init(&attr);
	if (iret)
		handle_error("pthread_attr_init");
	iret = pthread_attr_setstacksize(&attr, PTHREAD_STACK_MIN*2);
	if (iret)
		handle_error("pthread_attr_setstacksize");

	/* watchers need to know what file to pay with, so we need and argument */
	watcher_arg = calloc(NUM_INOTIFY_INSTANCES * NUM_WATCHER_THREADS, sizeof(struct watcher_struct));
	if (!watcher_arg)
		handle_error("allocating watcher_arg");

	operator_arg = calloc(NUM_INOTIFY_INSTANCES, sizeof(struct operator_struct));
	if (!operator_arg)
		handle_error("allocating operator_arg");

	/* allocate the pthread_t's for all of the threads */
	watchers = calloc(NUM_INOTIFY_INSTANCES * NUM_WATCHER_THREADS * WATCHER_MULTIPLIER, sizeof(pthread_t));
	if (!watchers)
		handle_error("allocating watchers");

	closers = calloc(NUM_INOTIFY_INSTANCES * NUM_CLOSER_THREADS, sizeof(pthread_t));
	if (!closers)
		handle_error("allocating closers");

	zero_closers = calloc(NUM_INOTIFY_INSTANCES * NUM_ZERO_CLOSERS, sizeof(pthread_t));
	if (!zero_closers)
		handle_error("allocating zero_closers");

	data_dumpers = calloc(NUM_INOTIFY_INSTANCES * NUM_DATA_DUMPERS, sizeof(pthread_t));
	if (!data_dumpers)
		handle_error("allocating data_dumpers");

	file_creaters = calloc(NUM_FILE_CREATERS, sizeof(*file_creaters));
	if (!file_creaters)
		handle_error("allocating file_creaters");

	/* create a thread that does nothing but reset the low_wd */
	iret = pthread_create(&low_wd_reseter, &attr, reset_low_wd, NULL);
	if (iret)
		handle_error("low_wd_reseter");

	iret = pthread_create(&mounter, &attr, mount_tmpdir, NULL);
	if (iret)
		handle_error("low_wd_reseter");

	/* create WATCHER_MULTIPLIER threads per file which do nothing
	 * but try to add a watch for each INOTIFY_INSTANCE */
	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++) {
		for (j = 0; j < NUM_WATCHER_THREADS; j++) {
			watcher_arg[i * NUM_WATCHER_THREADS + j].file_num = j;
			watcher_arg[i * NUM_WATCHER_THREADS + j].inotify_fd = inotify_fd[i];
			for (k = 0; k < WATCHER_MULTIPLIER; k++) {
				iret = pthread_create(&watchers[i * (NUM_WATCHER_THREADS * WATCHER_MULTIPLIER) +
								(j * WATCHER_MULTIPLIER) + k],
						      &attr, add_watches, &watcher_arg[i * NUM_WATCHER_THREADS + j]);
				if (iret)
					handle_error("creating water threads");
			}
		}
	}

	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		operator_arg[i].inotify_fd = inotify_fd[i];

	/* create threads which unlink and then recreate all of the files in question */
	for (i = 0; i < NUM_FILE_CREATERS; i++) {
		iret = pthread_create( &file_creaters[i], &attr, create_files, NULL);
		if (iret)
			handle_error("creating the file creators");
	}

	/* create threads which walk from low_wd to high_wd closing all of the wd's in between */
	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		for (j = 0; j < NUM_CLOSER_THREADS; j++) {
			iret = pthread_create ( &closers[i * NUM_CLOSER_THREADS + j], &attr, close_watches, &operator_arg[i]);
			if (iret)
				handle_error("creating the close threads");
		}

	/* create threads which just walk from low_wd to low_wd +3 closing wd's for extra races */
	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		for (j = 0; j < NUM_ZERO_CLOSERS; j++) {
			iret = pthread_create ( &zero_closers[i * NUM_ZERO_CLOSERS + j], &attr, zero_close_watches, &operator_arg[i]);
			if (iret)
				handle_error("creating the low closer threads");
		}

	/* create threads which just pull data off of the inotify fd.
	 * use default ATTR for larger stack */
	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		for (j = 0; j < NUM_DATA_DUMPERS; j++) {
			iret = pthread_create( &data_dumpers[i * NUM_DATA_DUMPERS + j], NULL, dump_data, &operator_arg[i]);
			if (iret)
				handle_error("creating threads to dump inotify data");
		}


	/* Wait till threads are complete before main continues. */
	pthread_join(low_wd_reseter, &ret);

	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		for (j = 0; j < NUM_WATCHER_THREADS; j++)
			for (k = 0; k < WATCHER_MULTIPLIER; k++)
				pthread_join(watchers[i * (NUM_WATCHER_THREADS * WATCHER_MULTIPLIER) +
						      (j * WATCHER_MULTIPLIER) + k], &ret);

	for (i = 0; i < NUM_FILE_CREATERS; i++)
		pthread_join(file_creaters[i], &ret);

	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		for (j = 0; j < NUM_CLOSER_THREADS; j++)
			pthread_join(closers[i * NUM_CLOSER_THREADS + j], &ret);

	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		for (j = 0; j < NUM_ZERO_CLOSERS; j++)
			pthread_join(zero_closers[i * NUM_ZERO_CLOSERS + j], &ret);

	for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
		for (j = 0; j < NUM_DATA_DUMPERS; j++)
			pthread_join(data_dumpers[i * NUM_DATA_DUMPERS + j], &ret);

	/* clean up the tmp dir which should be empty */
	rmdir(TMP_DIR_NAME);

	exit(0);
}

void sigfunc(int sig_num)
{
	if (sig_num == SIGINT)
		stopped = 1;
	else
		printf("Got an unknown signal!\n");
}

/* constantly create and delete all of the files that are bieng watched */
void *create_files(__attribute__ ((unused)) void *ptr)
{
	char filename[50];
	int i;

	fprintf(stderr, "Starting creator thread\n");

	while (!stopped) {
		for (i = 0; i < NUM_WATCHER_THREADS; i++) {
			int fd;

			snprintf(filename, 50, "%s/%d", TMP_DIR_NAME, i);
			unlink(filename);
			fd = open(filename, O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
			if (fd >= 0)
				close(fd);
		}
		sleep(10);
	}

	/* cleanup all files on exit */
	for (i = 0; i < NUM_WATCHER_THREADS; i++) {
		snprintf(filename, 50, "%s/%d", TMP_DIR_NAME, i);
		unlink(filename);
	}

	return NULL;
}

/* Reset the low_wd so closers can be smart */
void *reset_low_wd(__attribute__ ((unused)) void *ptr)
{
	fprintf(stderr, "Starting low_wd reset thread\n");

	while (!stopped) {
		low_wd = INT_MAX;
		sleep(1);
	}

	return NULL;
}

/* Pull events off the buffer and ignore them */
void *dump_data(void *ptr)
{
	char buf[8096];
	struct operator_struct *operator_arg = ptr;
	int inotify_fd = operator_arg->inotify_fd;
	int ret;

	fprintf(stderr, "Starting inotify data dumper thread\n");

	while (!stopped) {
		ret = read(inotify_fd, buf, 8096);
		if (ret <= 0)
			pthread_yield();
	}

	return NULL;
}

/* add a watch to a specific file as fast as we can */
void *add_watches(void *ptr)
{
	struct watcher_struct *watcher_arg = ptr;
	int file_num = watcher_arg->file_num;
	int notify_fd = watcher_arg->inotify_fd;
	int ret;
	char filename[50];

	fprintf(stderr, "Creating an watch adder thread, notify_fd=%d filenum=%d\n",
		notify_fd, file_num);

	snprintf(filename, 50, "%s/%d", TMP_DIR_NAME, file_num);

	while (!stopped) {
		ret = inotify_add_watch(notify_fd, filename, IN_ALL_EVENTS);
		if (ret < 0 && errno != ENOENT)
			perror("inotify_add_watch");
		if (ret > high_wd)
			high_wd = ret;
		if (ret < low_wd)
			low_wd = ret;
		pthread_yield();
	}

	return NULL;
}

/* run from low_wd to high_wd closing all watches in between */
void *close_watches(void *ptr)
{
	struct operator_struct *operator_arg = ptr;
	int inotify_fd = operator_arg->inotify_fd;
	int i;

	fprintf(stderr, "Starting a thread to close watchers\n");

	while (!stopped) {
		for (i = low_wd; i < high_wd; i++)
			inotify_rm_watch(inotify_fd, i);
		pthread_yield();
	}
	return NULL;
}

/* run from low_wd to low_wd+3 closing all watch in between just for extra races */
void *zero_close_watches(void *ptr)
{
	struct operator_struct *operator_arg = ptr;
	int inotify_fd = operator_arg->inotify_fd;
	int i;
	while (!stopped) {
		for (i = low_wd; i <= low_wd+3; i++)
			inotify_rm_watch(inotify_fd, i);
		pthread_yield();
	}
	return NULL;
}

void *mount_tmpdir(__attribute__ ((unused)) void *ptr)
{
	int rc;

	while (!stopped) {
		rc = mount(TMP_DIR_NAME, TMP_DIR_NAME, "tmpfs", MS_MGC_VAL, "rootcontext=\"unconfined_u:object_r:tmp_t:s0\"");
		usleep(100000);
		if (!rc)
			umount2(TMP_DIR_NAME, MNT_DETACH);
		usleep(100000);
	}
	return NULL;
}

next prev parent reply	other threads:[~2011-08-01 20:38 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-14 15:29 [PATCH 0/9] fsnotify: change locking order Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 1/9] inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 2/9] fsnotify: introduce fsnotify_get_group() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 3/9] fsnotify: use reference counting for groups Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 4/9] fsnotify: take groups mark_lock before mark lock Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 5/9] fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 6/9] fsnotify: use a mutex instead of a spinlock to protect a groups mark list Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 7/9] fsnotify: pass group to fsnotify_destroy_mark() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 8/9] fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 9/9] fsnotify: dont put marks on temporary list when clearing marks by group Lino Sanfilippo
2011-08-01 20:38 ` Eric Paris [this message]
2011-08-11 23:13   ` [PATCH 0/9] fsnotify: change locking order Lino Sanfilippo
2012-03-20 18:49     ` Luis Henriques
2012-03-20 18:58       ` Eric Paris
2012-03-20 19:05         ` Luis Henriques
2012-03-22 22:14         ` Lino Sanfilippo
2012-03-26 11:27         ` Luis Henriques
2012-03-26 15:12           ` Eric Paris
2012-03-26 15:27             ` Luis Henriques
2012-03-26 22:51               ` Lino Sanfilippo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E370EBE.3050100@redhat.com \
    --to=eparis@redhat.com \
    --cc=LinoSanfilippo@gmx.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).