From: Eric Paris <eparis@redhat.com>
To: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: linux-fsdevel@vger.kernel.org, viro@ZenIV.linux.org.uk
Subject: Re: [PATCH 0/9] fsnotify: change locking order
Date: Mon, 01 Aug 2011 16:38:22 -0400 [thread overview]
Message-ID: <4E370EBE.3050100@redhat.com> (raw)
In-Reply-To: <1308065393-29463-1-git-send-email-LinoSanfilippo@gmx.de>
[-- Attachment #1: Type: text/plain, Size: 6034 bytes --]
On 06/14/2011 11:29 AM, Lino Sanfilippo wrote:
> Hi Eric,
>
> After the patch series "decouple mark lock and marks fsobject lock"
> I sent on Feb. 21 this is another attempt to change the locking order
> used in fsnotify from
>
> mark->lock
> group->mark_lock
> inode/mnt->lock
> to
> group->mark_lock
> mark->lock
> inode/mnt->lock
>
> The former series still contained some flaws:
> 1. inodes/mounts that have marks and are not pinned could dissappear while
> another thread still wants to access a marks inode/mount (see https://lkml.org/lkml/2011/3/1/373)
> 2. in the new introduced function remove_mark_locked() a group could be freed
> while the groups mark mutex is held
>
> Both issues should be fixed with this series.
>
> The reason for changing the locking order is, as you may remember, that there are
> still some races when adding/removing marks to a group
> (see also https://lkml.org/lkml/2010/11/30/189).
>
> So what this series does is change the locking order (PATCH 4) to
>
> group->mark_mutex
> mark->lock
> inode/mnt->lock
>
> Beside this the group mark_lock is turned into a mutex (PATCH 6), which allows us to
> call functions that may sleep while this lock is held.
>
> At some places there is only a mark and no group available
> (i.e. clear_marks_by_[inode|mount]()), so we first have to get the group from the mark
> to use the group->mark_mutex (PATCH 7).
>
> Since we cant get a group from a mark and use it without the danger that the group is
> unregistered and freed, reference counting for groups is introduced and used
> (PATCHES 1 to 3) for each mark that is added to the group.
> With this freeing a group does not longer depend on the number of marks, but is
> simply destroyed when the reference counter hits 0.
>
> Since a group ref is now taken for each mark that is added to the group we first
> have to explicitly call fsnotify_destroy_group() to remove all marks from the group
> and release all group references held by those marks. This will also release the
> - possibly final - ref of the group held by the caller (PATCH 1).
>
> Furthermore we now can take the mark lock with the group mutex held so we dont need an
> extra "clear list" any more if we clear all marks by group (PATCH 9).
>
> For [add|remove]_mark() locked versions are introduced (PATCH 8) that can be
> used if a caller already has the mark lock held. Thus we now have the possibility
> to use the groups mark mutex to also synchronize addition/removal of a mark or to
> replace the dnotify mutex.
> This is not part of these patches. It would be the next step if these patches are
> accepted to be merged.
>
> This is against 2.6.39
I finally built and tested a v3.0 kernel with these patches (I know I'm
SOOOOOO far behind). Not what I hoped for:
> [ 150.937798] VFS: Busy inodes after unmount of tmpfs. Self-destruct in 5 seconds. Have a nice day...
> [ 150.945290] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
> [ 150.946012] IP: [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> [ 150.946012] PGD 2bf9e067 PUD 2bf9f067 PMD 0
> [ 150.946012] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 150.946012] CPU 0
> [ 150.946012] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ext4 jbd2 crc16 joydev ata_piix i2c_piix4 pcspkr uinput ipv6 autofs4 usbhid [last unloaded: scsi_wait_scan]
> [ 150.946012]
> [ 150.946012] Pid: 2764, comm: syscall_thrash Not tainted 3.0.0+ #1 Red Hat KVM
> [ 150.946012] RIP: 0010:[<ffffffff810ffd58>] [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> [ 150.946012] RSP: 0018:ffff88002c2e5df8 EFLAGS: 00010282
> [ 150.946012] RAX: 000000004e370d9f RBX: 0000000000000000 RCX: ffff88003a029438
> [ 150.946012] RDX: 0000000033630a5f RSI: 0000000000000000 RDI: ffff88003491c240
> [ 150.946012] RBP: ffff88002c2e5e08 R08: 0000000000000000 R09: 0000000000000000
> [ 150.946012] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a029428
> [ 150.946012] R13: ffff88003a029428 R14: ffff88003a029428 R15: ffff88003499a610
> [ 150.946012] FS: 00007f5a05420700(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
> [ 150.946012] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 150.946012] CR2: 0000000000000070 CR3: 000000002a662000 CR4: 00000000000006f0
> [ 150.946012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 150.946012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 150.946012] Process syscall_thrash (pid: 2764, threadinfo ffff88002c2e4000, task ffff88002bfbc760)
> [ 150.946012] Stack:
> [ 150.946012] ffff88003a029438 ffff88003a029428 ffff88002c2e5e38 ffffffff81102f76
> [ 150.946012] ffff88003a029438 ffff88003a029598 ffffffff8160f9c0 ffff88002c221250
> [ 150.946012] ffff88002c2e5e68 ffffffff8115e9be ffff88002c2e5e68 ffff88003a029438
> [ 150.946012] Call Trace:
> [ 150.946012] [<ffffffff81102f76>] shmem_evict_inode+0x76/0x130
> [ 150.946012] [<ffffffff8115e9be>] evict+0x7e/0x170
> [ 150.946012] [<ffffffff8115ee40>] iput_final+0xd0/0x190
> [ 150.946012] [<ffffffff8115ef33>] iput+0x33/0x40
> [ 150.946012] [<ffffffff81180205>] fsnotify_destroy_mark_locked+0x145/0x160
> [ 150.946012] [<ffffffff81180316>] fsnotify_destroy_mark+0x36/0x50
> [ 150.946012] [<ffffffff81181937>] sys_inotify_rm_watch+0x77/0xd0
> [ 150.946012] [<ffffffff815aca52>] system_call_fastpath+0x16/0x1b
> [ 150.946012] Code: 67 4a 00 b8 e4 ff ff ff eb aa 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 8b 9f 40 05 00 00
> [ 150.946012] 83 7b 70 00 74 1c 4c 8d a3 80 00 00 00 4c 89 e7 e8 d2 5d 4a
> [ 150.946012] RIP [<ffffffff810ffd58>] shmem_free_inode+0x18/0x50
> [ 150.946012] RSP <ffff88002c2e5df8>
> [ 150.946012] CR2: 0000000000000070
Looks at aweful lot like the problem from:
http://www.spinics.net/lists/linux-fsdevel/msg46101.html
Latest stress test program attached.....
-Eric
[-- Attachment #2: syscall_thrash.c --]
[-- Type: text/plain, Size: 9878 bytes --]
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <pthread.h>
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/inotify.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#define NUM_CORES 2
#define NUM_DATA_DUMPERS 4
#define WATCHER_MULTIPLIER 4
#define NUM_WATCHER_THREADS NUM_CORES
#define NUM_CLOSER_THREADS NUM_WATCHER_THREADS * 2
#define NUM_ZERO_CLOSERS 1
#define NUM_FILE_CREATERS 2
#define NUM_INOTIFY_INSTANCES 2
#define TMP_DIR_NAME "/tmp/inotify_syscall_thrash"
struct watcher_struct {
int inotify_fd;
int file_num;
};
struct operator_struct {
int inotify_fd;
};
static int stopped = 0;
static int high_wd = 0;
static int low_wd = INT_MAX;
void *add_watches(void *ptr);
void *close_watches(void *ptr);
void *zero_close_watches(void *ptr);
void *dump_data(void *ptr);
void *reset_low_wd(void *ptr);
void *create_files(void *ptr);
void *mount_tmpdir(void *ptr);
void sigfunc(int sig_num);
int handle_error(const char *arg)
{
perror(arg);
exit(1);
}
int main(void)
{
int inotify_fd[NUM_INOTIFY_INSTANCES];
struct watcher_struct *watcher_arg;
struct operator_struct *operator_arg;
pthread_t *watchers;
pthread_t *closers;
pthread_t *zero_closers;
pthread_t *data_dumpers;
pthread_t *file_creaters;
pthread_t low_wd_reseter;
pthread_t mounter;
pthread_attr_t attr;
int iret;
void *ret;
int i, j, k;
struct sigaction setmask;
/* close cleanly on cntl+c */
sigemptyset( &setmask.sa_mask );
setmask.sa_handler = sigfunc;
setmask.sa_flags = 0;
sigaction( SIGINT, &setmask, (struct sigaction *) NULL );
/* create and inotify instance an make it O_NONBLOCK */
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++) {
int fd = inotify_init();
if (fd < 0)
handle_error("opening inotify_fd");
iret = fcntl(fd, F_SETFL, O_NONBLOCK);
if (iret)
handle_error("setting NONBLOCK");
inotify_fd[i] = fd;
}
/* make sure the directory exists */
mkdir(TMP_DIR_NAME, S_IRWXU);
/* set up a pthread attr with a tiny stack */
iret = pthread_attr_init(&attr);
if (iret)
handle_error("pthread_attr_init");
iret = pthread_attr_setstacksize(&attr, PTHREAD_STACK_MIN*2);
if (iret)
handle_error("pthread_attr_setstacksize");
/* watchers need to know what file to pay with, so we need and argument */
watcher_arg = calloc(NUM_INOTIFY_INSTANCES * NUM_WATCHER_THREADS, sizeof(struct watcher_struct));
if (!watcher_arg)
handle_error("allocating watcher_arg");
operator_arg = calloc(NUM_INOTIFY_INSTANCES, sizeof(struct operator_struct));
if (!operator_arg)
handle_error("allocating operator_arg");
/* allocate the pthread_t's for all of the threads */
watchers = calloc(NUM_INOTIFY_INSTANCES * NUM_WATCHER_THREADS * WATCHER_MULTIPLIER, sizeof(pthread_t));
if (!watchers)
handle_error("allocating watchers");
closers = calloc(NUM_INOTIFY_INSTANCES * NUM_CLOSER_THREADS, sizeof(pthread_t));
if (!closers)
handle_error("allocating closers");
zero_closers = calloc(NUM_INOTIFY_INSTANCES * NUM_ZERO_CLOSERS, sizeof(pthread_t));
if (!zero_closers)
handle_error("allocating zero_closers");
data_dumpers = calloc(NUM_INOTIFY_INSTANCES * NUM_DATA_DUMPERS, sizeof(pthread_t));
if (!data_dumpers)
handle_error("allocating data_dumpers");
file_creaters = calloc(NUM_FILE_CREATERS, sizeof(*file_creaters));
if (!file_creaters)
handle_error("allocating file_creaters");
/* create a thread that does nothing but reset the low_wd */
iret = pthread_create(&low_wd_reseter, &attr, reset_low_wd, NULL);
if (iret)
handle_error("low_wd_reseter");
iret = pthread_create(&mounter, &attr, mount_tmpdir, NULL);
if (iret)
handle_error("low_wd_reseter");
/* create WATCHER_MULTIPLIER threads per file which do nothing
* but try to add a watch for each INOTIFY_INSTANCE */
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++) {
for (j = 0; j < NUM_WATCHER_THREADS; j++) {
watcher_arg[i * NUM_WATCHER_THREADS + j].file_num = j;
watcher_arg[i * NUM_WATCHER_THREADS + j].inotify_fd = inotify_fd[i];
for (k = 0; k < WATCHER_MULTIPLIER; k++) {
iret = pthread_create(&watchers[i * (NUM_WATCHER_THREADS * WATCHER_MULTIPLIER) +
(j * WATCHER_MULTIPLIER) + k],
&attr, add_watches, &watcher_arg[i * NUM_WATCHER_THREADS + j]);
if (iret)
handle_error("creating water threads");
}
}
}
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
operator_arg[i].inotify_fd = inotify_fd[i];
/* create threads which unlink and then recreate all of the files in question */
for (i = 0; i < NUM_FILE_CREATERS; i++) {
iret = pthread_create( &file_creaters[i], &attr, create_files, NULL);
if (iret)
handle_error("creating the file creators");
}
/* create threads which walk from low_wd to high_wd closing all of the wd's in between */
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
for (j = 0; j < NUM_CLOSER_THREADS; j++) {
iret = pthread_create ( &closers[i * NUM_CLOSER_THREADS + j], &attr, close_watches, &operator_arg[i]);
if (iret)
handle_error("creating the close threads");
}
/* create threads which just walk from low_wd to low_wd +3 closing wd's for extra races */
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
for (j = 0; j < NUM_ZERO_CLOSERS; j++) {
iret = pthread_create ( &zero_closers[i * NUM_ZERO_CLOSERS + j], &attr, zero_close_watches, &operator_arg[i]);
if (iret)
handle_error("creating the low closer threads");
}
/* create threads which just pull data off of the inotify fd.
* use default ATTR for larger stack */
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
for (j = 0; j < NUM_DATA_DUMPERS; j++) {
iret = pthread_create( &data_dumpers[i * NUM_DATA_DUMPERS + j], NULL, dump_data, &operator_arg[i]);
if (iret)
handle_error("creating threads to dump inotify data");
}
/* Wait till threads are complete before main continues. */
pthread_join(low_wd_reseter, &ret);
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
for (j = 0; j < NUM_WATCHER_THREADS; j++)
for (k = 0; k < WATCHER_MULTIPLIER; k++)
pthread_join(watchers[i * (NUM_WATCHER_THREADS * WATCHER_MULTIPLIER) +
(j * WATCHER_MULTIPLIER) + k], &ret);
for (i = 0; i < NUM_FILE_CREATERS; i++)
pthread_join(file_creaters[i], &ret);
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
for (j = 0; j < NUM_CLOSER_THREADS; j++)
pthread_join(closers[i * NUM_CLOSER_THREADS + j], &ret);
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
for (j = 0; j < NUM_ZERO_CLOSERS; j++)
pthread_join(zero_closers[i * NUM_ZERO_CLOSERS + j], &ret);
for (i = 0; i < NUM_INOTIFY_INSTANCES; i++)
for (j = 0; j < NUM_DATA_DUMPERS; j++)
pthread_join(data_dumpers[i * NUM_DATA_DUMPERS + j], &ret);
/* clean up the tmp dir which should be empty */
rmdir(TMP_DIR_NAME);
exit(0);
}
void sigfunc(int sig_num)
{
if (sig_num == SIGINT)
stopped = 1;
else
printf("Got an unknown signal!\n");
}
/* constantly create and delete all of the files that are bieng watched */
void *create_files(__attribute__ ((unused)) void *ptr)
{
char filename[50];
int i;
fprintf(stderr, "Starting creator thread\n");
while (!stopped) {
for (i = 0; i < NUM_WATCHER_THREADS; i++) {
int fd;
snprintf(filename, 50, "%s/%d", TMP_DIR_NAME, i);
unlink(filename);
fd = open(filename, O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd >= 0)
close(fd);
}
sleep(10);
}
/* cleanup all files on exit */
for (i = 0; i < NUM_WATCHER_THREADS; i++) {
snprintf(filename, 50, "%s/%d", TMP_DIR_NAME, i);
unlink(filename);
}
return NULL;
}
/* Reset the low_wd so closers can be smart */
void *reset_low_wd(__attribute__ ((unused)) void *ptr)
{
fprintf(stderr, "Starting low_wd reset thread\n");
while (!stopped) {
low_wd = INT_MAX;
sleep(1);
}
return NULL;
}
/* Pull events off the buffer and ignore them */
void *dump_data(void *ptr)
{
char buf[8096];
struct operator_struct *operator_arg = ptr;
int inotify_fd = operator_arg->inotify_fd;
int ret;
fprintf(stderr, "Starting inotify data dumper thread\n");
while (!stopped) {
ret = read(inotify_fd, buf, 8096);
if (ret <= 0)
pthread_yield();
}
return NULL;
}
/* add a watch to a specific file as fast as we can */
void *add_watches(void *ptr)
{
struct watcher_struct *watcher_arg = ptr;
int file_num = watcher_arg->file_num;
int notify_fd = watcher_arg->inotify_fd;
int ret;
char filename[50];
fprintf(stderr, "Creating an watch adder thread, notify_fd=%d filenum=%d\n",
notify_fd, file_num);
snprintf(filename, 50, "%s/%d", TMP_DIR_NAME, file_num);
while (!stopped) {
ret = inotify_add_watch(notify_fd, filename, IN_ALL_EVENTS);
if (ret < 0 && errno != ENOENT)
perror("inotify_add_watch");
if (ret > high_wd)
high_wd = ret;
if (ret < low_wd)
low_wd = ret;
pthread_yield();
}
return NULL;
}
/* run from low_wd to high_wd closing all watches in between */
void *close_watches(void *ptr)
{
struct operator_struct *operator_arg = ptr;
int inotify_fd = operator_arg->inotify_fd;
int i;
fprintf(stderr, "Starting a thread to close watchers\n");
while (!stopped) {
for (i = low_wd; i < high_wd; i++)
inotify_rm_watch(inotify_fd, i);
pthread_yield();
}
return NULL;
}
/* run from low_wd to low_wd+3 closing all watch in between just for extra races */
void *zero_close_watches(void *ptr)
{
struct operator_struct *operator_arg = ptr;
int inotify_fd = operator_arg->inotify_fd;
int i;
while (!stopped) {
for (i = low_wd; i <= low_wd+3; i++)
inotify_rm_watch(inotify_fd, i);
pthread_yield();
}
return NULL;
}
void *mount_tmpdir(__attribute__ ((unused)) void *ptr)
{
int rc;
while (!stopped) {
rc = mount(TMP_DIR_NAME, TMP_DIR_NAME, "tmpfs", MS_MGC_VAL, "rootcontext=\"unconfined_u:object_r:tmp_t:s0\"");
usleep(100000);
if (!rc)
umount2(TMP_DIR_NAME, MNT_DETACH);
usleep(100000);
}
return NULL;
}
next prev parent reply other threads:[~2011-08-01 20:38 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-14 15:29 [PATCH 0/9] fsnotify: change locking order Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 1/9] inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 2/9] fsnotify: introduce fsnotify_get_group() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 3/9] fsnotify: use reference counting for groups Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 4/9] fsnotify: take groups mark_lock before mark lock Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 5/9] fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 6/9] fsnotify: use a mutex instead of a spinlock to protect a groups mark list Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 7/9] fsnotify: pass group to fsnotify_destroy_mark() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 8/9] fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark() Lino Sanfilippo
2011-06-14 15:29 ` [PATCH 9/9] fsnotify: dont put marks on temporary list when clearing marks by group Lino Sanfilippo
2011-08-01 20:38 ` Eric Paris [this message]
2011-08-11 23:13 ` [PATCH 0/9] fsnotify: change locking order Lino Sanfilippo
2012-03-20 18:49 ` Luis Henriques
2012-03-20 18:58 ` Eric Paris
2012-03-20 19:05 ` Luis Henriques
2012-03-22 22:14 ` Lino Sanfilippo
2012-03-26 11:27 ` Luis Henriques
2012-03-26 15:12 ` Eric Paris
2012-03-26 15:27 ` Luis Henriques
2012-03-26 22:51 ` Lino Sanfilippo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E370EBE.3050100@redhat.com \
--to=eparis@redhat.com \
--cc=LinoSanfilippo@gmx.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).