Race condition in fio atexit code

All of lore.kernel.org
 help / color / mirror / Atom feed

* Race condition in fio atexit code
@ 2012-06-19 23:05 Erik Lattimore
  2012-07-31 19:06 ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Erik Lattimore @ 2012-06-19 23:05 UTC (permalink / raw)
  To: fio

Lately it seems like we've been hitting this more frequently, so I figured I'd file a bug.  Fio starts up a thread running the function disk_thread_main, which periodically calls update_io_ticks, which calls update_io_tick_disk on each entry in a circular linked list. The function disk_thread_main returns when the global variable "threads" is set to null, but it's only checked a couple of times in the loop.

The main thread runs the test and exits, and has registered an atexit handler free_shm. This routine sets "threads" to null and frees up storage, including the storage where the linked list used by update_io_ticks is stored.

Occasionally, somehow, update_io_tick_disk winds up getting called with a null pointer and crashing. The problem may be exacerbated when memory is tight. Here's the backtrace of the core dump:

Program terminated with signal 11, Segmentation fault.
#0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
80              if (!du->users)
(gdb) t apply all bt

Thread 2 (Thread 0x7faab680b700 (LWP 23148)):
#0  0x00007faab58df377 in shmdt () from /lib64/libc.so.6
#1  0x000000000040b98d in free_shm () at init.c:231
#2  0x00007faab583b7f5 in __run_exit_handlers () from /lib64/libc.so.6
#3  0x00007faab583b845 in exit () from /lib64/libc.so.6
#4  0x00007faab5824c3d in __libc_start_main () from /lib64/libc.so.6
#5  0x0000000000408ed9 in _start ()

Thread 1 (Thread 0x7faab32dd700 (LWP 23149)):
#0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
#1  update_io_ticks () at diskutil.c:114
#2  0x000000000043b303 in disk_thread_main (data=<optimized out>) at backend.c:1589
#3  0x00007faab61907b6 in start_thread () from /lib64/libpthread.so.0
#4  0x00007faab58dd9cd in clone () from /lib64/libc.so.6
#5  0x0000000000000000 in ?? ()
(gdb) q

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Race condition in fio atexit code
  2012-06-19 23:05 Race condition in fio atexit code Erik Lattimore
@ 2012-07-31 19:06 ` Jens Axboe
  2012-07-31 19:13   ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2012-07-31 19:06 UTC (permalink / raw)
  To: Erik Lattimore; +Cc: fio

On 2012-06-20 01:05, Erik Lattimore wrote:
> Lately it seems like we've been hitting this more frequently, so I figured I'd file a bug.  Fio starts up a thread running the function disk_thread_main, which periodically calls update_io_ticks, which calls update_io_tick_disk on each entry in a circular linked list. The function disk_thread_main returns when the global variable "threads" is set to null, but it's only checked a couple of times in the loop.
> 
> The main thread runs the test and exits, and has registered an atexit handler free_shm. This routine sets "threads" to null and frees up storage, including the storage where the linked list used by update_io_ticks is stored.
> 
> Occasionally, somehow, update_io_tick_disk winds up getting called with a null pointer and crashing. The problem may be exacerbated when memory is tight. Here's the backtrace of the core dump:
> 
> Program terminated with signal 11, Segmentation fault.
> #0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
> 80              if (!du->users)
> (gdb) t apply all bt
> 
> Thread 2 (Thread 0x7faab680b700 (LWP 23148)):
> #0  0x00007faab58df377 in shmdt () from /lib64/libc.so.6
> #1  0x000000000040b98d in free_shm () at init.c:231
> #2  0x00007faab583b7f5 in __run_exit_handlers () from /lib64/libc.so.6
> #3  0x00007faab583b845 in exit () from /lib64/libc.so.6
> #4  0x00007faab5824c3d in __libc_start_main () from /lib64/libc.so.6
> #5  0x0000000000408ed9 in _start ()
> 
> Thread 1 (Thread 0x7faab32dd700 (LWP 23149)):
> #0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
> #1  update_io_ticks () at diskutil.c:114
> #2  0x000000000043b303 in disk_thread_main (data=<optimized out>) at backend.c:1589
> #3  0x00007faab61907b6 in start_thread () from /lib64/libpthread.so.0
> #4  0x00007faab58dd9cd in clone () from /lib64/libc.so.6
> #5  0x0000000000000000 in ?? ()
> (gdb) q--

This is clearly a race in how the disk util thread is shut down and the
structures freed. I'll take a look at a fix. It would be useful if you
told me how you are hitting this most easily, as I don't recall seeing
it. Would make me more confident in a fix.

Also, are you sure it's threads == NULL, and not the du's themselves
being freed? They are in separate storage. It might be a good idea to
have diskutil.c:free_disk_util() signal and wait for the disk util
thread to shutdown before going further.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Race condition in fio atexit code
  2012-07-31 19:06 ` Jens Axboe
@ 2012-07-31 19:13   ` Jens Axboe
  2012-08-01  7:41     ` Jens Axboe
  0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2012-07-31 19:13 UTC (permalink / raw)
  To: Erik Lattimore; +Cc: fio

On 2012-07-31 21:06, Jens Axboe wrote:
> On 2012-06-20 01:05, Erik Lattimore wrote:
>> Lately it seems like we've been hitting this more frequently, so I figured I'd file a bug.  Fio starts up a thread running the function disk_thread_main, which periodically calls update_io_ticks, which calls update_io_tick_disk on each entry in a circular linked list. The function disk_thread_main returns when the global variable "threads" is set to null, but it's only checked a couple of times in the loop.
>>
>> The main thread runs the test and exits, and has registered an atexit handler free_shm. This routine sets "threads" to null and frees up storage, including the storage where the linked list used by update_io_ticks is stored.
>>
>> Occasionally, somehow, update_io_tick_disk winds up getting called with a null pointer and crashing. The problem may be exacerbated when memory is tight. Here's the backtrace of the core dump:
>>
>> Program terminated with signal 11, Segmentation fault.
>> #0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
>> 80              if (!du->users)
>> (gdb) t apply all bt
>>
>> Thread 2 (Thread 0x7faab680b700 (LWP 23148)):
>> #0  0x00007faab58df377 in shmdt () from /lib64/libc.so.6
>> #1  0x000000000040b98d in free_shm () at init.c:231
>> #2  0x00007faab583b7f5 in __run_exit_handlers () from /lib64/libc.so.6
>> #3  0x00007faab583b845 in exit () from /lib64/libc.so.6
>> #4  0x00007faab5824c3d in __libc_start_main () from /lib64/libc.so.6
>> #5  0x0000000000408ed9 in _start ()
>>
>> Thread 1 (Thread 0x7faab32dd700 (LWP 23149)):
>> #0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
>> #1  update_io_ticks () at diskutil.c:114
>> #2  0x000000000043b303 in disk_thread_main (data=<optimized out>) at backend.c:1589
>> #3  0x00007faab61907b6 in start_thread () from /lib64/libpthread.so.0
>> #4  0x00007faab58dd9cd in clone () from /lib64/libc.so.6
>> #5  0x0000000000000000 in ?? ()
>> (gdb) q--
> 
> This is clearly a race in how the disk util thread is shut down and the
> structures freed. I'll take a look at a fix. It would be useful if you
> told me how you are hitting this most easily, as I don't recall seeing
> it. Would make me more confident in a fix.
> 
> Also, are you sure it's threads == NULL, and not the du's themselves
> being freed? They are in separate storage. It might be a good idea to
> have diskutil.c:free_disk_util() signal and wait for the disk util
> thread to shutdown before going further.

Can you reproduce with this patch?

diff --git a/backend.c b/backend.c
index e1dc0ac..03d35df 100644
--- a/backend.c
+++ b/backend.c
@@ -50,6 +50,8 @@
 #include "server.h"
 
 static pthread_t disk_util_thread;
+struct fio_mutex *disk_util_mutex;
+int disk_util_exit = 0;
 static struct fio_mutex *startup_mutex;
 static struct fio_mutex *writeout_mutex;
 static struct flist_head *cgroup_list;
@@ -1592,9 +1594,9 @@ static void *disk_thread_main(void *data)
 {
 	fio_mutex_up(startup_mutex);
 
-	while (threads) {
+	while (threads && !disk_util_exit) {
 		usleep(DISK_UTIL_MSEC * 1000);
-		if (!threads)
+		if (!threads || disk_util_exit)
 			break;
 		update_io_ticks();
 
@@ -1602,6 +1604,7 @@ static void *disk_thread_main(void *data)
 			print_thread_status();
 	}
 
+	fio_mutex_up(disk_util_mutex);
 	return NULL;
 }
 
@@ -1609,15 +1612,19 @@ static int create_disk_util_thread(void)
 {
 	int ret;
 
+	disk_util_mutex = fio_mutex_init(0);
+
 	ret = pthread_create(&disk_util_thread, NULL, disk_thread_main, NULL);
 	if (ret) {
 		log_err("Can't create disk util thread: %s\n", strerror(ret));
+		fio_mutex_remove(disk_util_mutex);
 		return 1;
 	}
 
 	ret = pthread_detach(disk_util_thread);
 	if (ret) {
 		log_err("Can't detatch disk util thread: %s\n", strerror(ret));
+		fio_mutex_remove(disk_util_mutex);
 		return 1;
 	}
 
diff --git a/diskutil.c b/diskutil.c
index feb8852..36b381f 100644
--- a/diskutil.c
+++ b/diskutil.c
@@ -521,6 +521,9 @@ void free_disk_util(void)
 {
 	struct disk_util *du;
 
+	disk_util_exit = 1;
+	fio_mutex_down(disk_util_mutex);
+
 	while (!flist_empty(&disk_list)) {
 		du = flist_entry(disk_list.next, struct disk_util, list);
 		flist_del(&du->list);
@@ -528,6 +531,7 @@ void free_disk_util(void)
 	}
 
 	last_majdev = last_mindev = -1;
+	fio_mutex_remove(disk_util_mutex);
 }
 
 void print_disk_util(struct disk_util_stat *dus, struct disk_util_agg *agg,
diff --git a/diskutil.h b/diskutil.h
index 5a9b079..c7212fa 100644
--- a/diskutil.h
+++ b/diskutil.h
@@ -3,6 +3,9 @@
 
 #define FIO_DU_NAME_SZ		64
 
+extern struct fio_mutex *disk_util_mutex;
+extern int disk_util_exit;
+
 /*
  * Disk utils as read in /sys/block/<dev>/stat
  */
@@ -106,7 +109,10 @@ extern void update_io_ticks(void);
 #else
 #define print_disk_util(dus, agg, terse)
 #define show_disk_util(terse)
-#define free_disk_util()
+static inline void free_disk_util(void)
+{
+	fio_mutex_remove(disk_util_mutex);
+}
 #define init_disk_util(td)
 #define update_io_ticks()
 #endif

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Race condition in fio atexit code
  2012-07-31 19:13   ` Jens Axboe
@ 2012-08-01  7:41     ` Jens Axboe
  0 siblings, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2012-08-01  7:41 UTC (permalink / raw)
  To: Erik Lattimore; +Cc: fio

On 07/31/2012 09:13 PM, Jens Axboe wrote:
> On 2012-07-31 21:06, Jens Axboe wrote:
>> On 2012-06-20 01:05, Erik Lattimore wrote:
>>> Lately it seems like we've been hitting this more frequently, so I figured I'd file a bug.  Fio starts up a thread running the function disk_thread_main, which periodically calls update_io_ticks, which calls update_io_tick_disk on each entry in a circular linked list. The function disk_thread_main returns when the global variable "threads" is set to null, but it's only checked a couple of times in the loop.
>>>
>>> The main thread runs the test and exits, and has registered an atexit handler free_shm. This routine sets "threads" to null and frees up storage, including the storage where the linked list used by update_io_ticks is stored.
>>>
>>> Occasionally, somehow, update_io_tick_disk winds up getting called with a null pointer and crashing. The problem may be exacerbated when memory is tight. Here's the backtrace of the core dump:
>>>
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
>>> 80              if (!du->users)
>>> (gdb) t apply all bt
>>>
>>> Thread 2 (Thread 0x7faab680b700 (LWP 23148)):
>>> #0  0x00007faab58df377 in shmdt () from /lib64/libc.so.6
>>> #1  0x000000000040b98d in free_shm () at init.c:231
>>> #2  0x00007faab583b7f5 in __run_exit_handlers () from /lib64/libc.so.6
>>> #3  0x00007faab583b845 in exit () from /lib64/libc.so.6
>>> #4  0x00007faab5824c3d in __libc_start_main () from /lib64/libc.so.6
>>> #5  0x0000000000408ed9 in _start ()
>>>
>>> Thread 1 (Thread 0x7faab32dd700 (LWP 23149)):
>>> #0  update_io_tick_disk (du=<optimized out>) at diskutil.c:80
>>> #1  update_io_ticks () at diskutil.c:114
>>> #2  0x000000000043b303 in disk_thread_main (data=<optimized out>) at backend.c:1589
>>> #3  0x00007faab61907b6 in start_thread () from /lib64/libpthread.so.0
>>> #4  0x00007faab58dd9cd in clone () from /lib64/libc.so.6
>>> #5  0x0000000000000000 in ?? ()
>>> (gdb) q--
>>
>> This is clearly a race in how the disk util thread is shut down and the
>> structures freed. I'll take a look at a fix. It would be useful if you
>> told me how you are hitting this most easily, as I don't recall seeing
>> it. Would make me more confident in a fix.
>>
>> Also, are you sure it's threads == NULL, and not the du's themselves
>> being freed? They are in separate storage. It might be a good idea to
>> have diskutil.c:free_disk_util() signal and wait for the disk util
>> thread to shutdown before going further.
> 
> Can you reproduce with this patch?

I've checked in a different fix, doing proper locking around the du list
and ensuring that the disk util thread has exited before freeing the
structures.

http://git.kernel.dk/?p=fio.git;a=commit;h=feb418556a236aa041a625b083d6b99e55d23d74

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-08-01  7:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-19 23:05 Race condition in fio atexit code Erik Lattimore
2012-07-31 19:06 ` Jens Axboe
2012-07-31 19:13   ` Jens Axboe
2012-08-01  7:41     ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.