From: Jens Axboe <axboe@kernel.dk>
To: scameron@beardog.cce.hp.com, fio@vger.kernel.org
Subject: Re: fio main thread got stuck over the weekend
Date: Fri, 22 Aug 2014 14:04:34 -0500 [thread overview]
Message-ID: <53F79442.6010500@kernel.dk> (raw)
In-Reply-To: <20140811160418.GG7486@beardog.cce.hp.com>
On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
> On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com wrote:
>>
>> Hi,
>>
>> I was running an fio job over the weekend and the main thread
>> seems to have hung. This in on RHEL6u5 with kernel
>> 2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
>> driver, just in case it matters.
>>
>> [root@msablackburn saperf]# uname -a
>> Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
>> [root@msablackburn saperf]# /usr/local/bin/fio --version
>> fio-2.1.11-3-g789d
>>
>> The jobfile looks like this (note the runtime=216000, that's 60 hours)
>> It seems to have hung with about 25 hours remaining.)
>>
>> [root@msablackburn saperf]# cat 2drive_sdr.fio
>> [global]
>> direct=1
>> ioengine=libaio
>> norandommap
>> randrepeat=0
>> bs=4096
>> iodepth=1011
>> #iodepth=96
>> #numjobs=1
>> numjobs=12
>> #numjobs=24
>> # number_ios=1
>> runtime=216000
>> #runtime=54000
>> time_based=1
>> group_reporting
>> thread
>> gtod_reduce=1
>> #iodepth_batch=4
>> #iodepth_batch_complete=4
>> cpus_allowed=0-5
>> cpus_allowed_policy=split
>> rw=randread
>>
>> [4_KiB_RR_drive_r]
>> filename=/dev/sdr
>>
>> [4_KiB_RR_drive_s]
>> filename=/dev/sds
>>
>>
>> Here is all the output so far:
>>
>> ---------------8<------------8<-------------8<-----------------
>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> fio-2.1.11-3-g789d
>> Starting 24 threads
>> ^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops] [eta 14h:48m:47s]
>> fio: terminating on signal 2
>>
>> 4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug 8 16:49:19 2014
>> read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
>> cpu : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>> issued : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=1011
>>
>> Run status group 0 (all jobs):
>> READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s, maxb=1309.1MB/s, mint=672989msec, maxt=672989msec
>>
>> Disk stats (read/write):
>> sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755, util=98.57%
>> sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931, util=98.71%
>> [root@msablackburn saperf]# vi !$
>> vi 2drive_sdr.fio
>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1011
>> ...
>> fio-2.1.11-3-g789d
>> Starting 24 threads
>> Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops] [eta 01d:01h:43m:07s]
>> ---------------8<------------8<-------------8<-----------------
>>
>> And there it sits, the eta does not update.
>>
>>
>> I tried connecting to the process with gdb, and doing "info threads" and "bt":
>>
>> PID of main thread is 8530:
>>
>> root 8530 598 1.0 2159116 170812 pts/0 Sl+ Aug08 23570:56 fio 2drive_sdr.fio
>>
>> ...
>> (gdb) info threads
>> 26 Thread 0x7f65da431700 (LWP 8531) 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>> 25 Thread 0x7f65d9a30700 (LWP 8532) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 24 Thread 0x7f65d902f700 (LWP 8533) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 23 Thread 0x7f65d862e700 (LWP 8534) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 22 Thread 0x7f65d7c2d700 (LWP 8535) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 21 Thread 0x7f65d722c700 (LWP 8536) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 20 Thread 0x7f65d682b700 (LWP 8537) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 19 Thread 0x7f65d5e2a700 (LWP 8538) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 18 Thread 0x7f65d5429700 (LWP 8539) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 17 Thread 0x7f65d4a28700 (LWP 8540) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 16 Thread 0x7f65d4027700 (LWP 8541) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 15 Thread 0x7f65d3626700 (LWP 8542) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 14 Thread 0x7f65d2c25700 (LWP 8543) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 13 Thread 0x7f65d2222700 (LWP 8544) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 12 Thread 0x7f65d1820700 (LWP 8545) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 11 Thread 0x7f65d0e1e700 (LWP 8546) td_io_queue (td=0x7f65da496410, io_u=0x7f656804d010) at ioengines.c:344
>> 10 Thread 0x7f65d041c700 (LWP 8547) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 9 Thread 0x7f65cfa1a700 (LWP 8548) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 8 Thread 0x7f65cf018700 (LWP 8549) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 7 Thread 0x7f65ce616700 (LWP 8550) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 6 Thread 0x7f65cdc14700 (LWP 8551) fio_libaio_event (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
>> 5 Thread 0x7f65cd212700 (LWP 8552) __get_next_offset (td=0x7f65da4b5a60) at io_u.c:389
>> 4 Thread 0x7f65cc810700 (LWP 8553) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> 3 Thread 0x7f65cbe0e700 (LWP 8554) td_io_u_unlock (td=0x7f65da4c01d0, min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
>> 2 Thread 0x7f65cb40c700 (LWP 8555) 0x0000003c69800667 in io_submit () from /lib64/libaio.so.1
>> * 1 Thread 0x7f65e34c6720 (LWP 8530) 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>> (gdb) bt
>> #0 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>> #1 0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>> #2 0x000000000044afe4 in do_usleep () at backend.c:1727
>> #3 run_threads () at backend.c:1965
>> #4 0x000000000044b84d in fio_backend () at backend.c:2068
>> #5 0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
>> #6 0x0000000000409c39 in _start ()
>> (gdb) detach
>> Detaching from program: /usr/local/bin/fio, process 8530
>> (gdb) quit
>> [root@msablackburn saperf]#
>>
>> I am guessing the 8530 thread is the one that prints out the eta updates?
>> Well, maybe not, I see a call to create_disk_util_thread() in backend.c.
>>
>> i/o still seems to be running, and there are no messages in dmesg about
>> hung i/o or the scsi mid layer getting angry and trying to abort things
>> or anything like that. Seems to be happily cruising along, except that
>> the eta isn't updating, and the job should have finished several hours
>> ago now. I'm leaving it running for now just to try to figure out
>> what happened.
>>
>> Any ideas? Anything else I should poke at?
>
> A bit more info:
>
> thread 26 looks to be the one that's supposed to print the eta info:
>
> (gdb) thread 26
> [Switching to thread 26 (Thread 0x7f65da431700 (LWP 8531))]#0 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
> #1 0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
> #2 0x00000000004474a0 in disk_thread_main (data=<value optimized out>) at backend.c:1992
> #3 0x0000003aa50079d1 in start_thread () from /lib64/libpthread.so.0
> #4 0x0000003aa4ce8b5d in clone () from /lib64/libc.so.6
> (gdb)
>
> static void *disk_thread_main(void *data)
> {
> int ret = 0;
>
> fio_mutex_up(startup_mutex);
>
> while (threads && !ret) {
> usleep(DISK_UTIL_MSEC * 1000);
> if (!threads)
> break;
> ret = update_io_ticks();
>
> if (!is_backend)
> print_thread_status();
> }
>
> fio_mutex_up(disk_thread_mutex);
> return NULL;
> }
>
> that usleep() is where it's in nanosleep, I guess.
>
> from eta.c:
>
> void print_thread_status(void)
> {
> struct jobs_eta *je;
> size_t size;
>
> je = get_jobs_eta(0, &size);
> if (je)
> display_thread_status(je);
>
> free(je);
> }
>
> Maybe that je is coming back false? which is
> probably the return value of calc_thread_status() which, well,
> at a glance, I'm not sure what calc_thread_status() is doing.
I'll take a look at this next week, been away at a conference since last
weekend.
--
Jens Axboe
next prev parent reply other threads:[~2014-08-22 19:04 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-08-11 15:44 fio main thread got stuck over the weekend scameron
2014-08-11 16:04 ` scameron
2014-08-22 19:04 ` Jens Axboe [this message]
2014-08-22 19:09 ` scameron
2014-08-22 19:11 ` Jens Axboe
2014-12-12 20:32 ` Elliott, Robert (Server Storage)
2014-12-13 4:49 ` Jens Axboe
2014-12-15 17:33 ` Elliott, Robert (Server Storage)
2014-12-15 17:37 ` Jens Axboe
2014-12-15 19:39 ` Elliott, Robert (Server Storage)
2014-12-15 20:12 ` Jens Axboe
2014-12-15 20:31 ` Jens Axboe
2014-12-15 20:49 ` Jens Axboe
2014-12-16 0:52 ` Elliott, Robert (Server Storage)
2014-12-16 2:51 ` Jens Axboe
2014-12-16 22:43 ` Jens Axboe
2014-12-17 3:52 ` Elliott, Robert (Server Storage)
2014-12-17 5:43 ` Jens Axboe
2014-12-17 16:48 ` Elliott, Robert (Server Storage)
2014-12-17 17:27 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53F79442.6010500@kernel.dk \
--to=axboe@kernel.dk \
--cc=fio@vger.kernel.org \
--cc=scameron@beardog.cce.hp.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox