Flexible I/O Tester development
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: scameron@beardog.cce.hp.com
Cc: fio@vger.kernel.org
Subject: Re: fio main thread got stuck over the weekend
Date: Fri, 22 Aug 2014 14:11:28 -0500	[thread overview]
Message-ID: <53F795E0.3090806@kernel.dk> (raw)
In-Reply-To: <20140822190924.GQ19666@beardog.cce.hp.com>

On 2014-08-22 14:09, scameron@beardog.cce.hp.com wrote:
> On Fri, Aug 22, 2014 at 02:04:34PM -0500, Jens Axboe wrote:
>> On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
>>> On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I was running an fio job over the weekend and the main thread
>>>> seems to have hung.  This in on RHEL6u5 with kernel
>>>> 2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
>>>> driver, just in case it matters.
>>>>
>>>> [root@msablackburn saperf]# uname -a
>>>> Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP
>>>> Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>> [root@msablackburn saperf]# /usr/local/bin/fio --version
>>>> fio-2.1.11-3-g789d
>>>>
>>>> The jobfile looks like this (note the runtime=216000, that's 60 hours)
>>>> It seems to have hung with about 25 hours remaining.)
>>>>
>>>> [root@msablackburn saperf]# cat 2drive_sdr.fio
>>>> [global]
>>>> direct=1
>>>> ioengine=libaio
>>>> norandommap
>>>> randrepeat=0
>>>> bs=4096
>>>> iodepth=1011
>>>> #iodepth=96
>>>> #numjobs=1
>>>> numjobs=12
>>>> #numjobs=24
>>>> # number_ios=1
>>>> runtime=216000
>>>> #runtime=54000
>>>> time_based=1
>>>> group_reporting
>>>> thread
>>>> gtod_reduce=1
>>>> #iodepth_batch=4
>>>> #iodepth_batch_complete=4
>>>> cpus_allowed=0-5
>>>> cpus_allowed_policy=split
>>>> rw=randread
>>>>
>>>> [4_KiB_RR_drive_r]
>>>> filename=/dev/sdr
>>>>
>>>> [4_KiB_RR_drive_s]
>>>> filename=/dev/sds
>>>>
>>>>
>>>> Here is all the output so far:
>>>>
>>>> ---------------8<------------8<-------------8<-----------------
>>>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>>>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> fio-2.1.11-3-g789d
>>>> Starting 24 threads
>>>> ^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops]
>>>> [eta 14h:48m:47s]
>>>> fio: terminating on signal 2
>>>>
>>>> 4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug  8
>>>> 16:49:19 2014
>>>>    read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
>>>>    cpu          : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
>>>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>    >=64=100.0%
>>>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>       >=64=0.0%
>>>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>       >=64=0.1%
>>>>       issued    : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
>>>>       latency   : target=0, window=0, percentile=100.00%, depth=1011
>>>>
>>>> Run status group 0 (all jobs):
>>>>     READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s,
>>>>     maxb=1309.1MB/s, mint=672989msec, maxt=672989msec
>>>>
>>>> Disk stats (read/write):
>>>>    sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755,
>>>>    util=98.57%
>>>>    sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931,
>>>>    util=98.71%
>>>> [root@msablackburn saperf]# vi !$
>>>> vi 2drive_sdr.fio
>>>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>>>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> fio-2.1.11-3-g789d
>>>> Starting 24 threads
>>>> Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops]
>>>> [eta 01d:01h:43m:07s]
>>>> ---------------8<------------8<-------------8<-----------------
>>>>
>>>> And there it sits, the eta does not update.
>>>>
>>>>
>>>> I tried connecting to the process with gdb, and doing "info threads" and
>>>> "bt":
>>>>
>>>> PID of main thread is 8530:
>>>>
>>>> root      8530  598  1.0 2159116 170812 pts/0  Sl+  Aug08 23570:56 fio
>>>> 2drive_sdr.fio
>>>>
>>>> ...
>>>> (gdb) info threads
>>>>    26 Thread 0x7f65da431700 (LWP 8531)  0x0000003aa4cacced in nanosleep
>>>>    () from /lib64/libc.so.6
>>>>    25 Thread 0x7f65d9a30700 (LWP 8532)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    24 Thread 0x7f65d902f700 (LWP 8533)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    23 Thread 0x7f65d862e700 (LWP 8534)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    22 Thread 0x7f65d7c2d700 (LWP 8535)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    21 Thread 0x7f65d722c700 (LWP 8536)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    20 Thread 0x7f65d682b700 (LWP 8537)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    19 Thread 0x7f65d5e2a700 (LWP 8538)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    18 Thread 0x7f65d5429700 (LWP 8539)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    17 Thread 0x7f65d4a28700 (LWP 8540)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    16 Thread 0x7f65d4027700 (LWP 8541)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    15 Thread 0x7f65d3626700 (LWP 8542)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    14 Thread 0x7f65d2c25700 (LWP 8543)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    13 Thread 0x7f65d2222700 (LWP 8544)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    12 Thread 0x7f65d1820700 (LWP 8545)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    11 Thread 0x7f65d0e1e700 (LWP 8546)  td_io_queue (td=0x7f65da496410,
>>>>    io_u=0x7f656804d010) at ioengines.c:344
>>>>    10 Thread 0x7f65d041c700 (LWP 8547)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    9 Thread 0x7f65cfa1a700 (LWP 8548)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    8 Thread 0x7f65cf018700 (LWP 8549)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    7 Thread 0x7f65ce616700 (LWP 8550)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    6 Thread 0x7f65cdc14700 (LWP 8551)  fio_libaio_event
>>>>    (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
>>>>    5 Thread 0x7f65cd212700 (LWP 8552)  __get_next_offset
>>>>    (td=0x7f65da4b5a60) at io_u.c:389
>>>>    4 Thread 0x7f65cc810700 (LWP 8553)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    3 Thread 0x7f65cbe0e700 (LWP 8554)  td_io_u_unlock (td=0x7f65da4c01d0,
>>>>    min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
>>>>    2 Thread 0x7f65cb40c700 (LWP 8555)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>> * 1 Thread 0x7f65e34c6720 (LWP 8530)  0x0000003aa4cacced in nanosleep ()
>>> >from /lib64/libc.so.6
>>>> (gdb) bt
>>>> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>>> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>>>> #2  0x000000000044afe4 in do_usleep () at backend.c:1727
>>>> #3  run_threads () at backend.c:1965
>>>> #4  0x000000000044b84d in fio_backend () at backend.c:2068
>>>> #5  0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
>>>> #6  0x0000000000409c39 in _start ()
>>>> (gdb) detach
>>>> Detaching from program: /usr/local/bin/fio, process 8530
>>>> (gdb) quit
>>>> [root@msablackburn saperf]#
>>>>
>>>> I am guessing the 8530 thread is the one that prints out the eta updates?
>>>> Well, maybe not, I see a call to create_disk_util_thread() in backend.c.
>>>>
>>>> i/o still seems to be running, and there are no messages in dmesg about
>>>> hung i/o or the scsi mid layer getting angry and trying to abort things
>>>> or anything like that.  Seems to be happily cruising along, except that
>>>> the eta isn't updating, and the job should have finished several hours
>>>> ago now.  I'm leaving it running for now just to try to figure out
>>>> what happened.
>>>>
>>>> Any ideas?  Anything else I should poke at?
>>>
>>> A bit more info:
>>>
>>> thread 26 looks to be the one that's supposed to print the eta info:
>>>
>>> (gdb) thread 26
>>> [Switching to thread 26 (Thread 0x7f65da431700 (LWP 8531))]#0
>>> 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>>> #2  0x00000000004474a0 in disk_thread_main (data=<value optimized out>) at
>>> backend.c:1992
>>> #3  0x0000003aa50079d1 in start_thread () from /lib64/libpthread.so.0
>>> #4  0x0000003aa4ce8b5d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>> static void *disk_thread_main(void *data)
>>> {
>>>          int ret = 0;
>>>
>>>          fio_mutex_up(startup_mutex);
>>>
>>>          while (threads && !ret) {
>>>                  usleep(DISK_UTIL_MSEC * 1000);
>>>                  if (!threads)
>>>                          break;
>>>                  ret = update_io_ticks();
>>>
>>>                  if (!is_backend)
>>>                          print_thread_status();
>>>          }
>>>
>>>          fio_mutex_up(disk_thread_mutex);
>>>          return NULL;
>>> }
>>>
>>> that usleep() is where it's in nanosleep, I guess.
>>>
>> >from eta.c:
>>>
>>> void print_thread_status(void)
>>> {
>>>          struct jobs_eta *je;
>>>          size_t size;
>>>
>>>          je = get_jobs_eta(0, &size);
>>>          if (je)
>>>                  display_thread_status(je);
>>>
>>>          free(je);
>>> }
>>>
>>> Maybe that je is coming back false?  which is
>>> probably the return value of calc_thread_status() which, well,
>>> at a glance, I'm not sure what calc_thread_status() is doing.
>>
>> I'll take a look at this next week, been away at a conference since last
>> weekend.
>
> Ok.  Meantime, I had to reclaim the machine for testing, so I no longer
> have it just sitting there to debug, and I have not sseen the problem again
> that I know of.

Clearly a hardware issue :-)

-- 
Jens Axboe



  reply	other threads:[~2014-08-22 19:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-11 15:44 fio main thread got stuck over the weekend scameron
2014-08-11 16:04 ` scameron
2014-08-22 19:04   ` Jens Axboe
2014-08-22 19:09     ` scameron
2014-08-22 19:11       ` Jens Axboe [this message]
2014-12-12 20:32         ` Elliott, Robert (Server Storage)
2014-12-13  4:49           ` Jens Axboe
2014-12-15 17:33             ` Elliott, Robert (Server Storage)
2014-12-15 17:37               ` Jens Axboe
2014-12-15 19:39                 ` Elliott, Robert (Server Storage)
2014-12-15 20:12                   ` Jens Axboe
2014-12-15 20:31                     ` Jens Axboe
2014-12-15 20:49                       ` Jens Axboe
2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
2014-12-16  2:51                           ` Jens Axboe
2014-12-16 22:43                           ` Jens Axboe
2014-12-17  3:52                             ` Elliott, Robert (Server Storage)
2014-12-17  5:43                               ` Jens Axboe
2014-12-17 16:48                                 ` Elliott, Robert (Server Storage)
2014-12-17 17:27                                   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53F795E0.3090806@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=fio@vger.kernel.org \
    --cc=scameron@beardog.cce.hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox