All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: scameron@beardog.cce.hp.com
Cc: fio@vger.kernel.org
Subject: Re: fio main thread got stuck over the weekend
Date: Fri, 22 Aug 2014 14:11:28 -0500	[thread overview]
Message-ID: <53F795E0.3090806@kernel.dk> (raw)
In-Reply-To: <20140822190924.GQ19666@beardog.cce.hp.com>

On 2014-08-22 14:09, scameron@beardog.cce.hp.com wrote:
> On Fri, Aug 22, 2014 at 02:04:34PM -0500, Jens Axboe wrote:
>> On 2014-08-11 11:04, scameron@beardog.cce.hp.com wrote:
>>> On Mon, Aug 11, 2014 at 10:44:23AM -0500, scameron@beardog.cce.hp.com
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I was running an fio job over the weekend and the main thread
>>>> seems to have hung.  This in on RHEL6u5 with kernel
>>>> 2.6.32-431.20.3.el6.x86_64 but with a highly modified hpsa
>>>> driver, just in case it matters.
>>>>
>>>> [root@msablackburn saperf]# uname -a
>>>> Linux msablackburn.americas.hpqcorp.net 2.6.32-431.20.3.el6.x86_64 #1 SMP
>>>> Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
>>>> [root@msablackburn saperf]# /usr/local/bin/fio --version
>>>> fio-2.1.11-3-g789d
>>>>
>>>> The jobfile looks like this (note the runtime=216000, that's 60 hours)
>>>> It seems to have hung with about 25 hours remaining.)
>>>>
>>>> [root@msablackburn saperf]# cat 2drive_sdr.fio
>>>> [global]
>>>> direct=1
>>>> ioengine=libaio
>>>> norandommap
>>>> randrepeat=0
>>>> bs=4096
>>>> iodepth=1011
>>>> #iodepth=96
>>>> #numjobs=1
>>>> numjobs=12
>>>> #numjobs=24
>>>> # number_ios=1
>>>> runtime=216000
>>>> #runtime=54000
>>>> time_based=1
>>>> group_reporting
>>>> thread
>>>> gtod_reduce=1
>>>> #iodepth_batch=4
>>>> #iodepth_batch_complete=4
>>>> cpus_allowed=0-5
>>>> cpus_allowed_policy=split
>>>> rw=randread
>>>>
>>>> [4_KiB_RR_drive_r]
>>>> filename=/dev/sdr
>>>>
>>>> [4_KiB_RR_drive_s]
>>>> filename=/dev/sds
>>>>
>>>>
>>>> Here is all the output so far:
>>>>
>>>> ---------------8<------------8<-------------8<-----------------
>>>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>>>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> fio-2.1.11-3-g789d
>>>> Starting 24 threads
>>>> ^Cbs: 24 (f=24): [r(24)] [1.2% done] [1113MB/0KB/0KB /s] [285K/0/0 iops]
>>>> [eta 14h:48m:47s]
>>>> fio: terminating on signal 2
>>>>
>>>> 4_KiB_RR_drive_r: (groupid=0, jobs=24): err= 0: pid=7696: Fri Aug  8
>>>> 16:49:19 2014
>>>>    read : io=881568MB, bw=1309.1MB/s, iops=335341, runt=672989msec
>>>>    cpu          : usr=1.75%, sys=23.15%, ctx=1266117, majf=9, minf=835
>>>>    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>>>    >=64=100.0%
>>>>       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>       >=64=0.0%
>>>>       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>>       >=64=0.1%
>>>>       issued    : total=r=225681428/w=0/d=0, short=r=0/w=0/d=0
>>>>       latency   : target=0, window=0, percentile=100.00%, depth=1011
>>>>
>>>> Run status group 0 (all jobs):
>>>>     READ: io=881568MB, aggrb=1309.1MB/s, minb=1309.1MB/s,
>>>>     maxb=1309.1MB/s, mint=672989msec, maxt=672989msec
>>>>
>>>> Disk stats (read/write):
>>>>    sdr: ios=112299535/0, merge=0/0, ticks=13349007/0, in_queue=13054755,
>>>>    util=98.57%
>>>>    sds: ios=113359581/0, merge=0/0, ticks=13710834/0, in_queue=13414931,
>>>>    util=98.71%
>>>> [root@msablackburn saperf]# vi !$
>>>> vi 2drive_sdr.fio
>>>> [root@msablackburn saperf]# fio 2drive_sdr.fio
>>>> 4_KiB_RR_drive_r: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> 4_KiB_RR_drive_s: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=libaio, iodepth=1011
>>>> ...
>>>> fio-2.1.11-3-g789d
>>>> Starting 24 threads
>>>> Jobs: 24 (f=24): [r(24)] [57.1% done] [1408MB/0KB/0KB /s] [361K/0/0 iops]
>>>> [eta 01d:01h:43m:07s]
>>>> ---------------8<------------8<-------------8<-----------------
>>>>
>>>> And there it sits, the eta does not update.
>>>>
>>>>
>>>> I tried connecting to the process with gdb, and doing "info threads" and
>>>> "bt":
>>>>
>>>> PID of main thread is 8530:
>>>>
>>>> root      8530  598  1.0 2159116 170812 pts/0  Sl+  Aug08 23570:56 fio
>>>> 2drive_sdr.fio
>>>>
>>>> ...
>>>> (gdb) info threads
>>>>    26 Thread 0x7f65da431700 (LWP 8531)  0x0000003aa4cacced in nanosleep
>>>>    () from /lib64/libc.so.6
>>>>    25 Thread 0x7f65d9a30700 (LWP 8532)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    24 Thread 0x7f65d902f700 (LWP 8533)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    23 Thread 0x7f65d862e700 (LWP 8534)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    22 Thread 0x7f65d7c2d700 (LWP 8535)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    21 Thread 0x7f65d722c700 (LWP 8536)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    20 Thread 0x7f65d682b700 (LWP 8537)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    19 Thread 0x7f65d5e2a700 (LWP 8538)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    18 Thread 0x7f65d5429700 (LWP 8539)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    17 Thread 0x7f65d4a28700 (LWP 8540)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    16 Thread 0x7f65d4027700 (LWP 8541)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    15 Thread 0x7f65d3626700 (LWP 8542)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    14 Thread 0x7f65d2c25700 (LWP 8543)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    13 Thread 0x7f65d2222700 (LWP 8544)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    12 Thread 0x7f65d1820700 (LWP 8545)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    11 Thread 0x7f65d0e1e700 (LWP 8546)  td_io_queue (td=0x7f65da496410,
>>>>    io_u=0x7f656804d010) at ioengines.c:344
>>>>    10 Thread 0x7f65d041c700 (LWP 8547)  0x0000003c69800667 in io_submit
>>>>    () from /lib64/libaio.so.1
>>>>    9 Thread 0x7f65cfa1a700 (LWP 8548)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    8 Thread 0x7f65cf018700 (LWP 8549)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    7 Thread 0x7f65ce616700 (LWP 8550)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    6 Thread 0x7f65cdc14700 (LWP 8551)  fio_libaio_event
>>>>    (td=0x7f65da4b06a8, event=<value optimized out>) at engines/libaio.c:65
>>>>    5 Thread 0x7f65cd212700 (LWP 8552)  __get_next_offset
>>>>    (td=0x7f65da4b5a60) at io_u.c:389
>>>>    4 Thread 0x7f65cc810700 (LWP 8553)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>>    3 Thread 0x7f65cbe0e700 (LWP 8554)  td_io_u_unlock (td=0x7f65da4c01d0,
>>>>    min_evts=<value optimized out>, bytes=0x7f65cbe0dd10) at fio.h:619
>>>>    2 Thread 0x7f65cb40c700 (LWP 8555)  0x0000003c69800667 in io_submit ()
>>>>    from /lib64/libaio.so.1
>>>> * 1 Thread 0x7f65e34c6720 (LWP 8530)  0x0000003aa4cacced in nanosleep ()
>>> >from /lib64/libc.so.6
>>>> (gdb) bt
>>>> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>>> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>>>> #2  0x000000000044afe4 in do_usleep () at backend.c:1727
>>>> #3  run_threads () at backend.c:1965
>>>> #4  0x000000000044b84d in fio_backend () at backend.c:2068
>>>> #5  0x0000003aa4c1ed1d in __libc_start_main () from /lib64/libc.so.6
>>>> #6  0x0000000000409c39 in _start ()
>>>> (gdb) detach
>>>> Detaching from program: /usr/local/bin/fio, process 8530
>>>> (gdb) quit
>>>> [root@msablackburn saperf]#
>>>>
>>>> I am guessing the 8530 thread is the one that prints out the eta updates?
>>>> Well, maybe not, I see a call to create_disk_util_thread() in backend.c.
>>>>
>>>> i/o still seems to be running, and there are no messages in dmesg about
>>>> hung i/o or the scsi mid layer getting angry and trying to abort things
>>>> or anything like that.  Seems to be happily cruising along, except that
>>>> the eta isn't updating, and the job should have finished several hours
>>>> ago now.  I'm leaving it running for now just to try to figure out
>>>> what happened.
>>>>
>>>> Any ideas?  Anything else I should poke at?
>>>
>>> A bit more info:
>>>
>>> thread 26 looks to be the one that's supposed to print the eta info:
>>>
>>> (gdb) thread 26
>>> [Switching to thread 26 (Thread 0x7f65da431700 (LWP 8531))]#0
>>> 0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>> (gdb) bt
>>> #0  0x0000003aa4cacced in nanosleep () from /lib64/libc.so.6
>>> #1  0x0000003aa4ce1e44 in usleep () from /lib64/libc.so.6
>>> #2  0x00000000004474a0 in disk_thread_main (data=<value optimized out>) at
>>> backend.c:1992
>>> #3  0x0000003aa50079d1 in start_thread () from /lib64/libpthread.so.0
>>> #4  0x0000003aa4ce8b5d in clone () from /lib64/libc.so.6
>>> (gdb)
>>>
>>> static void *disk_thread_main(void *data)
>>> {
>>>          int ret = 0;
>>>
>>>          fio_mutex_up(startup_mutex);
>>>
>>>          while (threads && !ret) {
>>>                  usleep(DISK_UTIL_MSEC * 1000);
>>>                  if (!threads)
>>>                          break;
>>>                  ret = update_io_ticks();
>>>
>>>                  if (!is_backend)
>>>                          print_thread_status();
>>>          }
>>>
>>>          fio_mutex_up(disk_thread_mutex);
>>>          return NULL;
>>> }
>>>
>>> that usleep() is where it's in nanosleep, I guess.
>>>
>> >from eta.c:
>>>
>>> void print_thread_status(void)
>>> {
>>>          struct jobs_eta *je;
>>>          size_t size;
>>>
>>>          je = get_jobs_eta(0, &size);
>>>          if (je)
>>>                  display_thread_status(je);
>>>
>>>          free(je);
>>> }
>>>
>>> Maybe that je is coming back false?  which is
>>> probably the return value of calc_thread_status() which, well,
>>> at a glance, I'm not sure what calc_thread_status() is doing.
>>
>> I'll take a look at this next week, been away at a conference since last
>> weekend.
>
> Ok.  Meantime, I had to reclaim the machine for testing, so I no longer
> have it just sitting there to debug, and I have not sseen the problem again
> that I know of.

Clearly a hardware issue :-)

-- 
Jens Axboe



  reply	other threads:[~2014-08-22 19:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-11 15:44 fio main thread got stuck over the weekend scameron
2014-08-11 16:04 ` scameron
2014-08-22 19:04   ` Jens Axboe
2014-08-22 19:09     ` scameron
2014-08-22 19:11       ` Jens Axboe [this message]
2014-12-12 20:32         ` Elliott, Robert (Server Storage)
2014-12-13  4:49           ` Jens Axboe
2014-12-15 17:33             ` Elliott, Robert (Server Storage)
2014-12-15 17:37               ` Jens Axboe
2014-12-15 19:39                 ` Elliott, Robert (Server Storage)
2014-12-15 20:12                   ` Jens Axboe
2014-12-15 20:31                     ` Jens Axboe
2014-12-15 20:49                       ` Jens Axboe
2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
2014-12-16  2:51                           ` Jens Axboe
2014-12-16 22:43                           ` Jens Axboe
2014-12-17  3:52                             ` Elliott, Robert (Server Storage)
2014-12-17  5:43                               ` Jens Axboe
2014-12-17 16:48                                 ` Elliott, Robert (Server Storage)
2014-12-17 17:27                                   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53F795E0.3090806@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=fio@vger.kernel.org \
    --cc=scameron@beardog.cce.hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.