All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: "Elliott, Robert (Server Storage)" <Elliott@hp.com>,
	"stephenmcameron@gmail.com" <stephenmcameron@gmail.com>
Cc: "fio@vger.kernel.org" <fio@vger.kernel.org>
Subject: Re: fio main thread got stuck over the weekend
Date: Mon, 15 Dec 2014 13:12:24 -0700	[thread overview]
Message-ID: <548F40A8.3010405@kernel.dk> (raw)
In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295940F1BF@G4W3202.americas.hpqcorp.net>

On 12/15/2014 12:39 PM, Elliott, Robert (Server Storage) wrote:
> 
> 
>> it's just 18.9 minutes. Total runtime is summed after the fact. The one
>> I think is interesting is the first thread, that is the one that should
>> be printing eta stats. And it seems to be stuck:
>>
>>  >    212 Thread 0x7fa9a086e700 (LWP 6509)  0x0000003974c0b98e in
>> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>>
>> can you do a bt on that? This looks like it _might_ be an issue that was
>> fixed recently with stats. Are you running fio with any output or status
>> options?
>>
> 
> No output options.

OK

> (gdb) thread 212
> [Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x0000000000449381 in helper_thread_main (data=<value optimized out>) at backend.c:2127
> #2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
> (gdb) select-frame 1
> (gdb) info local
> ts = {tv_sec = 1418658662, tv_nsec = 980074000}
> tv = {tv_sec = 1418658662, tv_usec = 730074}
> ret = <value optimized out>
> (gdb) print helper_exit
> $36 = 0
> (gdb) print nr_thread
> $37 = 224
> (gdb) print nr_process
> $38 = 0

This is the helper thread, this is what will output the ETA etc stats.
There's a signal handler that runs and sets/wakes it to do so. So it
would be interesting if you could see what it is up to, set a break
point in there and check what the control flow ends up being.

It's backend.c:helper_thread_main()

backend.c:sig_show_status() should repeatedly fire and end up waking
this thread.

> 
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7fa9a99b1720 (LWP 6508))]#0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
> #1  0x00000039748e1e64 in usleep () from /lib64/libc.so.6
> #2  0x000000000044d019 in do_usleep () at backend.c:1841
> #3  run_threads () at backend.c:2083
> #4  0x000000000044d8ab in fio_backend () at backend.c:2199
> #5  0x000000397481ed1d in __libc_start_main () from /lib64/libc.so.6
> #6  0x000000000040a199 in _start ()
>  (gdb) select-frame 3
>  (gdb) info local
> td = <value optimized out>
> i = <value optimized out>
> todo = <value optimized out>
> nr_running = 210
> m_rate = 0
> t_rate = 0
> nr_started = <value optimized out>
> spent = <value optimized out>
> (gdb) print thread_number
> $39 = 224
> 
> so run_threads() is probably in this loop waiting for 210 threads to quit:
>         while (nr_running) {
>                 reap_threads(&nr_running, &t_rate, &m_rate);
>                 do_usleep(10000);
>         }

Yes, which is expected. Once all jobs have been started, it'll
continually sit in that loop and wait for them to finish.

> 
> Given:
> 
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> (gdb) print td->error
> No symbol "td" in current context.
> (gdb) bt
> #0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
> #1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
> #2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
> #3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e4015400) at ioengines.c:343
> #4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
> #5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
> #6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
> #7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
> (gdb) select-frame 5
> (gdb) print td->error
> $40 = 0
> (gdb) print td->terminate
> $41 = 0
> (gdb) print td->done
> $42 = 0
> (gdb) print td->o.time_based
> $43 = 1
> (gdb) print o->create_only
> $47 = 0

It's still actively running, correct? Not stuck in io_submit()?

> I guess the threads are not leaving this loop in thread_main(): 
>         while (keep_running(td)) {
>         ...
>         }
> 
> static int keep_running(struct thread_data *td)
> {
>         unsigned long long limit;
> 
>         if (td->done)
>                 return 0;
>         if (td->o.time_based)
>                 return 1;
> ...
> 
> After the runtime has elapsed, should td->done or td->terminate
> get set?

Nope, when the runtime is exceeded, the thread will notice by calling
runtime_exceeded() in do_io(). ->terminate is set if you do ctrl-c for
instance.

Are you seeing evidence that it's running for longer than it should?
From your job file:

> runtime=216000
> time_based=1

it should run for 2.5 days, then exit.

-- 
Jens Axboe



  reply	other threads:[~2014-12-15 20:12 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-11 15:44 fio main thread got stuck over the weekend scameron
2014-08-11 16:04 ` scameron
2014-08-22 19:04   ` Jens Axboe
2014-08-22 19:09     ` scameron
2014-08-22 19:11       ` Jens Axboe
2014-12-12 20:32         ` Elliott, Robert (Server Storage)
2014-12-13  4:49           ` Jens Axboe
2014-12-15 17:33             ` Elliott, Robert (Server Storage)
2014-12-15 17:37               ` Jens Axboe
2014-12-15 19:39                 ` Elliott, Robert (Server Storage)
2014-12-15 20:12                   ` Jens Axboe [this message]
2014-12-15 20:31                     ` Jens Axboe
2014-12-15 20:49                       ` Jens Axboe
2014-12-16  0:52                         ` Elliott, Robert (Server Storage)
2014-12-16  2:51                           ` Jens Axboe
2014-12-16 22:43                           ` Jens Axboe
2014-12-17  3:52                             ` Elliott, Robert (Server Storage)
2014-12-17  5:43                               ` Jens Axboe
2014-12-17 16:48                                 ` Elliott, Robert (Server Storage)
2014-12-17 17:27                                   ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=548F40A8.3010405@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=Elliott@hp.com \
    --cc=fio@vger.kernel.org \
    --cc=stephenmcameron@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.