Flexible I/O Tester development
 help / color / mirror / Atom feed
* core dump / segfault after 48 hour run
@ 2013-09-30 13:04 Roger Sibert
  2013-09-30 16:07 ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Roger Sibert @ 2013-09-30 13:04 UTC (permalink / raw)
  To: FIO

Hello Everyone,

I was looking to use fio to run full disks writes to a SSD after doing
a secure erase to measure/see how long it takes before the performance
stabilizes.  Give or take after about 48 hours I see this on the
screen.

B2-058:~/longtermruntime # ./fio.64bit.static longtermruntime-192h.fio
seqwrite-phase: (g=0): rw=write, bs=512K-512K/512K-512K/512K-512K,
ioengine=libaio, iodepth=16
fio-2.1.2-15-gd5603
Starting 1 process
fio: pid=6895, got signal=11ne] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
06d:07h:05m:31s]

seqwrite-phase: (groupid=0, jobs=1): err= 0: pid=6895: Sun Sep 29 03:40:38 2013
    lat (usec) : 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=99.15%
    lat (msec) : 100=0.56%, 250=0.28%, 500=0.01%, 750=0.01%
  cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=67108865/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=0KB, aggrb=0KB/s, minb=0KB/s, maxb=0KB/s,
mint=144006511329msec, maxt=144006511329msec

Disk stats (read/write):
  sdb: ios=0/67108865, merge=0/0, ticks=0/2354077568,
in_queue=2353971492, util=100.00%
fio: file hash not empty on exit

I took a look at one of the core files

B2-057:~/longtermruntime # gdb core core
GNU gdb (GDB) SUSE (7.0-0.4.16)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
"/root/longtermruntime/core": not in executable format: File format
not recognized
Missing separate debuginfo for the main executable file
Try: zypper install -C
"debuginfo(build-id)=559375f8a046f376897b4923007bff5b07ecd8d4"
Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000040a6c9 in ?? ()

Is there anything else that I can do prior to help pull out more debug
using gdb prior to restarting/retasking this systems?  My gdb skills
arent that great.


Systems are running SLES 11 SP1 using a static compiled version of fio
(fio-2.1.2-15-gd5603)

Job file looks similar to this.

; writes 512k verification blocks until the disk is full,
[global]
bs=512k
direct=1
ioengine=libaio
iodepth=16
filename=/dev/sdb   ; or use a full disk, for example /dev/sda
runtime=216h
time_based

[seqwrite-phase]
stonewall
rw=write
fill_device=1
write_bw_log=sdc-iodepth16-seqwrite-bs512k-216h
write_lat_log=sdc-iodepth16-seqwrite-bs512k-216h
write_iops_log=sdc-iodepth16-seqwrite-bs512k-216h

Thanks,
Roger

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: core dump / segfault after 48 hour run
  2013-09-30 13:04 core dump / segfault after 48 hour run Roger Sibert
@ 2013-09-30 16:07 ` Jens Axboe
  2013-09-30 16:17   ` Roger Sibert
  2013-09-30 16:20   ` Roger Sibert
  0 siblings, 2 replies; 6+ messages in thread
From: Jens Axboe @ 2013-09-30 16:07 UTC (permalink / raw)
  To: Roger Sibert; +Cc: FIO

On 09/30/2013 07:04 AM, Roger Sibert wrote:
> Hello Everyone,
> 
> I was looking to use fio to run full disks writes to a SSD after doing
> a secure erase to measure/see how long it takes before the performance
> stabilizes.  Give or take after about 48 hours I see this on the
> screen.
> 
> B2-058:~/longtermruntime # ./fio.64bit.static longtermruntime-192h.fio
> seqwrite-phase: (g=0): rw=write, bs=512K-512K/512K-512K/512K-512K,
> ioengine=libaio, iodepth=16
> fio-2.1.2-15-gd5603
> Starting 1 process
> fio: pid=6895, got signal=11ne] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 06d:07h:05m:31s]
> 
> seqwrite-phase: (groupid=0, jobs=1): err= 0: pid=6895: Sun Sep 29 03:40:38 2013
>     lat (usec) : 1000=0.01%
>     lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=99.15%
>     lat (msec) : 100=0.56%, 250=0.28%, 500=0.01%, 750=0.01%
>   cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=67108865/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>   WRITE: io=0KB, aggrb=0KB/s, minb=0KB/s, maxb=0KB/s,
> mint=144006511329msec, maxt=144006511329msec
> 
> Disk stats (read/write):
>   sdb: ios=0/67108865, merge=0/0, ticks=0/2354077568,
> in_queue=2353971492, util=100.00%
> fio: file hash not empty on exit
> 
> I took a look at one of the core files
> 
> B2-057:~/longtermruntime # gdb core core
> GNU gdb (GDB) SUSE (7.0-0.4.16)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-suse-linux".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> "/root/longtermruntime/core": not in executable format: File format
> not recognized
> Missing separate debuginfo for the main executable file
> Try: zypper install -C
> "debuginfo(build-id)=559375f8a046f376897b4923007bff5b07ecd8d4"
> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x000000000040a6c9 in ?? ()
> 
> Is there anything else that I can do prior to help pull out more debug
> using gdb prior to restarting/retasking this systems?  My gdb skills
> arent that great.

I know it's a pain to reproduce (especially after a 48h run), but if you
could edit the Makefile and remove the -O3 from the OPTFLAGS, then make
clean, make all, and then reproduce. Then the core files will be of more
use.

For the core files you have now, try and do a 'bt' when you open them so
I can see a backtrace. That might be enough to see what is going on.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: core dump / segfault after 48 hour run
  2013-09-30 16:07 ` Jens Axboe
@ 2013-09-30 16:17   ` Roger Sibert
  2013-09-30 16:20   ` Roger Sibert
  1 sibling, 0 replies; 6+ messages in thread
From: Roger Sibert @ 2013-09-30 16:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: FIO

On Mon, Sep 30, 2013 at 12:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/30/2013 07:04 AM, Roger Sibert wrote:
>> Hello Everyone,
>>
>> I was looking to use fio to run full disks writes to a SSD after doing
>> a secure erase to measure/see how long it takes before the performance
>> stabilizes.  Give or take after about 48 hours I see this on the
>> screen.
>>
>> B2-058:~/longtermruntime # ./fio.64bit.static longtermruntime-192h.fio
>> seqwrite-phase: (g=0): rw=write, bs=512K-512K/512K-512K/512K-512K,
>> ioengine=libaio, iodepth=16
>> fio-2.1.2-15-gd5603
>> Starting 1 process
>> fio: pid=6895, got signal=11ne] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> 06d:07h:05m:31s]
>>
>> seqwrite-phase: (groupid=0, jobs=1): err= 0: pid=6895: Sun Sep 29 03:40:38 2013
>>     lat (usec) : 1000=0.01%
>>     lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=99.15%
>>     lat (msec) : 100=0.56%, 250=0.28%, 500=0.01%, 750=0.01%
>>   cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued    : total=r=0/w=67108865/d=0, short=r=0/w=0/d=0
>>
>> Run status group 0 (all jobs):
>>   WRITE: io=0KB, aggrb=0KB/s, minb=0KB/s, maxb=0KB/s,
>> mint=144006511329msec, maxt=144006511329msec
>>
>> Disk stats (read/write):
>>   sdb: ios=0/67108865, merge=0/0, ticks=0/2354077568,
>> in_queue=2353971492, util=100.00%
>> fio: file hash not empty on exit
>>
>> I took a look at one of the core files
>>
>> B2-057:~/longtermruntime # gdb core core
>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-suse-linux".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> "/root/longtermruntime/core": not in executable format: File format
>> not recognized
>> Missing separate debuginfo for the main executable file
>> Try: zypper install -C
>> "debuginfo(build-id)=559375f8a046f376897b4923007bff5b07ecd8d4"
>> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x000000000040a6c9 in ?? ()
>>
>> Is there anything else that I can do prior to help pull out more debug
>> using gdb prior to restarting/retasking this systems?  My gdb skills
>> arent that great.
>
> I know it's a pain to reproduce (especially after a 48h run), but if you
> could edit the Makefile and remove the -O3 from the OPTFLAGS, then make
> clean, make all, and then reproduce. Then the core files will be of more
> use.
>
> For the core files you have now, try and do a 'bt' when you open them so
> I can see a backtrace. That might be enough to see what is going on.
>
> --
> Jens Axboe
>

Hello Jens,

I should be able to do the rebuild and run and give or take a few days
get the results.

Heres the results of the bt.

#0  0x000000000040a6c9 in ?? ()
(gdb) bt
#0  0x000000000040a6c9 in ?? ()
#1  0x0000000000891f20 in ?? ()
#2  0x0000000008d2d7f5 in ?? ()
#3  0x00007fe500080000 in ?? ()
#4  0x00007fe500000001 in ?? ()
#5  0x00007fe5e312b000 in ?? ()
#6  0x00000000008929a0 in ?? ()
#7  0x0000000000870eb0 in ?? ()
#8  0x0000000000440b05 in ?? ()
#9  0x00007fff4efcfb60 in ?? ()
#10 0x0000000000000008 in ?? ()
#11 0x000000000010bac4 in ?? ()
#12 0x00000000000a8c7c in ?? ()
#13 0x00007fe5e312b000 in ?? ()
#14 0x00007fe5e312b000 in ?? ()
#15 0x0000000000891f20 in ?? ()
#16 0x00007fe5e312b000 in ?? ()
#17 0x00007fe5e312ffa0 in ?? ()
#18 0x00007fe5e312ffb0 in ?? ()
---Type <return> to continue, or q <return> to quit---
#19 0x00007fe5e312fcb8 in ?? ()
#20 0x0000000000405385 in ?? ()
#21 0x00000000005388bb in ?? ()
#22 0x00000000005388cb in ?? ()
#23 0x0000000000539fe2 in ?? ()
#24 0x0000000000000000 in ?? ()

Thanks,
Roger


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: core dump / segfault after 48 hour run
  2013-09-30 16:07 ` Jens Axboe
  2013-09-30 16:17   ` Roger Sibert
@ 2013-09-30 16:20   ` Roger Sibert
  2013-09-30 18:13     ` Jens Axboe
  1 sibling, 1 reply; 6+ messages in thread
From: Roger Sibert @ 2013-09-30 16:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: FIO

On Mon, Sep 30, 2013 at 12:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/30/2013 07:04 AM, Roger Sibert wrote:
>> Hello Everyone,
>>
>> I was looking to use fio to run full disks writes to a SSD after doing
>> a secure erase to measure/see how long it takes before the performance
>> stabilizes.  Give or take after about 48 hours I see this on the
>> screen.
>>
>> B2-058:~/longtermruntime # ./fio.64bit.static longtermruntime-192h.fio
>> seqwrite-phase: (g=0): rw=write, bs=512K-512K/512K-512K/512K-512K,
>> ioengine=libaio, iodepth=16
>> fio-2.1.2-15-gd5603
>> Starting 1 process
>> fio: pid=6895, got signal=11ne] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> 06d:07h:05m:31s]
>>
>> seqwrite-phase: (groupid=0, jobs=1): err= 0: pid=6895: Sun Sep 29 03:40:38 2013
>>     lat (usec) : 1000=0.01%
>>     lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=99.15%
>>     lat (msec) : 100=0.56%, 250=0.28%, 500=0.01%, 750=0.01%
>>   cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>      issued    : total=r=0/w=67108865/d=0, short=r=0/w=0/d=0
>>
>> Run status group 0 (all jobs):
>>   WRITE: io=0KB, aggrb=0KB/s, minb=0KB/s, maxb=0KB/s,
>> mint=144006511329msec, maxt=144006511329msec
>>
>> Disk stats (read/write):
>>   sdb: ios=0/67108865, merge=0/0, ticks=0/2354077568,
>> in_queue=2353971492, util=100.00%
>> fio: file hash not empty on exit
>>
>> I took a look at one of the core files
>>
>> B2-057:~/longtermruntime # gdb core core
>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-suse-linux".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> "/root/longtermruntime/core": not in executable format: File format
>> not recognized
>> Missing separate debuginfo for the main executable file
>> Try: zypper install -C
>> "debuginfo(build-id)=559375f8a046f376897b4923007bff5b07ecd8d4"
>> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x000000000040a6c9 in ?? ()
>>
>> Is there anything else that I can do prior to help pull out more debug
>> using gdb prior to restarting/retasking this systems?  My gdb skills
>> arent that great.
>
> I know it's a pain to reproduce (especially after a 48h run), but if you
> could edit the Makefile and remove the -O3 from the OPTFLAGS, then make
> clean, make all, and then reproduce. Then the core files will be of more
> use.
>
> For the core files you have now, try and do a 'bt' when you open them so
> I can see a backtrace. That might be enough to see what is going on.
>
> --
> Jens Axboe
>

Let me try that again...  My gdb skills may be bad but it doesnt mean
I shouldnt recognize I was missing something.

Changed how I called the core file which should have what you where
actually asking for.

B2-057:~/longtermruntime # gdb ./fio.64bit.static ./core
GNU gdb (GDB) SUSE (7.0-0.4.16)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/longtermruntime/fio.64bit.static...done.

warning: core file may not match specified executable file.
Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
ddir=<value optimized out>, bs=<value optimized out>,
    t=<value optimized out>) at stat.c:1517
1517    stat.c: No such file or directory.
        in stat.c
(gdb) bt
#0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
ddir=<value optimized out>, bs=<value optimized out>,
    t=<value optimized out>) at stat.c:1517
#1  0x0000000000440b05 in fio_libaio_queued (nr=1, io_us=0x8929a0,
td=0x7fe5e312b000) at engines/libaio.c:199
#2  fio_libaio_commit (nr=1, io_us=0x8929a0, td=0x7fe5e312b000) at
engines/libaio.c:218
#3  0x0000000000405385 in td_io_commit (td=0x7fe5e312b000) at ioengines.c:379
#4  0x000000000040572a in td_io_queue (td=0x7fe5e312b000,
io_u=0x891f20) at ioengines.c:329
#5  0x000000000043692f in do_io (td=0x7fe5e312b000) at backend.c:701
#6  thread_main (td=0x7fe5e312b000) at backend.c:1314
#7  0x0000000000438447 in fork_main (offset=0, shmid=<value optimized
out>) at backend.c:1464
#8  run_threads (offset=0, shmid=<value optimized out>) at backend.c:1726
#9  0x000000000043889d in fio_backend () at backend.c:1912
#10 0x00000000004702a4 in __libc_start_main ()
#11 0x0000000000000000 in ?? ()

Thanks,
Roger


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: core dump / segfault after 48 hour run
  2013-09-30 16:20   ` Roger Sibert
@ 2013-09-30 18:13     ` Jens Axboe
  2013-09-30 18:18       ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2013-09-30 18:13 UTC (permalink / raw)
  To: Roger Sibert; +Cc: FIO

On 09/30/2013 10:20 AM, Roger Sibert wrote:
> On Mon, Sep 30, 2013 at 12:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 09/30/2013 07:04 AM, Roger Sibert wrote:
>>> Hello Everyone,
>>>
>>> I was looking to use fio to run full disks writes to a SSD after doing
>>> a secure erase to measure/see how long it takes before the performance
>>> stabilizes.  Give or take after about 48 hours I see this on the
>>> screen.
>>>
>>> B2-058:~/longtermruntime # ./fio.64bit.static longtermruntime-192h.fio
>>> seqwrite-phase: (g=0): rw=write, bs=512K-512K/512K-512K/512K-512K,
>>> ioengine=libaio, iodepth=16
>>> fio-2.1.2-15-gd5603
>>> Starting 1 process
>>> fio: pid=6895, got signal=11ne] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>> 06d:07h:05m:31s]
>>>
>>> seqwrite-phase: (groupid=0, jobs=1): err= 0: pid=6895: Sun Sep 29 03:40:38 2013
>>>     lat (usec) : 1000=0.01%
>>>     lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=99.15%
>>>     lat (msec) : 100=0.56%, 250=0.28%, 500=0.01%, 750=0.01%
>>>   cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>      issued    : total=r=0/w=67108865/d=0, short=r=0/w=0/d=0
>>>
>>> Run status group 0 (all jobs):
>>>   WRITE: io=0KB, aggrb=0KB/s, minb=0KB/s, maxb=0KB/s,
>>> mint=144006511329msec, maxt=144006511329msec
>>>
>>> Disk stats (read/write):
>>>   sdb: ios=0/67108865, merge=0/0, ticks=0/2354077568,
>>> in_queue=2353971492, util=100.00%
>>> fio: file hash not empty on exit
>>>
>>> I took a look at one of the core files
>>>
>>> B2-057:~/longtermruntime # gdb core core
>>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-suse-linux".
>>> For bug reporting instructions, please see:
>>> <http://www.gnu.org/software/gdb/bugs/>...
>>> "/root/longtermruntime/core": not in executable format: File format
>>> not recognized
>>> Missing separate debuginfo for the main executable file
>>> Try: zypper install -C
>>> "debuginfo(build-id)=559375f8a046f376897b4923007bff5b07ecd8d4"
>>> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0x000000000040a6c9 in ?? ()
>>>
>>> Is there anything else that I can do prior to help pull out more debug
>>> using gdb prior to restarting/retasking this systems?  My gdb skills
>>> arent that great.
>>
>> I know it's a pain to reproduce (especially after a 48h run), but if you
>> could edit the Makefile and remove the -O3 from the OPTFLAGS, then make
>> clean, make all, and then reproduce. Then the core files will be of more
>> use.
>>
>> For the core files you have now, try and do a 'bt' when you open them so
>> I can see a backtrace. That might be enough to see what is going on.
>>
>> --
>> Jens Axboe
>>
> 
> Let me try that again...  My gdb skills may be bad but it doesnt mean
> I shouldnt recognize I was missing something.
> 
> Changed how I called the core file which should have what you where
> actually asking for.
> 
> B2-057:~/longtermruntime # gdb ./fio.64bit.static ./core
> GNU gdb (GDB) SUSE (7.0-0.4.16)
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-suse-linux".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /root/longtermruntime/fio.64bit.static...done.
> 
> warning: core file may not match specified executable file.
> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
> ddir=<value optimized out>, bs=<value optimized out>,
>     t=<value optimized out>) at stat.c:1517
> 1517    stat.c: No such file or directory.
>         in stat.c
> (gdb) bt
> #0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
> ddir=<value optimized out>, bs=<value optimized out>,
>     t=<value optimized out>) at stat.c:1517
> #1  0x0000000000440b05 in fio_libaio_queued (nr=1, io_us=0x8929a0,
> td=0x7fe5e312b000) at engines/libaio.c:199
> #2  fio_libaio_commit (nr=1, io_us=0x8929a0, td=0x7fe5e312b000) at
> engines/libaio.c:218
> #3  0x0000000000405385 in td_io_commit (td=0x7fe5e312b000) at ioengines.c:379
> #4  0x000000000040572a in td_io_queue (td=0x7fe5e312b000,
> io_u=0x891f20) at ioengines.c:329
> #5  0x000000000043692f in do_io (td=0x7fe5e312b000) at backend.c:701
> #6  thread_main (td=0x7fe5e312b000) at backend.c:1314
> #7  0x0000000000438447 in fork_main (offset=0, shmid=<value optimized
> out>) at backend.c:1464
> #8  run_threads (offset=0, shmid=<value optimized out>) at backend.c:1726
> #9  0x000000000043889d in fio_backend () at backend.c:1912
> #10 0x00000000004702a4 in __libc_start_main ()
> #11 0x0000000000000000 in ?? ()

OK, that helps a whole lot. So my guess it that you ran out of memory.
Currently fio does not flush out the existing log, it just keeps
appending to it and flushes at the end. This is done to not disturb the
actual data run, but it does mean that for long runs, you can gobble up
a lot of memory...

I will commit something that is a little more defensive so we don't
actually segfault, just stop logging. Then we can look into handling it
better in the future.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: core dump / segfault after 48 hour run
  2013-09-30 18:13     ` Jens Axboe
@ 2013-09-30 18:18       ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2013-09-30 18:18 UTC (permalink / raw)
  To: Roger Sibert; +Cc: FIO

On 09/30/2013 12:13 PM, Jens Axboe wrote:
> On 09/30/2013 10:20 AM, Roger Sibert wrote:
>> On Mon, Sep 30, 2013 at 12:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>> On 09/30/2013 07:04 AM, Roger Sibert wrote:
>>>> Hello Everyone,
>>>>
>>>> I was looking to use fio to run full disks writes to a SSD after doing
>>>> a secure erase to measure/see how long it takes before the performance
>>>> stabilizes.  Give or take after about 48 hours I see this on the
>>>> screen.
>>>>
>>>> B2-058:~/longtermruntime # ./fio.64bit.static longtermruntime-192h.fio
>>>> seqwrite-phase: (g=0): rw=write, bs=512K-512K/512K-512K/512K-512K,
>>>> ioengine=libaio, iodepth=16
>>>> fio-2.1.2-15-gd5603
>>>> Starting 1 process
>>>> fio: pid=6895, got signal=11ne] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>> 06d:07h:05m:31s]
>>>>
>>>> seqwrite-phase: (groupid=0, jobs=1): err= 0: pid=6895: Sun Sep 29 03:40:38 2013
>>>>     lat (usec) : 1000=0.01%
>>>>     lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=99.15%
>>>>     lat (msec) : 100=0.56%, 250=0.28%, 500=0.01%, 750=0.01%
>>>>   cpu          : usr=0.00%, sys=0.00%, ctx=0, majf=0, minf=0
>>>>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>>>>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>>>>      issued    : total=r=0/w=67108865/d=0, short=r=0/w=0/d=0
>>>>
>>>> Run status group 0 (all jobs):
>>>>   WRITE: io=0KB, aggrb=0KB/s, minb=0KB/s, maxb=0KB/s,
>>>> mint=144006511329msec, maxt=144006511329msec
>>>>
>>>> Disk stats (read/write):
>>>>   sdb: ios=0/67108865, merge=0/0, ticks=0/2354077568,
>>>> in_queue=2353971492, util=100.00%
>>>> fio: file hash not empty on exit
>>>>
>>>> I took a look at one of the core files
>>>>
>>>> B2-057:~/longtermruntime # gdb core core
>>>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>>>> Copyright (C) 2009 Free Software Foundation, Inc.
>>>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>>>> This is free software: you are free to change and redistribute it.
>>>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>>>> and "show warranty" for details.
>>>> This GDB was configured as "x86_64-suse-linux".
>>>> For bug reporting instructions, please see:
>>>> <http://www.gnu.org/software/gdb/bugs/>...
>>>> "/root/longtermruntime/core": not in executable format: File format
>>>> not recognized
>>>> Missing separate debuginfo for the main executable file
>>>> Try: zypper install -C
>>>> "debuginfo(build-id)=559375f8a046f376897b4923007bff5b07ecd8d4"
>>>> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0  0x000000000040a6c9 in ?? ()
>>>>
>>>> Is there anything else that I can do prior to help pull out more debug
>>>> using gdb prior to restarting/retasking this systems?  My gdb skills
>>>> arent that great.
>>>
>>> I know it's a pain to reproduce (especially after a 48h run), but if you
>>> could edit the Makefile and remove the -O3 from the OPTFLAGS, then make
>>> clean, make all, and then reproduce. Then the core files will be of more
>>> use.
>>>
>>> For the core files you have now, try and do a 'bt' when you open them so
>>> I can see a backtrace. That might be enough to see what is going on.
>>>
>>> --
>>> Jens Axboe
>>>
>>
>> Let me try that again...  My gdb skills may be bad but it doesnt mean
>> I shouldnt recognize I was missing something.
>>
>> Changed how I called the core file which should have what you where
>> actually asking for.
>>
>> B2-057:~/longtermruntime # gdb ./fio.64bit.static ./core
>> GNU gdb (GDB) SUSE (7.0-0.4.16)
>> Copyright (C) 2009 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-suse-linux".
>> For bug reporting instructions, please see:
>> <http://www.gnu.org/software/gdb/bugs/>...
>> Reading symbols from /root/longtermruntime/fio.64bit.static...done.
>>
>> warning: core file may not match specified executable file.
>> Core was generated by `./fio.64bit.static longtermruntime-216h.fio'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
>> ddir=<value optimized out>, bs=<value optimized out>,
>>     t=<value optimized out>) at stat.c:1517
>> 1517    stat.c: No such file or directory.
>>         in stat.c
>> (gdb) bt
>> #0  0x000000000040a6c9 in __add_log_sample (iolog=0x872510, val=62,
>> ddir=<value optimized out>, bs=<value optimized out>,
>>     t=<value optimized out>) at stat.c:1517
>> #1  0x0000000000440b05 in fio_libaio_queued (nr=1, io_us=0x8929a0,
>> td=0x7fe5e312b000) at engines/libaio.c:199
>> #2  fio_libaio_commit (nr=1, io_us=0x8929a0, td=0x7fe5e312b000) at
>> engines/libaio.c:218
>> #3  0x0000000000405385 in td_io_commit (td=0x7fe5e312b000) at ioengines.c:379
>> #4  0x000000000040572a in td_io_queue (td=0x7fe5e312b000,
>> io_u=0x891f20) at ioengines.c:329
>> #5  0x000000000043692f in do_io (td=0x7fe5e312b000) at backend.c:701
>> #6  thread_main (td=0x7fe5e312b000) at backend.c:1314
>> #7  0x0000000000438447 in fork_main (offset=0, shmid=<value optimized
>> out>) at backend.c:1464
>> #8  run_threads (offset=0, shmid=<value optimized out>) at backend.c:1726
>> #9  0x000000000043889d in fio_backend () at backend.c:1912
>> #10 0x00000000004702a4 in __libc_start_main ()
>> #11 0x0000000000000000 in ?? ()
> 
> OK, that helps a whole lot. So my guess it that you ran out of memory.
> Currently fio does not flush out the existing log, it just keeps
> appending to it and flushes at the end. This is done to not disturb the
> actual data run, but it does mean that for long runs, you can gobble up
> a lot of memory...
> 
> I will commit something that is a little more defensive so we don't
> actually segfault, just stop logging. Then we can look into handling it
> better in the future.

I committed this:

http://git.kernel.dk/?p=fio.git;a=commit;h=3c568239a319087a965b06bc2ed94d058810100f

to handle the failure a bit more gracefully at least.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-09-30 18:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-30 13:04 core dump / segfault after 48 hour run Roger Sibert
2013-09-30 16:07 ` Jens Axboe
2013-09-30 16:17   ` Roger Sibert
2013-09-30 16:20   ` Roger Sibert
2013-09-30 18:13     ` Jens Axboe
2013-09-30 18:18       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox