From: "Theodore Ts'o" <tytso@mit.edu>
To: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>,
linux-ext4@vger.kernel.org, fstests@vger.kernel.org,
tarasov@vasily.name
Subject: Re: Test generic/299 stalling forever
Date: Tue, 18 Oct 2016 14:01:07 -0400 [thread overview]
Message-ID: <20161018180107.fscbfm66yidwhey4@thunk.org> (raw)
In-Reply-To: <20161013231923.j2fidfbtzdp66x3t@thunk.org>
Jens, ping?
Have you had a chance to take a look at the stack traces? Is there
any other debugging information I can get for you?
Thanks,
- Ted
On Thu, Oct 13, 2016 at 07:19:23PM -0400, Theodore Ts'o wrote:
> On Wed, Oct 12, 2016 at 08:39:57PM -0600, Jens Axboe wrote:
> >
> > If I just wanted to invoke generic/299 with a setup similar to yours,
> > what should I do? I can attempt to reproduce it.
>
> Well, if you're willing to get a Google Compute Engine account (it
> will come with a $300 free trial credit, which is plenty for this
> exercise), the simplest way to do an exact reproduction would be to
> follow the instructions here:
>
> https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
>
> If you use the current default gce-xfstests image, which is
> xfstests-201610131131 in the xfstests-cloud project, what I run which
> triggers it quite reliably is:
>
> gce-xfstests -C 10 generic/299
>
> On a run that I just did using a 4.8.0+ kernel, it ran 12 times (10
> times with the 4k configuration, and then 2 times with the 1k
> configuration) before it hung.
>
> FSTESTIMG: xfstests-cloud/xfstests-201610131131
> FSTESTVER: e2fsprogs v1.43.3-30-g8df85fb (Sun, 4 Sep 2016 21:32:35 -0400)
> FSTESTVER: fio fio-2.14-27-gafd2cef (Wed, 12 Oct 2016 08:59:25 -0600)
> FSTESTVER: quota 81aca5c (Tue, 12 Jul 2016 16:15:45 +0200)
> FSTESTVER: xfsprogs v4.5.0 (Tue, 15 Mar 2016 15:25:56 +1100)
> FSTESTVER: xfstests-bld de1a881 (Fri, 30 Sep 2016 18:23:15 -0400)
> FSTESTVER: xfstests linux-v3.8-1200-g218a62d (Thu, 29 Sep 2016 23:30:27 -0400)
> FSTESTVER: kernel 4.8.0-ext4-14234-gfb44543 #21 SMP Wed Oct 12 23:40:33 EDT 2016 x86_64
>
> So on a system which was hung in this way, I ran "apt-get update ;
> apt-get install gdb", and uploaded the build tree (plus source code
> for the fio used to build the image) to /root/fio.
>
> root@xfstests-201610131148:~# ps axm -o pid,tid,class,pri,psr,pcpu,stat,wchan:20,comm
> PID TID CLS PRI PSR %CPU STAT WCHAN COMMAND
> ....
> 19781 - - - - 1.6 - - fio
> - 19781 TS 19 0 1.6 Ssl hrtimer_nanosleep -
> - 19800 TS 19 0 0.0 Ssl futex_wait_queue_me -
> - 19801 TS 19 0 0.0 Ssl futex_wait_queue_me -
> - 19802 TS 19 0 0.0 Ssl futex_wait_queue_me -
> - 19803 TS 19 0 0.0 Ssl futex_wait_queue_me -
> ....
> root@xfstests-201610131148:~# gdb /root/fio/fio
> ....
> (gdb) dir /root/fio
> Source directories searched: /root/fio:$cdir:$cwd
> (gdb) attach 19781
> Attaching to program: /root/fio/fio, process 19781
> ....
> (gdb) thread apply all bt
>
> Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> Thread 1 (Thread 0x7fb3de190740 (LWP 19781)):
> #0 0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
> #1 0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
> #2 0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738
> #3 0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268
> #4 0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400
> #5 0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65
>
> (gdb) thread apply all bt full
>
> Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> list = {next = 0x7fb3ddfeced0, prev = 0x7fb3ddfeced0}
> td = 0x7fb3c7590610
> io_u = 0x202fac0
> ret = 0
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309
> __res = <optimized out>
> pd = 0x7fb3ddfed700
> now = <optimized out>
> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500339456, -2782054375790187915, 0, 140410502103136, 140410120428088,
> 140410500339456, 2739195402349996661, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
> data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
> not_first_call = <optimized out>
> pagesize_m1 = <optimized out>
> sp = <optimized out>
> freesize = <optimized out>
> __PRETTY_FUNCTION__ = "start_thread"
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
>
> Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> list = {next = 0x7fb3ddfe4ed0, prev = 0x7fb3ddfe4ed0}
> td = 0x7fb3c7590610
> io_u = 0x203d280
> ret = 0
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309
> __res = <optimized out>
> pd = 0x7fb3ddfe5700
> now = <optimized out>
> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500306688, -2782054375790187915, 0, 140410502103136, 140410120428088,
> 140410500306688, 2739195406644963957, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
> data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
> not_first_call = <optimized out>
> pagesize_m1 = <optimized out>
> sp = <optimized out>
> freesize = <optimized out>
> __PRETTY_FUNCTION__ = "start_thread"
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
>
> Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)):
> ---Type <return> to continue, or q <return> to quit---
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> list = {next = 0x7fb3ddfdced0, prev = 0x7fb3ddfdced0}
> td = 0x7fb3c7590610
> io_u = 0x203cf80
> ret = 0
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309
> __res = <optimized out>
> pd = 0x7fb3ddfdd700
> now = <optimized out>
> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500273920, -2782054375790187915, 0, 140410502103136, 140410120428088,
> 140410500273920, 2739195393760062069, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
> data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
> not_first_call = <optimized out>
> pagesize_m1 = <optimized out>
> sp = <optimized out>
> freesize = <optimized out>
> __PRETTY_FUNCTION__ = "start_thread"
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
>
> Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)):
> #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> list = {next = 0x7fb3ddfd4ed0, prev = 0x7fb3ddfd4ed0}
> td = 0x7fb3c7590610
> io_u = 0x203ccc0
> ret = 0
> #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309
> __res = <optimized out>
> pd = 0x7fb3ddfd5700
> now = <optimized out>
> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500241152, -2782054375790187915, 0, 140410502103136, 140410120428088,
> 140410500241152, 2739195398055029365, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
> data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
> not_first_call = <optimized out>
> pagesize_m1 = <optimized out>
> sp = <optimized out>
> freesize = <optimized out>
> __PRETTY_FUNCTION__ = "start_thread"
> #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
>
> Thread 1 (Thread 0x7fb3de190740 (LWP 19781)):
> #0 0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
> No locals.
> ---Type <return> to continue, or q <return> to quit---
> #1 0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
> ts = {tv_sec = 0, tv_nsec = 1000000}
> #2 0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738
> verify_bytes = 419209216
> elapsed_us = {0, 0, 0}
> td = 0x7fb3c7590610
> sk_out = 0x2039600
> clear_state = 33789440
> #3 0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268
> ret = <optimized out>
> pid = <optimized out>
> this_jobs = 6
> fd = 0x2029720
> map = {0x7fb3c754c000, 0x7fb3c7559ad0, 0x7fb3c75675a0, 0x7fb3c7575070, 0x7fb3c7582b40, 0x7fb3c7590610, 0x0 <repeats 11 times>,
> 0x10000000, 0x0 <repeats 771 times>, 0x7fb3ddf83c1c <check_match+300>, 0x0, 0xd827524, 0x5, 0x7, 0x7fb3de192040,
> 0x7fb3ddf8458e <do_lookup_x+2334>, 0x0, 0x7ffd24b8ea80, 0x7fb3dd2a9d28, 0x7fb3dd2af020, 0x7ffd24b8eb90, 0x3609d4,
> 0x7ffd24b8eb80, 0x0, 0x0, 0x7fb3de1920c0, 0x7fb3de1939d8, 0x406851, 0x7fb3dd2b6d78, 0x401e18, 0x100000000, 0x100000375, 0x0,
> 0x7fb3de1920c0, 0x7ffd24b8ec20, 0x7fb3de19d500, 0x7ffd24b8ec48, 0x7fb3de19d1a8, 0x1,
> 0x7fb3ddf8473d <_dl_lookup_symbol_x+349>, 0x0, 0x7fb3de1920c0, 0x1, 0x0, 0x1, 0x7fb3de19d1a8, 0x0, 0x0, 0x0, 0x0, 0x0,
> 0x7fb3de19d500, 0x7ffd24b8eb90, 0x7ffd24b8eb80, 0xd827524, 0x406851, 0xffffffff, 0x0, 0x7fb3dd2af020, 0x7fb3de1939d8,
> 0x0 <repeats 11 times>, 0x698620 <memcmp@got.plt>, 0x0, 0x7ffd24b8ecf8, 0x7ffd24b8ecf8, 0x3, 0x7ffd24b8f6f0,
> 0x7fb3ddf88ecc <_dl_fixup+508>, 0x9000000000001, 0x0, 0x8ff5c28f5c28f, 0x7fb3dd2af020, 0x140, 0x7ffd24b8f6f0,
> 0x7ffd24b91720, 0x7fb3ddf8f2e5 <_dl_runtime_resolve+53>, 0x0, 0x20262a0, 0x2bc4, 0x7ffd24b92530, 0x7ffd24b8f960,
> 0x8ff5c28f5c200, 0x2026270, 0x7fb3de19d1a8, 0x54, 0x46144b <fio_test_cconv+3595>, 0x0, 0x7ffd24b8ece0,
> 0x0 <repeats 16 times>, 0x2026210, 0x0, 0x2026270, 0x0, 0x1, 0x40000000400, 0x1, 0x0, 0x1, 0x100000001, 0x100000000, 0x0,
> 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x100000001000, 0x1000, 0x0 <repeats 14 times>, 0x1, 0x0, 0x100000000, 0x100000001, 0x0, 0x0,
> 0x0, 0x100000001, 0x400, 0x0 <repeats 87 times>, 0x100000000, 0x100000000, 0x0, 0x89, 0x0 <repeats 20 times>, 0x6400000000,
> 0x6400000064, 0x200000, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1f400000000, 0x1000001f4...}
> this_start = {tv_sec = 0, tv_usec = 0}
> left = <optimized out>
> td = 0x7fb3c7590610
> i = 5
> todo = 6
> nr_running = 0
> m_rate = 0
> t_rate = 0
> nr_started = 6
> spent = <optimized out>
> #4 0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400
> td = <optimized out>
> i = <optimized out>
> #5 0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65
> ret = 1
>
> > > As I said, I can attach to the hanging fio using a gdb and give you
> > > stackdumps for all of the threads if that would be helpful.
> >
> > That would definitely be helpful as well, especially if I cannot
> > reproduce it.
>
> Here you go!
>
> If I kill -9 the fio process, the tests continue to run, and there are
> no kernel complaints, so I'm pretty sure this is a fio issue.
>
> Many thanks!!
>
> - Ted
>
> P.S. In answer to your question from another e-mail message on this
> thread, the block device in volved is a Google Compute Engine
> Persistent Disk device backed using an SSD. To a very *rough*
> approximation, think about something kinda like a qemu qcow image on
> an very fast flash device in terms of performance. I've seen
> the problem before on a 5400 RPM laptop drive, but it's ***much***
> rarer. It reproduces much more easily faster block devices.
>
> P.P.S. And here's the fio job file that it was running:
>
> ###########
> # 299 test fio activity
> # Filenames derived from jobsname and jobid like follows:
> # ..
> [global]
> ioengine=libaio
> bs=128k
> directory=/xt-vdc
> filesize=5368709120
> size=999G
> iodepth=128*1
> continue_on_error=write
> ignore_error=,ENOSPC
> error_dump=0
> create_on_open=1
> fallocate=none
> exitall=1
>
> ## Perform direct aio, to files which may be truncated
> ## by external task
> [direct_aio]
> direct=1
> buffered=0
> numjobs=4
> rw=randwrite
> runtime=100*1
> time_based
>
> # Perform direct aio and verify data
> # This test case should check use-after-free issues
> [aio-dio-verifier]
> numjobs=1
> verify=crc32c-intel
> verify_fatal=1
> verify_dump=1
> verify_backlog=1024
> verify_async=4
> verifysort=1
> direct=1
> bs=4k
> rw=randrw
> filename=aio-dio-verifier
>
> # Perform buffered aio and verify data
> # This test case should check use-after-free issues
> [buffered-aio-verifier]
> numjobs=1
> verify=crc32c-intel
> verify_fatal=1
> verify_dump=1
> verify_backlog=1024
> verify_async=4
> verifysort=1
> direct=0
> buffered=1
> bs=4k
> rw=randrw
> filename=buffered-aio-verifier
next prev parent reply other threads:[~2016-10-18 18:01 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-18 15:53 Test generic/299 stalling forever Theodore Ts'o
2015-06-18 16:25 ` Eric Whitney
2015-06-18 23:34 ` Dave Chinner
2015-06-19 2:56 ` Theodore Ts'o
2016-09-29 4:37 ` Theodore Ts'o
2016-10-12 15:46 ` Jens Axboe
2016-10-12 15:46 ` Jens Axboe
2016-10-12 21:14 ` Dave Chinner
2016-10-12 21:19 ` Jens Axboe
2016-10-12 21:19 ` Jens Axboe
2016-10-13 2:15 ` Theodore Ts'o
2016-10-13 2:39 ` Jens Axboe
2016-10-13 2:39 ` Jens Axboe
2016-10-13 23:19 ` Theodore Ts'o
2016-10-18 18:01 ` Theodore Ts'o [this message]
2016-10-19 14:06 ` Jens Axboe
2016-10-19 14:06 ` Jens Axboe
2016-10-19 17:49 ` Jens Axboe
2016-10-19 17:49 ` Jens Axboe
2016-10-19 20:32 ` Theodore Ts'o
2016-10-20 14:22 ` Jens Axboe
2016-10-20 14:22 ` Jens Axboe
2016-10-21 22:15 ` Theodore Ts'o
2016-10-23 2:02 ` Theodore Ts'o
2016-10-23 14:32 ` Jens Axboe
2016-10-23 14:32 ` Jens Axboe
2016-10-23 19:33 ` Theodore Ts'o
2016-10-23 21:24 ` Theodore Ts'o
2016-10-24 1:41 ` Jens Axboe
2016-10-24 1:41 ` Jens Axboe
2016-10-24 3:38 ` Theodore Ts'o
2016-10-24 16:28 ` Jens Axboe
2016-10-24 16:28 ` Jens Axboe
2016-10-25 2:54 ` Theodore Ts'o
2016-10-25 2:59 ` Jens Axboe
2016-10-25 2:59 ` Jens Axboe
2016-10-13 13:08 ` Anatoly Pugachev
2016-10-13 13:36 ` Anatoly Pugachev
2016-10-13 14:28 ` Jens Axboe
2016-10-13 14:28 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161018180107.fscbfm66yidwhey4@thunk.org \
--to=tytso@mit.edu \
--cc=axboe@fb.com \
--cc=david@fromorbit.com \
--cc=fstests@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=tarasov@vasily.name \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.