* Test generic/299 stalling forever @ 2015-06-18 15:53 Theodore Ts'o 2015-06-18 16:25 ` Eric Whitney 2015-06-18 23:34 ` Dave Chinner 0 siblings, 2 replies; 29+ messages in thread From: Theodore Ts'o @ 2015-06-18 15:53 UTC (permalink / raw) To: linux-ext4, fstests I've been trying to figure out why generic/299 has occasionally been stalling forever. After taking a closer look, it appears the problem is that the fio process is stalling in userspace. Looking at the ps listing, the fio process hasn't run in over six hours, and using attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling that I started seeing this when I started using a newer version of fio. So I'm going to try roll back to an older version of fio and see if that causes the problem to go away. - Ted ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2015-06-18 15:53 Test generic/299 stalling forever Theodore Ts'o @ 2015-06-18 16:25 ` Eric Whitney 2015-06-18 23:34 ` Dave Chinner 1 sibling, 0 replies; 29+ messages in thread From: Eric Whitney @ 2015-06-18 16:25 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, fstests * Theodore Ts'o <tytso@mit.edu>: > I've been trying to figure out why generic/299 has occasionally been > stalling forever. After taking a closer look, it appears the problem > is that the fio process is stalling in userspace. Looking at the ps > listing, the fio process hasn't run in over six hours, and using > attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. > > Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling > that I started seeing this when I started using a newer version of > fio. So I'm going to try roll back to an older version of fio and see > if that causes the problem to go away. > I've not seen that, but have only been running 4.1-rc* kernels recently. I can say that I ran 1000 trials of generic/299 on nojournal using a vanilla 4.1-rc6 kernel on a Pandaboard successfully with fio 2.2.6. No failures, and completed in ~61 hours for an average test duration of ~3.7 minutes, which agrees well with what I see when sampling the logged output for individual test times. Eric ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2015-06-18 15:53 Test generic/299 stalling forever Theodore Ts'o 2015-06-18 16:25 ` Eric Whitney @ 2015-06-18 23:34 ` Dave Chinner 2015-06-19 2:56 ` Theodore Ts'o 2016-09-29 4:37 ` Theodore Ts'o 1 sibling, 2 replies; 29+ messages in thread From: Dave Chinner @ 2015-06-18 23:34 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, fstests On Thu, Jun 18, 2015 at 11:53:37AM -0400, Theodore Ts'o wrote: > I've been trying to figure out why generic/299 has occasionally been > stalling forever. After taking a closer look, it appears the problem > is that the fio process is stalling in userspace. Looking at the ps > listing, the fio process hasn't run in over six hours, and using > attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. > > Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling > that I started seeing this when I started using a newer version of > fio. So I'm going to try roll back to an older version of fio and see > if that causes the problem to go away. I'm running on fio 2.1.3 at the moment and I havne't seen any problems like this for months. Keep in mind that fio does tend to break in strange ways fairly regularly, so I'd suggest an upgrade/downgrade of fio as your first move. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2015-06-18 23:34 ` Dave Chinner @ 2015-06-19 2:56 ` Theodore Ts'o 2016-09-29 4:37 ` Theodore Ts'o 1 sibling, 0 replies; 29+ messages in thread From: Theodore Ts'o @ 2015-06-19 2:56 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-ext4, fstests On Fri, Jun 19, 2015 at 09:34:30AM +1000, Dave Chinner wrote: > On Thu, Jun 18, 2015 at 11:53:37AM -0400, Theodore Ts'o wrote: > > I've been trying to figure out why generic/299 has occasionally been > > stalling forever. After taking a closer look, it appears the problem > > is that the fio process is stalling in userspace. Looking at the ps > > listing, the fio process hasn't run in over six hours, and using > > attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. > > > > Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling > > that I started seeing this when I started using a newer version of > > fio. So I'm going to try roll back to an older version of fio and see > > if that causes the problem to go away. > > I'm running on fio 2.1.3 at the moment and I havne't seen any > problems like this for months. Keep in mind that fio does tend to > break in strange ways fairly regularly, so I'd suggest an > upgrade/downgrade of fio as your first move. Yep, I've rolled back to fio-2.1.13-77-g0f9940a, which is what I had been using previously for a long time, and the problem with generic/299 seems to have gone away. I had hoped that the days when new releases of fio wouldn't intrdouce random new bugs was a temporary aberration, but obviously that was a bit over-optimistic. - Ted ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2015-06-18 23:34 ` Dave Chinner 2015-06-19 2:56 ` Theodore Ts'o @ 2016-09-29 4:37 ` Theodore Ts'o 2016-10-12 15:46 ` Jens Axboe 2016-10-12 21:14 ` Dave Chinner 1 sibling, 2 replies; 29+ messages in thread From: Theodore Ts'o @ 2016-09-29 4:37 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-ext4, fstests, tarasov, axboe On Fri, Jun 19, 2015 at 09:34:30AM +1000, Dave Chinner wrote: > On Thu, Jun 18, 2015 at 11:53:37AM -0400, Theodore Ts'o wrote: > > I've been trying to figure out why generic/299 has occasionally been > > stalling forever. After taking a closer look, it appears the problem > > is that the fio process is stalling in userspace. Looking at the ps > > listing, the fio process hasn't run in over six hours, and using > > attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. > > > > Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling > > that I started seeing this when I started using a newer version of > > fio. So I'm going to try roll back to an older version of fio and see > > if that causes the problem to go away. > > I'm running on fio 2.1.3 at the moment and I havne't seen any > problems like this for months. Keep in mind that fio does tend to > break in strange ways fairly regularly, so I'd suggest an > upgrade/downgrade of fio as your first move. Out of curiosity, Dave, are you still using fio 2.1.3? I had upgraded to the latest fio to fix other test breaks, and I'm stil seeing the occasional generic/299 test failure. In fact, it's been happening often enough on one of my test platforms[1] that I decided to really dig down and investigate it, and all of the threads were blocking on td->verify_cond in fio's verify.c. It bisected down to this commit: commit e5437a073e658e8154b9e87bab5c7b3b06ed4255 Author: Vasily Tarasov <tarasov@vasily.name> Date: Sun Nov 9 20:22:24 2014 -0700 Fix for a race when fio prints I/O statistics periodically Below is the demonstration for the latest code in git: ... So generic/299 passes reliably with this commits parent, and it fails on this commit within a dozen tries or so. The commit first landed in fio 2.1.14, so it's consistent with Dave's report a year ago he was still using fio 2.1.3. I haven't had time to do a deep analysis on what fio/verify.c does, or the above patch, but the good news is that when fio hangs, it's just a userspace hang, so I can log into machine and attach a gdb to the process. The code in question isn't very well documented, so I'm sending this out in the hopes that Jens and Vasily might see something obvious, and because I'm curious whether anyone else has seen this (since it seems to be a timing-related race in fio, so it's likely a file system independent issue). Thanks, - Ted [1] When running xfstests in a Google Compute Engine VM with a SSD-backed Persistent disk, using a n1-standard-2 machine type with a recent kernel testing with ext4, the command "gce-xfstests -C 100 generic/299" will hang within a dozen runs of the test, so -C 100 to run the test a hundred times was definitely overkill --- in fact usually in fio would hang after less than a half-dozen runs. My bisecting technique (using the infrastructure at https://github.com/tytso/xfstests-bld) was: ./build-all --fio-only make tarball gce-xfstests --update-xfstests -C 100 generic/299 and then wait for an hour or so and see whether or not fio was hanging or not, and then follow it up with "(cd fio ; git bisect good)" or "(cd fio ; git bisect bad)" as appropriate. I was using a Debian jessie build chroot to compile fio and all of xfstests-bld. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-09-29 4:37 ` Theodore Ts'o @ 2016-10-12 15:46 ` Jens Axboe 2016-10-12 21:14 ` Dave Chinner 1 sibling, 0 replies; 29+ messages in thread From: Jens Axboe @ 2016-10-12 15:46 UTC (permalink / raw) To: Theodore Ts'o, Dave Chinner; +Cc: linux-ext4, fstests, tarasov On 09/28/2016 10:37 PM, Theodore Ts'o wrote: > On Fri, Jun 19, 2015 at 09:34:30AM +1000, Dave Chinner wrote: >> On Thu, Jun 18, 2015 at 11:53:37AM -0400, Theodore Ts'o wrote: >>> I've been trying to figure out why generic/299 has occasionally been >>> stalling forever. After taking a closer look, it appears the problem >>> is that the fio process is stalling in userspace. Looking at the ps >>> listing, the fio process hasn't run in over six hours, and using >>> attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. >>> >>> Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling >>> that I started seeing this when I started using a newer version of >>> fio. So I'm going to try roll back to an older version of fio and see >>> if that causes the problem to go away. >> >> I'm running on fio 2.1.3 at the moment and I havne't seen any >> problems like this for months. Keep in mind that fio does tend to >> break in strange ways fairly regularly, so I'd suggest an >> upgrade/downgrade of fio as your first move. > > Out of curiosity, Dave, are you still using fio 2.1.3? I had upgraded > to the latest fio to fix other test breaks, and I'm stil seeing the > occasional generic/299 test failure. In fact, it's been happening > often enough on one of my test platforms[1] that I decided to really > dig down and investigate it, and all of the threads were blocking on > td->verify_cond in fio's verify.c. > > It bisected down to this commit: > > commit e5437a073e658e8154b9e87bab5c7b3b06ed4255 > Author: Vasily Tarasov <tarasov@vasily.name> > Date: Sun Nov 9 20:22:24 2014 -0700 > > Fix for a race when fio prints I/O statistics periodically > > Below is the demonstration for the latest code in git: > ... > > So generic/299 passes reliably with this commits parent, and it fails > on this commit within a dozen tries or so. The commit first landed in > fio 2.1.14, so it's consistent with Dave's report a year ago he was > still using fio 2.1.3. > > I haven't had time to do a deep analysis on what fio/verify.c does, or > the above patch, but the good news is that when fio hangs, it's just a > userspace hang, so I can log into machine and attach a gdb to the > process. The code in question isn't very well documented, so I'm > sending this out in the hopes that Jens and Vasily might see something > obvious, and because I'm curious whether anyone else has seen this > (since it seems to be a timing-related race in fio, so it's likely a > file system independent issue). > > Thanks, > > - Ted > > [1] When running xfstests in a Google Compute Engine VM with a > SSD-backed Persistent disk, using a n1-standard-2 machine type with a > recent kernel testing with ext4, the command "gce-xfstests -C 100 > generic/299" will hang within a dozen runs of the test, so -C 100 to > run the test a hundred times was definitely overkill --- in fact > usually in fio would hang after less than a half-dozen runs. > > My bisecting technique (using the infrastructure at > https://github.com/tytso/xfstests-bld) was: > > ./build-all --fio-only > make tarball > gce-xfstests --update-xfstests -C 100 generic/299 > > and then wait for an hour or so and see whether or not fio was hanging > or not, and then follow it up with "(cd fio ; git bisect good)" or > "(cd fio ; git bisect bad)" as appropriate. I was using a Debian > jessie build chroot to compile fio and all of xfstests-bld. Sorry for being late to the party here, was away. This issue should be fixed in newer versions. Can you update to current -git and check that it works without hangs? -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-09-29 4:37 ` Theodore Ts'o 2016-10-12 15:46 ` Jens Axboe @ 2016-10-12 21:14 ` Dave Chinner 2016-10-12 21:19 ` Jens Axboe 1 sibling, 1 reply; 29+ messages in thread From: Dave Chinner @ 2016-10-12 21:14 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, fstests, tarasov, axboe On Thu, Sep 29, 2016 at 12:37:22AM -0400, Theodore Ts'o wrote: > On Fri, Jun 19, 2015 at 09:34:30AM +1000, Dave Chinner wrote: > > On Thu, Jun 18, 2015 at 11:53:37AM -0400, Theodore Ts'o wrote: > > > I've been trying to figure out why generic/299 has occasionally been > > > stalling forever. After taking a closer look, it appears the problem > > > is that the fio process is stalling in userspace. Looking at the ps > > > listing, the fio process hasn't run in over six hours, and using > > > attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. > > > > > > Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling > > > that I started seeing this when I started using a newer version of > > > fio. So I'm going to try roll back to an older version of fio and see > > > if that causes the problem to go away. > > > > I'm running on fio 2.1.3 at the moment and I havne't seen any > > problems like this for months. Keep in mind that fio does tend to > > break in strange ways fairly regularly, so I'd suggest an > > upgrade/downgrade of fio as your first move. > > Out of curiosity, Dave, are you still using fio 2.1.3? I had upgraded No. $ fio -v fio-2.1.11 $ > to the latest fio to fix other test breaks, and I'm stil seeing the > occasional generic/299 test failure. In fact, it's been happening > often enough on one of my test platforms[1] that I decided to really > dig down and investigate it, and all of the threads were blocking on > td->verify_cond in fio's verify.c. > > It bisected down to this commit: > > commit e5437a073e658e8154b9e87bab5c7b3b06ed4255 > Author: Vasily Tarasov <tarasov@vasily.name> > Date: Sun Nov 9 20:22:24 2014 -0700 > > Fix for a race when fio prints I/O statistics periodically > > Below is the demonstration for the latest code in git: > ... > > So generic/299 passes reliably with this commits parent, and it fails > on this commit within a dozen tries or so. The commit first landed in > fio 2.1.14, so it's consistent with Dave's report a year ago he was > still using fio 2.1.3. But I'm still not using a fio recent enough to hit this. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-12 21:14 ` Dave Chinner @ 2016-10-12 21:19 ` Jens Axboe 2016-10-13 2:15 ` Theodore Ts'o 0 siblings, 1 reply; 29+ messages in thread From: Jens Axboe @ 2016-10-12 21:19 UTC (permalink / raw) To: Dave Chinner, Theodore Ts'o; +Cc: linux-ext4, fstests, tarasov On 10/12/2016 03:14 PM, Dave Chinner wrote: > On Thu, Sep 29, 2016 at 12:37:22AM -0400, Theodore Ts'o wrote: >> On Fri, Jun 19, 2015 at 09:34:30AM +1000, Dave Chinner wrote: >>> On Thu, Jun 18, 2015 at 11:53:37AM -0400, Theodore Ts'o wrote: >>>> I've been trying to figure out why generic/299 has occasionally been >>>> stalling forever. After taking a closer look, it appears the problem >>>> is that the fio process is stalling in userspace. Looking at the ps >>>> listing, the fio process hasn't run in over six hours, and using >>>> attaching strace to the fio process, it's stalled in a FUTUEX_WAIT. >>>> >>>> Has anyone else seen this? I'm using fio 2.2.6, and I have a feeling >>>> that I started seeing this when I started using a newer version of >>>> fio. So I'm going to try roll back to an older version of fio and see >>>> if that causes the problem to go away. >>> >>> I'm running on fio 2.1.3 at the moment and I havne't seen any >>> problems like this for months. Keep in mind that fio does tend to >>> break in strange ways fairly regularly, so I'd suggest an >>> upgrade/downgrade of fio as your first move. >> >> Out of curiosity, Dave, are you still using fio 2.1.3? I had upgraded > > No. > > $ fio -v > fio-2.1.11 > $ > >> to the latest fio to fix other test breaks, and I'm stil seeing the >> occasional generic/299 test failure. In fact, it's been happening >> often enough on one of my test platforms[1] that I decided to really >> dig down and investigate it, and all of the threads were blocking on >> td->verify_cond in fio's verify.c. >> >> It bisected down to this commit: >> >> commit e5437a073e658e8154b9e87bab5c7b3b06ed4255 >> Author: Vasily Tarasov <tarasov@vasily.name> >> Date: Sun Nov 9 20:22:24 2014 -0700 >> >> Fix for a race when fio prints I/O statistics periodically >> >> Below is the demonstration for the latest code in git: >> ... >> >> So generic/299 passes reliably with this commits parent, and it fails >> on this commit within a dozen tries or so. The commit first landed in >> fio 2.1.14, so it's consistent with Dave's report a year ago he was >> still using fio 2.1.3. > > But I'm still not using a fio recent enough to hit this. FWIW, this is the commit that fixes it: commit 39d13e67ef1f4b327c68431f8daf033a03920117 Author: Jens Axboe <axboe@fb.com> Date: Fri Aug 26 14:39:30 2016 -0600 backend: check if we need to update rusage stats, if stat_mutex is busy 2.14 and newer should not have the problem, but earlier versions may depending on how old... -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-12 21:19 ` Jens Axboe @ 2016-10-13 2:15 ` Theodore Ts'o 2016-10-13 2:39 ` Jens Axboe 2016-10-13 13:08 ` Anatoly Pugachev 0 siblings, 2 replies; 29+ messages in thread From: Theodore Ts'o @ 2016-10-13 2:15 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On Wed, Oct 12, 2016 at 03:19:25PM -0600, Jens Axboe wrote: > > FWIW, this is the commit that fixes it: > > commit 39d13e67ef1f4b327c68431f8daf033a03920117 > Author: Jens Axboe <axboe@fb.com> > Date: Fri Aug 26 14:39:30 2016 -0600 > > backend: check if we need to update rusage stats, if stat_mutex is busy > > 2.14 and newer should not have the problem, but earlier versions may > depending on how old... Unfortunately, I'm still seeing hangs in generic/299 with the latest version of fio: fio fio-2.14-10-g0a301e9 (Fri, 23 Sep 2016 11:57:00 -0600) If I use an older fio, it reliably does not hang. What can I do to help debug this? As I said, I can attach to the hanging fio using a gdb and give you stackdumps for all of the threads if that would be helpful. - Ted ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-13 2:15 ` Theodore Ts'o @ 2016-10-13 2:39 ` Jens Axboe 2016-10-13 23:19 ` Theodore Ts'o 2016-10-13 13:08 ` Anatoly Pugachev 1 sibling, 1 reply; 29+ messages in thread From: Jens Axboe @ 2016-10-13 2:39 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/12/2016 08:15 PM, Theodore Ts'o wrote: > On Wed, Oct 12, 2016 at 03:19:25PM -0600, Jens Axboe wrote: >> >> FWIW, this is the commit that fixes it: >> >> commit 39d13e67ef1f4b327c68431f8daf033a03920117 >> Author: Jens Axboe <axboe@fb.com> >> Date: Fri Aug 26 14:39:30 2016 -0600 >> >> backend: check if we need to update rusage stats, if stat_mutex is busy >> >> 2.14 and newer should not have the problem, but earlier versions may >> depending on how old... > > Unfortunately, I'm still seeing hangs in generic/299 with the latest version of fio: > > fio fio-2.14-10-g0a301e9 (Fri, 23 Sep 2016 11:57:00 -0600) Well, that's no good. > If I use an older fio, it reliably does not hang. What can I do to > help debug this? If I just wanted to invoke generic/299 with a setup similar to yours, what should I do? I can attempt to reproduce it. > As I said, I can attach to the hanging fio using a gdb and give you > stackdumps for all of the threads if that would be helpful. That would definitely be helpful as well, especially if I cannot reproduce it. -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-13 2:39 ` Jens Axboe @ 2016-10-13 23:19 ` Theodore Ts'o 2016-10-18 18:01 ` Theodore Ts'o 0 siblings, 1 reply; 29+ messages in thread From: Theodore Ts'o @ 2016-10-13 23:19 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On Wed, Oct 12, 2016 at 08:39:57PM -0600, Jens Axboe wrote: > > If I just wanted to invoke generic/299 with a setup similar to yours, > what should I do? I can attempt to reproduce it. Well, if you're willing to get a Google Compute Engine account (it will come with a $300 free trial credit, which is plenty for this exercise), the simplest way to do an exact reproduction would be to follow the instructions here: https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md If you use the current default gce-xfstests image, which is xfstests-201610131131 in the xfstests-cloud project, what I run which triggers it quite reliably is: gce-xfstests -C 10 generic/299 On a run that I just did using a 4.8.0+ kernel, it ran 12 times (10 times with the 4k configuration, and then 2 times with the 1k configuration) before it hung. FSTESTIMG: xfstests-cloud/xfstests-201610131131 FSTESTVER: e2fsprogs v1.43.3-30-g8df85fb (Sun, 4 Sep 2016 21:32:35 -0400) FSTESTVER: fio fio-2.14-27-gafd2cef (Wed, 12 Oct 2016 08:59:25 -0600) FSTESTVER: quota 81aca5c (Tue, 12 Jul 2016 16:15:45 +0200) FSTESTVER: xfsprogs v4.5.0 (Tue, 15 Mar 2016 15:25:56 +1100) FSTESTVER: xfstests-bld de1a881 (Fri, 30 Sep 2016 18:23:15 -0400) FSTESTVER: xfstests linux-v3.8-1200-g218a62d (Thu, 29 Sep 2016 23:30:27 -0400) FSTESTVER: kernel 4.8.0-ext4-14234-gfb44543 #21 SMP Wed Oct 12 23:40:33 EDT 2016 x86_64 So on a system which was hung in this way, I ran "apt-get update ; apt-get install gdb", and uploaded the build tree (plus source code for the fio used to build the image) to /root/fio. root@xfstests-201610131148:~# ps axm -o pid,tid,class,pri,psr,pcpu,stat,wchan:20,comm PID TID CLS PRI PSR %CPU STAT WCHAN COMMAND .... 19781 - - - - 1.6 - - fio - 19781 TS 19 0 1.6 Ssl hrtimer_nanosleep - - 19800 TS 19 0 0.0 Ssl futex_wait_queue_me - - 19801 TS 19 0 0.0 Ssl futex_wait_queue_me - - 19802 TS 19 0 0.0 Ssl futex_wait_queue_me - - 19803 TS 19 0 0.0 Ssl futex_wait_queue_me - .... root@xfstests-201610131148:~# gdb /root/fio/fio .... (gdb) dir /root/fio Source directories searched: /root/fio:$cdir:$cwd (gdb) attach 19781 Attaching to program: /root/fio/fio, process 19781 .... (gdb) thread apply all bt Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309 #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309 #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309 #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309 #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 1 (Thread 0x7fb3de190740 (LWP 19781)): #0 0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32 #2 0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738 #3 0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 #4 0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 #5 0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65 (gdb) thread apply all bt full Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 No locals. #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 list = {next = 0x7fb3ddfeced0, prev = 0x7fb3ddfeced0} td = 0x7fb3c7590610 io_u = 0x202fac0 ret = 0 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309 __res = <optimized out> pd = 0x7fb3ddfed700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500339456, -2782054375790187915, 0, 140410502103136, 140410120428088, 140410500339456, 2739195402349996661, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 No locals. Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 No locals. #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 list = {next = 0x7fb3ddfe4ed0, prev = 0x7fb3ddfe4ed0} td = 0x7fb3c7590610 io_u = 0x203d280 ret = 0 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309 __res = <optimized out> pd = 0x7fb3ddfe5700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500306688, -2782054375790187915, 0, 140410502103136, 140410120428088, 140410500306688, 2739195406644963957, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 No locals. Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)): ---Type <return> to continue, or q <return> to quit--- #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 No locals. #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 list = {next = 0x7fb3ddfdced0, prev = 0x7fb3ddfdced0} td = 0x7fb3c7590610 io_u = 0x203cf80 ret = 0 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309 __res = <optimized out> pd = 0x7fb3ddfdd700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500273920, -2782054375790187915, 0, 140410502103136, 140410120428088, 140410500273920, 2739195393760062069, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 No locals. Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 No locals. #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 list = {next = 0x7fb3ddfd4ed0, prev = 0x7fb3ddfd4ed0} td = 0x7fb3c7590610 io_u = 0x203ccc0 ret = 0 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309 __res = <optimized out> pd = 0x7fb3ddfd5700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500241152, -2782054375790187915, 0, 140410502103136, 140410120428088, 140410500241152, 2739195398055029365, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 No locals. Thread 1 (Thread 0x7fb3de190740 (LWP 19781)): #0 0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 No locals. ---Type <return> to continue, or q <return> to quit--- #1 0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32 ts = {tv_sec = 0, tv_nsec = 1000000} #2 0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738 verify_bytes = 419209216 elapsed_us = {0, 0, 0} td = 0x7fb3c7590610 sk_out = 0x2039600 clear_state = 33789440 #3 0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 ret = <optimized out> pid = <optimized out> this_jobs = 6 fd = 0x2029720 map = {0x7fb3c754c000, 0x7fb3c7559ad0, 0x7fb3c75675a0, 0x7fb3c7575070, 0x7fb3c7582b40, 0x7fb3c7590610, 0x0 <repeats 11 times>, 0x10000000, 0x0 <repeats 771 times>, 0x7fb3ddf83c1c <check_match+300>, 0x0, 0xd827524, 0x5, 0x7, 0x7fb3de192040, 0x7fb3ddf8458e <do_lookup_x+2334>, 0x0, 0x7ffd24b8ea80, 0x7fb3dd2a9d28, 0x7fb3dd2af020, 0x7ffd24b8eb90, 0x3609d4, 0x7ffd24b8eb80, 0x0, 0x0, 0x7fb3de1920c0, 0x7fb3de1939d8, 0x406851, 0x7fb3dd2b6d78, 0x401e18, 0x100000000, 0x100000375, 0x0, 0x7fb3de1920c0, 0x7ffd24b8ec20, 0x7fb3de19d500, 0x7ffd24b8ec48, 0x7fb3de19d1a8, 0x1, 0x7fb3ddf8473d <_dl_lookup_symbol_x+349>, 0x0, 0x7fb3de1920c0, 0x1, 0x0, 0x1, 0x7fb3de19d1a8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7fb3de19d500, 0x7ffd24b8eb90, 0x7ffd24b8eb80, 0xd827524, 0x406851, 0xffffffff, 0x0, 0x7fb3dd2af020, 0x7fb3de1939d8, 0x0 <repeats 11 times>, 0x698620 <memcmp@got.plt>, 0x0, 0x7ffd24b8ecf8, 0x7ffd24b8ecf8, 0x3, 0x7ffd24b8f6f0, 0x7fb3ddf88ecc <_dl_fixup+508>, 0x9000000000001, 0x0, 0x8ff5c28f5c28f, 0x7fb3dd2af020, 0x140, 0x7ffd24b8f6f0, 0x7ffd24b91720, 0x7fb3ddf8f2e5 <_dl_runtime_resolve+53>, 0x0, 0x20262a0, 0x2bc4, 0x7ffd24b92530, 0x7ffd24b8f960, 0x8ff5c28f5c200, 0x2026270, 0x7fb3de19d1a8, 0x54, 0x46144b <fio_test_cconv+3595>, 0x0, 0x7ffd24b8ece0, 0x0 <repeats 16 times>, 0x2026210, 0x0, 0x2026270, 0x0, 0x1, 0x40000000400, 0x1, 0x0, 0x1, 0x100000001, 0x100000000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x100000001000, 0x1000, 0x0 <repeats 14 times>, 0x1, 0x0, 0x100000000, 0x100000001, 0x0, 0x0, 0x0, 0x100000001, 0x400, 0x0 <repeats 87 times>, 0x100000000, 0x100000000, 0x0, 0x89, 0x0 <repeats 20 times>, 0x6400000000, 0x6400000064, 0x200000, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1f400000000, 0x1000001f4...} this_start = {tv_sec = 0, tv_usec = 0} left = <optimized out> td = 0x7fb3c7590610 i = 5 todo = 6 nr_running = 0 m_rate = 0 t_rate = 0 nr_started = 6 spent = <optimized out> #4 0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 td = <optimized out> i = <optimized out> #5 0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65 ret = 1 > > As I said, I can attach to the hanging fio using a gdb and give you > > stackdumps for all of the threads if that would be helpful. > > That would definitely be helpful as well, especially if I cannot > reproduce it. Here you go! If I kill -9 the fio process, the tests continue to run, and there are no kernel complaints, so I'm pretty sure this is a fio issue. Many thanks!! - Ted P.S. In answer to your question from another e-mail message on this thread, the block device in volved is a Google Compute Engine Persistent Disk device backed using an SSD. To a very *rough* approximation, think about something kinda like a qemu qcow image on an very fast flash device in terms of performance. I've seen the problem before on a 5400 RPM laptop drive, but it's ***much*** rarer. It reproduces much more easily faster block devices. P.P.S. And here's the fio job file that it was running: ########### # 299 test fio activity # Filenames derived from jobsname and jobid like follows: # .. [global] ioengine=libaio bs=128k directory=/xt-vdc filesize=5368709120 size=999G iodepth=128*1 continue_on_error=write ignore_error=,ENOSPC error_dump=0 create_on_open=1 fallocate=none exitall=1 ## Perform direct aio, to files which may be truncated ## by external task [direct_aio] direct=1 buffered=0 numjobs=4 rw=randwrite runtime=100*1 time_based # Perform direct aio and verify data # This test case should check use-after-free issues [aio-dio-verifier] numjobs=1 verify=crc32c-intel verify_fatal=1 verify_dump=1 verify_backlog=1024 verify_async=4 verifysort=1 direct=1 bs=4k rw=randrw filename=aio-dio-verifier # Perform buffered aio and verify data # This test case should check use-after-free issues [buffered-aio-verifier] numjobs=1 verify=crc32c-intel verify_fatal=1 verify_dump=1 verify_backlog=1024 verify_async=4 verifysort=1 direct=0 buffered=1 bs=4k rw=randrw filename=buffered-aio-verifier ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-13 23:19 ` Theodore Ts'o @ 2016-10-18 18:01 ` Theodore Ts'o 2016-10-19 14:06 ` Jens Axboe 0 siblings, 1 reply; 29+ messages in thread From: Theodore Ts'o @ 2016-10-18 18:01 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov Jens, ping? Have you had a chance to take a look at the stack traces? Is there any other debugging information I can get for you? Thanks, - Ted On Thu, Oct 13, 2016 at 07:19:23PM -0400, Theodore Ts'o wrote: > On Wed, Oct 12, 2016 at 08:39:57PM -0600, Jens Axboe wrote: > > > > If I just wanted to invoke generic/299 with a setup similar to yours, > > what should I do? I can attempt to reproduce it. > > Well, if you're willing to get a Google Compute Engine account (it > will come with a $300 free trial credit, which is plenty for this > exercise), the simplest way to do an exact reproduction would be to > follow the instructions here: > > https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md > > If you use the current default gce-xfstests image, which is > xfstests-201610131131 in the xfstests-cloud project, what I run which > triggers it quite reliably is: > > gce-xfstests -C 10 generic/299 > > On a run that I just did using a 4.8.0+ kernel, it ran 12 times (10 > times with the 4k configuration, and then 2 times with the 1k > configuration) before it hung. > > FSTESTIMG: xfstests-cloud/xfstests-201610131131 > FSTESTVER: e2fsprogs v1.43.3-30-g8df85fb (Sun, 4 Sep 2016 21:32:35 -0400) > FSTESTVER: fio fio-2.14-27-gafd2cef (Wed, 12 Oct 2016 08:59:25 -0600) > FSTESTVER: quota 81aca5c (Tue, 12 Jul 2016 16:15:45 +0200) > FSTESTVER: xfsprogs v4.5.0 (Tue, 15 Mar 2016 15:25:56 +1100) > FSTESTVER: xfstests-bld de1a881 (Fri, 30 Sep 2016 18:23:15 -0400) > FSTESTVER: xfstests linux-v3.8-1200-g218a62d (Thu, 29 Sep 2016 23:30:27 -0400) > FSTESTVER: kernel 4.8.0-ext4-14234-gfb44543 #21 SMP Wed Oct 12 23:40:33 EDT 2016 x86_64 > > So on a system which was hung in this way, I ran "apt-get update ; > apt-get install gdb", and uploaded the build tree (plus source code > for the fio used to build the image) to /root/fio. > > root@xfstests-201610131148:~# ps axm -o pid,tid,class,pri,psr,pcpu,stat,wchan:20,comm > PID TID CLS PRI PSR %CPU STAT WCHAN COMMAND > .... > 19781 - - - - 1.6 - - fio > - 19781 TS 19 0 1.6 Ssl hrtimer_nanosleep - > - 19800 TS 19 0 0.0 Ssl futex_wait_queue_me - > - 19801 TS 19 0 0.0 Ssl futex_wait_queue_me - > - 19802 TS 19 0 0.0 Ssl futex_wait_queue_me - > - 19803 TS 19 0 0.0 Ssl futex_wait_queue_me - > .... > root@xfstests-201610131148:~# gdb /root/fio/fio > .... > (gdb) dir /root/fio > Source directories searched: /root/fio:$cdir:$cwd > (gdb) attach 19781 > Attaching to program: /root/fio/fio, process 19781 > .... > (gdb) thread apply all bt > > Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309 > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309 > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309 > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309 > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > Thread 1 (Thread 0x7fb3de190740 (LWP 19781)): > #0 0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 > #1 0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32 > #2 0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738 > #3 0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 > #4 0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 > #5 0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65 > > (gdb) thread apply all bt full > > Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > No locals. > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > list = {next = 0x7fb3ddfeced0, prev = 0x7fb3ddfeced0} > td = 0x7fb3c7590610 > io_u = 0x202fac0 > ret = 0 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309 > __res = <optimized out> > pd = 0x7fb3ddfed700 > now = <optimized out> > unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500339456, -2782054375790187915, 0, 140410502103136, 140410120428088, > 140410500339456, 2739195402349996661, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, > data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} > not_first_call = <optimized out> > pagesize_m1 = <optimized out> > sp = <optimized out> > freesize = <optimized out> > __PRETTY_FUNCTION__ = "start_thread" > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > No locals. > > Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > No locals. > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > list = {next = 0x7fb3ddfe4ed0, prev = 0x7fb3ddfe4ed0} > td = 0x7fb3c7590610 > io_u = 0x203d280 > ret = 0 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309 > __res = <optimized out> > pd = 0x7fb3ddfe5700 > now = <optimized out> > unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500306688, -2782054375790187915, 0, 140410502103136, 140410120428088, > 140410500306688, 2739195406644963957, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, > data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} > not_first_call = <optimized out> > pagesize_m1 = <optimized out> > sp = <optimized out> > freesize = <optimized out> > __PRETTY_FUNCTION__ = "start_thread" > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > No locals. > > Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)): > ---Type <return> to continue, or q <return> to quit--- > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > No locals. > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > list = {next = 0x7fb3ddfdced0, prev = 0x7fb3ddfdced0} > td = 0x7fb3c7590610 > io_u = 0x203cf80 > ret = 0 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309 > __res = <optimized out> > pd = 0x7fb3ddfdd700 > now = <optimized out> > unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500273920, -2782054375790187915, 0, 140410502103136, 140410120428088, > 140410500273920, 2739195393760062069, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, > data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} > not_first_call = <optimized out> > pagesize_m1 = <optimized out> > sp = <optimized out> > freesize = <optimized out> > __PRETTY_FUNCTION__ = "start_thread" > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > No locals. > > Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > No locals. > #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 > list = {next = 0x7fb3ddfd4ed0, prev = 0x7fb3ddfd4ed0} > td = 0x7fb3c7590610 > io_u = 0x203ccc0 > ret = 0 > #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309 > __res = <optimized out> > pd = 0x7fb3ddfd5700 > now = <optimized out> > unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500241152, -2782054375790187915, 0, 140410502103136, 140410120428088, > 140410500241152, 2739195398055029365, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, > data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} > not_first_call = <optimized out> > pagesize_m1 = <optimized out> > sp = <optimized out> > freesize = <optimized out> > __PRETTY_FUNCTION__ = "start_thread" > #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > No locals. > > Thread 1 (Thread 0x7fb3de190740 (LWP 19781)): > #0 0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 > No locals. > ---Type <return> to continue, or q <return> to quit--- > #1 0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32 > ts = {tv_sec = 0, tv_nsec = 1000000} > #2 0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738 > verify_bytes = 419209216 > elapsed_us = {0, 0, 0} > td = 0x7fb3c7590610 > sk_out = 0x2039600 > clear_state = 33789440 > #3 0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 > ret = <optimized out> > pid = <optimized out> > this_jobs = 6 > fd = 0x2029720 > map = {0x7fb3c754c000, 0x7fb3c7559ad0, 0x7fb3c75675a0, 0x7fb3c7575070, 0x7fb3c7582b40, 0x7fb3c7590610, 0x0 <repeats 11 times>, > 0x10000000, 0x0 <repeats 771 times>, 0x7fb3ddf83c1c <check_match+300>, 0x0, 0xd827524, 0x5, 0x7, 0x7fb3de192040, > 0x7fb3ddf8458e <do_lookup_x+2334>, 0x0, 0x7ffd24b8ea80, 0x7fb3dd2a9d28, 0x7fb3dd2af020, 0x7ffd24b8eb90, 0x3609d4, > 0x7ffd24b8eb80, 0x0, 0x0, 0x7fb3de1920c0, 0x7fb3de1939d8, 0x406851, 0x7fb3dd2b6d78, 0x401e18, 0x100000000, 0x100000375, 0x0, > 0x7fb3de1920c0, 0x7ffd24b8ec20, 0x7fb3de19d500, 0x7ffd24b8ec48, 0x7fb3de19d1a8, 0x1, > 0x7fb3ddf8473d <_dl_lookup_symbol_x+349>, 0x0, 0x7fb3de1920c0, 0x1, 0x0, 0x1, 0x7fb3de19d1a8, 0x0, 0x0, 0x0, 0x0, 0x0, > 0x7fb3de19d500, 0x7ffd24b8eb90, 0x7ffd24b8eb80, 0xd827524, 0x406851, 0xffffffff, 0x0, 0x7fb3dd2af020, 0x7fb3de1939d8, > 0x0 <repeats 11 times>, 0x698620 <memcmp@got.plt>, 0x0, 0x7ffd24b8ecf8, 0x7ffd24b8ecf8, 0x3, 0x7ffd24b8f6f0, > 0x7fb3ddf88ecc <_dl_fixup+508>, 0x9000000000001, 0x0, 0x8ff5c28f5c28f, 0x7fb3dd2af020, 0x140, 0x7ffd24b8f6f0, > 0x7ffd24b91720, 0x7fb3ddf8f2e5 <_dl_runtime_resolve+53>, 0x0, 0x20262a0, 0x2bc4, 0x7ffd24b92530, 0x7ffd24b8f960, > 0x8ff5c28f5c200, 0x2026270, 0x7fb3de19d1a8, 0x54, 0x46144b <fio_test_cconv+3595>, 0x0, 0x7ffd24b8ece0, > 0x0 <repeats 16 times>, 0x2026210, 0x0, 0x2026270, 0x0, 0x1, 0x40000000400, 0x1, 0x0, 0x1, 0x100000001, 0x100000000, 0x0, > 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x100000001000, 0x1000, 0x0 <repeats 14 times>, 0x1, 0x0, 0x100000000, 0x100000001, 0x0, 0x0, > 0x0, 0x100000001, 0x400, 0x0 <repeats 87 times>, 0x100000000, 0x100000000, 0x0, 0x89, 0x0 <repeats 20 times>, 0x6400000000, > 0x6400000064, 0x200000, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1f400000000, 0x1000001f4...} > this_start = {tv_sec = 0, tv_usec = 0} > left = <optimized out> > td = 0x7fb3c7590610 > i = 5 > todo = 6 > nr_running = 0 > m_rate = 0 > t_rate = 0 > nr_started = 6 > spent = <optimized out> > #4 0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 > td = <optimized out> > i = <optimized out> > #5 0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65 > ret = 1 > > > > As I said, I can attach to the hanging fio using a gdb and give you > > > stackdumps for all of the threads if that would be helpful. > > > > That would definitely be helpful as well, especially if I cannot > > reproduce it. > > Here you go! > > If I kill -9 the fio process, the tests continue to run, and there are > no kernel complaints, so I'm pretty sure this is a fio issue. > > Many thanks!! > > - Ted > > P.S. In answer to your question from another e-mail message on this > thread, the block device in volved is a Google Compute Engine > Persistent Disk device backed using an SSD. To a very *rough* > approximation, think about something kinda like a qemu qcow image on > an very fast flash device in terms of performance. I've seen > the problem before on a 5400 RPM laptop drive, but it's ***much*** > rarer. It reproduces much more easily faster block devices. > > P.P.S. And here's the fio job file that it was running: > > ########### > # 299 test fio activity > # Filenames derived from jobsname and jobid like follows: > # .. > [global] > ioengine=libaio > bs=128k > directory=/xt-vdc > filesize=5368709120 > size=999G > iodepth=128*1 > continue_on_error=write > ignore_error=,ENOSPC > error_dump=0 > create_on_open=1 > fallocate=none > exitall=1 > > ## Perform direct aio, to files which may be truncated > ## by external task > [direct_aio] > direct=1 > buffered=0 > numjobs=4 > rw=randwrite > runtime=100*1 > time_based > > # Perform direct aio and verify data > # This test case should check use-after-free issues > [aio-dio-verifier] > numjobs=1 > verify=crc32c-intel > verify_fatal=1 > verify_dump=1 > verify_backlog=1024 > verify_async=4 > verifysort=1 > direct=1 > bs=4k > rw=randrw > filename=aio-dio-verifier > > # Perform buffered aio and verify data > # This test case should check use-after-free issues > [buffered-aio-verifier] > numjobs=1 > verify=crc32c-intel > verify_fatal=1 > verify_dump=1 > verify_backlog=1024 > verify_async=4 > verifysort=1 > direct=0 > buffered=1 > bs=4k > rw=randrw > filename=buffered-aio-verifier ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-18 18:01 ` Theodore Ts'o @ 2016-10-19 14:06 ` Jens Axboe 2016-10-19 17:49 ` Jens Axboe 0 siblings, 1 reply; 29+ messages in thread From: Jens Axboe @ 2016-10-19 14:06 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/18/2016 12:01 PM, Theodore Ts'o wrote: > Jens, ping? > > Have you had a chance to take a look at the stack traces? Is there > any other debugging information I can get for you? I'll take a look today. I agree, this definitely looks like a fio bug. But not related to the mutex issue for the stat part, all verifier threads are waiting to be woken up, but the main thread is done. I'll see if I can reproduce. If not, I may ask you to dump more info from the hung processes. -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-19 14:06 ` Jens Axboe @ 2016-10-19 17:49 ` Jens Axboe 2016-10-19 20:32 ` Theodore Ts'o 0 siblings, 1 reply; 29+ messages in thread From: Jens Axboe @ 2016-10-19 17:49 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/19/2016 08:06 AM, Jens Axboe wrote: > On 10/18/2016 12:01 PM, Theodore Ts'o wrote: >> Jens, ping? >> >> Have you had a chance to take a look at the stack traces? Is there >> any other debugging information I can get for you? > > I'll take a look today. I agree, this definitely looks like a fio > bug. But not related to the mutex issue for the stat part, all verifier > threads are waiting to be woken up, but the main thread is done. > > I'll see if I can reproduce. If not, I may ask you to dump more info > from the hung processes. Been running for 4 hours now, in a loop, with bs=1k. I haven't been able to reproduce. Before I go through the hassle of doing the google compute engine thing, can you give me some details of the hardware? Number of cores/nodes? Memory size? Rough speed and size of the device? Any special mkfs options? And whatever else might be relevant. -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-19 17:49 ` Jens Axboe @ 2016-10-19 20:32 ` Theodore Ts'o 2016-10-20 14:22 ` Jens Axboe 0 siblings, 1 reply; 29+ messages in thread From: Theodore Ts'o @ 2016-10-19 20:32 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On Wed, Oct 19, 2016 at 11:49:12AM -0600, Jens Axboe wrote: > > Number of cores/nodes? > Memory size? I'm using a gce n1-standard-2 VM. So that's two CPU's and 7680M. Each CPU is a virtual CPU is implemented as a single hardware hyper-thread on a 2.3 GHz Intel Xeon E5 v3 (Haswell). (I was using a GCE zone that has Haswell processors; different GCE zones may have different processors. See [1] and [2] for more details.) [1] https://cloud.google.com/compute/docs/machine-types [2] https://cloud.google.com/compute/docs/regions-zones/regions-zones > Rough speed and size of the device? I'm using a GCE PD backed by a SSD. To a first approximation, you can think of it as a KVM qcow file stored on a fast flash device. I'm running LVM on the disk, and the fio is running on a 5 gig LVM volume. > Any special mkfs options? No. This particular error will trigger on 4k block file systems, 1k block file systems, 4k file system swith journals disabled, etc. It's fairly insensitive to the file system configuration. > And whatever else might be relevant. Note that the generic/299 test is running fio in an an ENOSPC hitter configuration, where there is an antangonist thread which is constantly allocating all of the disk space available and then freeing it all: # FSQA Test No. 299 # # AIO/DIO stress test # Run random AIO/DIO activity and fallocate/truncate simultaneously # Test will operate on huge sparsed files so ENOSPC is expected. So some of the AIO/DIO operations will be failing with an error, and and I suspect that's very likely relevant to reproducing the failure. The actual guts of the test from generic/299[1]: [1] https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/tree/tests/generic/299 _workout() { echo "" echo "Run fio with random aio-dio pattern" echo "" cat $fio_config >> $seqres.full run_check $FIO_PROG $fio_config & pid=$! echo "Start fallocate/truncate loop" for ((i=0; ; i++)) do for ((k=1; k <= NUM_JOBS; k++)) do $XFS_IO_PROG -f -c "falloc 0 $FILE_SIZE" \ $SCRATCH_MNT/direct_aio.$k.0 >> $seqres.full 2>&1 done for ((k=1; k <= NUM_JOBS; k++)) do $XFS_IO_PROG -c "truncate 0" \ $SCRATCH_MNT/direct_aio.$k.0 >> $seqres.full 2>&1 done # Following like will check that pid is still run. # Once fio exit we can stop fallocate/truncate loop pgrep -f "$FIO_PROG" > /dev/null 2>&1 || break done wait $pid } So what's happening is that generic/299 is looping in the fallocate/truncate loop until fio exits, but since fio never exits, so it ends up looping forever. Cheers, - Ted ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-19 20:32 ` Theodore Ts'o @ 2016-10-20 14:22 ` Jens Axboe 2016-10-21 22:15 ` Theodore Ts'o 0 siblings, 1 reply; 29+ messages in thread From: Jens Axboe @ 2016-10-20 14:22 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/19/2016 02:32 PM, Theodore Ts'o wrote: > On Wed, Oct 19, 2016 at 11:49:12AM -0600, Jens Axboe wrote: >> >> Number of cores/nodes? >> Memory size? > > I'm using a gce n1-standard-2 VM. So that's two CPU's and 7680M. > > Each CPU is a virtual CPU is implemented as a single hardware > hyper-thread on a 2.3 GHz Intel Xeon E5 v3 (Haswell). (I was using a > GCE zone that has Haswell processors; different GCE zones may have > different processors. See [1] and [2] for more details.) > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.google.com_compute_docs_machine-2Dtypes&d=DQIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=cK1a7KivzZRh1fKQMjSm2A&m=itmAtbDNAiup3d6EW5J8mxTc5VKZZo4z-TaIkfeBJ8o&s=JGVEfRvtOlOiYH_c8NLKuy3FFH8Ap3EGLrhsLV_UdiM&e= > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.google.com_compute_docs_regions-2Dzones_regions-2Dzones&d=DQIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=cK1a7KivzZRh1fKQMjSm2A&m=itmAtbDNAiup3d6EW5J8mxTc5VKZZo4z-TaIkfeBJ8o&s=nMlkYsjMNWYhpWmDmQmSFH_bQ6am_PeCfQzhWwFbbag&e= > >> Rough speed and size of the device? > > I'm using a GCE PD backed by a SSD. To a first approximation, you can > think of it as a KVM qcow file stored on a fast flash device. I'm > running LVM on the disk, and the fio is running on a 5 gig LVM volume. > >> Any special mkfs options? > > No. This particular error will trigger on 4k block file systems, 1k > block file systems, 4k file system swith journals disabled, etc. It's > fairly insensitive to the file system configuration. > >> And whatever else might be relevant. > > Note that the generic/299 test is running fio in an an ENOSPC hitter > configuration, where there is an antangonist thread which is > constantly allocating all of the disk space available and then freeing > it all: > > # FSQA Test No. 299 > # > # AIO/DIO stress test > # Run random AIO/DIO activity and fallocate/truncate simultaneously > # Test will operate on huge sparsed files so ENOSPC is expected. > > > So some of the AIO/DIO operations will be failing with an error, and > and I suspect that's very likely relevant to reproducing the failure. > > The actual guts of the test from generic/299[1]: > > [1] https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/tree/tests/generic/299 > > _workout() > { > echo "" > echo "Run fio with random aio-dio pattern" > echo "" > cat $fio_config >> $seqres.full > run_check $FIO_PROG $fio_config & > pid=$! > echo "Start fallocate/truncate loop" > > for ((i=0; ; i++)) > do > for ((k=1; k <= NUM_JOBS; k++)) > do > $XFS_IO_PROG -f -c "falloc 0 $FILE_SIZE" \ > $SCRATCH_MNT/direct_aio.$k.0 >> $seqres.full 2>&1 > done > for ((k=1; k <= NUM_JOBS; k++)) > do > $XFS_IO_PROG -c "truncate 0" \ > $SCRATCH_MNT/direct_aio.$k.0 >> $seqres.full 2>&1 > done > # Following like will check that pid is still run. > # Once fio exit we can stop fallocate/truncate loop > pgrep -f "$FIO_PROG" > /dev/null 2>&1 || break > done > wait $pid > } > > So what's happening is that generic/299 is looping in the > fallocate/truncate loop until fio exits, but since fio never exits, so > it ends up looping forever. I'm setting up the GCE now, I've had the tests running for about 24h now on another test box and haven't been able to trigger any hangs. I'll match your setup as closely as I can, hopefully that'll work. -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-20 14:22 ` Jens Axboe @ 2016-10-21 22:15 ` Theodore Ts'o 2016-10-23 2:02 ` Theodore Ts'o 2016-10-23 14:32 ` Jens Axboe 0 siblings, 2 replies; 29+ messages in thread From: Theodore Ts'o @ 2016-10-21 22:15 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On Thu, Oct 20, 2016 at 08:22:00AM -0600, Jens Axboe wrote: > > So what's happening is that generic/299 is looping in the > > fallocate/truncate loop until fio exits, but since fio never exits, so > > it ends up looping forever. > > I'm setting up the GCE now, I've had the tests running for about 24h now > on another test box and haven't been able to trigger any hangs. I'll > match your setup as closely as I can, hopefully that'll work. Any luck reproducing the problem? On Wed, Oct 19, 2016 at 08:06:44AM -0600, Jens Axboe wrote: > > I'll take a look today. I agree, this definitely looks like a fio > bug. But not related to the mutex issue for the stat part, all verifier > threads are waiting to be woken up, but the main thread is done. > I was taking a closer look at this, and it does look ike it's related to the stat_mutex. The main thread (according to gdb) seems to be stuck in this loop in backend.c line 1738 (in thread_main): do { check_update_rusage(td); if (!fio_mutex_down_trylock(stat_mutex)) break; usleep(1000); <----- line 1738 } while (1); So it looks like it's not able to grab the stat_mutex. But I can't figure out how the stat_mutex could be down. None of the strack traces seem to show that, and I've looked at all of the places where stat_mutex is taken, and it doesn't look like stat_mutex should ever be down for more than, say, a second? So as a temporary workaround, I'm considering adding a check to see if we stay stuck in this loop for than a thousand times, and if so, print an error to stderr and then call _exit(1), or maybe just break out two levels by jumping to line 1778 at "td_set_runstate(td, TD_FINISHING)" and just give up on the usage statistics (since for xfstests we really don't care about the usage stats). - Ted P.S. I can't see any way this could be happening other than perhaps a pointer error that corrupted stat_mutex. I can't see any way a thread could leave stat_mutex down WDYT? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-21 22:15 ` Theodore Ts'o @ 2016-10-23 2:02 ` Theodore Ts'o 2016-10-23 14:32 ` Jens Axboe 1 sibling, 0 replies; 29+ messages in thread From: Theodore Ts'o @ 2016-10-23 2:02 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On Fri, Oct 21, 2016 at 06:15:51PM -0400, Theodore Ts'o wrote: > I was taking a closer look at this, and it does look ike it's related > to the stat_mutex. The main thread (according to gdb) seems to be > stuck in this loop in backend.c line 1738 (in thread_main): > > do { > check_update_rusage(td); > if (!fio_mutex_down_trylock(stat_mutex)) > break; > usleep(1000); <----- line 1738 > } while (1); So I have something very strange to report. I sync'ed up to the latest fio repo, at commit e291cff14e97feb3cf. The problem still manifests with that commit. Given what I've observed with a thread spinning in this do loop, I added this commit: commit 0f2f71f51595f6b708b801f7ae1dc86c5b2f3705 Author: Theodore Ts'o <tytso@mit.edu> Date: Sat Oct 22 10:32:41 2016 -0400 backend: if we can't grab stat_mutex, report a deadlock error and exit Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff --git a/backend.c b/backend.c index fb2a855..093b6a3 100644 --- a/backend.c +++ b/backend.c @@ -1471,6 +1471,7 @@ static void *thread_main(void *data) struct thread_data *td = fd->td; struct thread_options *o = &td->o; struct sk_out *sk_out = fd->sk_out; + int deadlock_loop_cnt; int clear_state; int ret; @@ -1731,11 +1732,17 @@ static void *thread_main(void *data) * the rusage_sem, which would never get upped because * this thread is waiting for the stat mutex. */ + deadlock_loop_cnt = 0; do { check_update_rusage(td); if (!fio_mutex_down_trylock(stat_mutex)) break; usleep(1000); + if (deadlock_loop_cnt++ > 5000) { + log_err("fio seems to be stuck grabbing stat_mutex, forcibly exiting\n"); + td->error = EDEADLOCK; + goto err; + } } while (1); if (td_read(td) && td->io_bytes[DDIR_READ]) With this commit, the fioe in the generic/299 test no longer hangs. I've tried running a very large time, and it no longer reproduces at all. Specifically, the log_err() and the EDEADLOCK error added by the patch isn't triggering, and fio is no longer hanging. So merely adding loop counter seems to make the problem go away. Which makes me wonder if there is either some kind of compiler or code generation artifact we're seeing. So I should mention which compiler I'm currently using: % schroot -c jessie64 -- gcc --version gcc (Debian 4.9.2-10) 4.9.2 Anyway, I have a work around that seems to work for me, and which even if the deadlock_loop counter fires, will at least stop the test run from hanging. You may or may not want to include this in the fio upstream repo, given that I can't explain merely trying to check for the deadlock (or inability to grab the stat_mute, anyway) makes the deadlock go away. At least for the purposes of running the test, though, it does seem to be a valid workaround, though. Cheers, - Ted ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-21 22:15 ` Theodore Ts'o 2016-10-23 2:02 ` Theodore Ts'o @ 2016-10-23 14:32 ` Jens Axboe 2016-10-23 19:33 ` Theodore Ts'o 1 sibling, 1 reply; 29+ messages in thread From: Jens Axboe @ 2016-10-23 14:32 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/21/2016 04:15 PM, Theodore Ts'o wrote: > On Thu, Oct 20, 2016 at 08:22:00AM -0600, Jens Axboe wrote: >>> So what's happening is that generic/299 is looping in the >>> fallocate/truncate loop until fio exits, but since fio never exits, so >>> it ends up looping forever. >> >> I'm setting up the GCE now, I've had the tests running for about 24h now >> on another test box and haven't been able to trigger any hangs. I'll >> match your setup as closely as I can, hopefully that'll work. > > Any luck reproducing the problem? > > On Wed, Oct 19, 2016 at 08:06:44AM -0600, Jens Axboe wrote: >> >> I'll take a look today. I agree, this definitely looks like a fio >> bug. But not related to the mutex issue for the stat part, all verifier >> threads are waiting to be woken up, but the main thread is done. >> > > I was taking a closer look at this, and it does look ike it's related > to the stat_mutex. The main thread (according to gdb) seems to be > stuck in this loop in backend.c line 1738 (in thread_main): > > do { > check_update_rusage(td); > if (!fio_mutex_down_trylock(stat_mutex)) > break; > usleep(1000); <----- line 1738 > } while (1); > > So it looks like it's not able to grab the stat_mutex. But I can't > figure out how the stat_mutex could be down. None of the strack > traces seem to show that, and I've looked at all of the places where > stat_mutex is taken, and it doesn't look like stat_mutex should ever > be down for more than, say, a second? > > So as a temporary workaround, I'm considering adding a check to see if > we stay stuck in this loop for than a thousand times, and if so, print > an error to stderr and then call _exit(1), or maybe just break out two > levels by jumping to line 1778 at "td_set_runstate(td, TD_FINISHING)" > and just give up on the usage statistics (since for xfstests we really > don't care about the usage stats). Very strange. Can you see who the owner is of stat_mutex->lock, that's the pthread_mutex_t they are sleeping on. For now, I'll apply the work-around you sent. I haven't been able to reproduce this, but knowing that it's the stat_mutex will allow me to better make up a test case to hit it. -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-23 14:32 ` Jens Axboe @ 2016-10-23 19:33 ` Theodore Ts'o 2016-10-23 21:24 ` Theodore Ts'o 0 siblings, 1 reply; 29+ messages in thread From: Theodore Ts'o @ 2016-10-23 19:33 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On Sun, Oct 23, 2016 at 08:32:49AM -0600, Jens Axboe wrote: > > Very strange. Can you see who the owner is of stat_mutex->lock, that's > the pthread_mutex_t they are sleeping on. > > For now, I'll apply the work-around you sent. I haven't been able to > reproduce this, but knowing that it's the stat_mutex will allow me to > better make up a test case to hit it. Well, that's the wierd thing. I can't see anything that could be holding the stat_mutex. This is from a somewhat older version of fio (fio-2.14-27-gafd2cef) but there when things wedged, there were four threads hanging here (with the exact same stack trace): Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241 #2 0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309 #3 0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 ... and the main thread was hanging here: Thread 1 (Thread 0x7fb3de190740 (LWP 19781)): #0 0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32 #2 0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738 #3 0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 #4 0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 #5 0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65 The verify threads were waiting td->verify_cond, which it looks like the main thread was supposed to wakeup in its shutdown sequence when it calls verify_async_exit(), or if there was work to be done via the verify_io_u_async() end_io callback. But the verify threads don't take the stat_mutex, and the only places that take the stat_mutex in stat.c and backend.c don't hold it for long, and I can't see any longjumps or other progam flow discontinuities that could have left stat_mutex locked. I'm not an expert on pthreads debugging; is there some way to figure out ala CONFIG_PROVE_LOCKiNG, which thread took the mutex, and at what line number? Or would we need to to try hacking in some kind of mutex debugging code into fio_mutex_down(), et. al.? Cheers, - Ted ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-23 19:33 ` Theodore Ts'o @ 2016-10-23 21:24 ` Theodore Ts'o 2016-10-24 1:41 ` Jens Axboe 2016-10-24 3:38 ` Theodore Ts'o 0 siblings, 2 replies; 29+ messages in thread From: Theodore Ts'o @ 2016-10-23 21:24 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov OK, perhaps this is helpful. I reverted my last patch (so I could repro the problem), and then added the following patch to fio/mutex.c (see below). When it hung, I got the following stack trace on the main thread: (gdb) where #0 0x00007ff8c018cf2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007ff8c01b4fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32 #2 0x000000000045b691 in thread_main (data=0x1bc64000) at backend.c:1738 #3 0x000000000045d623 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 #4 0x000000000045d95d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 #5 0x000000000040cc18 in main (argc=2, argv=0x7fff8446f0d8, envp=<optimized out>) at fio.c:65 .... and it looks like the reason why we can't take the mutex is that somehow some thread managed to call stat_exit(), which calls fio_mutex_down() followed by fio_mutex_remove(): (gdb) p (void *) stat_mutex->owner $4 = (void *) 0x42fd60 <stat_exit+16> How this could have happened, I have no idea. The only call to stat_exit() appears to be at the end of fio_backend(), and the main thread is still in the middle of calling thread_main() which is called by run_threads(). Do you have any thoughts? Let me know if there's something I can try. - Ted diff --git a/mutex.c b/mutex.c index 7580922..26e2955 100644 --- a/mutex.c +++ b/mutex.c @@ -108,6 +108,8 @@ int __fio_mutex_init(struct fio_mutex *mutex, int value) mutex->value = value; mutex->magic = FIO_MUTEX_MAGIC; + if (!value) + mutex->owner = __builtin_return_address(0); ret = mutex_cond_init_pshared(&mutex->lock, &mutex->cond); if (ret) @@ -128,8 +130,11 @@ struct fio_mutex *fio_mutex_init(int value) return NULL; } - if (!__fio_mutex_init(mutex, value)) + if (!__fio_mutex_init(mutex, value)) { + if (!value) + mutex->owner = __builtin_return_address(0); return mutex; + } fio_mutex_remove(mutex); return NULL; @@ -194,7 +199,13 @@ bool fio_mutex_down_trylock(struct fio_mutex *mutex) pthread_mutex_lock(&mutex->lock); if (mutex->value) { + if (mutex->value > 1) { + log_err("Completely bogus mutex value?!? (%d, %p)\n", + mutex->value, mutex->owner); + abort(); + } mutex->value--; + mutex->owner = __builtin_return_address(0); ret = false; } pthread_mutex_unlock(&mutex->lock); @@ -215,6 +226,7 @@ void fio_mutex_down(struct fio_mutex *mutex) } mutex->value--; + mutex->owner = __builtin_return_address(0); pthread_mutex_unlock(&mutex->lock); } @@ -228,6 +240,8 @@ void fio_mutex_up(struct fio_mutex *mutex) read_barrier(); if (!mutex->value && mutex->waiters) do_wake = 1; + if (!mutex->value) + mutex->owner = 0; mutex->value++; pthread_mutex_unlock(&mutex->lock); diff --git a/mutex.h b/mutex.h index 54009ba..ea43845 100644 --- a/mutex.h +++ b/mutex.h @@ -13,6 +13,7 @@ struct fio_mutex { int value; int waiters; int magic; + unsigned long owner; }; struct fio_rwlock { P.P.S. As before, the only other reads are the verify threads: (gdb) thread apply all bt Thread 5 (Thread 0x7ff8c0e1a700 (LWP 13147)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00000000004398cb in verify_async_thread (data=0x7ff8aa3bd610) at verify.c:1241 #2 0x00007ff8c068a0a4 in start_thread (arg=0x7ff8c0e1a700) at pthread_create.c:309 #3 0x00007ff8c01bb62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 4 (Thread 0x7ff8c0e12700 (LWP 13148)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00000000004398cb in verify_async_thread (data=0x7ff8aa3bd610) at verify.c:1241 #2 0x00007ff8c068a0a4 in start_thread (arg=0x7ff8c0e12700) at pthread_create.c:309 #3 0x00007ff8c01bb62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 3 (Thread 0x7ff8c0e0a700 (LWP 13149)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00000000004398cb in verify_async_thread (data=0x7ff8aa3bd610) at verify.c:1241 #2 0x00007ff8c068a0a4 in start_thread (arg=0x7ff8c0e0a700) at pthread_create.c:309 #3 0x00007ff8c01bb62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 2 (Thread 0x7ff8c0e02700 (LWP 13150)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00000000004398cb in verify_async_thread (data=0x7ff8aa3bd610) at verify.c:1241 #2 0x00007ff8c068a0a4 in start_thread (arg=0x7ff8c0e02700) at pthread_create.c:309 #3 0x00007ff8c01bb62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 1 (Thread 0x7ff8c0fbd740 (LWP 13139)): #0 0x00007ff8c018cf2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007ff8c01b4fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32 #2 0x000000000045b691 in thread_main (data=0x1bc64000) at backend.c:1738 #3 0x000000000045d623 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 #4 0x000000000045d95d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 #5 0x000000000040cc18 in main (argc=2, argv=0x7fff8446f0d8, envp=<optimized out>) at fio.c:65 (gdb) ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-23 21:24 ` Theodore Ts'o @ 2016-10-24 1:41 ` Jens Axboe 2016-10-24 3:38 ` Theodore Ts'o 1 sibling, 0 replies; 29+ messages in thread From: Jens Axboe @ 2016-10-24 1:41 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/23/2016 03:24 PM, Theodore Ts'o wrote: > OK, perhaps this is helpful. I reverted my last patch (so I could > repro the problem), and then added the following patch to fio/mutex.c > (see below). > > When it hung, I got the following stack trace on the main thread: > > (gdb) where > #0 0x00007ff8c018cf2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81 > #1 0x00007ff8c01b4fb4 in usleep (useconds=<optimized out>) > at ../sysdeps/unix/sysv/linux/usleep.c:32 > #2 0x000000000045b691 in thread_main (data=0x1bc64000) at backend.c:1738 > #3 0x000000000045d623 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268 > #4 0x000000000045d95d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400 > #5 0x000000000040cc18 in main (argc=2, argv=0x7fff8446f0d8, envp=<optimized out>) at fio.c:65 > > .... and it looks like the reason why we can't take the mutex is that > somehow some thread managed to call stat_exit(), which calls > fio_mutex_down() followed by fio_mutex_remove(): > > (gdb) p (void *) stat_mutex->owner > $4 = (void *) 0x42fd60 <stat_exit+16> > > How this could have happened, I have no idea. The only call to > stat_exit() appears to be at the end of fio_backend(), and the main > thread is still in the middle of calling thread_main() which is called > by run_threads(). > > Do you have any thoughts? Let me know if there's something I can try. Wow, that is very odd. That should only be possible if the main thread exits too early, and calls stat_exit() while we still have jobs running. That should trigger other funkiness too. I have no idea how that is happening, but it obviously is since your debugging patch showed that stat_exit() owns the mutex. But this gives me more to work from, very useul, thanks Ted! I'll pound on this tomorrow. -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-23 21:24 ` Theodore Ts'o 2016-10-24 1:41 ` Jens Axboe @ 2016-10-24 3:38 ` Theodore Ts'o 2016-10-24 16:28 ` Jens Axboe 1 sibling, 1 reply; 29+ messages in thread From: Theodore Ts'o @ 2016-10-24 3:38 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov [-- Attachment #1: Type: text/plain, Size: 3558 bytes --] I enabled some more debugging and it's become more clear what's going on. (See attached for the full log). The main issue seems to be that once one of fio is done, it kills off the other threads (actually, we're using processes): process 31848 terminate group_id=0 process 31848 setting terminate on direct_aio/31846 process 31848 setting terminate on direct_aio/31848 process 31848 setting terminate on direct_aio/31849 process 31848 setting terminate on direct_aio/31851 process 31848 setting terminate on aio-dio-verifier/31852 process 31848 setting terminate on buffered-aio-verifier/31854 process 31851 pid=31851: runstate RUNNING -> FINISHING process 31851 terminate group_id=0 process 31851 setting terminate on direct_aio/31846 process 31851 setting terminate on direct_aio/31848 process 31851 setting terminate on direct_aio/31849 process 31851 setting terminate on direct_aio/31851 process 31851 setting terminate on aio-dio-verifier/31852 process 31851 setting terminate on buffered-aio-verifier/31854 process 31852 pid=31852: runstate RUNNING -> FINISHING process 31846 pid=31846: runstate RUNNING -> FINISHING ... but one or more of the threads doesn't exit within 60 seconds: fio: job 'direct_aio' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job. process 31794 pid=31849: runstate RUNNING -> REAPED fio: job 'buffered-aio-verifier' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job. process 31794 pid=31854: runstate RUNNING -> REAPED process 31794 terminate group_id=-1 The main thread then prints all of the statistics, and calls stat_exit(): stat_exit called by tid: 31794 <---- debugging message which prints gettid() Unfortunately, this process(es) aren't actually, killed, they are marked as reap, but they are still in the process listing: root@xfstests:~# ps augxww | grep fio root 1585 0.0 0.0 0 0 ? S< 18:45 0:00 [dm_bufio_cache] root 7191 0.0 0.0 12732 2200 pts/1 S+ 23:05 0:00 grep fio root 31849 1.5 0.2 407208 18876 ? Ss 22:36 0:26 /root/xfstests/bin/fio /tmp/31503.fio root 31854 1.2 0.1 398480 10240 ? Ssl 22:36 0:22 /root/xfstests/bin/fio /tmp/31503.fio And if you attach to them with a gdb, they are spinning trying to grab the stat_mutex(), which they can't get because the main thread has already called stat_exit() and then has exited. So these two threads did eventually return, but some time after 60 seconds had passed, and then they hung waiting for stat_mutex(), which they will never get because the main thread has already called stat_exit(). This probably also explains why you had trouble reproducing it. It requires a disk whose performance is variable enougoh that under heavy load, it might take more than 60 seconds for the direct_aio or buffered-aio-verifier thread to close itself out. And I suspect once the main thread exited, it probably also closed out the debugging channel so the deadlock detector did probably trip, but somehow we just didn't see the output. So I can imagine some possible fixes. We could make the thread timeout configurable, and/or increase it from 60 seconds to something like 300 seconds. We could make stat_exit() a no-op --- after all, if the main thread is exiting, there's no real point to down and then destroy the stat_mutex. And/or we could change the forced reap to send a kill -9 to the thread, and instead of maring it as reaped. Cheers, - Ted [-- Attachment #2: 299.full.gz --] [-- Type: application/gzip, Size: 4711 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-24 3:38 ` Theodore Ts'o @ 2016-10-24 16:28 ` Jens Axboe 2016-10-25 2:54 ` Theodore Ts'o 0 siblings, 1 reply; 29+ messages in thread From: Jens Axboe @ 2016-10-24 16:28 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/23/2016 09:38 PM, Theodore Ts'o wrote: > I enabled some more debugging and it's become more clear what's going > on. (See attached for the full log). > > The main issue seems to be that once one of fio is done, it kills off > the other threads (actually, we're using processes): > > process 31848 terminate group_id=0 > process 31848 setting terminate on direct_aio/31846 > process 31848 setting terminate on direct_aio/31848 > process 31848 setting terminate on direct_aio/31849 > process 31848 setting terminate on direct_aio/31851 > process 31848 setting terminate on aio-dio-verifier/31852 > process 31848 setting terminate on buffered-aio-verifier/31854 > process 31851 pid=31851: runstate RUNNING -> FINISHING > process 31851 terminate group_id=0 > process 31851 setting terminate on direct_aio/31846 > process 31851 setting terminate on direct_aio/31848 > process 31851 setting terminate on direct_aio/31849 > process 31851 setting terminate on direct_aio/31851 > process 31851 setting terminate on aio-dio-verifier/31852 > process 31851 setting terminate on buffered-aio-verifier/31854 > process 31852 pid=31852: runstate RUNNING -> FINISHING > process 31846 pid=31846: runstate RUNNING -> FINISHING > ... > > but one or more of the threads doesn't exit within 60 seconds: > > fio: job 'direct_aio' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job. > process 31794 pid=31849: runstate RUNNING -> REAPED > fio: job 'buffered-aio-verifier' (state=5) hasn't exited in 60 seconds, it appears to be stuck. Doing forceful exit of this job. > process 31794 pid=31854: runstate RUNNING -> REAPED > process 31794 terminate group_id=-1 > > The main thread then prints all of the statistics, and calls stat_exit(): > > stat_exit called by tid: 31794 <---- debugging message which prints gettid() > > Unfortunately, this process(es) aren't actually, killed, they are > marked as reap, but they are still in the process listing: > > root@xfstests:~# ps augxww | grep fio > root 1585 0.0 0.0 0 0 ? S< 18:45 0:00 [dm_bufio_cache] > root 7191 0.0 0.0 12732 2200 pts/1 S+ 23:05 0:00 grep fio > root 31849 1.5 0.2 407208 18876 ? Ss 22:36 0:26 /root/xfstests/bin/fio /tmp/31503.fio > root 31854 1.2 0.1 398480 10240 ? Ssl 22:36 0:22 /root/xfstests/bin/fio /tmp/31503.fio > > And if you attach to them with a gdb, they are spinning trying to grab > the stat_mutex(), which they can't get because the main thread has > already called stat_exit() and then has exited. So these two threads > did eventually return, but some time after 60 seconds had passed, and > then they hung waiting for stat_mutex(), which they will never get > because the main thread has already called stat_exit(). > > This probably also explains why you had trouble reproducing it. It > requires a disk whose performance is variable enougoh that under heavy > load, it might take more than 60 seconds for the direct_aio or > buffered-aio-verifier thread to close itself out. Good catch! Yes, that could certainly explain why we are stuck on that stat_mutex and why the main thread just gave up on it and ended up in stat_exit() with a thread (or more) still running. > And I suspect once the main thread exited, it probably also closed out > the debugging channel so the deadlock detector did probably trip, but > somehow we just didn't see the output. > > So I can imagine some possible fixes. We could make the thread > timeout configurable, and/or increase it from 60 seconds to something like > 300 seconds. We could make stat_exit() a no-op --- after all, if the > main thread is exiting, there's no real point to down and then destroy > the stat_mutex. And/or we could change the forced reap to send a kill > -9 to the thread, and instead of maring it as reaped. We have to clean up - for normal runs, it's not a big deal, but if fio is run as a client/server setup, the backend will persist across runs. If we leak, then that could be a concern. How about the below? Bump the timeout to 5 min, 1 min is a little on the short side, we want normal error handling to be out of the way before that happens. And additionally, break out if we have been marked as reaped/exited, so we avoid grabbing the stat mutex again. diff --git a/backend.c b/backend.c index 093b6a3a290e..f0927abfccb0 100644 --- a/backend.c +++ b/backend.c @@ -1723,6 +1723,14 @@ static void *thread_main(void *data) } } + /* + * If we took too long to shut down, the main thread could + * already consider us reaped/exited. If that happens, break + * out and clean up. + */ + if (td->runstate >= TD_EXITED) + break; + clear_state = 1; /* diff --git a/fio.h b/fio.h index 080842aef4f8..74c1b306af26 100644 --- a/fio.h +++ b/fio.h @@ -588,7 +588,7 @@ extern const char *runstate_to_name(int runstate); * Allow 60 seconds for a job to quit on its own, otherwise reap with * a vengeance. */ -#define FIO_REAP_TIMEOUT 60 +#define FIO_REAP_TIMEOUT 300 #define TERMINATE_ALL (-1U) extern void fio_terminate_threads(unsigned int); -- Jens Axboe ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-24 16:28 ` Jens Axboe @ 2016-10-25 2:54 ` Theodore Ts'o 2016-10-25 2:59 ` Jens Axboe 0 siblings, 1 reply; 29+ messages in thread From: Theodore Ts'o @ 2016-10-25 2:54 UTC (permalink / raw) To: Jens Axboe; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On Mon, Oct 24, 2016 at 10:28:14AM -0600, Jens Axboe wrote: > How about the below? Bump the timeout to 5 min, 1 min is a little on the > short side, we want normal error handling to be out of the way before > that happens. And additionally, break out if we have been marked as > reaped/exited, so we avoid grabbing the stat mutex again. Yep, that works. I tried a test with just the second change: > + /* > + * If we took too long to shut down, the main thread could > + * already consider us reaped/exited. If that happens, break > + * out and clean up. > + */ > + if (td->runstate >= TD_EXITED) > + break; > + And that's sufficient to solve the problem. Increasing the timeout to 5 minute also would be a good idea, so we can let the worker threads exit cleanly so the reported stats will be completely accurate. Thanks for your help in figuring out this long-standing problem! - Ted ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-25 2:54 ` Theodore Ts'o @ 2016-10-25 2:59 ` Jens Axboe 0 siblings, 0 replies; 29+ messages in thread From: Jens Axboe @ 2016-10-25 2:59 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/24/2016 08:54 PM, Theodore Ts'o wrote: > On Mon, Oct 24, 2016 at 10:28:14AM -0600, Jens Axboe wrote: > >> How about the below? Bump the timeout to 5 min, 1 min is a little on the >> short side, we want normal error handling to be out of the way before >> that happens. And additionally, break out if we have been marked as >> reaped/exited, so we avoid grabbing the stat mutex again. > > Yep, that works. I tried a test with just the second change: > >> + /* >> + * If we took too long to shut down, the main thread could >> + * already consider us reaped/exited. If that happens, break >> + * out and clean up. >> + */ >> + if (td->runstate >= TD_EXITED) >> + break; >> + > > And that's sufficient to solve the problem. Yes, it should be, so glad that it is! > Increasing the timeout to 5 minute also would be a good idea, so we > can let the worker threads exit cleanly so the reported stats will be > completely accurate. I made that separate change as well. If the job is stuck in the kernel for some sync operation, we could feasibly be uninterruptible for minutes. So 1 minutes is too short in any case, and I'd rather just make this check than sending kill signals since it won't fix the uninterruptible problem. > Thanks for your help in figuring out this long-standing problem! It was easy based on all your info, since I could not reproduce. So thanks for your help! Everything should be committed now, and I'll cut a new release tomorrow so we can hopefully put this behind us. -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-13 2:15 ` Theodore Ts'o 2016-10-13 2:39 ` Jens Axboe @ 2016-10-13 13:08 ` Anatoly Pugachev 2016-10-13 13:36 ` Anatoly Pugachev 1 sibling, 1 reply; 29+ messages in thread From: Anatoly Pugachev @ 2016-10-13 13:08 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Jens Axboe, Dave Chinner, linux-ext4, fstests, tarasov [-- Attachment #1: Type: text/plain, Size: 2859 bytes --] On Wed, Oct 12, 2016 at 10:15:52PM -0400, Theodore Ts'o wrote: > On Wed, Oct 12, 2016 at 03:19:25PM -0600, Jens Axboe wrote: > > > > FWIW, this is the commit that fixes it: > > > > commit 39d13e67ef1f4b327c68431f8daf033a03920117 > > Author: Jens Axboe <axboe@fb.com> > > Date: Fri Aug 26 14:39:30 2016 -0600 > > > > backend: check if we need to update rusage stats, if stat_mutex is busy > > > > 2.14 and newer should not have the problem, but earlier versions may > > depending on how old... > > Unfortunately, I'm still seeing hangs in generic/299 with the latest version of fio: > > fio fio-2.14-10-g0a301e9 (Fri, 23 Sep 2016 11:57:00 -0600) > > If I use an older fio, it reliably does not hang. What can I do to > help debug this? > > As I said, I can attach to the hanging fio using a gdb and give you > stackdumps for all of the threads if that would be helpful. I don't have git fio hangs on 299 (just a lot of "fallocate: No space left on device" in results/generic/299.full): # fio -v fio-2.14-27-gafd2c # mount tmpfs -t tmpfs -o size=26g /ramdisk/ # fallocate -l 12g /ramdisk/testvol1 && losetup -f /ramdisk/testvol1 # fallocate -l 12g /ramdisk/scratch0 && losetup -f /ramdisk/scratch0 root@ttip:/home/mator/xfstests-dev# cat local.config export TEST_DEV=/dev/loop0 export TEST_DIR=/testvol export SCRATCH_DEV_POOL="/dev/loop1" #export SCRATCH_DEV_POOL="/dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3" export SCRATCH_MNT=/mnt/scratch # mkfs.xfs -f /dev/loop0 meta-data=/dev/loop0 isize=512 agcount=4, agsize=786432 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0, rmapbt=0 data = bsize=4096 blocks=3145728, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 root@ttip:/home/mator/xfstests-dev# ./check generic/299 FSTYP -- xfs (non-debug) PLATFORM -- Linux/sparc64 ttip 4.8.0+ MKFS_OPTIONS -- -f -bsize=4096 /dev/loop1 MOUNT_OPTIONS -- /dev/loop1 /mnt/scratch generic/299 103s Ran: generic/299 Passed all 1 tests kernel logs: $ journalctl -k -e Oct 13 15:35:41 ttip unknown: run fstests generic/299 at 2016-10-13 15:35:41 Oct 13 15:35:41 ttip kernel: XFS (loop1): Unmounting Filesystem Oct 13 15:35:42 ttip kernel: XFS (loop1): Mounting V5 Filesystem Oct 13 15:35:42 ttip kernel: XFS (loop1): Ending clean mount Oct 13 15:37:24 ttip kernel: XFS (loop1): Unmounting Filesystem Oct 13 15:37:25 ttip kernel: XFS (loop0): Unmounting Filesystem results/generic/299.full in attachment [-- Attachment #2: 299.full.gz --] [-- Type: application/x-gzip, Size: 3351 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-13 13:08 ` Anatoly Pugachev @ 2016-10-13 13:36 ` Anatoly Pugachev 2016-10-13 14:28 ` Jens Axboe 0 siblings, 1 reply; 29+ messages in thread From: Anatoly Pugachev @ 2016-10-13 13:36 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Jens Axboe, Dave Chinner, linux-ext4, fstests, tarasov [-- Attachment #1: Type: text/plain, Size: 5362 bytes --] On Thu, Oct 13, 2016 at 04:08:37PM +0300, Anatoly Pugachev wrote: > On Wed, Oct 12, 2016 at 10:15:52PM -0400, Theodore Ts'o wrote: > > On Wed, Oct 12, 2016 at 03:19:25PM -0600, Jens Axboe wrote: > > > > > > FWIW, this is the commit that fixes it: > > > > > > commit 39d13e67ef1f4b327c68431f8daf033a03920117 > > > Author: Jens Axboe <axboe@fb.com> > > > Date: Fri Aug 26 14:39:30 2016 -0600 > > > > > > backend: check if we need to update rusage stats, if stat_mutex is busy > > > > > > 2.14 and newer should not have the problem, but earlier versions may > > > depending on how old... > > > > Unfortunately, I'm still seeing hangs in generic/299 with the latest version of fio: > > > > fio fio-2.14-10-g0a301e9 (Fri, 23 Sep 2016 11:57:00 -0600) > > > > If I use an older fio, it reliably does not hang. What can I do to > > help debug this? > > > > As I said, I can attach to the hanging fio using a gdb and give you > > stackdumps for all of the threads if that would be helpful. I'm sorry, didn't read that it only with ext4, it hangs for me with ext4 as well. # mkfs.ext4 /dev/loop0 mke2fs 1.43.3 (04-Sep-2016) /dev/loop0 contains a xfs file system Proceed anyway? (y,n) y Discarding device blocks: done Creating filesystem with 3145728 4k blocks and 786432 inodes Filesystem UUID: 18abfd64-3395-43c4-9c7d-f2c312e7516d Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208 Allocating group tables: done Writing inode tables: done Creating journal (16384 blocks): done Writing superblocks and filesystem accounting information: done root@ttip:/home/mator/xfstests-dev# ./check generic/299 FSTYP -- ext4 PLATFORM -- Linux/sparc64 ttip 4.8.0+ MKFS_OPTIONS -- /dev/loop1 MOUNT_OPTIONS -- -o acl,user_xattr /dev/loop1 /mnt/scratch generic/299 103s ... 104s (hangs) on another console: $ pstree -Alap | `-check,219419 ./check generic/299 | `-fsck,221209 -t ext4 -nf /dev/loop1 | `-fsck.ext4,221210 -nf /dev/loop1 fsck.ext4 can't finish its work # strace \-p 221210 strace: Process 221210 attached pread64(4, "\n\363R\0T\1\0\0\0\0\0\0\37\350\"\0\1\0\0\0x+\26\0*\350\"\0\1\0\0\0"..., 4096, 12793810944) = 4096 pread64(4, "\n\363\37\1T\1\0\0\0\0\0\0\260\356\"\0\1\0\0\0\227\256/\0\261\356\"\0\1\0\0\0"..., 4096, 3918262272) = 4096 pread64(4, "\n\363\357\0T\1\0\0\0\0045\254\32\6#\0\1\0\0\0\275\322 \0%\6#\0\1\0\0\0"..., 4096, 974860288) = 4096 pread64(4, "\n\363\226\0T\1\0\0\0\0\0\2\377\31#\0\1\0\0\0\311\257/\0\33\32#\0\1\0\0\0"..., 4096, 3927293952) = 4096 brk(0x10000d08000) = 0x10000d08000 pread64(4, "\n\363\306\0T\1\0\0\0\0\0\2L%#\0\1\0\0\0<3\10\0Y%#\0\1\0\0\0"..., 4096, 6654865408) = 4096 pread64(4, "\n\363\3\0T\1\0\0\0\0\0\0\2124#\0\1\0\0\0\272\263\2\0\2344#\0\1\0\0\0"..., 4096, 3800674304) = 4096 pread64(4, "\n\363\254\0T\1\0\0!\277\301~\3074#\0\1\0\0\0\"\310\16\0\3514#\0\1\0\0\0"..., 4096, 2717888512) = 4096 pread64(4, "\n\363Q\1T\1\0\0\255\255\255\255\307B#\0\1\0\0\0S\231\6\0\313B#\0\1\0\0\0"..., 4096, 4038303744) = 4096 pread64(4, "\n\363/\0T\1\0\0&o\r\201e^#\0\1\0\0\0\233\237\f\0\202^#\0\1\0\0\0"..., 4096, 4184920064) = 4096 pread64(4, "\n\363\244\0T\1\0\0\305\305\305\305\266b#\0\1\0\0\0n]\21\0\322b#\0\1\0\0\0"..., 4096, 6043922432) = 4096 pread64(4, "\n\363\"\1T\1\0\0\0\0\0\2\310q#\0\1\0\0\0 \337\26\0\335q#\0\1\0\0\0"..., 4096, 4194627584) = 4096 pread64(4, "\n\363\376\0T\1\0\0\0\0\0\2F\211#\0\1\0\0\0 \256/\0M\211#\0\1\0\0\0"..., 4096, 6049001472) = 4096 pread64(4, "\n\363]\0T\1\0\0\0\0\0\1\16\235#\0\1\0\0\0\272_\30\0\20\235#\0\1\0\0\0"..., 4096, 6567260160) = 4096 pread64(4, "\n\363\343\0T\1\0\0eeee\317\243#\0\1\0\0\0\264\222\20\0\332\243#\0\1\0\0\0"..., 4096, 6083014656) = 4096 pread64(4, "\n\363\332\0T\1\0\0\36\255\246d\216\264#\0\1\0\0\0.\212\17\0\242\264#\0\1\0\0\0"..., 4096, 8083423232) = 4096 pread64(4, "\n\363\334\0T\1\0\0EEEEW\306#\0\1\0\0\0^\273\20\0g\306#\0\1\0\0\0"..., 4096, 6083252224) = 4096 pread64(4, "\n\363\333\0T\1\0\0\0\0\0\0@\330#\0\1\0\0\0Z\353\23\0N\330#\0\1\0\0\0"..., 4096, 7774924800) = 4096 pread64(4, "\n\363x\0T\1\0\0{{{{H\347#\0\1\0\0\0\232~\27\0M\347#\0\1\0\0\0"..., 4096, 4735361024) = 4096 pread64(4, "\n\363\367\0T\1\0\0\377\377\377\377#\361#\0\1\0\0\0O\t\27\0b\361#\0\1\0\0\0"..., 4096, 4448808960) = 4096 pread64(4, "\n\363\t\0T\1\0\0aaaa\271\2$\0\1\0\0\0\351I\21\0\340\2$\0\1\0\0\0"..., 4096, 5948190720) = 4096 pread64(4, "\n\363\25\0T\1\0\0\0\0\0\0@\3$\0\1\0\0\0t\33.\0D\3$\0\1\0\0\0"..., 4096, 1714593792) = 4096 pread64(4, "\n\363b\0T\1\0\0\0\0\0\0g\4$\0\1\0\0\0\332\366\0\0n\4$\0\1\0\0\0"..., 4096, 1635794944) = 4096 pread64(4, "\n\363)\1T\1\0\0\0\0\0\0%\v$\0\1\0\0\0e\230\17\0003\v$\0\1\0\0\0"..., 4096, 2111643648) = 4096 pread64(4, "\n\363<\0T\1\0\0\0\0\0\0D $\0\1\0\0\0\1\355 \0T $\0\1\0\0\0"..., 4096, 165666816) = 4096 brk(0x10000d2a000) = 0x10000d2a000 pread64(4, "\n\363\n\0T\1\0\0\301\301\301\3012%$\0\1\0\0\0r\37\22\0004%$\0\1\0\0\0"..., 4096, 4871196672) = 4096 pread64(4, "\n\363\225\0T\1\0\0\0\0\0\1(&$\0\1\0\0\0\213\"\22\0@&$\0\1\0\0\0"..., 4096, 5401088000) = 4096 ^Cstrace: Process 221210 detached $ dpkg -l e2fsprogs ii e2fsprogs 1.43.3-1 sparc64 ext2/ext3/ext4 file system utilities 299.full in attachment [-- Attachment #2: 299.full.gz --] [-- Type: application/x-gzip, Size: 3245 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Test generic/299 stalling forever 2016-10-13 13:36 ` Anatoly Pugachev @ 2016-10-13 14:28 ` Jens Axboe 0 siblings, 0 replies; 29+ messages in thread From: Jens Axboe @ 2016-10-13 14:28 UTC (permalink / raw) To: Anatoly Pugachev, Theodore Ts'o Cc: Dave Chinner, linux-ext4, fstests, tarasov On 10/13/2016 07:36 AM, Anatoly Pugachev wrote: > On Thu, Oct 13, 2016 at 04:08:37PM +0300, Anatoly Pugachev wrote: >> On Wed, Oct 12, 2016 at 10:15:52PM -0400, Theodore Ts'o wrote: >>> On Wed, Oct 12, 2016 at 03:19:25PM -0600, Jens Axboe wrote: >>>> >>>> FWIW, this is the commit that fixes it: >>>> >>>> commit 39d13e67ef1f4b327c68431f8daf033a03920117 >>>> Author: Jens Axboe <axboe@fb.com> >>>> Date: Fri Aug 26 14:39:30 2016 -0600 >>>> >>>> backend: check if we need to update rusage stats, if stat_mutex is busy >>>> >>>> 2.14 and newer should not have the problem, but earlier versions may >>>> depending on how old... >>> >>> Unfortunately, I'm still seeing hangs in generic/299 with the latest version of fio: >>> >>> fio fio-2.14-10-g0a301e9 (Fri, 23 Sep 2016 11:57:00 -0600) >>> >>> If I use an older fio, it reliably does not hang. What can I do to >>> help debug this? >>> >>> As I said, I can attach to the hanging fio using a gdb and give you >>> stackdumps for all of the threads if that would be helpful. > > I'm sorry, didn't read that it only with ext4, it hangs for me with ext4 as > well. > > # mkfs.ext4 /dev/loop0 > mke2fs 1.43.3 (04-Sep-2016) > /dev/loop0 contains a xfs file system > Proceed anyway? (y,n) y > Discarding device blocks: done > Creating filesystem with 3145728 4k blocks and 786432 inodes > Filesystem UUID: 18abfd64-3395-43c4-9c7d-f2c312e7516d > Superblock backups stored on blocks: > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208 > > Allocating group tables: done > Writing inode tables: done > Creating journal (16384 blocks): done > Writing superblocks and filesystem accounting information: done > > root@ttip:/home/mator/xfstests-dev# ./check generic/299 > FSTYP -- ext4 > PLATFORM -- Linux/sparc64 ttip 4.8.0+ > MKFS_OPTIONS -- /dev/loop1 > MOUNT_OPTIONS -- -o acl,user_xattr /dev/loop1 /mnt/scratch > > generic/299 103s ... 104s > > (hangs) > > on another console: > > $ pstree -Alap > > | `-check,219419 ./check generic/299 > | `-fsck,221209 -t ext4 -nf /dev/loop1 > | `-fsck.ext4,221210 -nf /dev/loop1 > > fsck.ext4 can't finish its work > > # strace \-p 221210 > strace: Process 221210 attached > pread64(4, "\n\363R\0T\1\0\0\0\0\0\0\37\350\"\0\1\0\0\0x+\26\0*\350\"\0\1\0\0\0"..., 4096, 12793810944) = 4096 > pread64(4, "\n\363\37\1T\1\0\0\0\0\0\0\260\356\"\0\1\0\0\0\227\256/\0\261\356\"\0\1\0\0\0"..., 4096, 3918262272) = 4096 > pread64(4, "\n\363\357\0T\1\0\0\0\0045\254\32\6#\0\1\0\0\0\275\322 \0%\6#\0\1\0\0\0"..., 4096, 974860288) = 4096 > pread64(4, "\n\363\226\0T\1\0\0\0\0\0\2\377\31#\0\1\0\0\0\311\257/\0\33\32#\0\1\0\0\0"..., 4096, 3927293952) = 4096 > brk(0x10000d08000) = 0x10000d08000 > pread64(4, "\n\363\306\0T\1\0\0\0\0\0\2L%#\0\1\0\0\0<3\10\0Y%#\0\1\0\0\0"..., 4096, 6654865408) = 4096 > pread64(4, "\n\363\3\0T\1\0\0\0\0\0\0\2124#\0\1\0\0\0\272\263\2\0\2344#\0\1\0\0\0"..., 4096, 3800674304) = 4096 > pread64(4, "\n\363\254\0T\1\0\0!\277\301~\3074#\0\1\0\0\0\"\310\16\0\3514#\0\1\0\0\0"..., 4096, 2717888512) = 4096 > pread64(4, "\n\363Q\1T\1\0\0\255\255\255\255\307B#\0\1\0\0\0S\231\6\0\313B#\0\1\0\0\0"..., 4096, 4038303744) = 4096 > pread64(4, "\n\363/\0T\1\0\0&o\r\201e^#\0\1\0\0\0\233\237\f\0\202^#\0\1\0\0\0"..., 4096, 4184920064) = 4096 > pread64(4, "\n\363\244\0T\1\0\0\305\305\305\305\266b#\0\1\0\0\0n]\21\0\322b#\0\1\0\0\0"..., 4096, 6043922432) = 4096 > pread64(4, "\n\363\"\1T\1\0\0\0\0\0\2\310q#\0\1\0\0\0 \337\26\0\335q#\0\1\0\0\0"..., 4096, 4194627584) = 4096 > pread64(4, "\n\363\376\0T\1\0\0\0\0\0\2F\211#\0\1\0\0\0 \256/\0M\211#\0\1\0\0\0"..., 4096, 6049001472) = 4096 > pread64(4, "\n\363]\0T\1\0\0\0\0\0\1\16\235#\0\1\0\0\0\272_\30\0\20\235#\0\1\0\0\0"..., 4096, 6567260160) = 4096 > pread64(4, "\n\363\343\0T\1\0\0eeee\317\243#\0\1\0\0\0\264\222\20\0\332\243#\0\1\0\0\0"..., 4096, 6083014656) = 4096 > pread64(4, "\n\363\332\0T\1\0\0\36\255\246d\216\264#\0\1\0\0\0.\212\17\0\242\264#\0\1\0\0\0"..., 4096, 8083423232) = 4096 > pread64(4, "\n\363\334\0T\1\0\0EEEEW\306#\0\1\0\0\0^\273\20\0g\306#\0\1\0\0\0"..., 4096, 6083252224) = 4096 > pread64(4, "\n\363\333\0T\1\0\0\0\0\0\0@\330#\0\1\0\0\0Z\353\23\0N\330#\0\1\0\0\0"..., 4096, 7774924800) = 4096 > pread64(4, "\n\363x\0T\1\0\0{{{{H\347#\0\1\0\0\0\232~\27\0M\347#\0\1\0\0\0"..., 4096, 4735361024) = 4096 > pread64(4, "\n\363\367\0T\1\0\0\377\377\377\377#\361#\0\1\0\0\0O\t\27\0b\361#\0\1\0\0\0"..., 4096, 4448808960) = 4096 > pread64(4, "\n\363\t\0T\1\0\0aaaa\271\2$\0\1\0\0\0\351I\21\0\340\2$\0\1\0\0\0"..., 4096, 5948190720) = 4096 > pread64(4, "\n\363\25\0T\1\0\0\0\0\0\0@\3$\0\1\0\0\0t\33.\0D\3$\0\1\0\0\0"..., 4096, 1714593792) = 4096 > pread64(4, "\n\363b\0T\1\0\0\0\0\0\0g\4$\0\1\0\0\0\332\366\0\0n\4$\0\1\0\0\0"..., 4096, 1635794944) = 4096 > pread64(4, "\n\363)\1T\1\0\0\0\0\0\0%\v$\0\1\0\0\0e\230\17\0003\v$\0\1\0\0\0"..., 4096, 2111643648) = 4096 > pread64(4, "\n\363<\0T\1\0\0\0\0\0\0D $\0\1\0\0\0\1\355 \0T $\0\1\0\0\0"..., 4096, 165666816) = 4096 > brk(0x10000d2a000) = 0x10000d2a000 > pread64(4, "\n\363\n\0T\1\0\0\301\301\301\3012%$\0\1\0\0\0r\37\22\0004%$\0\1\0\0\0"..., 4096, 4871196672) = 4096 > pread64(4, "\n\363\225\0T\1\0\0\0\0\0\1(&$\0\1\0\0\0\213\"\22\0@&$\0\1\0\0\0"..., 4096, 5401088000) = 4096 > ^Cstrace: Process 221210 detached > > $ dpkg -l e2fsprogs > ii e2fsprogs 1.43.3-1 sparc64 ext2/ext3/ext4 file system utilities If it's fsck that's hung, then that's a different issue than the one Ted is reporting, where it is fio that is hung. I've run your test case 10 times this morning, and I haven't seen any issues. I'll try with two nvme devices and see if that changes the situation for me. Ted, what devices are you using for data/scratch? -- Jens Axboe ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2016-10-25 3:00 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-06-18 15:53 Test generic/299 stalling forever Theodore Ts'o 2015-06-18 16:25 ` Eric Whitney 2015-06-18 23:34 ` Dave Chinner 2015-06-19 2:56 ` Theodore Ts'o 2016-09-29 4:37 ` Theodore Ts'o 2016-10-12 15:46 ` Jens Axboe 2016-10-12 21:14 ` Dave Chinner 2016-10-12 21:19 ` Jens Axboe 2016-10-13 2:15 ` Theodore Ts'o 2016-10-13 2:39 ` Jens Axboe 2016-10-13 23:19 ` Theodore Ts'o 2016-10-18 18:01 ` Theodore Ts'o 2016-10-19 14:06 ` Jens Axboe 2016-10-19 17:49 ` Jens Axboe 2016-10-19 20:32 ` Theodore Ts'o 2016-10-20 14:22 ` Jens Axboe 2016-10-21 22:15 ` Theodore Ts'o 2016-10-23 2:02 ` Theodore Ts'o 2016-10-23 14:32 ` Jens Axboe 2016-10-23 19:33 ` Theodore Ts'o 2016-10-23 21:24 ` Theodore Ts'o 2016-10-24 1:41 ` Jens Axboe 2016-10-24 3:38 ` Theodore Ts'o 2016-10-24 16:28 ` Jens Axboe 2016-10-25 2:54 ` Theodore Ts'o 2016-10-25 2:59 ` Jens Axboe 2016-10-13 13:08 ` Anatoly Pugachev 2016-10-13 13:36 ` Anatoly Pugachev 2016-10-13 14:28 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).