* DIO process stuck apparently due to dioread_nolock (3.0) @ 2011-08-10 10:51 Michael Tokarev 2011-08-11 11:59 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-10 10:51 UTC (permalink / raw) To: linux-ext4 Hello. For a few days I'm evaluating various options to use storage. I'm interested in concurrent direct I/O (oracle rdbms workload). I noticed that somehow, ext4fs in mixed read-write test greatly prefers writes over reads - writes goes at full speed while reads are almost non-existent. Sandeen on IRC pointed me at dioread_nolock mount option, which I tried with great results, if not one "but". There's a deadlock somewhere, which I can't trigger "on demand" - I can't hit the right condition. It happened twice in a row already, each time after the same scenario (more about that later). When it happens, a process doing direct AIO stalls infinitely, with the following backtrace: [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds. [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000 [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0 [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0 [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2 [87550.760245] Call Trace: [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170 [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140 [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b At this point, the process in question can't be killed or stopped. Yes it's oracle DB, and I can kill all other processes of this instance (this one is lgwr, aka log writer), but the stuck process will continue to be stuck, so it is not an inter-process deadlock. echo "w" > /proc/sysrq-trigger shows only that process, with the same stack trace. This is 3.0.1 kernel from kernel.org (amd64 arch). The system is a relatively large box (IBM System x3850 X5). So far, I've seen this issue twice, and each time in the following scenario: I copy an oracle database from another machine to filesystem mounted with dioread_nolock, and right after the copy completes, I start the database. And immediately when Oracle opens its DB ("Database opened") I see stuck lgwr process like above. So I suspect it happens when there are some unwritten files in buffer/page cache and some process tries to do direct writes. I haven't seen this happening without dioread_nolock, but since I don't have an easy reproducer I can't say this mount option is a requiriment. So far, I was able to trigger it only after large db copy, with small database I created in order to try to reproduce it the issue does not happen. And sure thing, when it happens, the only way to clean up is to forcible reboot the machine (echo b > sysrq-trigger). I'll continue experiments in a hope to find an easier reproducer, but the problem is that I've little time left before the machine in question will go into production. So if anyone have hints for this issue, please share.. ;) Thank you! /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-10 10:51 DIO process stuck apparently due to dioread_nolock (3.0) Michael Tokarev @ 2011-08-11 11:59 ` Jan Kara 2011-08-11 12:21 ` Michael Tokarev 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2011-08-11 11:59 UTC (permalink / raw) To: Michael Tokarev; +Cc: linux-ext4 Hello, On Wed 10-08-11 14:51:17, Michael Tokarev wrote: > For a few days I'm evaluating various options to use > storage. I'm interested in concurrent direct I/O > (oracle rdbms workload). > > I noticed that somehow, ext4fs in mixed read-write > test greatly prefers writes over reads - writes goes > at full speed while reads are almost non-existent. > > Sandeen on IRC pointed me at dioread_nolock mount > option, which I tried with great results, if not > one "but". > > There's a deadlock somewhere, which I can't trigger > "on demand" - I can't hit the right condition. It > happened twice in a row already, each time after the > same scenario (more about that later). > > When it happens, a process doing direct AIO stalls > infinitely, with the following backtrace: > > [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds. > [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000 > [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0 > [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0 > [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2 > [87550.760245] Call Trace: > [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 > [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170 > [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140 > [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b Hmm, the stack trace does not quite make sense to me - the part between __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we blocked in ext4_file_write() but I don't see any place there where we would allocate memory. By any chance, are there messages like "Unaligned AIO/DIO on inode ..." in the kernel log? > At this point, the process in question can't be killed or > stopped. Yes it's oracle DB, and I can kill all other processes > of this instance (this one is lgwr, aka log writer), but the stuck > process will continue to be stuck, so it is not an inter-process > deadlock. > > echo "w" > /proc/sysrq-trigger shows only that process, with the > same stack trace. > > This is 3.0.1 kernel from kernel.org (amd64 arch). The system is > a relatively large box (IBM System x3850 X5). So far, I've seen > this issue twice, and each time in the following scenario: > > I copy an oracle database from another machine to filesystem > mounted with dioread_nolock, and right after the copy completes, > I start the database. And immediately when Oracle opens its > DB ("Database opened") I see stuck lgwr process like above. > > So I suspect it happens when there are some unwritten files > in buffer/page cache and some process tries to do direct > writes. > > I haven't seen this happening without dioread_nolock, but since > I don't have an easy reproducer I can't say this mount option > is a requiriment. So far, I was able to trigger it only after > large db copy, with small database I created in order to try > to reproduce it the issue does not happen. > > And sure thing, when it happens, the only way to clean up is > to forcible reboot the machine (echo b > sysrq-trigger). > > I'll continue experiments in a hope to find an easier reproducer, > but the problem is that I've little time left before the machine > in question will go into production. So if anyone have hints > for this issue, please share.. ;) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-11 11:59 ` Jan Kara @ 2011-08-11 12:21 ` Michael Tokarev 2011-08-11 14:01 ` Jan Kara 0 siblings, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-11 12:21 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 11.08.2011 15:59, Jan Kara wrote: > Hello, > > On Wed 10-08-11 14:51:17, Michael Tokarev wrote: >> For a few days I'm evaluating various options to use >> storage. I'm interested in concurrent direct I/O >> (oracle rdbms workload). >> >> I noticed that somehow, ext4fs in mixed read-write >> test greatly prefers writes over reads - writes goes >> at full speed while reads are almost non-existent. >> >> Sandeen on IRC pointed me at dioread_nolock mount >> option, which I tried with great results, if not >> one "but". >> >> There's a deadlock somewhere, which I can't trigger >> "on demand" - I can't hit the right condition. It >> happened twice in a row already, each time after the >> same scenario (more about that later). >> >> When it happens, a process doing direct AIO stalls >> infinitely, with the following backtrace: >> >> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds. >> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000 >> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0 >> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0 >> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2 >> [87550.760245] Call Trace: >> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 >> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170 >> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140 >> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] >> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 >> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] >> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 >> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 >> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 >> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > Hmm, the stack trace does not quite make sense to me - the part between > __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we > blocked in ext4_file_write() but I don't see any place there where we would > allocate memory. By any chance, are there messages like "Unaligned AIO/DIO > on inode ..." in the kernel log? Yes, there are warnings about unaligned DIO, referring to this same process actually. Oracle does almost good job at aligning writes (usually it does i/o by its blocks which are 4Kb by default but are set to something larger - like 16Kb - for larger database). Except of a few cases, and lgwr process is one of them (*) - it writes logfiles using 512b blocks. This is okay for a raw device with 512bytes blocks, but ext4 expects 4k writes at min. (*) another case is writing to control file, which is also done in 512byte chunks. >> At this point, the process in question can't be killed or >> stopped. Yes it's oracle DB, and I can kill all other processes >> of this instance (this one is lgwr, aka log writer), but the stuck >> process will continue to be stuck, so it is not an inter-process >> deadlock. >> >> echo "w" > /proc/sysrq-trigger shows only that process, with the >> same stack trace. Just for the record, this is from echoing "w" into sysrq-trigger: [ 4691.585450] SysRq : Show Blocked State [ 4691.585493] task PC stack pid father [ 4691.585752] oracle D 0000000000000000 0 23357 1 0x00000004 [ 4691.585804] ffff88127719e080 0000000000000082 ffff881200000000 ffff881279791040 [ 4691.585883] ffff8803caf03fd8 ffff8803caf03fd8 ffff8803caf03fd8 ffff88127719e080 [ 4691.585961] ffffea0046db2510 ffffffff00000000 ffff8820431b24a8 ffffffff810eeda2 [ 4691.586039] Call Trace: [ 4691.586079] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 [ 4691.586140] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] [ 4691.586185] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 [ 4691.586232] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] [ 4691.586275] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 [ 4691.586315] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 [ 4691.586354] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 [ 4691.586398] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b It is almost but not the same as above. Thank you, /mjt >> This is 3.0.1 kernel from kernel.org (amd64 arch). The system is >> a relatively large box (IBM System x3850 X5). So far, I've seen >> this issue twice, and each time in the following scenario: >> >> I copy an oracle database from another machine to filesystem >> mounted with dioread_nolock, and right after the copy completes, >> I start the database. And immediately when Oracle opens its >> DB ("Database opened") I see stuck lgwr process like above. >> >> So I suspect it happens when there are some unwritten files >> in buffer/page cache and some process tries to do direct >> writes. >> >> I haven't seen this happening without dioread_nolock, but since >> I don't have an easy reproducer I can't say this mount option >> is a requiriment. So far, I was able to trigger it only after >> large db copy, with small database I created in order to try >> to reproduce it the issue does not happen. >> >> And sure thing, when it happens, the only way to clean up is >> to forcible reboot the machine (echo b > sysrq-trigger). >> >> I'll continue experiments in a hope to find an easier reproducer, >> but the problem is that I've little time left before the machine >> in question will go into production. So if anyone have hints >> for this issue, please share.. ;) > > Honza ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-11 12:21 ` Michael Tokarev @ 2011-08-11 14:01 ` Jan Kara 2011-08-11 20:05 ` Michael Tokarev 0 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2011-08-11 14:01 UTC (permalink / raw) To: Michael Tokarev; +Cc: Jan Kara, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 3367 bytes --] Hello, On Thu 11-08-11 16:21:42, Michael Tokarev wrote: > 11.08.2011 15:59, Jan Kara wrote: > > On Wed 10-08-11 14:51:17, Michael Tokarev wrote: > >> For a few days I'm evaluating various options to use > >> storage. I'm interested in concurrent direct I/O > >> (oracle rdbms workload). > >> > >> I noticed that somehow, ext4fs in mixed read-write > >> test greatly prefers writes over reads - writes goes > >> at full speed while reads are almost non-existent. > >> > >> Sandeen on IRC pointed me at dioread_nolock mount > >> option, which I tried with great results, if not > >> one "but". > >> > >> There's a deadlock somewhere, which I can't trigger > >> "on demand" - I can't hit the right condition. It > >> happened twice in a row already, each time after the > >> same scenario (more about that later). > >> > >> When it happens, a process doing direct AIO stalls > >> infinitely, with the following backtrace: > >> > >> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds. > >> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000 > >> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0 > >> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0 > >> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2 > >> [87550.760245] Call Trace: > >> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 > >> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170 > >> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140 > >> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > >> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > >> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > >> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > >> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > >> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > >> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > Hmm, the stack trace does not quite make sense to me - the part between > > __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we > > blocked in ext4_file_write() but I don't see any place there where we would > > allocate memory. By any chance, are there messages like "Unaligned AIO/DIO > > on inode ..." in the kernel log? > > Yes, there are warnings about unaligned DIO, referring to this same > process actually. Oracle does almost good job at aligning writes > (usually it does i/o by its blocks which are 4Kb by default but > are set to something larger - like 16Kb - for larger database). > Except of a few cases, and lgwr process is one of them (*) - it > writes logfiles using 512b blocks. This is okay for a raw device > with 512bytes blocks, but ext4 expects 4k writes at min. > > (*) another case is writing to control file, which is also done in > 512byte chunks. Ah, OK. I think the code ext4_end_io_nolock() handling the wait queue might be racy. waitqueue_active() check is missing a barrier I think. Does attached (untested) patch fix the issue for you? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: 0001-ext4-Fix-missed-wakeups.patch --] [-- Type: text/x-patch, Size: 1388 bytes --] >From a5dd84bbe3c55b2717150ac26f8b9011d8f9181f Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Thu, 11 Aug 2011 15:57:51 +0200 Subject: [PATCH] ext4: Fix missed wakeups Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/page-io.c | 15 +++++++++------ 1 files changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 430c401..34d01d4 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -79,9 +79,11 @@ void ext4_free_io_end(ext4_io_end_t *io) put_io_page(io->pages[i]); io->num_io_pages = 0; wq = ext4_ioend_wq(io->inode); - if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count) && - waitqueue_active(wq)) - wake_up_all(wq); + if (atomic_dec_and_test(&EXT4_I(io->inode)->i_ioend_count)) { + smp_mb__after_atomic_dec(); + if (waitqueue_active(wq)) + wake_up_all(wq); + } kmem_cache_free(io_end_cachep, io); } @@ -122,9 +124,10 @@ int ext4_end_io_nolock(ext4_io_end_t *io) io->flag &= ~EXT4_IO_END_UNWRITTEN; /* Wake up anyone waiting on unwritten extent conversion */ wq = ext4_ioend_wq(io->inode); - if (atomic_dec_and_test(&EXT4_I(inode)->i_aiodio_unwritten) && - waitqueue_active(wq)) { - wake_up_all(wq); + if (atomic_dec_and_test(&EXT4_I(inode)->i_aiodio_unwritten)) { + smp_mb__after_atomic_dec(); + if (waitqueue_active(wq)) + wake_up_all(wq); } } -- 1.7.1 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-11 14:01 ` Jan Kara @ 2011-08-11 20:05 ` Michael Tokarev 2011-08-12 2:46 ` Jiaying Zhang 0 siblings, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-11 20:05 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4 11.08.2011 18:01, Jan Kara wrote: [] >>>> When it happens, a process doing direct AIO stalls >>>> infinitely, with the following backtrace: >>>> >>>> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds. >>>> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000 >>>> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0 >>>> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0 >>>> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2 >>>> [87550.760245] Call Trace: >>>> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 >>>> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170 >>>> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140 >>>> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] >>>> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 >>>> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] >>>> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 >>>> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 >>>> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 >>>> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b >>> Hmm, the stack trace does not quite make sense to me - the part between >>> __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we >>> blocked in ext4_file_write() but I don't see any place there where we would >>> allocate memory. By any chance, are there messages like "Unaligned AIO/DIO >>> on inode ..." in the kernel log? >> >> Yes, there are warnings about unaligned DIO, referring to this same >> process actually. Oracle does almost good job at aligning writes >> (usually it does i/o by its blocks which are 4Kb by default but >> are set to something larger - like 16Kb - for larger database). >> Except of a few cases, and lgwr process is one of them (*) - it >> writes logfiles using 512b blocks. This is okay for a raw device >> with 512bytes blocks, but ext4 expects 4k writes at min. >> >> (*) another case is writing to control file, which is also done in >> 512byte chunks. > Ah, OK. I think the code ext4_end_io_nolock() handling the wait queue > might be racy. waitqueue_active() check is missing a barrier I think. > Does attached (untested) patch fix the issue for you? With this patch applied, I can reproduce the problem easily too: [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. [ 1469.734114] SysRq : Show Blocked State [ 1469.734157] task PC stack pid father [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 [ 1469.734760] Call Trace: [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b which is exactly the same as with previous "w" into sysrq-trigger. So it's something else... :( Thank you! /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-11 20:05 ` Michael Tokarev @ 2011-08-12 2:46 ` Jiaying Zhang 2011-08-12 6:23 ` Michael Tokarev 0 siblings, 1 reply; 43+ messages in thread From: Jiaying Zhang @ 2011-08-12 2:46 UTC (permalink / raw) To: Michael Tokarev; +Cc: Jan Kara, linux-ext4 Hi Michael, Could you try your test with the patch I just posted: http://marc.info/?l=linux-ext4&m=131311627101278&w=2 and see whether it fixes the problem? Thanks! Jiaying On Thu, Aug 11, 2011 at 1:05 PM, Michael Tokarev <mjt@tls.msk.ru> wrote: > 11.08.2011 18:01, Jan Kara wrote: > [] >>>>> When it happens, a process doing direct AIO stalls >>>>> infinitely, with the following backtrace: >>>>> >>>>> [87550.759848] INFO: task oracle:23176 blocked for more than 120 seconds. >>>>> [87550.759892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> [87550.759955] oracle D 0000000000000000 0 23176 1 0x00000000 >>>>> [87550.760006] ffff8820457b47d0 0000000000000082 ffff880600000000 ffff881278e3f7d0 >>>>> [87550.760085] ffff8806215c1fd8 ffff8806215c1fd8 ffff8806215c1fd8 ffff8820457b47d0 >>>>> [87550.760163] ffffea0010bd7c68 ffffffff00000000 ffff882045512ef8 ffffffff810eeda2 >>>>> [87550.760245] Call Trace: >>>>> [87550.760285] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 >>>>> [87550.760327] [<ffffffff81111ded>] ? kmem_getpages+0x5d/0x170 >>>>> [87550.760367] [<ffffffff81112e58>] ? ____cache_alloc_node+0x48/0x140 >>>>> [87550.760430] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] >>>>> [87550.760475] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 >>>>> [87550.760523] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] >>>>> [87550.760566] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 >>>>> [87550.760607] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 >>>>> [87550.760646] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 >>>>> [87550.760689] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b >>>> Hmm, the stack trace does not quite make sense to me - the part between >>>> __do_fault and aio_rw_vect_retry is somehow broken. I can imagine we >>>> blocked in ext4_file_write() but I don't see any place there where we would >>>> allocate memory. By any chance, are there messages like "Unaligned AIO/DIO >>>> on inode ..." in the kernel log? >>> >>> Yes, there are warnings about unaligned DIO, referring to this same >>> process actually. Oracle does almost good job at aligning writes >>> (usually it does i/o by its blocks which are 4Kb by default but >>> are set to something larger - like 16Kb - for larger database). >>> Except of a few cases, and lgwr process is one of them (*) - it >>> writes logfiles using 512b blocks. This is okay for a raw device >>> with 512bytes blocks, but ext4 expects 4k writes at min. >>> >>> (*) another case is writing to control file, which is also done in >>> 512byte chunks. >> Ah, OK. I think the code ext4_end_io_nolock() handling the wait queue >> might be racy. waitqueue_active() check is missing a barrier I think. >> Does attached (untested) patch fix the issue for you? > > With this patch applied, I can reproduce the problem easily too: > > [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. > [ 1469.734114] SysRq : Show Blocked State > [ 1469.734157] task PC stack pid father > [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 > [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 > [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 > [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 > [ 1469.734760] Call Trace: > [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 > [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > which is exactly the same as with previous "w" into sysrq-trigger. > > So it's something else... :( > > Thank you! > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 2:46 ` Jiaying Zhang @ 2011-08-12 6:23 ` Michael Tokarev 2011-08-12 7:07 ` Michael Tokarev 2011-08-12 13:07 ` Jan Kara 0 siblings, 2 replies; 43+ messages in thread From: Michael Tokarev @ 2011-08-12 6:23 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Jan Kara, linux-ext4 12.08.2011 06:46, Jiaying Zhang wrote: > Hi Michael, > > Could you try your test with the patch I just posted: > http://marc.info/?l=linux-ext4&m=131311627101278&w=2 > and see whether it fixes the problem? No it does not. I'm now able to trigger it more or less reliable - I need to decompress a relatively small (about 70Gb) oracle database and try to start it (this requires a rebuild of initrd and reboot ofcourse -- whole thing takes about 15 minutes) - and I see this: [ 945.729965] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767175 by oracle; performance will be poor. [ 960.915602] SysRq : Show Blocked State [ 960.915650] task PC stack pid father [ 960.915852] oracle D 0000000000000000 0 4985 1 0x00000000 [ 960.915909] ffff88103e627040 0000000000000082 ffff881000000000 ffff881078e3f7d0 [ 960.917855] ffff88103f88ffd8 ffff88103f88ffd8 ffff88103f88ffd8 ffff88103e627040 [ 960.917953] 0000000001c08400 ffff88203e98c948 ffff88207873e240 ffffffff813527c6 [ 960.918045] Call Trace: [ 960.918092] [<ffffffff813527c6>] ? printk+0x43/0x48 [ 960.918153] [<ffffffffa01432a8>] ? ext4_msg+0x58/0x60 [ext4] [ 960.918201] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] [ 960.918252] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 [ 960.918301] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] [ 960.918348] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 [ 960.918392] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 [ 960.918436] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 [ 960.918483] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b (Inum 5767175 is one of oracle redologs). Jan, I lied to you initially - I didn't even test your first patch, because I loaded the wrong initrd. With it applied, situation does not improve still, and it looks exactly the same as with this new patch by Jiaying Zhang: [ 415.288104] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767177 by oracle; performance will be poor. [ 422.967323] SysRq : Show Blocked State [ 422.967494] task PC stack pid father [ 422.967872] oracle D 0000000000000000 0 3743 1 0x00000000 [ 422.968112] ffff88203e5a2810 0000000000000086 ffff882000000000 ffff88103f403080 [ 422.968505] ffff88203eeddfd8 ffff88203eeddfd8 ffff88203eeddfd8 ffff88203e5a2810 [ 422.968895] 0000000001c08400 ffff88103f3db348 ffff88103f2fe800 ffffffff813527c6 [ 422.969287] Call Trace: [ 422.969397] [<ffffffff813527c6>] ? printk+0x43/0x48 [ 422.969528] [<ffffffffa0143288>] ? ext4_msg+0x58/0x60 [ext4] [ 422.969643] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] [ 422.969758] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 [ 422.969873] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] [ 422.969985] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 [ 422.970093] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 [ 422.970200] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 [ 422.970312] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b Note in both this cases, I now see slightly different backtrace -- both mentions ext4_llseek and abort_exclusive_wait, but the rest is different: >> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. >> [ 1469.734114] SysRq : Show Blocked State >> [ 1469.734157] task PC stack pid father >> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 >> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 >> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 >> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 >> [ 1469.734760] Call Trace: >> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 >> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] >> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 >> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] >> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 >> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 >> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 >> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b As Jan already pointed out, this place looks bogus, and the same can be said about the new backtrace. So I wonder if there's some stack corruption going on there as well. Btw, does ext4_llseek() look sane here? Note it's called from aio_submit() -- does it _ever_ implement SEEKs? Maybe some debugging is neecessary here? Thank you! /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 6:23 ` Michael Tokarev @ 2011-08-12 7:07 ` Michael Tokarev 2011-08-12 13:07 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Michael Tokarev @ 2011-08-12 7:07 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Jan Kara, linux-ext4 12.08.2011 10:23, Michael Tokarev пишет: > 12.08.2011 06:46, Jiaying Zhang wrote: >> Hi Michael, >> >> Could you try your test with the patch I just posted: >> http://marc.info/?l=linux-ext4&m=131311627101278&w=2 >> and see whether it fixes the problem? > > No it does not. I'm now able to trigger it more or > less reliable - I need to decompress a relatively > small (about 70Gb) oracle database and try to start > it (this requires a rebuild of initrd and reboot ofcourse -- > whole thing takes about 15 minutes) - and I see this: And I forgot to note an important (in my opinion anyway) detail: I can only trigger the issue after copying the database files, ie, on a hot cache. After reboot when the files are not in cache, I can start this database without any issues whatsoever, and it passes all our stress tests just fine. So it appears the problem only happens when the cache is hot - maybe DIO is fighting with in-memory cache of the file or somesuch, or maybe with yet unwritten file (since the amount of memory on this machine is large, whole database with all its files fits in memory). Next time I'll try to ensure only the (problematic and small) logfiles are hot in the cache. Thanks, /mjt > [ 945.729965] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767175 by oracle; performance will be poor. > [ 960.915602] SysRq : Show Blocked State > [ 960.915650] task PC stack pid father > [ 960.915852] oracle D 0000000000000000 0 4985 1 0x00000000 > [ 960.915909] ffff88103e627040 0000000000000082 ffff881000000000 ffff881078e3f7d0 > [ 960.917855] ffff88103f88ffd8 ffff88103f88ffd8 ffff88103f88ffd8 ffff88103e627040 > [ 960.917953] 0000000001c08400 ffff88203e98c948 ffff88207873e240 ffffffff813527c6 > [ 960.918045] Call Trace: > [ 960.918092] [<ffffffff813527c6>] ? printk+0x43/0x48 > [ 960.918153] [<ffffffffa01432a8>] ? ext4_msg+0x58/0x60 [ext4] > [ 960.918201] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > [ 960.918252] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [ 960.918301] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > [ 960.918348] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [ 960.918392] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [ 960.918436] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [ 960.918483] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > (Inum 5767175 is one of oracle redologs). > > Jan, I lied to you initially - I didn't even test your first patch, > because I loaded the wrong initrd. With it applied, situation does > not improve still, and it looks exactly the same as with this new > patch by Jiaying Zhang: > > [ 415.288104] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767177 by oracle; performance will be poor. > [ 422.967323] SysRq : Show Blocked State > [ 422.967494] task PC stack pid father > [ 422.967872] oracle D 0000000000000000 0 3743 1 0x00000000 > [ 422.968112] ffff88203e5a2810 0000000000000086 ffff882000000000 ffff88103f403080 > [ 422.968505] ffff88203eeddfd8 ffff88203eeddfd8 ffff88203eeddfd8 ffff88203e5a2810 > [ 422.968895] 0000000001c08400 ffff88103f3db348 ffff88103f2fe800 ffffffff813527c6 > [ 422.969287] Call Trace: > [ 422.969397] [<ffffffff813527c6>] ? printk+0x43/0x48 > [ 422.969528] [<ffffffffa0143288>] ? ext4_msg+0x58/0x60 [ext4] > [ 422.969643] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > [ 422.969758] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [ 422.969873] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > [ 422.969985] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [ 422.970093] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [ 422.970200] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [ 422.970312] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > Note in both this cases, I now see slightly different > backtrace -- both mentions ext4_llseek and abort_exclusive_wait, > but the rest is different: > >>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. >>> [ 1469.734114] SysRq : Show Blocked State >>> [ 1469.734157] task PC stack pid father >>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 >>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 >>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 >>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 >>> [ 1469.734760] Call Trace: >>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 >>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] >>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 >>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] >>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 >>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 >>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 >>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > As Jan already pointed out, this place looks bogus, and > the same can be said about the new backtrace. So I wonder > if there's some stack corruption going on there as well. > > Btw, does ext4_llseek() look sane here? Note it's called from > aio_submit() -- does it _ever_ implement SEEKs? > > Maybe some debugging is neecessary here? > > Thank you! > > /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 6:23 ` Michael Tokarev 2011-08-12 7:07 ` Michael Tokarev @ 2011-08-12 13:07 ` Jan Kara 2011-08-12 15:55 ` Michael Tokarev 1 sibling, 1 reply; 43+ messages in thread From: Jan Kara @ 2011-08-12 13:07 UTC (permalink / raw) To: Michael Tokarev; +Cc: Jiaying Zhang, Jan Kara, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 5767 bytes --] On Fri 12-08-11 10:23:41, Michael Tokarev wrote: > 12.08.2011 06:46, Jiaying Zhang wrote: > > Hi Michael, > > > > Could you try your test with the patch I just posted: > > http://marc.info/?l=linux-ext4&m=131311627101278&w=2 > > and see whether it fixes the problem? > > No it does not. I'm now able to trigger it more or > less reliable - I need to decompress a relatively > small (about 70Gb) oracle database and try to start > it (this requires a rebuild of initrd and reboot ofcourse -- > whole thing takes about 15 minutes) - and I see this: > > [ 945.729965] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767175 by oracle; performance will be poor. > [ 960.915602] SysRq : Show Blocked State > [ 960.915650] task PC stack pid father > [ 960.915852] oracle D 0000000000000000 0 4985 1 0x00000000 > [ 960.915909] ffff88103e627040 0000000000000082 ffff881000000000 ffff881078e3f7d0 > [ 960.917855] ffff88103f88ffd8 ffff88103f88ffd8 ffff88103f88ffd8 ffff88103e627040 > [ 960.917953] 0000000001c08400 ffff88203e98c948 ffff88207873e240 ffffffff813527c6 > [ 960.918045] Call Trace: > [ 960.918092] [<ffffffff813527c6>] ? printk+0x43/0x48 > [ 960.918153] [<ffffffffa01432a8>] ? ext4_msg+0x58/0x60 [ext4] > [ 960.918201] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > [ 960.918252] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [ 960.918301] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > [ 960.918348] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [ 960.918392] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [ 960.918436] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [ 960.918483] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > (Inum 5767175 is one of oracle redologs). > > Jan, I lied to you initially - I didn't even test your first patch, > because I loaded the wrong initrd. With it applied, situation does > not improve still, and it looks exactly the same as with this new > patch by Jiaying Zhang: > > [ 415.288104] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767177 by oracle; performance will be poor. > [ 422.967323] SysRq : Show Blocked State > [ 422.967494] task PC stack pid father > [ 422.967872] oracle D 0000000000000000 0 3743 1 0x00000000 > [ 422.968112] ffff88203e5a2810 0000000000000086 ffff882000000000 ffff88103f403080 > [ 422.968505] ffff88203eeddfd8 ffff88203eeddfd8 ffff88203eeddfd8 ffff88203e5a2810 > [ 422.968895] 0000000001c08400 ffff88103f3db348 ffff88103f2fe800 ffffffff813527c6 > [ 422.969287] Call Trace: > [ 422.969397] [<ffffffff813527c6>] ? printk+0x43/0x48 > [ 422.969528] [<ffffffffa0143288>] ? ext4_msg+0x58/0x60 [ext4] > [ 422.969643] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > [ 422.969758] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [ 422.969873] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > [ 422.969985] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [ 422.970093] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [ 422.970200] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [ 422.970312] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > Note in both this cases, I now see slightly different > backtrace -- both mentions ext4_llseek and abort_exclusive_wait, > but the rest is different: > > >> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. > >> [ 1469.734114] SysRq : Show Blocked State > >> [ 1469.734157] task PC stack pid father > >> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 > >> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 > >> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 > >> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 > >> [ 1469.734760] Call Trace: > >> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 > >> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > >> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > >> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > >> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > >> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > >> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > >> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > > As Jan already pointed out, this place looks bogus, and > the same can be said about the new backtrace. So I wonder > if there's some stack corruption going on there as well. Probably not, just garbage on stack confuses the stack unwinder. Note that e.g. ext4_llseek() is at the end of function as well as abort_exclusive_wait() which probably means these functions have already finished and just left their addresses on stack. Could you disasseble your ext4_file_write() function and check where offset 0x20d is? Sadly you won't see where calls are going when ext4 is compiled as a module (module is linked when loaded into kernel) so it might be easier to compile ext4 into the kernel and disassemble the function in vmlinux file. > Btw, does ext4_llseek() look sane here? Note it's called from > aio_submit() -- does it _ever_ implement SEEKs? > > Maybe some debugging is neecessary here? Yes. The trouble is I'm not completely sure where we are hanging yet. We should know more from your disassembly. But you can try running with attached debug patch - maybe it shows something interesting. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR [-- Attachment #2: ext4-debug.diff --] [-- Type: text/x-patch, Size: 485 bytes --] diff --git a/fs/ext4/file.c b/fs/ext4/file.c index e4095e9..a897474 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -59,6 +59,8 @@ static void ext4_aiodio_wait(struct inode *inode) { wait_queue_head_t *wq = ext4_ioend_wq(inode); + if (atomic_read(&EXT4_I(inode)->i_aiodio_unwritten) != 0) + printk("Going to wait for %lu aios\n", (unsigned long)atomic_read(&EXT4_I(inode)->i_aiodio_unwritten)); wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_aiodio_unwritten) == 0)); } ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 13:07 ` Jan Kara @ 2011-08-12 15:55 ` Michael Tokarev 2011-08-12 17:01 ` Eric Sandeen 2011-08-12 21:19 ` Jan Kara 0 siblings, 2 replies; 43+ messages in thread From: Michael Tokarev @ 2011-08-12 15:55 UTC (permalink / raw) To: Jan Kara; +Cc: Jiaying Zhang, linux-ext4 12.08.2011 17:07, Jan Kara wrote: [] >>>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. >>>> [ 1469.734114] SysRq : Show Blocked State >>>> [ 1469.734157] task PC stack pid father >>>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 >>>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 >>>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 >>>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 >>>> [ 1469.734760] Call Trace: >>>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 >>>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] >>>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 >>>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] >>>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 >>>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 >>>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 >>>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b [] > Probably not, just garbage on stack confuses the stack unwinder. Note > that e.g. ext4_llseek() is at the end of function as well as > abort_exclusive_wait() which probably means these functions have already > finished and just left their addresses on stack. Could you disasseble > your ext4_file_write() function and check where offset 0x20d is? Sadly you > won't see where calls are going when ext4 is compiled as a module (module > is linked when loaded into kernel) so it might be easier to compile ext4 > into the kernel and disassemble the function in vmlinux file. > >> Btw, does ext4_llseek() look sane here? Note it's called from >> aio_submit() -- does it _ever_ implement SEEKs? >> >> Maybe some debugging is neecessary here? > Yes. The trouble is I'm not completely sure where we are hanging yet. > We should know more from your disassembly. But you can try running with > attached debug patch - maybe it shows something interesting. With ext4 built-in and your patch applied: [ 429.061524] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767181 by oracle; performance will be poor. [ 429.061669] Going to wait for 18446744073709551199 aios [ 437.717942] SysRq : Show Blocked State [ 437.718109] task PC stack pid father [ 437.718528] oracle D 0000000000000000 0 3844 1 0x00000000 [ 437.718767] ffff88203e330080 0000000000000082 0000000000000000 ffff881078e3f7d0 [ 437.719156] ffff88203dc15fd8 ffff88203dc15fd8 ffff88203dc15fd8 ffff88203e330080 [ 437.719546] 0000001e0000000f 0000000000000246 205b88103f806680 3136302e39323420 [ 437.719935] Call Trace: [ 437.720043] [<ffffffff8139b976>] ? printk+0x43/0x48 [ 437.720155] [<ffffffff8118e16d>] ? ext4_file_write+0x21d/0x290 [ 437.720267] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 [ 437.720376] [<ffffffff8118df50>] ? ext4_llseek+0x120/0x120 [ 437.720485] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 [ 437.720593] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 [ 437.720699] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 [ 437.720811] [<ffffffff813a3152>] ? system_call_fastpath+0x16/0x1b Trying to find how to disassemble things now... And yes, 18446744073709551199 aios sounds quite alot ;) Thanks, /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 15:55 ` Michael Tokarev @ 2011-08-12 17:01 ` Eric Sandeen 2011-08-12 17:34 ` Michael Tokarev 2011-08-12 21:19 ` Jan Kara 1 sibling, 1 reply; 43+ messages in thread From: Eric Sandeen @ 2011-08-12 17:01 UTC (permalink / raw) To: Michael Tokarev; +Cc: Jan Kara, Jiaying Zhang, linux-ext4 On 8/12/11 10:55 AM, Michael Tokarev wrote: > 12.08.2011 17:07, Jan Kara wrote: > [] >>>>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. >>>>> [ 1469.734114] SysRq : Show Blocked State >>>>> [ 1469.734157] task PC stack pid father >>>>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 >>>>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 >>>>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 >>>>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 >>>>> [ 1469.734760] Call Trace: >>>>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 >>>>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] >>>>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 >>>>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] >>>>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 >>>>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 >>>>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 >>>>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > [] >> Probably not, just garbage on stack confuses the stack unwinder. Note >> that e.g. ext4_llseek() is at the end of function as well as >> abort_exclusive_wait() which probably means these functions have already >> finished and just left their addresses on stack. Could you disasseble >> your ext4_file_write() function and check where offset 0x20d is? Sadly you >> won't see where calls are going when ext4 is compiled as a module (module >> is linked when loaded into kernel) so it might be easier to compile ext4 >> into the kernel and disassemble the function in vmlinux file. >> >>> Btw, does ext4_llseek() look sane here? Note it's called from >>> aio_submit() -- does it _ever_ implement SEEKs? >>> >>> Maybe some debugging is neecessary here? >> Yes. The trouble is I'm not completely sure where we are hanging yet. >> We should know more from your disassembly. But you can try running with >> attached debug patch - maybe it shows something interesting. > > With ext4 built-in and your patch applied: > > [ 429.061524] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767181 by oracle; performance will be poor. > [ 429.061669] Going to wait for 18446744073709551199 aios > [ 437.717942] SysRq : Show Blocked State > [ 437.718109] task PC stack pid father > [ 437.718528] oracle D 0000000000000000 0 3844 1 0x00000000 > [ 437.718767] ffff88203e330080 0000000000000082 0000000000000000 ffff881078e3f7d0 > [ 437.719156] ffff88203dc15fd8 ffff88203dc15fd8 ffff88203dc15fd8 ffff88203e330080 > [ 437.719546] 0000001e0000000f 0000000000000246 205b88103f806680 3136302e39323420 > [ 437.719935] Call Trace: > [ 437.720043] [<ffffffff8139b976>] ? printk+0x43/0x48 > [ 437.720155] [<ffffffff8118e16d>] ? ext4_file_write+0x21d/0x290 > [ 437.720267] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [ 437.720376] [<ffffffff8118df50>] ? ext4_llseek+0x120/0x120 > [ 437.720485] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [ 437.720593] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [ 437.720699] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [ 437.720811] [<ffffffff813a3152>] ? system_call_fastpath+0x16/0x1b > > Trying to find how to disassemble things now... > And yes, 18446744073709551199 aios sounds quite alot ;) looks like it went negative. I see that in one case we set EXT4_IO_END_UNWRITTEN, but don't increment the counter. We decrement the counter for every EXT4_IO_END_UNWRITTEN completion, I think. I'm not quite sure if that was intentional or not, but it might be a place to start. I haven't though hard about this, in the middle of something else right now, but this looks like it's a probllem in my code from that unaligned AIO patch, perhaps... diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 3e5191f..7366488 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3640,6 +3640,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) io_end->flag = EXT4_IO_END_UNWRITTEN; inode = io_end->inode; + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); /* Add the io_end to per-inode completed io list*/ spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); > Thanks, > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 17:01 ` Eric Sandeen @ 2011-08-12 17:34 ` Michael Tokarev 2011-08-13 16:02 ` Tao Ma 0 siblings, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-12 17:34 UTC (permalink / raw) To: Eric Sandeen; +Cc: Jan Kara, Jiaying Zhang, linux-ext4 12.08.2011 21:01, Eric Sandeen wrote: [] >> And yes, 18446744073709551199 aios sounds quite alot ;) > > looks like it went negative. > > I see that in one case we set EXT4_IO_END_UNWRITTEN, but don't increment the counter. > We decrement the counter for every EXT4_IO_END_UNWRITTEN completion, I think. > > I'm not quite sure if that was intentional or not, but it might be a place to start. > I haven't though hard about this, in the middle of something else right now, > but this looks like it's a probllem in my code from that unaligned AIO patch, > perhaps... > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 3e5191f..7366488 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3640,6 +3640,7 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) > > io_end->flag = EXT4_IO_END_UNWRITTEN; > inode = io_end->inode; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > > /* Add the io_end to per-inode completed io list*/ > spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); This patch does not change thing (unfortunately I forgot to reapply the debugging patch posted by Jan so don't know how many aiocbs it wants to wait). But reverting e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d (ext4: serialize unaligned asynchronous DIO) helps, the thing goes fine without that patch. And this refcounting - I can't see why in my case the problem only happens with hot cache and only with dioread_nolock mount option (so far anyway - I weren't able to trigger it without at least one of the two conditions). BTW, ext4_end_io_buffer_write() is declared twice in fs/ext4/inode.c Thanks, /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 17:34 ` Michael Tokarev @ 2011-08-13 16:02 ` Tao Ma 2011-08-14 20:57 ` Michael Tokarev 2011-08-15 16:08 ` Eric Sandeen 0 siblings, 2 replies; 43+ messages in thread From: Tao Ma @ 2011-08-13 16:02 UTC (permalink / raw) To: linux-ext4; +Cc: mjt, sandeen From: Tao Ma <boyu.mt@taobao.com> Hi Michael, could you please check whether this patch work for you? Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 From: Tao Ma <boyu.mt@taobao.com> Date: Sat, 13 Aug 2011 23:50:28 +0800 Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should be done simultaneously since ext4_end_io_nolock always clear the flag and decrease the counter in the same time. We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process to wait forever. Part of the patch came from Eric in his e-mail. http://marc.info/?l=linux-ext4&m=131316851417460&w=2 Reported-by: Michael Tokarev<mjt@tls.msk.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> --- fs/ext4/inode.c | 9 ++++++++- fs/ext4/page-io.c | 6 ++++-- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d47264c..40c0b10 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) goto out; } - io_end->flag = EXT4_IO_END_UNWRITTEN; + /* + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, + * but being more careful is always safe for the future change. + */ inode = io_end->inode; + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { + io_end->flag |= EXT4_IO_END_UNWRITTEN; + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); + } /* Add the io_end to per-inode completed io list*/ spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 430c401..78839af 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -334,8 +334,10 @@ submit_and_retry: if ((io_end->num_io_pages >= MAX_IO_PAGES) && (io_end->pages[io_end->num_io_pages-1] != io_page)) goto submit_and_retry; - if (buffer_uninit(bh)) - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { + io_end->flag |= EXT4_IO_END_UNWRITTEN; + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); + } io->io_end->size += bh->b_size; io->io_next_block++; ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); -- 1.7.0.4 ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-13 16:02 ` Tao Ma @ 2011-08-14 20:57 ` Michael Tokarev 2011-08-14 21:07 ` Michael Tokarev 2011-08-15 16:08 ` Eric Sandeen 1 sibling, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-14 20:57 UTC (permalink / raw) To: Tao Ma; +Cc: linux-ext4, sandeen 13.08.2011 20:02, Tao Ma wrote: > From: Tao Ma <boyu.mt@taobao.com> > > Hi Michael, > could you please check whether this patch work for you? With this patch applied to 3.0.1 I can't trigger the issue anymore, after several attempts -- the system just works as it shold be. I'm not sure this is the right fix or it's just my testcase isn't as good as it can be... ;) > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 > From: Tao Ma <boyu.mt@taobao.com> > Date: Sat, 13 Aug 2011 23:50:28 +0800 > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > > EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should > be done simultaneously since ext4_end_io_nolock always clear the flag and > decrease the counter in the same time. > > We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so > it will go nagative and causes some process to wait forever. > > Part of the patch came from Eric in his e-mail. > http://marc.info/?l=linux-ext4&m=131316851417460&w=2 > > Reported-by: Michael Tokarev<mjt@tls.msk.ru> > Signed-off-by: Eric Sandeen <sandeen@redhat.com> > Signed-off-by: Tao Ma <boyu.mt@taobao.com> You can add my Tested-By too ;) Thank you! And please excuse me for so long delay replying - I was AFK whole weekend. /mjt > fs/ext4/inode.c | 9 ++++++++- > fs/ext4/page-io.c | 6 ++++-- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d47264c..40c0b10 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) > goto out; > } > > - io_end->flag = EXT4_IO_END_UNWRITTEN; > + /* > + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, > + * but being more careful is always safe for the future change. > + */ > inode = io_end->inode; > + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > > /* Add the io_end to per-inode completed io list*/ > spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c > index 430c401..78839af 100644 > --- a/fs/ext4/page-io.c > +++ b/fs/ext4/page-io.c > @@ -334,8 +334,10 @@ submit_and_retry: > if ((io_end->num_io_pages >= MAX_IO_PAGES) && > (io_end->pages[io_end->num_io_pages-1] != io_page)) > goto submit_and_retry; > - if (buffer_uninit(bh)) > - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; > + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > io->io_end->size += bh->b_size; > io->io_next_block++; > ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-14 20:57 ` Michael Tokarev @ 2011-08-14 21:07 ` Michael Tokarev 2011-08-15 2:36 ` Tao Ma 0 siblings, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-14 21:07 UTC (permalink / raw) To: Tao Ma; +Cc: linux-ext4, sandeen, Jan Kara 15.08.2011 00:57, Michael Tokarev пишет: > 13.08.2011 20:02, Tao Ma wrote: >> From: Tao Ma <boyu.mt@taobao.com> >> >> Hi Michael, >> could you please check whether this patch work for you? > > With this patch applied to 3.0.1 I can't trigger the issue anymore, > after several attempts -- the system just works as it shold be. > I'm not sure this is the right fix or it's just my testcase isn't > as good as it can be... ;) Well, I found a way to trigger data corruption with this patch applied. I guess it's not fault of this patch, but some more deep problem instead. The sequence is my usual copy of an oracle database from another place and start it. When oracle starts doing it's direct-I/O against its redologs, we had problem which is now solved. But now I do the following: I shutdown the database, rename the current redologs out of the way and copy them back into place as new files. And start the database again. This time, oracle complains that the redologs contains garbage. I can reboot the machine now, and compare old (renamed) redologs with copies - they're indeed different. My guess is that copy is done from the pagecache - from the old contents of the files, somehow ignoring the (direct) writes performed by initial database open. But that copy is somehow damaged now too, since even file identification is now different. Is this new issue something that dioread_nolock supposed to create? I mean, it isn't entirely clear what it supposed to do, it looks somewhat hackish, but without it performance is quite bad. Thanks, /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-14 21:07 ` Michael Tokarev @ 2011-08-15 2:36 ` Tao Ma 2011-08-15 8:00 ` Michael Tokarev 0 siblings, 1 reply; 43+ messages in thread From: Tao Ma @ 2011-08-15 2:36 UTC (permalink / raw) To: Michael Tokarev; +Cc: linux-ext4, sandeen, Jan Kara On 08/15/2011 05:07 AM, Michael Tokarev wrote: > 15.08.2011 00:57, Michael Tokarev пишет: >> 13.08.2011 20:02, Tao Ma wrote: >>> From: Tao Ma <boyu.mt@taobao.com> >>> >>> Hi Michael, >>> could you please check whether this patch work for you? >> >> With this patch applied to 3.0.1 I can't trigger the issue anymore, >> after several attempts -- the system just works as it shold be. >> I'm not sure this is the right fix or it's just my testcase isn't >> as good as it can be... ;) Thanks for the test. > > Well, I found a way to trigger data corruption with this patch > applied. I guess it's not fault of this patch, but some more > deep problem instead. > > The sequence is my usual copy of an oracle database from another > place and start it. When oracle starts doing it's direct-I/O > against its redologs, we had problem which is now solved. But > now I do the following: I shutdown the database, rename the current > redologs out of the way and copy them back into place as new files. > And start the database again. > > This time, oracle complains that the redologs contains garbage. > I can reboot the machine now, and compare old (renamed) redologs > with copies - they're indeed different. > > My guess is that copy is done from the pagecache - from the old > contents of the files, somehow ignoring the (direct) writes > performed by initial database open. But that copy is somehow > damaged now too, since even file identification is now different. > > Is this new issue something that dioread_nolock supposed to create? > I mean, it isn't entirely clear what it supposed to do, it looks > somewhat hackish, but without it performance is quite bad. So could I generalize your sequence like below: 1. copy a large file to a new ext4 volume 2. do some direct i/o read/write to this file(bs=512) 3. rename it. 4. cp this back to the original file 5. do direct i/o read/write(bs=512) now and the file is actually corrupted. You used to meet with problem in step 2, and my patch resolved it. Now you met with problems in step 5. Right? Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 2:36 ` Tao Ma @ 2011-08-15 8:00 ` Michael Tokarev 2011-08-15 8:56 ` Michael Tokarev 0 siblings, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-15 8:00 UTC (permalink / raw) To: Tao Ma; +Cc: linux-ext4, sandeen, Jan Kara 15.08.2011 06:36, Tao Ma wrote: > On 08/15/2011 05:07 AM, Michael Tokarev wrote: [] >> Well, I found a way to trigger data corruption with this patch >> applied. I guess it's not fault of this patch, but some more >> deep problem instead. >> >> The sequence is my usual copy of an oracle database from another >> place and start it. When oracle starts doing it's direct-I/O >> against its redologs, we had problem which is now solved. But >> now I do the following: I shutdown the database, rename the current >> redologs out of the way and copy them back into place as new files. >> And start the database again. >> >> This time, oracle complains that the redologs contains garbage. >> I can reboot the machine now, and compare old (renamed) redologs >> with copies - they're indeed different. >> >> My guess is that copy is done from the pagecache - from the old >> contents of the files, somehow ignoring the (direct) writes >> performed by initial database open. But that copy is somehow >> damaged now too, since even file identification is now different. >> >> Is this new issue something that dioread_nolock supposed to create? >> I mean, it isn't entirely clear what it supposed to do, it looks >> somewhat hackish, but without it performance is quite bad. > So could I generalize your sequence like below: > 1. copy a large file to a new ext4 volume > 2. do some direct i/o read/write to this file(bs=512) > 3. rename it. > 4. cp this back to the original file > 5. do direct i/o read/write(bs=512) now and the file is actually corrupted. > > You used to meet with problem in step 2, and my patch resolved it. Now > you met with problems in step 5. Right? SQL> shutdown immediate; -- shuts down the database cleanly $ mkdir tmp $ mv redo* tmp/ $ cp -p tmp/* . -- this will make redolog files to be in hot cache, not even written to disk. SQL> startup Database mounted. -- now open and read our redologs... -- at this point, without the patch, it hangs. ORA-00316: log 1 of thread 1, type in header is not log file ORA-00312: online log 1 thread 1: '.../redo1.odf' $ mv -f tmp/* . SQL> alter database open; -- this will try to open files again and read them again Database altered. -- and now we're fine. This is my small(ish) testcase so far. Only the redologs needs to be in hot cache in order to trigger the issue. This does direct I/O in 512byte blocks in these redo* files. The rename and a new directory is just to keep the pieces of the database in the right place. There's even more fun. I once managed to get old content in the copied files, but I can't repeat it. I made a copy as before, sync(1)ed everything, started the database - it was ok. Next I shut it down, and rebooted (why drop_caches does not really work is another big question). And now, oracle complains that the redologs contains previous sequence number. (to clarify: there's a sequence number in each oracle db which is incremented each time something happens with the database, including startup. So on startup, each file in the database gets new (the same) sequence number). So it looked like even despite of oracle doing direct writes to record new sequence number, a previously cached data gets written to the file. Now I'm not really sure what's going on, it is somewhat inconsistent. Before, it used to hang after "Database mounted" message, when it tries to write to redologs, -- now that hang is gone. But now I see some apparent data corruption - again, with hot cache only - but I don't actually understand when it happens. I'm trying to narrow it down further. Thank you! /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 8:00 ` Michael Tokarev @ 2011-08-15 8:56 ` Michael Tokarev 2011-08-15 9:03 ` Michael Tokarev 2011-08-15 23:53 ` Jiaying Zhang 0 siblings, 2 replies; 43+ messages in thread From: Michael Tokarev @ 2011-08-15 8:56 UTC (permalink / raw) To: Tao Ma; +Cc: linux-ext4, sandeen, Jan Kara 15.08.2011 12:00, Michael Tokarev wrote: [....] So, it looks like this (starting with cold cache): 1. rename the redologs and copy them over - this will make a hot copy of redologs 2. startup oracle - it will complain that the redologs aren't redologs, the header is corrupt 3. shut down oracle, start it up again - it will succeed. If between 1 and 2 you'll issue sync(1) everything will work. When shutting down, oracle calls fsync(), so that's like sync(1) again. If there will be some time between 1. and 2., everything will work too. Without dioread_nolock I can't trigger the problem no matter how I tried. A smaller test case. I used redo1.odf file (one of the redologs) as a test file, any will work. $ cp -p redo1.odf temp $ dd if=temp of=foo iflag=direct count=20 Now, first 512bytes of "foo" will contain all zeros, while the beginning of redo1.odf is _not_ zeros. Again, without aioread_nolock it works as expected. And the most important note: without the patch there's no data corruption like that. But instead, there is the lockup... ;) Thank you, /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 8:56 ` Michael Tokarev @ 2011-08-15 9:03 ` Michael Tokarev 2011-08-15 10:28 ` Tao Ma 2011-08-15 23:53 ` Jiaying Zhang 1 sibling, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-15 9:03 UTC (permalink / raw) To: Tao Ma; +Cc: linux-ext4, sandeen, Jan Kara 15.08.2011 12:56, Michael Tokarev пишет: > 15.08.2011 12:00, Michael Tokarev wrote: > [....] > > So, it looks like this (starting with cold cache): > > 1. rename the redologs and copy them over - this will > make a hot copy of redologs > 2. startup oracle - it will complain that the redologs aren't > redologs, the header is corrupt > 3. shut down oracle, start it up again - it will succeed. > > If between 1 and 2 you'll issue sync(1) everything will work. > When shutting down, oracle calls fsync(), so that's like > sync(1) again. > > If there will be some time between 1. and 2., everything > will work too. > > Without dioread_nolock I can't trigger the problem no matter > how I tried. > > > A smaller test case. I used redo1.odf file (one of the > redologs) as a test file, any will work. > > $ cp -p redo1.odf temp > $ dd if=temp of=foo iflag=direct count=20 > > Now, first 512bytes of "foo" will contain all zeros, while > the beginning of redo1.odf is _not_ zeros. > > Again, without aioread_nolock it works as expected. > > > And the most important note: without the patch there's no > data corruption like that. But instead, there is the > lockup... ;) Actually I can reproduce this data corruption without the patch too, just not that easily. Oracle testcase (with copying redologs over) does that nicely. So that's a separate bug which was here before. > Thank you, > > /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 9:03 ` Michael Tokarev @ 2011-08-15 10:28 ` Tao Ma 0 siblings, 0 replies; 43+ messages in thread From: Tao Ma @ 2011-08-15 10:28 UTC (permalink / raw) To: Michael Tokarev; +Cc: linux-ext4, sandeen, Jan Kara On 08/15/2011 05:03 PM, Michael Tokarev wrote: > 15.08.2011 12:56, Michael Tokarev пишет: >> 15.08.2011 12:00, Michael Tokarev wrote: >> [....] >> >> So, it looks like this (starting with cold cache): >> >> 1. rename the redologs and copy them over - this will >> make a hot copy of redologs >> 2. startup oracle - it will complain that the redologs aren't >> redologs, the header is corrupt >> 3. shut down oracle, start it up again - it will succeed. >> >> If between 1 and 2 you'll issue sync(1) everything will work. >> When shutting down, oracle calls fsync(), so that's like >> sync(1) again. >> >> If there will be some time between 1. and 2., everything >> will work too. >> >> Without dioread_nolock I can't trigger the problem no matter >> how I tried. >> >> >> A smaller test case. I used redo1.odf file (one of the >> redologs) as a test file, any will work. >> >> $ cp -p redo1.odf temp >> $ dd if=temp of=foo iflag=direct count=20 >> >> Now, first 512bytes of "foo" will contain all zeros, while >> the beginning of redo1.odf is _not_ zeros. >> >> Again, without aioread_nolock it works as expected. >> >> >> And the most important note: without the patch there's no >> data corruption like that. But instead, there is the >> lockup... ;) > > Actually I can reproduce this data corruption without the > patch too, just not that easily. Oracle testcase (with > copying redologs over) does that nicely. So that's a > separate bug which was here before. cool, thanks for the test. btw, I can reproduce the bug with $ cp -p redo1.odf temp $ dd if=temp of=foo iflag=direct count=20 Not that easy, but I did encounter one during my more than 20 tries, hope I can get something out soon. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 8:56 ` Michael Tokarev 2011-08-15 9:03 ` Michael Tokarev @ 2011-08-15 23:53 ` Jiaying Zhang 2011-08-16 4:15 ` Tao Ma ` (2 more replies) 1 sibling, 3 replies; 43+ messages in thread From: Jiaying Zhang @ 2011-08-15 23:53 UTC (permalink / raw) To: Michael Tokarev; +Cc: Tao Ma, linux-ext4, sandeen, Jan Kara Hi Michael, On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: > 15.08.2011 12:00, Michael Tokarev wrote: > [....] > > So, it looks like this (starting with cold cache): > > 1. rename the redologs and copy them over - this will > make a hot copy of redologs > 2. startup oracle - it will complain that the redologs aren't > redologs, the header is corrupt > 3. shut down oracle, start it up again - it will succeed. > > If between 1 and 2 you'll issue sync(1) everything will work. > When shutting down, oracle calls fsync(), so that's like > sync(1) again. > > If there will be some time between 1. and 2., everything > will work too. > > Without dioread_nolock I can't trigger the problem no matter > how I tried. > > > A smaller test case. I used redo1.odf file (one of the > redologs) as a test file, any will work. > > $ cp -p redo1.odf temp > $ dd if=temp of=foo iflag=direct count=20 Isn't this the expected behavior here? When doing 'cp -p redo1.odf temp', data is copied to temp through buffer write, but there is no guarantee when data will be actually written to disk. Then with 'dd if=temp of=foo iflag=direct count=20', data is read directly from disk. Very likely, the written data hasn't been flushed to disk yet so ext4 returns zero in this case. Jiaying > > Now, first 512bytes of "foo" will contain all zeros, while > the beginning of redo1.odf is _not_ zeros. > > Again, without aioread_nolock it works as expected. > > > And the most important note: without the patch there's no > data corruption like that. But instead, there is the > lockup... ;) > > Thank you, > > /mjt > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 23:53 ` Jiaying Zhang @ 2011-08-16 4:15 ` Tao Ma 2011-08-16 8:38 ` Michael Tokarev 2011-08-16 13:53 ` Jan Kara 2 siblings, 0 replies; 43+ messages in thread From: Tao Ma @ 2011-08-16 4:15 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Michael Tokarev, linux-ext4, sandeen, Jan Kara Hi Jiaying, On 08/16/2011 07:53 AM, Jiaying Zhang wrote: > Hi Michael, > > On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: >> 15.08.2011 12:00, Michael Tokarev wrote: >> [....] >> >> So, it looks like this (starting with cold cache): >> >> 1. rename the redologs and copy them over - this will >> make a hot copy of redologs >> 2. startup oracle - it will complain that the redologs aren't >> redologs, the header is corrupt >> 3. shut down oracle, start it up again - it will succeed. >> >> If between 1 and 2 you'll issue sync(1) everything will work. >> When shutting down, oracle calls fsync(), so that's like >> sync(1) again. >> >> If there will be some time between 1. and 2., everything >> will work too. >> >> Without dioread_nolock I can't trigger the problem no matter >> how I tried. >> >> >> A smaller test case. I used redo1.odf file (one of the >> redologs) as a test file, any will work. >> >> $ cp -p redo1.odf temp >> $ dd if=temp of=foo iflag=direct count=20 > Isn't this the expected behavior here? When doing > 'cp -p redo1.odf temp', data is copied to temp through > buffer write, but there is no guarantee when data will be > actually written to disk. Then with 'dd if=temp of=foo > iflag=direct count=20', data is read directly from disk. > Very likely, the written data hasn't been flushed to disk > yet so ext4 returns zero in this case. Sorry, but it doesn't sound correct to me. Say we use a buffer write to a file and then use direct i/o read, what we expect(or at least Michael expect) is that we use read the updated data, not the stale one. I thought of a tiny race window in ext4 here, but need to do some test to verify and then fix it. Thanks Tao ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 23:53 ` Jiaying Zhang 2011-08-16 4:15 ` Tao Ma @ 2011-08-16 8:38 ` Michael Tokarev 2011-08-16 13:53 ` Jan Kara 2 siblings, 0 replies; 43+ messages in thread From: Michael Tokarev @ 2011-08-16 8:38 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Tao Ma, linux-ext4, sandeen, Jan Kara 16.08.2011 03:53, Jiaying Zhang wrote: > Hi Michael, > > On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: [] >> A smaller test case. I used redo1.odf file (one of the >> redologs) as a test file, any will work. >> >> $ cp -p redo1.odf temp >> $ dd if=temp of=foo iflag=direct count=20 > Isn't this the expected behavior here? When doing > 'cp -p redo1.odf temp', data is copied to temp through > buffer write, but there is no guarantee when data will be > actually written to disk. Then with 'dd if=temp of=foo > iflag=direct count=20', data is read directly from disk. > Very likely, the written data hasn't been flushed to disk > yet so ext4 returns zero in this case. The problem is 3-faced (at least ;) First of all, it is _not_ an expected behavour. When you think about it, maybe it becomes "more expected", but for first it looks like something Really Wrong (tm). It can be made "more expected" by mentioning in various manpages and whatnot all the possible consecuences of mixing direct and buffered I/O. So far it hasn't been done. I can understand (and sort of expect), say, buffered write being insisible for concurrent direct read, while they're going at the same time. But here, the file has been closed and re-opened between writes and reads. I agree that it's difficult to keep both pieces - direct and buffered I/O - in sync, -- there were numerous efforts to syncronize them, with various success and usually huge amount of work. Maybe if it were noted initially that direct I/O _is_ incompatible with buffered I/O, things weren't that bad now. Next, this problem does not happen without the mentioned dioread_nolock option (which - as far as I can see - supposed to be the default (or only) way to handle this in the future). I can't trigger any of the issues described in this thread without dioread_nolock. So that makes this as yet another "corner case" somehow (like famous non-fs-buffer-aligned direct write past end of file, or like mmapped I/O mixed with direct I/O and so on), but since most other such corner cases are fixed now, this one just needs to be fixed too. And 3rd, this is a race condition: it does not happen all the time, or even most of the time, it happens "sometimes", which makes it more like a bug than not. Thanks, /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 23:53 ` Jiaying Zhang 2011-08-16 4:15 ` Tao Ma 2011-08-16 8:38 ` Michael Tokarev @ 2011-08-16 13:53 ` Jan Kara 2011-08-16 15:03 ` Tao Ma 2 siblings, 1 reply; 43+ messages in thread From: Jan Kara @ 2011-08-16 13:53 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Michael Tokarev, Tao Ma, linux-ext4, sandeen, Jan Kara On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: > On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: > > 15.08.2011 12:00, Michael Tokarev wrote: > > [....] > > > > So, it looks like this (starting with cold cache): > > > > 1. rename the redologs and copy them over - this will > > make a hot copy of redologs > > 2. startup oracle - it will complain that the redologs aren't > > redologs, the header is corrupt > > 3. shut down oracle, start it up again - it will succeed. > > > > If between 1 and 2 you'll issue sync(1) everything will work. > > When shutting down, oracle calls fsync(), so that's like > > sync(1) again. > > > > If there will be some time between 1. and 2., everything > > will work too. > > > > Without dioread_nolock I can't trigger the problem no matter > > how I tried. > > > > > > A smaller test case. I used redo1.odf file (one of the > > redologs) as a test file, any will work. > > > > $ cp -p redo1.odf temp > > $ dd if=temp of=foo iflag=direct count=20 > Isn't this the expected behavior here? When doing > 'cp -p redo1.odf temp', data is copied to temp through > buffer write, but there is no guarantee when data will be > actually written to disk. Then with 'dd if=temp of=foo > iflag=direct count=20', data is read directly from disk. > Very likely, the written data hasn't been flushed to disk > yet so ext4 returns zero in this case. No it's not. Buffered and direct IO is supposed to work correctly (although not fast) together. In this particular case we take care to flush dirty data from page cache before performing direct IO read... But something is broken in this path obviously. I don't have time to dig into this in detail now but what seems to be the problem is that with dioread_nolock option, we don't acquire i_mutex for direct IO reads anymore. Thus these reads can compete with ext4_end_io_nolock() called from ext4_end_io_work() (this is called under i_mutex so without dioread_nolock the race cannot happen). Hmm, the new writepages code seems to be broken in combination with direct IO. Direct IO code expects that when filemap_write_and_wait() finishes, data is on disk but with new bio submission code this is not true because we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in ext4_end_io_buffer_write() but do extent conversion only after that in convert workqueue. So the race seems to be there all the time, just without dioread_nolock it's much smaller. Fixing this is going to be non-trivial - I'm not sure we can really move clearing of PageWriteback bit to conversion workqueue. I think we already tried that once but it caused deadlocks for some reason... > > Now, first 512bytes of "foo" will contain all zeros, while > > the beginning of redo1.odf is _not_ zeros. > > > > Again, without aioread_nolock it works as expected. > > > > > > And the most important note: without the patch there's no > > data corruption like that. But instead, there is the > > lockup... ;) Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-16 13:53 ` Jan Kara @ 2011-08-16 15:03 ` Tao Ma 2011-08-16 21:32 ` Jiaying Zhang 0 siblings, 1 reply; 43+ messages in thread From: Tao Ma @ 2011-08-16 15:03 UTC (permalink / raw) To: Jan Kara; +Cc: Jiaying Zhang, Michael Tokarev, linux-ext4, sandeen On 08/16/2011 09:53 PM, Jan Kara wrote: > On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: >> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: >>> 15.08.2011 12:00, Michael Tokarev wrote: >>> [....] >>> >>> So, it looks like this (starting with cold cache): >>> >>> 1. rename the redologs and copy them over - this will >>> make a hot copy of redologs >>> 2. startup oracle - it will complain that the redologs aren't >>> redologs, the header is corrupt >>> 3. shut down oracle, start it up again - it will succeed. >>> >>> If between 1 and 2 you'll issue sync(1) everything will work. >>> When shutting down, oracle calls fsync(), so that's like >>> sync(1) again. >>> >>> If there will be some time between 1. and 2., everything >>> will work too. >>> >>> Without dioread_nolock I can't trigger the problem no matter >>> how I tried. >>> >>> >>> A smaller test case. I used redo1.odf file (one of the >>> redologs) as a test file, any will work. >>> >>> $ cp -p redo1.odf temp >>> $ dd if=temp of=foo iflag=direct count=20 >> Isn't this the expected behavior here? When doing >> 'cp -p redo1.odf temp', data is copied to temp through >> buffer write, but there is no guarantee when data will be >> actually written to disk. Then with 'dd if=temp of=foo >> iflag=direct count=20', data is read directly from disk. >> Very likely, the written data hasn't been flushed to disk >> yet so ext4 returns zero in this case. > No it's not. Buffered and direct IO is supposed to work correctly > (although not fast) together. In this particular case we take care to flush > dirty data from page cache before performing direct IO read... But > something is broken in this path obviously. > > I don't have time to dig into this in detail now but what seems to be the > problem is that with dioread_nolock option, we don't acquire i_mutex for > direct IO reads anymore. Thus these reads can compete with > ext4_end_io_nolock() called from ext4_end_io_work() (this is called under > i_mutex so without dioread_nolock the race cannot happen). > > Hmm, the new writepages code seems to be broken in combination with direct > IO. Direct IO code expects that when filemap_write_and_wait() finishes, > data is on disk but with new bio submission code this is not true because > we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in > ext4_end_io_buffer_write() but do extent conversion only after that in > convert workqueue. So the race seems to be there all the time, just without > dioread_nolock it's much smaller. You are absolutely right. The really problem is that ext4_direct_IO begins to work *after* we clear the page writeback flag and *before* we convert unwritten extent to a valid state. Some of my trace does show that. I am working on it now. > > Fixing this is going to be non-trivial - I'm not sure we can really move > clearing of PageWriteback bit to conversion workqueue. I thienk we already > tried that once but it caused deadlocks for some reason... I just did as what you described and yes I met with another problem and try to resolve it now. Once it is OK, I will send out the patch. Thanks Tao ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-16 15:03 ` Tao Ma @ 2011-08-16 21:32 ` Jiaying Zhang 2011-08-16 22:28 ` Michael Tokarev 2011-08-16 23:59 ` Dave Chinner 0 siblings, 2 replies; 43+ messages in thread From: Jiaying Zhang @ 2011-08-16 21:32 UTC (permalink / raw) To: Tao Ma; +Cc: Jan Kara, Michael Tokarev, linux-ext4, sandeen On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote: > On 08/16/2011 09:53 PM, Jan Kara wrote: >> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: >>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: >>>> 15.08.2011 12:00, Michael Tokarev wrote: >>>> [....] >>>> >>>> So, it looks like this (starting with cold cache): >>>> >>>> 1. rename the redologs and copy them over - this will >>>> make a hot copy of redologs >>>> 2. startup oracle - it will complain that the redologs aren't >>>> redologs, the header is corrupt >>>> 3. shut down oracle, start it up again - it will succeed. >>>> >>>> If between 1 and 2 you'll issue sync(1) everything will work. >>>> When shutting down, oracle calls fsync(), so that's like >>>> sync(1) again. >>>> >>>> If there will be some time between 1. and 2., everything >>>> will work too. >>>> >>>> Without dioread_nolock I can't trigger the problem no matter >>>> how I tried. >>>> >>>> >>>> A smaller test case. I used redo1.odf file (one of the >>>> redologs) as a test file, any will work. >>>> >>>> $ cp -p redo1.odf temp >>>> $ dd if=temp of=foo iflag=direct count=20 >>> Isn't this the expected behavior here? When doing >>> 'cp -p redo1.odf temp', data is copied to temp through >>> buffer write, but there is no guarantee when data will be >>> actually written to disk. Then with 'dd if=temp of=foo >>> iflag=direct count=20', data is read directly from disk. >>> Very likely, the written data hasn't been flushed to disk >>> yet so ext4 returns zero in this case. >> No it's not. Buffered and direct IO is supposed to work correctly >> (although not fast) together. In this particular case we take care to flush >> dirty data from page cache before performing direct IO read... But >> something is broken in this path obviously. I see. Thanks a lot for the explanation. I wonder whether the following patch will solve the problem: diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 6c27111..ca90d73 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, } retry: - if (rw == READ && ext4_should_dioread_nolock(inode)) + if (rw == READ && ext4_should_dioread_nolock(inode)) { + if (unlikely(!list_empty(&ei->i_completed_io_list))) { + mutex_lock(&inode->i_mutex); + ext4_flush_completed_IO(inode); + mutex_unlock(&inode->i_mutex); + } ret = __blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, ext4_get_block, NULL, NULL, 0); - else { + } else { ret = blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, I tested the patch a little bit and it seems to resolve the race on dioread_nolock in my case. Michael, I would very appreciate if you can try this patch with your test case and see whether it works. >> >> I don't have time to dig into this in detail now but what seems to be the >> problem is that with dioread_nolock option, we don't acquire i_mutex for >> direct IO reads anymore. Thus these reads can compete with >> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under >> i_mutex so without dioread_nolock the race cannot happen). >> >> Hmm, the new writepages code seems to be broken in combination with direct >> IO. Direct IO code expects that when filemap_write_and_wait() finishes, >> data is on disk but with new bio submission code this is not true because >> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in >> ext4_end_io_buffer_write() but do extent conversion only after that in >> convert workqueue. So the race seems to be there all the time, just without >> dioread_nolock it's much smaller. I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled. So I think we should be ok with the default mount options. Jiaying > You are absolutely right. The really problem is that ext4_direct_IO > begins to work *after* we clear the page writeback flag and *before* we > convert unwritten extent to a valid state. Some of my trace does show > that. I am working on it now. >> >> Fixing this is going to be non-trivial - I'm not sure we can really move >> clearing of PageWriteback bit to conversion workqueue. I thienk we already >> tried that once but it caused deadlocks for some reason... > I just did as what you described and yes I met with another problem and > try to resolve it now. Once it is OK, I will send out the patch. > > Thanks > Tao > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-16 21:32 ` Jiaying Zhang @ 2011-08-16 22:28 ` Michael Tokarev 2011-08-16 23:07 ` Jiaying Zhang 2011-08-16 23:59 ` Dave Chinner 1 sibling, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-16 22:28 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Tao Ma, Jan Kara, linux-ext4, sandeen 17.08.2011 01:32, Jiaying Zhang wrote: > On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote: >> On 08/16/2011 09:53 PM, Jan Kara wrote: >>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: >>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: >>>>> 15.08.2011 12:00, Michael Tokarev wrote: >>>>> [....] >>>>> >>>>> So, it looks like this (starting with cold cache): >>>>> >>>>> 1. rename the redologs and copy them over - this will >>>>> make a hot copy of redologs >>>>> 2. startup oracle - it will complain that the redologs aren't >>>>> redologs, the header is corrupt >>>>> 3. shut down oracle, start it up again - it will succeed. >>>>> >>>>> If between 1 and 2 you'll issue sync(1) everything will work. >>>>> When shutting down, oracle calls fsync(), so that's like >>>>> sync(1) again. >>>>> >>>>> If there will be some time between 1. and 2., everything >>>>> will work too. >>>>> >>>>> Without dioread_nolock I can't trigger the problem no matter >>>>> how I tried. >>>>> >>>>> >>>>> A smaller test case. I used redo1.odf file (one of the >>>>> redologs) as a test file, any will work. >>>>> >>>>> $ cp -p redo1.odf temp >>>>> $ dd if=temp of=foo iflag=direct count=20 >>>> Isn't this the expected behavior here? When doing >>>> 'cp -p redo1.odf temp', data is copied to temp through >>>> buffer write, but there is no guarantee when data will be >>>> actually written to disk. Then with 'dd if=temp of=foo >>>> iflag=direct count=20', data is read directly from disk. >>>> Very likely, the written data hasn't been flushed to disk >>>> yet so ext4 returns zero in this case. >>> No it's not. Buffered and direct IO is supposed to work correctly >>> (although not fast) together. In this particular case we take care to flush >>> dirty data from page cache before performing direct IO read... But >>> something is broken in this path obviously. > I see. Thanks a lot for the explanation. > > I wonder whether the following patch will solve the problem: > > diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c > index 6c27111..ca90d73 100644 > --- a/fs/ext4/indirect.c > +++ b/fs/ext4/indirect.c It is in inode.c in 3.0, not in indirect.c (there's no such file there). But it applies (after de-MIMEfying it) - well, almost with no mods... ;) > @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, > } > > retry: > - if (rw == READ && ext4_should_dioread_nolock(inode)) > + if (rw == READ && ext4_should_dioread_nolock(inode)) { > + if (unlikely(!list_empty(&ei->i_completed_io_list))) { > + mutex_lock(&inode->i_mutex); > + ext4_flush_completed_IO(inode); > + mutex_unlock(&inode->i_mutex); > + } > ret = __blockdev_direct_IO(rw, iocb, inode, > inode->i_sb->s_bdev, iov, > offset, nr_segs, > ext4_get_block, NULL, NULL, 0); > - else { > + } else { > ret = blockdev_direct_IO(rw, iocb, inode, > inode->i_sb->s_bdev, iov, > offset, nr_segs, > > I tested the patch a little bit and it seems to resolve the race > on dioread_nolock in my case. Michael, I would very appreciate > if you can try this patch with your test case and see whether it works. So I tried it just now. I tried hard, several times. No joy: I don't see the "corruption" anymore, neither with the simple dd case nor with oracle version. I added a printk into the new if-unlikely path, and it triggers several times, - almost always when I run my "oracle test-case" - starting the database after newly-copying the redologs. So it appears you've hit the right place there. At least for this test case ;) Not sure if that's exactly right way to go - maybe it's possible to have some optimisation in there, in case completed list is not empty but not overlaps with our I/O list, but that's probably overkill, dunno. >>> Hmm, the new writepages code seems to be broken in combination with direct >>> IO. Direct IO code expects that when filemap_write_and_wait() finishes, >>> data is on disk but with new bio submission code this is not true because >>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in >>> ext4_end_io_buffer_write() but do extent conversion only after that in >>> convert workqueue. So the race seems to be there all the time, just without >>> dioread_nolock it's much smaller. > I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled. > So I think we should be ok with the default mount options. Am I right that the intention was to enable dioread_nolock by default eventually, or even to keep its behavour only, without mount options? And thank you -- all -- for your work and support! /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-16 22:28 ` Michael Tokarev @ 2011-08-16 23:07 ` Jiaying Zhang 2011-08-17 17:02 ` Ted Ts'o 0 siblings, 1 reply; 43+ messages in thread From: Jiaying Zhang @ 2011-08-16 23:07 UTC (permalink / raw) To: Michael Tokarev; +Cc: Tao Ma, Jan Kara, linux-ext4, sandeen On Tue, Aug 16, 2011 at 3:28 PM, Michael Tokarev <mjt@tls.msk.ru> wrote: > 17.08.2011 01:32, Jiaying Zhang wrote: >> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote: >>> On 08/16/2011 09:53 PM, Jan Kara wrote: >>>> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote: >>>>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: >>>>>> 15.08.2011 12:00, Michael Tokarev wrote: >>>>>> [....] >>>>>> >>>>>> So, it looks like this (starting with cold cache): >>>>>> >>>>>> 1. rename the redologs and copy them over - this will >>>>>> make a hot copy of redologs >>>>>> 2. startup oracle - it will complain that the redologs aren't >>>>>> redologs, the header is corrupt >>>>>> 3. shut down oracle, start it up again - it will succeed. >>>>>> >>>>>> If between 1 and 2 you'll issue sync(1) everything will work. >>>>>> When shutting down, oracle calls fsync(), so that's like >>>>>> sync(1) again. >>>>>> >>>>>> If there will be some time between 1. and 2., everything >>>>>> will work too. >>>>>> >>>>>> Without dioread_nolock I can't trigger the problem no matter >>>>>> how I tried. >>>>>> >>>>>> >>>>>> A smaller test case. I used redo1.odf file (one of the >>>>>> redologs) as a test file, any will work. >>>>>> >>>>>> $ cp -p redo1.odf temp >>>>>> $ dd if=temp of=foo iflag=direct count=20 >>>>> Isn't this the expected behavior here? When doing >>>>> 'cp -p redo1.odf temp', data is copied to temp through >>>>> buffer write, but there is no guarantee when data will be >>>>> actually written to disk. Then with 'dd if=temp of=foo >>>>> iflag=direct count=20', data is read directly from disk. >>>>> Very likely, the written data hasn't been flushed to disk >>>>> yet so ext4 returns zero in this case. >>>> No it's not. Buffered and direct IO is supposed to work correctly >>>> (although not fast) together. In this particular case we take care to flush >>>> dirty data from page cache before performing direct IO read... But >>>> something is broken in this path obviously. >> I see. Thanks a lot for the explanation. >> >> I wonder whether the following patch will solve the problem: >> >> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c >> index 6c27111..ca90d73 100644 >> --- a/fs/ext4/indirect.c >> +++ b/fs/ext4/indirect.c > > It is in inode.c in 3.0, not in indirect.c (there's no such file there). > But it applies (after de-MIMEfying it) - well, almost with no mods... ;) > >> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, >> } >> >> retry: >> - if (rw == READ && ext4_should_dioread_nolock(inode)) >> + if (rw == READ && ext4_should_dioread_nolock(inode)) { >> + if (unlikely(!list_empty(&ei->i_completed_io_list))) { >> + mutex_lock(&inode->i_mutex); >> + ext4_flush_completed_IO(inode); >> + mutex_unlock(&inode->i_mutex); >> + } >> ret = __blockdev_direct_IO(rw, iocb, inode, >> inode->i_sb->s_bdev, iov, >> offset, nr_segs, >> ext4_get_block, NULL, NULL, 0); >> - else { >> + } else { >> ret = blockdev_direct_IO(rw, iocb, inode, >> inode->i_sb->s_bdev, iov, >> offset, nr_segs, >> >> I tested the patch a little bit and it seems to resolve the race >> on dioread_nolock in my case. Michael, I would very appreciate >> if you can try this patch with your test case and see whether it works. > > So I tried it just now. I tried hard, several times. No joy: I don't > see the "corruption" anymore, neither with the simple dd case nor with > oracle version. I added a printk into the new if-unlikely path, and > it triggers several times, - almost always when I run my "oracle > test-case" - starting the database after newly-copying the redologs. > > So it appears you've hit the right place there. At least for this > test case ;) Not sure if that's exactly right way to go - maybe > it's possible to have some optimisation in there, in case completed > list is not empty but not overlaps with our I/O list, but that's > probably overkill, dunno. > >>>> Hmm, the new writepages code seems to be broken in combination with direct >>>> IO. Direct IO code expects that when filemap_write_and_wait() finishes, >>>> data is on disk but with new bio submission code this is not true because >>>> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in >>>> ext4_end_io_buffer_write() but do extent conversion only after that in >>>> convert workqueue. So the race seems to be there all the time, just without >>>> dioread_nolock it's much smaller. > >> I think ext4_end_io_buffer_write() is only called with dioread_nolock enabled. >> So I think we should be ok with the default mount options. > > Am I right that the intention was to enable dioread_nolock by default > eventually, or even to keep its behavour only, without mount options? Good question. At the time when we checked in the code, we wanted to be careful that it didn't introduce data corruptions that would affect normal workloads. Apparently, the downside is that this code path doesn't get a good test coverage. Maybe it is time to reconsider enabling this feature by default. I guess we still want to guard it with a mount option given that it doesn't work with certain options, like "data=journaled" mode and small block size. Jiaying > > And thank you -- all -- for your work and support! > > /mjt > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-16 23:07 ` Jiaying Zhang @ 2011-08-17 17:02 ` Ted Ts'o 2011-08-18 6:49 ` Michael Tokarev 0 siblings, 1 reply; 43+ messages in thread From: Ted Ts'o @ 2011-08-17 17:02 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Michael Tokarev, Tao Ma, Jan Kara, linux-ext4, sandeen On Tue, Aug 16, 2011 at 04:07:38PM -0700, Jiaying Zhang wrote: > Good question. At the time when we checked in the code, we wanted to be > careful that it didn't introduce data corruptions that would affect normal > workloads. Apparently, the downside is that this code path doesn't get > a good test coverage. Maybe it is time to reconsider enabling this feature > by default. I guess we still want to guard it with a mount option given that > it doesn't work with certain options, like "data=journaled" mode and small > block size. What I'd like to do long-term here is to change things so that (a) instead of instantiating the extent as uninitialized, writing the data, and then doing the uninit->init conversion to writing the data and then instantiated the extent as initialzied. This would also allow us to get rid of data=ordered mode. And we should make it work for fs block size != page size. It means that we need a way of adding this sort of information into an in-memory extent cache but which isn't saved to disk until the data is written. We've also talked about adding the information about whether an extent is subject to delalloc as well, so we don't have to grovel through the page cache and look at individual buffers attached to the pages. And there are folks who have been experimenting with an in-memory extent tree cache to speed access to fast PCIe-attached flash. It seems to me that if we're careful a single solution should be able to solve all of these problems... - Ted ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-17 17:02 ` Ted Ts'o @ 2011-08-18 6:49 ` Michael Tokarev 2011-08-18 18:54 ` Jiaying Zhang 2011-08-19 3:18 ` Tao Ma 0 siblings, 2 replies; 43+ messages in thread From: Michael Tokarev @ 2011-08-18 6:49 UTC (permalink / raw) To: Ted Ts'o; +Cc: Jiaying Zhang, Tao Ma, Jan Kara, linux-ext4, sandeen 17.08.2011 21:02, Ted Ts'o wrote: [] > What I'd like to do long-term here is to change things so that (a) > instead of instantiating the extent as uninitialized, writing the > data, and then doing the uninit->init conversion to writing the data > and then instantiated the extent as initialzied. This would also > allow us to get rid of data=ordered mode. And we should make it work > for fs block size != page size. > > It means that we need a way of adding this sort of information into an > in-memory extent cache but which isn't saved to disk until the data is > written. We've also talked about adding the information about whether > an extent is subject to delalloc as well, so we don't have to grovel > through the page cache and look at individual buffers attached to the > pages. And there are folks who have been experimenting with an > in-memory extent tree cache to speed access to fast PCIe-attached > flash. > > It seems to me that if we're careful a single solution should be able > to solve all of these problems... What about current situation, how do you think - should it be ignored for now, having in mind that dioread_nolock isn't used often (but it gives _serious_ difference in read speed), or, short term, fix this very case which have real-life impact already, while implementing a long-term solution? Thank you! /mjt ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-18 6:49 ` Michael Tokarev @ 2011-08-18 18:54 ` Jiaying Zhang 2011-08-19 3:20 ` Tao Ma 2011-08-19 3:18 ` Tao Ma 1 sibling, 1 reply; 43+ messages in thread From: Jiaying Zhang @ 2011-08-18 18:54 UTC (permalink / raw) To: Michael Tokarev; +Cc: Ted Ts'o, Tao Ma, Jan Kara, linux-ext4, sandeen On Wed, Aug 17, 2011 at 11:49 PM, Michael Tokarev <mjt@tls.msk.ru> wrote: > 17.08.2011 21:02, Ted Ts'o wrote: > [] >> What I'd like to do long-term here is to change things so that (a) >> instead of instantiating the extent as uninitialized, writing the >> data, and then doing the uninit->init conversion to writing the data >> and then instantiated the extent as initialzied. This would also >> allow us to get rid of data=ordered mode. And we should make it work >> for fs block size != page size. >> >> It means that we need a way of adding this sort of information into an >> in-memory extent cache but which isn't saved to disk until the data is >> written. We've also talked about adding the information about whether >> an extent is subject to delalloc as well, so we don't have to grovel >> through the page cache and look at individual buffers attached to the >> pages. And there are folks who have been experimenting with an >> in-memory extent tree cache to speed access to fast PCIe-attached >> flash. >> >> It seems to me that if we're careful a single solution should be able >> to solve all of these problems... > > What about current situation, how do you think - should it be ignored > for now, having in mind that dioread_nolock isn't used often (but it > gives _serious_ difference in read speed), or, short term, fix this > very case which have real-life impact already, while implementing a > long-term solution? I plan to send my patch as a bandaid fix. It doesn't solve the fundamental problem but I think it helps close the race you saw on your test. In the long term, I agree that we should think about implementing an extent tree cache and use it to hold pending uninitialized-to-initialized extent conversions. Jiaying > > Thank you! > > /mjt > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-18 18:54 ` Jiaying Zhang @ 2011-08-19 3:20 ` Tao Ma 0 siblings, 0 replies; 43+ messages in thread From: Tao Ma @ 2011-08-19 3:20 UTC (permalink / raw) To: Jiaying Zhang Cc: Michael Tokarev, Ted Ts'o, Jan Kara, linux-ext4, sandeen Hi Ted and Jiaying, On 08/19/2011 02:54 AM, Jiaying Zhang wrote: > On Wed, Aug 17, 2011 at 11:49 PM, Michael Tokarev <mjt@tls.msk.ru> wrote: >> 17.08.2011 21:02, Ted Ts'o wrote: >> [] >>> What I'd like to do long-term here is to change things so that (a) >>> instead of instantiating the extent as uninitialized, writing the >>> data, and then doing the uninit->init conversion to writing the data >>> and then instantiated the extent as initialzied. This would also >>> allow us to get rid of data=ordered mode. And we should make it work >>> for fs block size != page size. >>> >>> It means that we need a way of adding this sort of information into an >>> in-memory extent cache but which isn't saved to disk until the data is >>> written. We've also talked about adding the information about whether >>> an extent is subject to delalloc as well, so we don't have to grovel >>> through the page cache and look at individual buffers attached to the >>> pages. And there are folks who have been experimenting with an >>> in-memory extent tree cache to speed access to fast PCIe-attached >>> flash. >>> >>> It seems to me that if we're careful a single solution should be able >>> to solve all of these problems... >> >> What about current situation, how do you think - should it be ignored >> for now, having in mind that dioread_nolock isn't used often (but it >> gives _serious_ difference in read speed), or, short term, fix this >> very case which have real-life impact already, while implementing a >> long-term solution? > I plan to send my patch as a bandaid fix. It doesn't solve the fundamental > problem but I think it helps close the race you saw on your test. In the long > term, I agree that we should think about implementing an extent tree cache > and use it to hold pending uninitialized-to-initialized extent conversions. Does Google has some plan of doing it recently? We used a large number of direct read, and we can arrange some resources to try to work it out. Thanks Tao ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-18 6:49 ` Michael Tokarev 2011-08-18 18:54 ` Jiaying Zhang @ 2011-08-19 3:18 ` Tao Ma 2011-08-19 7:05 ` Michael Tokarev 1 sibling, 1 reply; 43+ messages in thread From: Tao Ma @ 2011-08-19 3:18 UTC (permalink / raw) To: Michael Tokarev Cc: Ted Ts'o, Jiaying Zhang, Jan Kara, linux-ext4, sandeen Hi Michael, On 08/18/2011 02:49 PM, Michael Tokarev wrote: > 17.08.2011 21:02, Ted Ts'o wrote: > [] >> What I'd like to do long-term here is to change things so that (a) >> instead of instantiating the extent as uninitialized, writing the >> data, and then doing the uninit->init conversion to writing the data >> and then instantiated the extent as initialzied. This would also >> allow us to get rid of data=ordered mode. And we should make it work >> for fs block size != page size. >> >> It means that we need a way of adding this sort of information into an >> in-memory extent cache but which isn't saved to disk until the data is >> written. We've also talked about adding the information about whether >> an extent is subject to delalloc as well, so we don't have to grovel >> through the page cache and look at individual buffers attached to the >> pages. And there are folks who have been experimenting with an >> in-memory extent tree cache to speed access to fast PCIe-attached >> flash. >> >> It seems to me that if we're careful a single solution should be able >> to solve all of these problems... > > What about current situation, how do you think - should it be ignored > for now, having in mind that dioread_nolock isn't used often (but it > gives _serious_ difference in read speed), or, short term, fix this > very case which have real-life impact already, while implementing a > long-term solution? So could you please share with us how you test and your test result with/without dioread_nolock? A quick test with fio and intel ssd does't see much improvement here. We are based on RHEL6, and dioread_nolock isn't there by now and a large number of our product system use direct read and buffer write. So if your test proves to be promising, I guess our company can arrange some resources to try to work it out. Thanks Tao ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-19 3:18 ` Tao Ma @ 2011-08-19 7:05 ` Michael Tokarev 2011-08-19 17:55 ` Jiaying Zhang 0 siblings, 1 reply; 43+ messages in thread From: Michael Tokarev @ 2011-08-19 7:05 UTC (permalink / raw) To: Tao Ma; +Cc: Ted Ts'o, Jiaying Zhang, Jan Kara, linux-ext4, sandeen On 19.08.2011 07:18, Tao Ma wrote: > Hi Michael, > On 08/18/2011 02:49 PM, Michael Tokarev wrote: [] >> What about current situation, how do you think - should it be ignored >> for now, having in mind that dioread_nolock isn't used often (but it >> gives _serious_ difference in read speed), or, short term, fix this >> very case which have real-life impact already, while implementing a >> long-term solution? > So could you please share with us how you test and your test result > with/without dioread_nolock? A quick test with fio and intel ssd does't > see much improvement here. I used my home-grown quick-n-dirty microbenchmark for years to measure i/o subsystem performance. Here are the results from 3.0 kernel on some Hitachi NAS (FC, on brocade adaptors), 14-drive raid10 array. The numbers are all megabytes/sec transferred (read or written), summed for all threads. Leftmost column is the block size; next column is the number of concurrent threads of the same type. And the columns are tests: linear read, random read, linear write, random write, and concurrent random read and write. For a raw device: BlkSz Trd linRd rndRd linWr rndWr rndR/W 4k 1 18.3 0.8 14.5 9.6 0.1/ 9.1 4 2.5 9.4 0.4/ 8.4 32 10.0 9.3 4.7/ 5.4 16k 1 59.4 2.5 49.9 35.7 0.3/ 34.7 4 10.3 36.1 1.5/ 31.4 32 38.5 36.2 17.5/ 20.4 64k 1 118.4 9.1 136.0 106.5 1.1/105.8 4 37.7 108.5 4.7/102.6 32 153.0 108.5 57.9/ 73.3 128k 1 125.9 16.5 138.8 125.8 1.1/125.6 4 68.7 128.7 6.3/122.8 32 277.0 128.7 70.3/ 98.6 1024k 1 89.9 81.2 138.9 134.4 5.0/132.3 4 254.7 137.6 19.2/127.1 32 390.7 137.5 117.2/ 90.1 For ext4fs, 1Tb file, default mount options: BlkSz Trd linRd rndRd linWr rndWr rndR/W 4k 1 15.7 0.6 15.4 9.4 0.0/ 9.0 4 2.6 9.3 0.0/ 8.9 32 10.0 9.3 0.0/ 8.9 16k 1 47.6 2.5 53.2 34.6 0.1/ 33.6 4 10.2 34.6 0.0/ 33.5 32 39.9 34.8 0.1/ 33.6 64k 1 100.5 9.0 137.0 106.2 0.2/105.8 4 37.8 107.8 0.1/106.1 32 153.9 107.8 0.2/105.9 128k 1 115.4 16.3 138.6 125.2 0.3/125.3 4 68.8 127.8 0.2/125.6 32 274.6 127.8 0.2/126.2 1024k 1 124.5 54.2 138.9 133.6 1.0/133.3 4 159.5 136.6 0.2/134.3 32 349.7 136.5 0.3/133.6 And for a 1tb file on ext4fs with dioread_nolock: BlkSz Trd linRd rndRd linWr rndWr rndR/W 4k 1 15.7 0.6 14.6 9.4 0.1/ 9.0 4 2.6 9.4 0.3/ 8.6 32 10.0 9.4 4.5/ 5.3 16k 1 50.9 2.4 56.7 36.0 0.3/ 35.2 4 10.1 36.4 1.5/ 34.6 32 38.7 36.4 17.3/ 21.0 64k 1 95.2 8.9 136.5 106.8 1.0/106.3 4 37.7 108.4 5.2/103.3 32 152.7 108.6 57.4/ 74.0 128k 1 115.1 16.3 138.8 125.8 1.2/126.4 4 68.9 128.5 5.7/124.0 32 276.1 128.6 70.8/ 98.5 1024k 1 128.5 81.9 138.9 134.4 5.1/132.3 4 253.4 137.4 19.1/126.8 32 385.1 137.4 111.7/ 92.3 These are complete test results. First 4 result columns are merely identical, the difference is within last column. Here they are together: BlkSz Trd Raw Ext4nolock Ext4dflt 4k 1 0.1/ 9.1 0.1/ 9.0 0.0/ 9.0 4 0.4/ 8.4 0.3/ 8.6 0.0/ 8.9 32 4.7/ 5.4 4.5/ 5.3 0.0/ 8.9 16k 1 0.3/ 34.7 0.3/ 35.2 0.1/ 33.6 4 1.5/ 31.4 1.5/ 34.6 0.0/ 33.5 32 17.5/ 20.4 17.3/ 21.0 0.1/ 33.6 64k 1 1.1/105.8 1.0/106.3 0.2/105.8 4 4.7/102.6 5.2/103.3 0.1/106.1 32 57.9/ 73.3 57.4/ 74.0 0.2/105.9 128k 1 1.1/125.6 1.2/126.4 0.3/125.3 4 6.3/122.8 5.7/124.0 0.2/125.6 32 70.3/ 98.6 70.8/ 98.5 0.2/126.2 1024k 1 5.0/132.3 5.1/132.3 1.0/133.3 4 19.2/127.1 19.1/126.8 0.2/134.3 32 117.2/ 90.1 111.7/ 92.3 0.3/133.6 Ext4 with dioread_nolock (middle column) behaves close to raw device. But default ext4 greatly prefers writes over reads, reads are almost non-existent. This is, again, more or less a microbenchmark. Where it come from is my attempt to simulate an (oracle) database workload (many years ago, when larger and more standard now benchmarks weren't (freely) available). And there, on a busy DB, the difference is quite well-visible. In short, any writer makes all readers to wait. Once we start writing something, all users immediately notice. With dioread_nolock they don't complain anymore. There's some more background around this all. Right now I'm evaluating a new machine for our current database. Old hardware had 2Gb RAM so it had _significant_ memory pressure, and lots of stuff weren't able to be cached. New machine has 128Gb of RAM, which will ensure that all important stuff is in cache. So the effect of this read/write disbalance will be much less visible. For example, we've a dictionary (several tables) with addresses - towns, streets, even buildings. When they enter customer information they search in these dicts. With current 2Gb memory thses dictionaries can't be kept in memory, so they gets read from disk again every time someone enters customer information, and this is what they do all the time. So no doubt disk access is very important here. On a new hardware, obviously, all these dictionaries will be in memory after first access, so even if all reads will wait till any write completes, it wont be as dramatic as it is now. That to say, -- maybe I'm really paying too much attention for a wrong problem. So far, on a new machine, I don't see actual noticeable difference between dioread_nolock and without that option. (BTW, I found no way to remount a filesystem to EXclude that option, I have to umount and mount it in order to switch from using dioread_nolock to not using it. Is there a way?) Thanks, /mjt > We are based on RHEL6, and dioread_nolock isn't there by now and a large > number of our product system use direct read and buffer write. So if > your test proves to be promising, I guess our company can arrange some > resources to try to work it out. > > Thanks > Tao ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-19 7:05 ` Michael Tokarev @ 2011-08-19 17:55 ` Jiaying Zhang 0 siblings, 0 replies; 43+ messages in thread From: Jiaying Zhang @ 2011-08-19 17:55 UTC (permalink / raw) To: Michael Tokarev; +Cc: Tao Ma, Ted Ts'o, Jan Kara, linux-ext4, sandeen On Fri, Aug 19, 2011 at 12:05 AM, Michael Tokarev <mjt@tls.msk.ru> wrote: > On 19.08.2011 07:18, Tao Ma wrote: >> Hi Michael, >> On 08/18/2011 02:49 PM, Michael Tokarev wrote: > [] >>> What about current situation, how do you think - should it be ignored >>> for now, having in mind that dioread_nolock isn't used often (but it >>> gives _serious_ difference in read speed), or, short term, fix this >>> very case which have real-life impact already, while implementing a >>> long-term solution? > >> So could you please share with us how you test and your test result >> with/without dioread_nolock? A quick test with fio and intel ssd does't >> see much improvement here. > > I used my home-grown quick-n-dirty microbenchmark for years to measure > i/o subsystem performance. Here are the results from 3.0 kernel on > some Hitachi NAS (FC, on brocade adaptors), 14-drive raid10 array. > > The numbers are all megabytes/sec transferred (read or written), summed > for all threads. Leftmost column is the block size; next column is the > number of concurrent threads of the same type. And the columns are > tests: linear read, random read, linear write, random write, and > concurrent random read and write. > > For a raw device: > > BlkSz Trd linRd rndRd linWr rndWr rndR/W > 4k 1 18.3 0.8 14.5 9.6 0.1/ 9.1 > 4 2.5 9.4 0.4/ 8.4 > 32 10.0 9.3 4.7/ 5.4 > 16k 1 59.4 2.5 49.9 35.7 0.3/ 34.7 > 4 10.3 36.1 1.5/ 31.4 > 32 38.5 36.2 17.5/ 20.4 > 64k 1 118.4 9.1 136.0 106.5 1.1/105.8 > 4 37.7 108.5 4.7/102.6 > 32 153.0 108.5 57.9/ 73.3 > 128k 1 125.9 16.5 138.8 125.8 1.1/125.6 > 4 68.7 128.7 6.3/122.8 > 32 277.0 128.7 70.3/ 98.6 > 1024k 1 89.9 81.2 138.9 134.4 5.0/132.3 > 4 254.7 137.6 19.2/127.1 > 32 390.7 137.5 117.2/ 90.1 > > For ext4fs, 1Tb file, default mount options: > > BlkSz Trd linRd rndRd linWr rndWr rndR/W > 4k 1 15.7 0.6 15.4 9.4 0.0/ 9.0 > 4 2.6 9.3 0.0/ 8.9 > 32 10.0 9.3 0.0/ 8.9 > 16k 1 47.6 2.5 53.2 34.6 0.1/ 33.6 > 4 10.2 34.6 0.0/ 33.5 > 32 39.9 34.8 0.1/ 33.6 > 64k 1 100.5 9.0 137.0 106.2 0.2/105.8 > 4 37.8 107.8 0.1/106.1 > 32 153.9 107.8 0.2/105.9 > 128k 1 115.4 16.3 138.6 125.2 0.3/125.3 > 4 68.8 127.8 0.2/125.6 > 32 274.6 127.8 0.2/126.2 > 1024k 1 124.5 54.2 138.9 133.6 1.0/133.3 > 4 159.5 136.6 0.2/134.3 > 32 349.7 136.5 0.3/133.6 > > And for a 1tb file on ext4fs with dioread_nolock: > > BlkSz Trd linRd rndRd linWr rndWr rndR/W > 4k 1 15.7 0.6 14.6 9.4 0.1/ 9.0 > 4 2.6 9.4 0.3/ 8.6 > 32 10.0 9.4 4.5/ 5.3 > 16k 1 50.9 2.4 56.7 36.0 0.3/ 35.2 > 4 10.1 36.4 1.5/ 34.6 > 32 38.7 36.4 17.3/ 21.0 > 64k 1 95.2 8.9 136.5 106.8 1.0/106.3 > 4 37.7 108.4 5.2/103.3 > 32 152.7 108.6 57.4/ 74.0 > 128k 1 115.1 16.3 138.8 125.8 1.2/126.4 > 4 68.9 128.5 5.7/124.0 > 32 276.1 128.6 70.8/ 98.5 > 1024k 1 128.5 81.9 138.9 134.4 5.1/132.3 > 4 253.4 137.4 19.1/126.8 > 32 385.1 137.4 111.7/ 92.3 > > These are complete test results. First 4 result > columns are merely identical, the difference is > within last column. Here they are together: > > BlkSz Trd Raw Ext4nolock Ext4dflt > 4k 1 0.1/ 9.1 0.1/ 9.0 0.0/ 9.0 > 4 0.4/ 8.4 0.3/ 8.6 0.0/ 8.9 > 32 4.7/ 5.4 4.5/ 5.3 0.0/ 8.9 > 16k 1 0.3/ 34.7 0.3/ 35.2 0.1/ 33.6 > 4 1.5/ 31.4 1.5/ 34.6 0.0/ 33.5 > 32 17.5/ 20.4 17.3/ 21.0 0.1/ 33.6 > 64k 1 1.1/105.8 1.0/106.3 0.2/105.8 > 4 4.7/102.6 5.2/103.3 0.1/106.1 > 32 57.9/ 73.3 57.4/ 74.0 0.2/105.9 > 128k 1 1.1/125.6 1.2/126.4 0.3/125.3 > 4 6.3/122.8 5.7/124.0 0.2/125.6 > 32 70.3/ 98.6 70.8/ 98.5 0.2/126.2 > 1024k 1 5.0/132.3 5.1/132.3 1.0/133.3 > 4 19.2/127.1 19.1/126.8 0.2/134.3 > 32 117.2/ 90.1 111.7/ 92.3 0.3/133.6 > > Ext4 with dioread_nolock (middle column) behaves close to > raw device. But default ext4 greatly prefers writes over > reads, reads are almost non-existent. > > This is, again, more or less a microbenchmark. Where it > come from is my attempt to simulate an (oracle) database > workload (many years ago, when larger and more standard > now benchmarks weren't (freely) available). And there, > on a busy DB, the difference is quite well-visible. > In short, any writer makes all readers to wait. Once > we start writing something, all users immediately notice. > With dioread_nolock they don't complain anymore. > > There's some more background around this all. Right > now I'm evaluating a new machine for our current database. > Old hardware had 2Gb RAM so it had _significant_ memory > pressure, and lots of stuff weren't able to be cached. > New machine has 128Gb of RAM, which will ensure that > all important stuff is in cache. So the effect of this > read/write disbalance will be much less visible. > > For example, we've a dictionary (several tables) with > addresses - towns, streets, even buildings. When they > enter customer information they search in these dicts. > With current 2Gb memory thses dictionaries can't be > kept in memory, so they gets read from disk again every > time someone enters customer information, and this is > what they do all the time. So no doubt disk access is > very important here. > > On a new hardware, obviously, all these dictionaries will > be in memory after first access, so even if all reads will > wait till any write completes, it wont be as dramatic as > it is now. > > That to say, -- maybe I'm really paying too much attention > for a wrong problem. So far, on a new machine, I don't see > actual noticeable difference between dioread_nolock and > without that option. > > (BTW, I found no way to remount a filesystem to EXclude > that option, I have to umount and mount it in order to > switch from using dioread_nolock to not using it. Is > there a way?) I think the command to do this is: mount -o remount,dioread_lock /dev/xxx <mountpoint> Now looking at this, I guess it is not very intuitive that the option to turn off dioread_nolock is dioread_lock instead of nodioread_nolock, but nodioread_nolock does look ugly. Maybe we should try to support both ways. Jiaying > > Thanks, > > /mjt > >> We are based on RHEL6, and dioread_nolock isn't there by now and a large >> number of our product system use direct read and buffer write. So if >> your test proves to be promising, I guess our company can arrange some >> resources to try to work it out. >> >> Thanks >> Tao > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-16 21:32 ` Jiaying Zhang 2011-08-16 22:28 ` Michael Tokarev @ 2011-08-16 23:59 ` Dave Chinner 2011-08-17 0:08 ` Jiaying Zhang 1 sibling, 1 reply; 43+ messages in thread From: Dave Chinner @ 2011-08-16 23:59 UTC (permalink / raw) To: Jiaying Zhang; +Cc: Tao Ma, Jan Kara, Michael Tokarev, linux-ext4, sandeen On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote: > On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote: > > On 08/16/2011 09:53 PM, Jan Kara wrote: > I wonder whether the following patch will solve the problem: > > diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c > index 6c27111..ca90d73 100644 > --- a/fs/ext4/indirect.c > +++ b/fs/ext4/indirect.c > @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, > } > > retry: > - if (rw == READ && ext4_should_dioread_nolock(inode)) > + if (rw == READ && ext4_should_dioread_nolock(inode)) { > + if (unlikely(!list_empty(&ei->i_completed_io_list))) { > + mutex_lock(&inode->i_mutex); > + ext4_flush_completed_IO(inode); > + mutex_unlock(&inode->i_mutex); > + } > ret = __blockdev_direct_IO(rw, iocb, inode, > inode->i_sb->s_bdev, iov, > offset, nr_segs, > ext4_get_block, NULL, NULL, 0); > - else { > + } else { > ret = blockdev_direct_IO(rw, iocb, inode, > inode->i_sb->s_bdev, iov, > offset, nr_segs, > > I tested the patch a little bit and it seems to resolve the race > on dioread_nolock in my case. Michael, I would very appreciate > if you can try this patch with your test case and see whether it works. Just my 2c worth here: this is a data corruption bug so the root cause neeeds to be fixed. The above patch does not address the root cause. > > You are absolutely right. The really problem is that ext4_direct_IO > > begins to work *after* we clear the page writeback flag and *before* we > > convert unwritten extent to a valid state. Some of my trace does show > > that. I am working on it now. And that's the root cause - think about what that means for a minute. It means that extent conversion can race with anything that requires IO to complete first. e.g. truncate or fsync. It can then race with other subsequent operations, which can have even nastier effects. IOWs, there is a data-corruption landmine just sitting there waiting for the next person to trip over it. Fix the root cause, don't put band-aids over the symptoms. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-16 23:59 ` Dave Chinner @ 2011-08-17 0:08 ` Jiaying Zhang 2011-08-17 2:22 ` Tao Ma 2011-08-17 9:04 ` Jan Kara 0 siblings, 2 replies; 43+ messages in thread From: Jiaying Zhang @ 2011-08-17 0:08 UTC (permalink / raw) To: Dave Chinner; +Cc: Tao Ma, Jan Kara, Michael Tokarev, linux-ext4, sandeen On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote: >> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote: >> > On 08/16/2011 09:53 PM, Jan Kara wrote: >> I wonder whether the following patch will solve the problem: >> >> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c >> index 6c27111..ca90d73 100644 >> --- a/fs/ext4/indirect.c >> +++ b/fs/ext4/indirect.c >> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, >> } >> >> retry: >> - if (rw == READ && ext4_should_dioread_nolock(inode)) >> + if (rw == READ && ext4_should_dioread_nolock(inode)) { >> + if (unlikely(!list_empty(&ei->i_completed_io_list))) { >> + mutex_lock(&inode->i_mutex); >> + ext4_flush_completed_IO(inode); >> + mutex_unlock(&inode->i_mutex); >> + } >> ret = __blockdev_direct_IO(rw, iocb, inode, >> inode->i_sb->s_bdev, iov, >> offset, nr_segs, >> ext4_get_block, NULL, NULL, 0); >> - else { >> + } else { >> ret = blockdev_direct_IO(rw, iocb, inode, >> inode->i_sb->s_bdev, iov, >> offset, nr_segs, >> >> I tested the patch a little bit and it seems to resolve the race >> on dioread_nolock in my case. Michael, I would very appreciate >> if you can try this patch with your test case and see whether it works. > > Just my 2c worth here: this is a data corruption bug so the root > cause neeeds to be fixed. The above patch does not address the root > cause. > >> > You are absolutely right. The really problem is that ext4_direct_IO >> > begins to work *after* we clear the page writeback flag and *before* we >> > convert unwritten extent to a valid state. Some of my trace does show >> > that. I am working on it now. > > And that's the root cause - think about what that means for a > minute. It means that extent conversion can race with anything that > requires IO to complete first. e.g. truncate or fsync. It can then > race with other subsequent operations, which can have even nastier > effects. IOWs, there is a data-corruption landmine just sitting > there waiting for the next person to trip over it. You are right that extent conversion can race with truncate and fsync as well. That is why we already need to call ext4_flush_completed_IO() in those places as well. I agree this is a little nasty and there can be some other corner cases that we haven't covered. The problem is we can not do extent conversion during the end_io time. I haven't thought of a better approach to deal with these races. I am curious how xfs deals with this problem. Jiaying > > Fix the root cause, don't put band-aids over the symptoms. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-17 0:08 ` Jiaying Zhang @ 2011-08-17 2:22 ` Tao Ma 2011-08-17 9:04 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Tao Ma @ 2011-08-17 2:22 UTC (permalink / raw) To: Jiaying Zhang Cc: Dave Chinner, Jan Kara, Michael Tokarev, linux-ext4, sandeen Hi Jiaying, On 08/17/2011 08:08 AM, Jiaying Zhang wrote: > On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <david@fromorbit.com> wrote: >> On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote: >>> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote: >>>> On 08/16/2011 09:53 PM, Jan Kara wrote: >>> I wonder whether the following patch will solve the problem: >>> >>> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c >>> index 6c27111..ca90d73 100644 >>> --- a/fs/ext4/indirect.c >>> +++ b/fs/ext4/indirect.c >>> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, >>> } >>> >>> retry: >>> - if (rw == READ && ext4_should_dioread_nolock(inode)) >>> + if (rw == READ && ext4_should_dioread_nolock(inode)) { >>> + if (unlikely(!list_empty(&ei->i_completed_io_list))) { >>> + mutex_lock(&inode->i_mutex); >>> + ext4_flush_completed_IO(inode); >>> + mutex_unlock(&inode->i_mutex); >>> + } >>> ret = __blockdev_direct_IO(rw, iocb, inode, >>> inode->i_sb->s_bdev, iov, >>> offset, nr_segs, >>> ext4_get_block, NULL, NULL, 0); >>> - else { >>> + } else { >>> ret = blockdev_direct_IO(rw, iocb, inode, >>> inode->i_sb->s_bdev, iov, >>> offset, nr_segs, >>> >>> I tested the patch a little bit and it seems to resolve the race >>> on dioread_nolock in my case. Michael, I would very appreciate >>> if you can try this patch with your test case and see whether it works. >> >> Just my 2c worth here: this is a data corruption bug so the root >> cause neeeds to be fixed. The above patch does not address the root >> cause. >> >>>> You are absolutely right. The really problem is that ext4_direct_IO >>>> begins to work *after* we clear the page writeback flag and *before* we >>>> convert unwritten extent to a valid state. Some of my trace does show >>>> that. I am working on it now. >> >> And that's the root cause - think about what that means for a >> minute. It means that extent conversion can race with anything that >> requires IO to complete first. e.g. truncate or fsync. It can then >> race with other subsequent operations, which can have even nastier >> effects. IOWs, there is a data-corruption landmine just sitting >> there waiting for the next person to trip over it. > You are right that extent conversion can race with truncate and fsync > as well. That is why we already need to call ext4_flush_completed_IO() > in those places as well. I agree this is a little nasty and there can be > some other corner cases that we haven't covered. The problem is we > can not do extent conversion during the end_io time. I haven't thought of > a better approach to deal with these races. I am curious how xfs deals > with this problem. I agree with Dave that we may need to figure out a better way for this. What's more, you patch has another side-effect: if there are concurrent direct read and buffered write and even they are not interleaved, the direct read is affected. Do you have any test data of the performance regression? Thanks Tao > > Jiaying > >> >> Fix the root cause, don't put band-aids over the symptoms. >> >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@fromorbit.com >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-17 0:08 ` Jiaying Zhang 2011-08-17 2:22 ` Tao Ma @ 2011-08-17 9:04 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Jan Kara @ 2011-08-17 9:04 UTC (permalink / raw) To: Jiaying Zhang Cc: Dave Chinner, Tao Ma, Jan Kara, Michael Tokarev, linux-ext4, sandeen On Tue 16-08-11 17:08:42, Jiaying Zhang wrote: > On Tue, Aug 16, 2011 at 4:59 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Aug 16, 2011 at 02:32:12PM -0700, Jiaying Zhang wrote: > >> On Tue, Aug 16, 2011 at 8:03 AM, Tao Ma <tm@tao.ma> wrote: > >> > On 08/16/2011 09:53 PM, Jan Kara wrote: > >> I wonder whether the following patch will solve the problem: > >> > >> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c > >> index 6c27111..ca90d73 100644 > >> --- a/fs/ext4/indirect.c > >> +++ b/fs/ext4/indirect.c > >> @@ -800,12 +800,17 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, > >> } > >> > >> retry: > >> - if (rw == READ && ext4_should_dioread_nolock(inode)) > >> + if (rw == READ && ext4_should_dioread_nolock(inode)) { > >> + if (unlikely(!list_empty(&ei->i_completed_io_list))) { > >> + mutex_lock(&inode->i_mutex); > >> + ext4_flush_completed_IO(inode); > >> + mutex_unlock(&inode->i_mutex); > >> + } > >> ret = __blockdev_direct_IO(rw, iocb, inode, > >> inode->i_sb->s_bdev, iov, > >> offset, nr_segs, > >> ext4_get_block, NULL, NULL, 0); > >> - else { > >> + } else { > >> ret = blockdev_direct_IO(rw, iocb, inode, > >> inode->i_sb->s_bdev, iov, > >> offset, nr_segs, > >> > >> I tested the patch a little bit and it seems to resolve the race > >> on dioread_nolock in my case. Michael, I would very appreciate > >> if you can try this patch with your test case and see whether it works. > > > > Just my 2c worth here: this is a data corruption bug so the root > > cause neeeds to be fixed. The above patch does not address the root > > cause. > > > >> > You are absolutely right. The really problem is that ext4_direct_IO > >> > begins to work *after* we clear the page writeback flag and *before* we > >> > convert unwritten extent to a valid state. Some of my trace does show > >> > that. I am working on it now. > > > > And that's the root cause - think about what that means for a > > minute. It means that extent conversion can race with anything that > > requires IO to complete first. e.g. truncate or fsync. It can then > > race with other subsequent operations, which can have even nastier > > effects. IOWs, there is a data-corruption landmine just sitting > > there waiting for the next person to trip over it. > You are right that extent conversion can race with truncate and fsync > as well. That is why we already need to call ext4_flush_completed_IO() > in those places as well. I agree this is a little nasty and there can be > some other corner cases that we haven't covered. Exactly. I agree with Dave here that it is asking for serious trouble to clear PageWriteback bit before really completing the IO. > The problem is we can not do extent conversion during the end_io time. I > haven't thought of a better approach to deal with these races. I am > curious how xfs deals with this problem. Well, XFS cannot do extent conversion in end_io for AIO+DIO either. So it clears PageWriteback bit only after extent conversion has happened in the worker thread. ext4 has problems with this (deadlocks) because of unlucky locking of extent tree using i_mutex. So I believe we have to find a better locking for extent tree so that ext4 can clear PageWriteback bit from the worker thread... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-13 16:02 ` Tao Ma 2011-08-14 20:57 ` Michael Tokarev @ 2011-08-15 16:08 ` Eric Sandeen 2011-08-16 4:12 ` Tao Ma 2011-08-16 6:15 ` Tao Ma 1 sibling, 2 replies; 43+ messages in thread From: Eric Sandeen @ 2011-08-15 16:08 UTC (permalink / raw) To: Tao Ma; +Cc: linux-ext4, mjt On 8/13/11 11:02 AM, Tao Ma wrote: > From: Tao Ma <boyu.mt@taobao.com> > > Hi Michael, > could you please check whether this patch work for you? Thanks, this is the patch I meant to send, I guess ;) I think I missed the "|=" case before :( given that every place we set it requires a test and an increment if not set, maybe a helper to do all that is in order? -Eric > > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 > From: Tao Ma <boyu.mt@taobao.com> > Date: Sat, 13 Aug 2011 23:50:28 +0800 > Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. > > EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should > be done simultaneously since ext4_end_io_nolock always clear the flag and > decrease the counter in the same time. > > We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so > it will go nagative and causes some process to wait forever. > > Part of the patch came from Eric in his e-mail. > http://marc.info/?l=linux-ext4&m=131316851417460&w=2 > > Reported-by: Michael Tokarev<mjt@tls.msk.ru> > Signed-off-by: Eric Sandeen <sandeen@redhat.com> > Signed-off-by: Tao Ma <boyu.mt@taobao.com> > --- > fs/ext4/inode.c | 9 ++++++++- > fs/ext4/page-io.c | 6 ++++-- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index d47264c..40c0b10 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) > goto out; > } > > - io_end->flag = EXT4_IO_END_UNWRITTEN; > + /* > + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, > + * but being more careful is always safe for the future change. > + */ > inode = io_end->inode; > + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > > /* Add the io_end to per-inode completed io list*/ > spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); > diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c > index 430c401..78839af 100644 > --- a/fs/ext4/page-io.c > +++ b/fs/ext4/page-io.c > @@ -334,8 +334,10 @@ submit_and_retry: > if ((io_end->num_io_pages >= MAX_IO_PAGES) && > (io_end->pages[io_end->num_io_pages-1] != io_page)) > goto submit_and_retry; > - if (buffer_uninit(bh)) > - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; > + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { > + io_end->flag |= EXT4_IO_END_UNWRITTEN; > + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); > + } > io->io_end->size += bh->b_size; > io->io_next_block++; > ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 16:08 ` Eric Sandeen @ 2011-08-16 4:12 ` Tao Ma 2011-08-16 6:15 ` Tao Ma 1 sibling, 0 replies; 43+ messages in thread From: Tao Ma @ 2011-08-16 4:12 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4, mjt On 08/16/2011 12:08 AM, Eric Sandeen wrote: > On 8/13/11 11:02 AM, Tao Ma wrote: >> From: Tao Ma <boyu.mt@taobao.com> >> >> Hi Michael, >> could you please check whether this patch work for you? > > Thanks, this is the patch I meant to send, I guess ;) I think > I missed the "|=" case before :( > > given that every place we set it requires a test and an increment > if not set, maybe a helper to do all that is in order? sure, I can work out a patch like that. Thanks for the review. Thanks Tao > > -Eric > >> >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 >> From: Tao Ma <boyu.mt@taobao.com> >> Date: Sat, 13 Aug 2011 23:50:28 +0800 >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> >> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should >> be done simultaneously since ext4_end_io_nolock always clear the flag and >> decrease the counter in the same time. >> >> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so >> it will go nagative and causes some process to wait forever. >> >> Part of the patch came from Eric in his e-mail. >> http://marc.info/?l=linux-ext4&m=131316851417460&w=2 >> >> Reported-by: Michael Tokarev<mjt@tls.msk.ru> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com> >> Signed-off-by: Tao Ma <boyu.mt@taobao.com> >> --- >> fs/ext4/inode.c | 9 ++++++++- >> fs/ext4/page-io.c | 6 ++++-- >> 2 files changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index d47264c..40c0b10 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) >> goto out; >> } >> >> - io_end->flag = EXT4_IO_END_UNWRITTEN; >> + /* >> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, >> + * but being more careful is always safe for the future change. >> + */ >> inode = io_end->inode; >> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> >> /* Add the io_end to per-inode completed io list*/ >> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); >> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c >> index 430c401..78839af 100644 >> --- a/fs/ext4/page-io.c >> +++ b/fs/ext4/page-io.c >> @@ -334,8 +334,10 @@ submit_and_retry: >> if ((io_end->num_io_pages >= MAX_IO_PAGES) && >> (io_end->pages[io_end->num_io_pages-1] != io_page)) >> goto submit_and_retry; >> - if (buffer_uninit(bh)) >> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> io->io_end->size += bh->b_size; >> io->io_next_block++; >> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-15 16:08 ` Eric Sandeen 2011-08-16 4:12 ` Tao Ma @ 2011-08-16 6:15 ` Tao Ma 1 sibling, 0 replies; 43+ messages in thread From: Tao Ma @ 2011-08-16 6:15 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4, mjt Hi Eric, On 08/16/2011 12:08 AM, Eric Sandeen wrote: > On 8/13/11 11:02 AM, Tao Ma wrote: >> From: Tao Ma <boyu.mt@taobao.com> >> >> Hi Michael, >> could you please check whether this patch work for you? > > Thanks, this is the patch I meant to send, I guess ;) I think > I missed the "|=" case before :( > > given that every place we set it requires a test and an increment > if not set, maybe a helper to do all that is in order? I have decided to leave it to the next merge window. So I will re-send this patch as a fix and then create another patch for the helper generalization. Thanks Tao > > -Eric > >> >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> From 4f1d797973da8b5cf79f58cc1e4d22efb9d46d19 Mon Sep 17 00:00:00 2001 >> From: Tao Ma <boyu.mt@taobao.com> >> Date: Sat, 13 Aug 2011 23:50:28 +0800 >> Subject: [PATCH] ext4: Set unwritten flag and increase i_aiodio_unwritten simultaneously. >> >> EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten should >> be done simultaneously since ext4_end_io_nolock always clear the flag and >> decrease the counter in the same time. >> >> We don't increase i_aiodio_unwritten when setting EXT4_IO_END_UNWRITTEN so >> it will go nagative and causes some process to wait forever. >> >> Part of the patch came from Eric in his e-mail. >> http://marc.info/?l=linux-ext4&m=131316851417460&w=2 >> >> Reported-by: Michael Tokarev<mjt@tls.msk.ru> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com> >> Signed-off-by: Tao Ma <boyu.mt@taobao.com> >> --- >> fs/ext4/inode.c | 9 ++++++++- >> fs/ext4/page-io.c | 6 ++++-- >> 2 files changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index d47264c..40c0b10 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -2668,8 +2668,15 @@ static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate) >> goto out; >> } >> >> - io_end->flag = EXT4_IO_END_UNWRITTEN; >> + /* >> + * It may be over-defensive here to check EXT4_IO_END_UNWRITTEN now, >> + * but being more careful is always safe for the future change. >> + */ >> inode = io_end->inode; >> + if (!(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> >> /* Add the io_end to per-inode completed io list*/ >> spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags); >> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c >> index 430c401..78839af 100644 >> --- a/fs/ext4/page-io.c >> +++ b/fs/ext4/page-io.c >> @@ -334,8 +334,10 @@ submit_and_retry: >> if ((io_end->num_io_pages >= MAX_IO_PAGES) && >> (io_end->pages[io_end->num_io_pages-1] != io_page)) >> goto submit_and_retry; >> - if (buffer_uninit(bh)) >> - io->io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + if (buffer_uninit(bh) && !(io_end->flag & EXT4_IO_END_UNWRITTEN)) { >> + io_end->flag |= EXT4_IO_END_UNWRITTEN; >> + atomic_inc(&EXT4_I(inode)->i_aiodio_unwritten); >> + } >> io->io_end->size += bh->b_size; >> io->io_next_block++; >> ret = bio_add_page(io->io_bio, bh->b_page, bh->b_size, bh_offset(bh)); > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: DIO process stuck apparently due to dioread_nolock (3.0) 2011-08-12 15:55 ` Michael Tokarev 2011-08-12 17:01 ` Eric Sandeen @ 2011-08-12 21:19 ` Jan Kara 1 sibling, 0 replies; 43+ messages in thread From: Jan Kara @ 2011-08-12 21:19 UTC (permalink / raw) To: Michael Tokarev; +Cc: Jan Kara, Jiaying Zhang, linux-ext4 On Fri 12-08-11 19:55:00, Michael Tokarev wrote: > 12.08.2011 17:07, Jan Kara wrote: > [] > >>>> [ 76.982985] EXT4-fs (dm-1): Unaligned AIO/DIO on inode 3407879 by oracle; performance will be poor. > >>>> [ 1469.734114] SysRq : Show Blocked State > >>>> [ 1469.734157] task PC stack pid father > >>>> [ 1469.734473] oracle D 0000000000000000 0 6146 1 0x00000000 > >>>> [ 1469.734525] ffff88103f604810 0000000000000082 ffff881000000000 ffff881079791040 > >>>> [ 1469.734603] ffff880432c19fd8 ffff880432c19fd8 ffff880432c19fd8 ffff88103f604810 > >>>> [ 1469.734681] ffffea000ec13590 ffffffff00000000 ffff881438c8dad8 ffffffff810eeda2 > >>>> [ 1469.734760] Call Trace: > >>>> [ 1469.734800] [<ffffffff810eeda2>] ? __do_fault+0x422/0x520 > >>>> [ 1469.734863] [<ffffffffa0123e6d>] ? ext4_file_write+0x20d/0x260 [ext4] > >>>> [ 1469.734909] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > >>>> [ 1469.734956] [<ffffffffa0123c60>] ? ext4_llseek+0x120/0x120 [ext4] > >>>> [ 1469.734999] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > >>>> [ 1469.735039] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > >>>> [ 1469.735078] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > >>>> [ 1469.735122] [<ffffffff81359b52>] ? system_call_fastpath+0x16/0x1b > [] > > Probably not, just garbage on stack confuses the stack unwinder. Note > > that e.g. ext4_llseek() is at the end of function as well as > > abort_exclusive_wait() which probably means these functions have already > > finished and just left their addresses on stack. Could you disasseble > > your ext4_file_write() function and check where offset 0x20d is? Sadly you > > won't see where calls are going when ext4 is compiled as a module (module > > is linked when loaded into kernel) so it might be easier to compile ext4 > > into the kernel and disassemble the function in vmlinux file. > > > >> Btw, does ext4_llseek() look sane here? Note it's called from > >> aio_submit() -- does it _ever_ implement SEEKs? > >> > >> Maybe some debugging is neecessary here? > > Yes. The trouble is I'm not completely sure where we are hanging yet. > > We should know more from your disassembly. But you can try running with > > attached debug patch - maybe it shows something interesting. > > With ext4 built-in and your patch applied: > > [ 429.061524] EXT4-fs (sda11): Unaligned AIO/DIO on inode 5767181 by oracle; performance will be poor. > [ 429.061669] Going to wait for 18446744073709551199 aios > [ 437.717942] SysRq : Show Blocked State > [ 437.718109] task PC stack pid father > [ 437.718528] oracle D 0000000000000000 0 3844 1 0x00000000 > [ 437.718767] ffff88203e330080 0000000000000082 0000000000000000 ffff881078e3f7d0 > [ 437.719156] ffff88203dc15fd8 ffff88203dc15fd8 ffff88203dc15fd8 ffff88203e330080 > [ 437.719546] 0000001e0000000f 0000000000000246 205b88103f806680 3136302e39323420 > [ 437.719935] Call Trace: > [ 437.720043] [<ffffffff8139b976>] ? printk+0x43/0x48 > [ 437.720155] [<ffffffff8118e16d>] ? ext4_file_write+0x21d/0x290 > [ 437.720267] [<ffffffff8106aee0>] ? abort_exclusive_wait+0xb0/0xb0 > [ 437.720376] [<ffffffff8118df50>] ? ext4_llseek+0x120/0x120 > [ 437.720485] [<ffffffff81162173>] ? aio_rw_vect_retry+0x73/0x1d0 > [ 437.720593] [<ffffffff8116302f>] ? aio_run_iocb+0x5f/0x160 > [ 437.720699] [<ffffffff81164258>] ? do_io_submit+0x4f8/0x600 > [ 437.720811] [<ffffffff813a3152>] ? system_call_fastpath+0x16/0x1b > > Trying to find how to disassemble things now... > And yes, 18446744073709551199 aios sounds quite alot ;) I think you don't have to. My stab-in-the-dark patch is showing that we underflow the number of outstanding aiodio calls - we think there are -416 of these calls outstanding. Thus we wait indefinitely. So now we have to just find where is the accounting wrong. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2011-08-19 17:55 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-08-10 10:51 DIO process stuck apparently due to dioread_nolock (3.0) Michael Tokarev 2011-08-11 11:59 ` Jan Kara 2011-08-11 12:21 ` Michael Tokarev 2011-08-11 14:01 ` Jan Kara 2011-08-11 20:05 ` Michael Tokarev 2011-08-12 2:46 ` Jiaying Zhang 2011-08-12 6:23 ` Michael Tokarev 2011-08-12 7:07 ` Michael Tokarev 2011-08-12 13:07 ` Jan Kara 2011-08-12 15:55 ` Michael Tokarev 2011-08-12 17:01 ` Eric Sandeen 2011-08-12 17:34 ` Michael Tokarev 2011-08-13 16:02 ` Tao Ma 2011-08-14 20:57 ` Michael Tokarev 2011-08-14 21:07 ` Michael Tokarev 2011-08-15 2:36 ` Tao Ma 2011-08-15 8:00 ` Michael Tokarev 2011-08-15 8:56 ` Michael Tokarev 2011-08-15 9:03 ` Michael Tokarev 2011-08-15 10:28 ` Tao Ma 2011-08-15 23:53 ` Jiaying Zhang 2011-08-16 4:15 ` Tao Ma 2011-08-16 8:38 ` Michael Tokarev 2011-08-16 13:53 ` Jan Kara 2011-08-16 15:03 ` Tao Ma 2011-08-16 21:32 ` Jiaying Zhang 2011-08-16 22:28 ` Michael Tokarev 2011-08-16 23:07 ` Jiaying Zhang 2011-08-17 17:02 ` Ted Ts'o 2011-08-18 6:49 ` Michael Tokarev 2011-08-18 18:54 ` Jiaying Zhang 2011-08-19 3:20 ` Tao Ma 2011-08-19 3:18 ` Tao Ma 2011-08-19 7:05 ` Michael Tokarev 2011-08-19 17:55 ` Jiaying Zhang 2011-08-16 23:59 ` Dave Chinner 2011-08-17 0:08 ` Jiaying Zhang 2011-08-17 2:22 ` Tao Ma 2011-08-17 9:04 ` Jan Kara 2011-08-15 16:08 ` Eric Sandeen 2011-08-16 4:12 ` Tao Ma 2011-08-16 6:15 ` Tao Ma 2011-08-12 21:19 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).