* Fwd: Ext4 bug with f [not found] <a125a870910180836i66781e1ey8113e7f8569f1fad@mail.gmail.com> @ 2009-10-18 15:47 ` Fredrik Andersson 2009-10-18 15:57 ` Fwd: Ext4 bug with fallocate Eric Sandeen 2009-10-27 4:42 ` Eric Sandeen 0 siblings, 2 replies; 14+ messages in thread From: Fredrik Andersson @ 2009-10-18 15:47 UTC (permalink / raw) To: linux-ext4 Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope this is the right place to do so. My program creates a big file (around 30 GB) with posix_fallocate (to utilize extents), fills it with data and uses ftruncate to crop it to its final size (usually somewhere between 20 and 25 GB). The problem is that in around 5% of the cases, the program locks up completely in a syscall. The process can thus not be killed even with kill -9, and a reboot is all that will do. Here is the contents of my /proc/PID/syscall: 76 0xee4a80 0x486d6aaf8 0x36390113f8 0x7fffc63cd350 0xecc050 0x7fffc63cd3f0 0x7fffc63cd5c8 0x36380e0cc7 Syscall 76 is as far as I can tell getrlimit, which I do not call. It must have been called somewhere from ftruncate. The file is on a software raid 0 mount with two disks, handled by mdadm. I have reported the problem to the md people but they insist it's an ext4 problem. I have also tried closing the file and opening it again prior to the ftruncate. No change. There are no other strange phenomena whatsoever with ext4. This problem only arises in this particular situation. I'm running Fedora on an x86_64 system. I have tried this on several kernel versions, the last one 2.6.31.1. It has persisted all the way since the kernel that originally shipped with Fedora 11. Is this a bug/known problem? Thankful for any insights! /Fredrik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-18 15:47 ` Fwd: Ext4 bug with f Fredrik Andersson @ 2009-10-18 15:57 ` Eric Sandeen 2009-10-19 9:49 ` Fredrik Andersson 2009-10-27 4:42 ` Eric Sandeen 1 sibling, 1 reply; 14+ messages in thread From: Eric Sandeen @ 2009-10-18 15:57 UTC (permalink / raw) To: Fredrik Andersson; +Cc: linux-ext4 Fredrik Andersson wrote: > Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope > this is the right place to do so. > > My program creates a big file (around 30 GB) with posix_fallocate (to > utilize extents), fills it with data and uses ftruncate to crop it to > its final size (usually somewhere between 20 and 25 GB). > The problem is that in around 5% of the cases, the program locks up > completely in a syscall. The process can thus not be killed even with > kill -9, and a reboot is all that will do. does echo w > /proc/sysrq-trigger (this does sleeping processes; or use echo t for all processes) show you where the stuck threads are? -Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-18 15:57 ` Fwd: Ext4 bug with fallocate Eric Sandeen @ 2009-10-19 9:49 ` Fredrik Andersson 2009-10-20 16:49 ` Fredrik Andersson 2009-10-21 2:20 ` Eric Sandeen 0 siblings, 2 replies; 14+ messages in thread From: Fredrik Andersson @ 2009-10-19 9:49 UTC (permalink / raw) To: Eric Sandeen, linux-ext4 Hi, here is the data for this process: 5958816.744013] drdbmake D ffff88021e4c7800 0 27019 13796 [5958816.744013] ffff8801d1bcda88 0000000000000082 ffff8801f4ce9bf0 ffff8801678b1380 [5958816.744013] 0000000000010e80 000000000000c748 ffff8800404963c0 ffffffff81526360 [5958816.744013] ffff880040496730 00000000f4ce9bf0 000000025819cebe 0000000000000282 [5958816.744013] Call Trace: [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 [5958816.744013] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/ 0x40 [5958816.744013] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 [5958816.744013] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 [5958816.744013] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 [5958816.744013] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 [5958816.744013] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 [5958816.744013] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 [5958816.744013] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 [5958816.744013] [<ffffffff810dda10>] inode_setattr+0x30/0x180 [5958816.744013] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 [5958816.744013] [<ffffffff810ddc79>] notify_change+0x119/0x330 [5958816.744013] [<ffffffff810c6df3>] do_truncate+0x63/0x90 [5958816.744013] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 [5958816.744013] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b Don't know if this has anything to do with it, but I also noticed that another process of mine, which is working just fine, is executing a suspicious looking function called raid0_unplug. It operates on the same raid0/ext4 filesystem as the hung process. I include the calltrace for it here too: [5958816.744013] nodeserv D ffff880167bd7ca8 0 17900 13796 [5958816.744013] ffff880167bd7bf8 0000000000000082 ffff88002800a588 ffff88021e5b56e0 [5958816.744013] 0000000000010e80 000000000000c748 ffff880100664020 ffffffff81526360 [5958816.744013] ffff880100664390 000000008119bd17 000000026327bfa9 0000000000000002 [5958816.744013] Call Trace: [5958816.744013] [<ffffffffa0039291>] ? raid0_unplug+0x51/0x70 [raid0] [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 [5958816.744013] [<ffffffff813a9687>] io_schedule+0x37/0x50 [5958816.744013] [<ffffffff81095e35>] sync_page+0x35/0x60 [5958816.744013] [<ffffffff81095e69>] sync_page_killable+0x9/0x50 [5958816.744013] [<ffffffff813a99d2>] __wait_on_bit_lock+0x52/0xb0 [5958816.744013] [<ffffffff81095e60>] ? sync_page_killable+0x0/0x50 [5958816.744013] [<ffffffff81095d74>] __lock_page_killable+0x64/0x70 [5958816.744013] [<ffffffff8105b940>] ? wake_bit_function+0x0/0x40 [5958816.744013] [<ffffffff81095c0b>] ? find_get_page+0x1b/0xb0 [5958816.744013] [<ffffffff81097908>] generic_file_aio_read+0x3b8/0x6b0 [5958816.744013] [<ffffffff810c7dc1>] do_sync_read+0xf1/0x140 [5958816.744013] [<ffffffff8106a5e8>] ? do_futex+0xb8/0xb20 [5958816.744013] [<ffffffff813ab78f>] ? _spin_unlock_irqrestore+0x2f/0x40 [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [5958816.744013] [<ffffffff8105bc73>] ? add_wait_queue+0x43/0x60 [5958816.744013] [<ffffffff81062a6c>] ? getnstimeofday+0x5c/0xf0 [5958816.744013] [<ffffffff810c85b8>] vfs_read+0xc8/0x170 [5958816.744013] [<ffffffff810c86fa>] sys_pread64+0x9a/0xa0 [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b Hope this makes sense to anyone, and please let me know if there is more info I can provide. /Fredrik On Sun, Oct 18, 2009 at 5:57 PM, Eric Sandeen <sandeen@redhat.com> wrote: > > Fredrik Andersson wrote: >> >> Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope >> this is the right place to do so. >> >> My program creates a big file (around 30 GB) with posix_fallocate (to >> utilize extents), fills it with data and uses ftruncate to crop it to >> its final size (usually somewhere between 20 and 25 GB). >> The problem is that in around 5% of the cases, the program locks up >> completely in a syscall. The process can thus not be killed even with >> kill -9, and a reboot is all that will do. > > does echo w > /proc/sysrq-trigger (this does sleeping processes; or use echo t for all processes) show you where the stuck threads are? > > -Eric > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-19 9:49 ` Fredrik Andersson @ 2009-10-20 16:49 ` Fredrik Andersson 2009-10-21 0:24 ` Mingming 2009-10-21 2:20 ` Eric Sandeen 1 sibling, 1 reply; 14+ messages in thread From: Fredrik Andersson @ 2009-10-20 16:49 UTC (permalink / raw) To: Eric Sandeen, linux-ext4 I found the following post to the ext4 list. This seems to fit my experienced problems pretty exactly. http://osdir.com/ml/linux-ext4/2009-08/msg00184.html Is it the same problem? /Fredrik On Mon, Oct 19, 2009 at 11:49 AM, Fredrik Andersson <nablaman@gmail.com> wrote: > Hi, here is the data for this process: > > 5958816.744013] drdbmake D ffff88021e4c7800 0 27019 13796 > [5958816.744013] ffff8801d1bcda88 0000000000000082 ffff8801f4ce9bf0 > ffff8801678b1380 > [5958816.744013] 0000000000010e80 000000000000c748 ffff8800404963c0 > ffffffff81526360 > [5958816.744013] ffff880040496730 00000000f4ce9bf0 000000025819cebe > 0000000000000282 > [5958816.744013] Call Trace: > [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 > [5958816.744013] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 > [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/ > 0x40 > [5958816.744013] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 > [5958816.744013] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 > [5958816.744013] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 > [5958816.744013] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 > [5958816.744013] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 > [5958816.744013] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 > [5958816.744013] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 > [5958816.744013] [<ffffffff810dda10>] inode_setattr+0x30/0x180 > [5958816.744013] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 > [5958816.744013] [<ffffffff810ddc79>] notify_change+0x119/0x330 > [5958816.744013] [<ffffffff810c6df3>] do_truncate+0x63/0x90 > [5958816.744013] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 > [5958816.744013] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 > [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b > > Don't know if this has anything to do with it, but I also noticed > that another process of mine, > which is working just fine, is executing a suspicious looking function > called raid0_unplug. > It operates on the same raid0/ext4 filesystem as the hung process. I > include the calltrace for it here too: > > [5958816.744013] nodeserv D ffff880167bd7ca8 0 17900 13796 > [5958816.744013] ffff880167bd7bf8 0000000000000082 ffff88002800a588 > ffff88021e5b56e0 > [5958816.744013] 0000000000010e80 000000000000c748 ffff880100664020 > ffffffff81526360 > [5958816.744013] ffff880100664390 000000008119bd17 000000026327bfa9 > 0000000000000002 > [5958816.744013] Call Trace: > [5958816.744013] [<ffffffffa0039291>] ? raid0_unplug+0x51/0x70 [raid0] > [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 > [5958816.744013] [<ffffffff813a9687>] io_schedule+0x37/0x50 > [5958816.744013] [<ffffffff81095e35>] sync_page+0x35/0x60 > [5958816.744013] [<ffffffff81095e69>] sync_page_killable+0x9/0x50 > [5958816.744013] [<ffffffff813a99d2>] __wait_on_bit_lock+0x52/0xb0 > [5958816.744013] [<ffffffff81095e60>] ? sync_page_killable+0x0/0x50 > [5958816.744013] [<ffffffff81095d74>] __lock_page_killable+0x64/0x70 > [5958816.744013] [<ffffffff8105b940>] ? wake_bit_function+0x0/0x40 > [5958816.744013] [<ffffffff81095c0b>] ? find_get_page+0x1b/0xb0 > [5958816.744013] [<ffffffff81097908>] generic_file_aio_read+0x3b8/0x6b0 > [5958816.744013] [<ffffffff810c7dc1>] do_sync_read+0xf1/0x140 > [5958816.744013] [<ffffffff8106a5e8>] ? do_futex+0xb8/0xb20 > [5958816.744013] [<ffffffff813ab78f>] ? _spin_unlock_irqrestore+0x2f/0x40 > [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 > [5958816.744013] [<ffffffff8105bc73>] ? add_wait_queue+0x43/0x60 > [5958816.744013] [<ffffffff81062a6c>] ? getnstimeofday+0x5c/0xf0 > [5958816.744013] [<ffffffff810c85b8>] vfs_read+0xc8/0x170 > [5958816.744013] [<ffffffff810c86fa>] sys_pread64+0x9a/0xa0 > [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b > > Hope this makes sense to anyone, and please let me know if there is > more info I can provide. > > /Fredrik > > On Sun, Oct 18, 2009 at 5:57 PM, Eric Sandeen <sandeen@redhat.com> wrote: >> >> Fredrik Andersson wrote: >>> >>> Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope >>> this is the right place to do so. >>> >>> My program creates a big file (around 30 GB) with posix_fallocate (to >>> utilize extents), fills it with data and uses ftruncate to crop it to >>> its final size (usually somewhere between 20 and 25 GB). >>> The problem is that in around 5% of the cases, the program locks up >>> completely in a syscall. The process can thus not be killed even with >>> kill -9, and a reboot is all that will do. >> >> does echo w > /proc/sysrq-trigger (this does sleeping processes; or use echo t for all processes) show you where the stuck threads are? >> >> -Eric >> > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-20 16:49 ` Fredrik Andersson @ 2009-10-21 0:24 ` Mingming 2009-10-22 7:37 ` Fredrik Andersson 0 siblings, 1 reply; 14+ messages in thread From: Mingming @ 2009-10-21 0:24 UTC (permalink / raw) To: Fredrik Andersson; +Cc: Eric Sandeen, linux-ext4 On Tue, 2009-10-20 at 18:49 +0200, Fredrik Andersson wrote: > I found the following post to the ext4 list. This seems to fit my > experienced problems pretty exactly. > > http://osdir.com/ml/linux-ext4/2009-08/msg00184.html > > Is it the same problem? > The link you provided about is related to a race between restart a transaction from truncate, and the other process is doing something like block allocation to the same file. Do you have another threads allocating blocks while you are truncating? > /Fredrik > > On Mon, Oct 19, 2009 at 11:49 AM, Fredrik Andersson <nablaman@gmail.com> wrote: > > Hi, here is the data for this process: > > > > 5958816.744013] drdbmake D ffff88021e4c7800 0 27019 13796 > > [5958816.744013] ffff8801d1bcda88 0000000000000082 ffff8801f4ce9bf0 > > ffff8801678b1380 > > [5958816.744013] 0000000000010e80 000000000000c748 ffff8800404963c0 > > ffffffff81526360 > > [5958816.744013] ffff880040496730 00000000f4ce9bf0 000000025819cebe > > 0000000000000282 > > [5958816.744013] Call Trace: > > [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 > > [5958816.744013] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 > > [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/ > > 0x40 > > [5958816.744013] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 > > [5958816.744013] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 > > [5958816.744013] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 > > [5958816.744013] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 > > [5958816.744013] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 > > [5958816.744013] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 > > [5958816.744013] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 > > [5958816.744013] [<ffffffff810dda10>] inode_setattr+0x30/0x180 > > [5958816.744013] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 > > [5958816.744013] [<ffffffff810ddc79>] notify_change+0x119/0x330 > > [5958816.744013] [<ffffffff810c6df3>] do_truncate+0x63/0x90 > > [5958816.744013] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 > > [5958816.744013] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 > > [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b > > > > Don't know if this has anything to do with it, but I also noticed > > that another process of mine, > > which is working just fine, is executing a suspicious looking function > > called raid0_unplug. > > It operates on the same raid0/ext4 filesystem as the hung process. I > > include the calltrace for it here too: > > > > [5958816.744013] nodeserv D ffff880167bd7ca8 0 17900 13796 > > [5958816.744013] ffff880167bd7bf8 0000000000000082 ffff88002800a588 > > ffff88021e5b56e0 > > [5958816.744013] 0000000000010e80 000000000000c748 ffff880100664020 > > ffffffff81526360 > > [5958816.744013] ffff880100664390 000000008119bd17 000000026327bfa9 > > 0000000000000002 > > [5958816.744013] Call Trace: > > [5958816.744013] [<ffffffffa0039291>] ? raid0_unplug+0x51/0x70 [raid0] > > [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 > > [5958816.744013] [<ffffffff813a9687>] io_schedule+0x37/0x50 > > [5958816.744013] [<ffffffff81095e35>] sync_page+0x35/0x60 > > [5958816.744013] [<ffffffff81095e69>] sync_page_killable+0x9/0x50 > > [5958816.744013] [<ffffffff813a99d2>] __wait_on_bit_lock+0x52/0xb0 > > [5958816.744013] [<ffffffff81095e60>] ? sync_page_killable+0x0/0x50 > > [5958816.744013] [<ffffffff81095d74>] __lock_page_killable+0x64/0x70 > > [5958816.744013] [<ffffffff8105b940>] ? wake_bit_function+0x0/0x40 > > [5958816.744013] [<ffffffff81095c0b>] ? find_get_page+0x1b/0xb0 > > [5958816.744013] [<ffffffff81097908>] generic_file_aio_read+0x3b8/0x6b0 > > [5958816.744013] [<ffffffff810c7dc1>] do_sync_read+0xf1/0x140 > > [5958816.744013] [<ffffffff8106a5e8>] ? do_futex+0xb8/0xb20 > > [5958816.744013] [<ffffffff813ab78f>] ? _spin_unlock_irqrestore+0x2f/0x40 > > [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 > > [5958816.744013] [<ffffffff8105bc73>] ? add_wait_queue+0x43/0x60 > > [5958816.744013] [<ffffffff81062a6c>] ? getnstimeofday+0x5c/0xf0 > > [5958816.744013] [<ffffffff810c85b8>] vfs_read+0xc8/0x170 > > [5958816.744013] [<ffffffff810c86fa>] sys_pread64+0x9a/0xa0 > > [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b > > This stack seems to me this thread is doing IO but never come back. > > Hope this makes sense to anyone, and please let me know if there is > > more info I can provide. > > > > /Fredrik > > > > On Sun, Oct 18, 2009 at 5:57 PM, Eric Sandeen <sandeen@redhat.com> wrote: > >> > >> Fredrik Andersson wrote: > >>> > >>> Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope > >>> this is the right place to do so. > >>> > >>> My program creates a big file (around 30 GB) with posix_fallocate (to > >>> utilize extents), fills it with data and uses ftruncate to crop it to > >>> its final size (usually somewhere between 20 and 25 GB). > >>> The problem is that in around 5% of the cases, the program locks up > >>> completely in a syscall. The process can thus not be killed even with > >>> kill -9, and a reboot is all that will do. > >> > >> does echo w > /proc/sysrq-trigger (this does sleeping processes; or use echo t for all processes) show you where the stuck threads are? > >> > >> -Eric > >> > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-21 0:24 ` Mingming @ 2009-10-22 7:37 ` Fredrik Andersson 2009-10-26 10:43 ` Fredrik Andersson 0 siblings, 1 reply; 14+ messages in thread From: Fredrik Andersson @ 2009-10-22 7:37 UTC (permalink / raw) To: Mingming, linux-ext4 > The link you provided about is related to a race between restart a > transaction from truncate, and the other process is doing something like > block allocation to the same file. Do you have another threads > allocating blocks while you are truncating? No, there is only a single thread in this program. No other thread or process is accessing the same file or file descriptor. /Fredrik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-22 7:37 ` Fredrik Andersson @ 2009-10-26 10:43 ` Fredrik Andersson 0 siblings, 0 replies; 14+ messages in thread From: Fredrik Andersson @ 2009-10-26 10:43 UTC (permalink / raw) To: linux-ext4 I had the same error again, on another machine. I'm posting the calls stacks of these hung processes too. Hope someone can figure this out.. 6515054.873144] INFO: task pdflush:269 blocked for more than 120 seconds. [6515054.873148] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515054.873151] pdflush D ffff8801f4cf0ad4 0 269 2 [6515054.873156] ffff88021e551960 0000000000000046 ffff88021e63d6e0 ffff88007ddd1540 [6515054.873161] 0000000000010e80 000000000000c748 ffff88021fa2c020 ffff88021f843900 [6515054.873166] ffff88021fa2c390 0000000100000008 00000002844bfcb2 0080001000800013 [6515054.873170] Call Trace: [6515054.873180] [<ffffffff813a9639>] schedule+0x9/0x20 [6515054.873184] [<ffffffff813ab2d5>] __down_read+0x85/0xb7 [6515054.873187] [<ffffffff813aa565>] down_read+0x25/0x30 [6515054.873192] [<ffffffff8114e6f2>] ext4_get_blocks_wrap+0x52/0x290 [6515054.873196] [<ffffffff8114f601>] mpage_da_map_blocks+0xd1/0x7a0 [6515054.873202] [<ffffffff8109d12c>] ? write_cache_pages+0x2fc/0x470 [6515054.873205] [<ffffffff81150330>] ? __mpage_da_writepage+0x0/0x180 [6515054.873209] [<ffffffff8114ffdb>] ext4_da_writepages+0x30b/0x590 [6515054.873213] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6515054.873217] [<ffffffff8109d2f8>] do_writepages+0x28/0x50 [6515054.873222] [<ffffffff810e60fb>] __writeback_single_inode+0x9b/0x3e0 [6515054.873226] [<ffffffff8104f830>] ? process_timeout+0x0/0x10 [6515054.873230] [<ffffffff813a8827>] ? io_schedule_timeout+0x37/0x50 [6515054.873234] [<ffffffff810e68ec>] generic_sync_sb_inodes+0x36c/0x440 [6515054.873237] [<ffffffff810e6b8e>] writeback_inodes+0x5e/0x100 [6515054.873241] [<ffffffff8109d4ac>] wb_kupdate+0xbc/0x140 [6515054.873244] [<ffffffff8109e6a8>] pdflush+0x118/0x220 [6515054.873248] [<ffffffff8109d3f0>] ? wb_kupdate+0x0/0x140 [6515054.873251] [<ffffffff8109e590>] ? pdflush+0x0/0x220 [6515054.873255] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6515054.873259] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6515054.873263] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6515054.873266] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6515054.873270] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6515054.873275] INFO: task kjournald2:1334 blocked for more than 120 seconds. [6515054.873277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515054.873279] kjournald2 D ffff88021e731e28 0 1334 2 [6515054.873283] ffff88021e731d10 0000000000000046 ffff88021e731ca0 ffffffff813a9b3b [6515054.873288] 0000000000010e80 000000000000c748 ffff88021f9c5580 ffff88021f843900 [6515054.873292] ffff88021f9c58f0 0000000100000282 00000002844c0480 0000000000000282 [6515054.873296] Call Trace: [6515054.873300] [<ffffffff813a9b3b>] ? __wait_on_bit+0x7b/0x90 [6515054.873303] [<ffffffff813a9639>] schedule+0x9/0x20 [6515054.873308] [<ffffffff81178ebe>] jbd2_journal_commit_transaction+0x1be/0x1810 [6515054.873312] [<ffffffff81040cd6>] ? dequeue_task_fair+0x256/0x260 [6515054.873315] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6515054.873320] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515054.873324] [<ffffffff8117ea8a>] kjournald2+0x11a/0x370 [6515054.873327] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515054.873330] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6515054.873333] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6515054.873337] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6515054.873340] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6515054.873344] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6515054.873347] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6515054.873351] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6515054.873361] INFO: task drdbmake:3919 blocked for more than 120 seconds. [6515054.873363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515054.873365] drdbmake D ffff88021e56f000 0 3919 27107 [6515054.873369] ffff88021eb4da88 0000000000000082 ffff8801f4cf0ae8 ffff880008ba03f0 [6515054.873374] 0000000000010e80 000000000000c748 ffff8801049063c0 ffff88021f843900 [6515054.873378] ffff880104906730 00000001f4cf0ae8 00000002844c0480 0000000000000282 [6515054.873382] Call Trace: [6515054.873386] [<ffffffff813a9639>] schedule+0x9/0x20 [6515054.873390] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6515054.873394] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515054.873397] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 [6515054.873402] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 [6515054.873406] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 [6515054.873409] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 [6515054.873414] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 [6515054.873418] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 [6515054.873422] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 [6515054.873426] [<ffffffff810dda10>] inode_setattr+0x30/0x180 [6515054.873430] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 [6515054.873433] [<ffffffff810ddc79>] notify_change+0x119/0x330 [6515054.873438] [<ffffffff810c6df3>] do_truncate+0x63/0x90 [6515054.873441] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 [6515054.873445] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 [6515054.873448] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6515174.873066] INFO: task pdflush:269 blocked for more than 120 seconds. [6515174.873070] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515174.873074] pdflush D ffff8801f4cf0ad4 0 269 2 [6515174.873079] ffff88021e551960 0000000000000046 ffff88021e63d6e0 ffff88007ddd1540 [6515174.873085] 0000000000010e80 000000000000c748 ffff88021fa2c020 ffff88021f843900 [6515174.873090] ffff88021fa2c390 0000000100000008 00000002844bfcb2 0080001000800013 [6515174.873096] Call Trace: [6515174.873106] [<ffffffff813a9639>] schedule+0x9/0x20 [6515174.873110] [<ffffffff813ab2d5>] __down_read+0x85/0xb7 [6515174.873115] [<ffffffff813aa565>] down_read+0x25/0x30 [6515174.873120] [<ffffffff8114e6f2>] ext4_get_blocks_wrap+0x52/0x290 [6515174.873124] [<ffffffff8114f601>] mpage_da_map_blocks+0xd1/0x7a0 [6515174.873138] [<ffffffff8109d12c>] ? write_cache_pages+0x2fc/0x470 [6515174.873146] [<ffffffff81150330>] ? __mpage_da_writepage+0x0/0x180 [6515174.873155] [<ffffffff8114ffdb>] ext4_da_writepages+0x30b/0x590 [6515174.873164] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6515174.873172] [<ffffffff8109d2f8>] do_writepages+0x28/0x50 [6515174.873181] [<ffffffff810e60fb>] __writeback_single_inode+0x9b/0x3e0 [6515174.873190] [<ffffffff8104f830>] ? process_timeout+0x0/0x10 [6515174.873197] [<ffffffff813a8827>] ? io_schedule_timeout+0x37/0x50 [6515174.873206] [<ffffffff810e68ec>] generic_sync_sb_inodes+0x36c/0x440 [6515174.873214] [<ffffffff810e6b8e>] writeback_inodes+0x5e/0x100 [6515174.873222] [<ffffffff8109d4ac>] wb_kupdate+0xbc/0x140 [6515174.873230] [<ffffffff8109e6a8>] pdflush+0x118/0x220 [6515174.873238] [<ffffffff8109d3f0>] ? wb_kupdate+0x0/0x140 [6515174.873245] [<ffffffff8109e590>] ? pdflush+0x0/0x220 [6515174.873253] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6515174.873261] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6515174.873269] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6515174.873277] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6515174.873284] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6515174.873303] INFO: task kjournald2:1334 blocked for more than 120 seconds. [6515174.873308] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515174.873314] kjournald2 D ffff88021e731e28 0 1334 2 [6515174.873322] ffff88021e731d10 0000000000000046 ffff88021e731ca0 ffffffff813a9b3b [6515174.873333] 0000000000010e80 000000000000c748 ffff88021f9c5580 ffff88021f843900 [6515174.873344] ffff88021f9c58f0 0000000100000282 00000002844c0480 0000000000000282 [6515174.873355] Call Trace: [6515174.873360] [<ffffffff813a9b3b>] ? __wait_on_bit+0x7b/0x90 [6515174.873367] [<ffffffff813a9639>] schedule+0x9/0x20 [6515174.873375] [<ffffffff81178ebe>] jbd2_journal_commit_transaction+0x1be/0x1810 [6515174.873382] [<ffffffff81040cd6>] ? dequeue_task_fair+0x256/0x260 [6515174.873389] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6515174.873396] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515174.873404] [<ffffffff8117ea8a>] kjournald2+0x11a/0x370 [6515174.873411] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515174.873418] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6515174.873424] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6515174.873431] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6515174.873437] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6515174.873444] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6515174.873450] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6515174.873457] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6515174.873470] INFO: task drdbmake:3919 blocked for more than 120 seconds. [6515174.873475] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515174.873480] drdbmake D ffff88021e56f000 0 3919 27107 [6515174.873489] ffff88021eb4da88 0000000000000082 ffff8801f4cf0ae8 ffff880008ba03f0 [6515174.873499] 0000000000010e80 000000000000c748 ffff8801049063c0 ffff88021f843900 [6515174.873510] ffff880104906730 00000001f4cf0ae8 00000002844c0480 0000000000000282 [6515174.873521] Call Trace: [6515174.873526] [<ffffffff813a9639>] schedule+0x9/0x20 [6515174.873534] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6515174.873541] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515174.873548] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 [6515174.873556] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 [6515174.873563] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 [6515174.873569] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 [6515174.873578] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 [6515174.873585] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 [6515174.873592] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 [6515174.873600] [<ffffffff810dda10>] inode_setattr+0x30/0x180 [6515174.873607] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 [6515174.873614] [<ffffffff810ddc79>] notify_change+0x119/0x330 [6515174.873621] [<ffffffff810c6df3>] do_truncate+0x63/0x90 [6515174.873628] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 [6515174.873635] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 [6515174.873642] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6515294.873034] INFO: task pdflush:269 blocked for more than 120 seconds. [6515294.873038] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515294.873041] pdflush D ffff8801f4cf0ad4 0 269 2 [6515294.873046] ffff88021e551960 0000000000000046 ffff88021e63d6e0 ffff88007ddd1540 [6515294.873053] 0000000000010e80 000000000000c748 ffff88021fa2c020 ffff88021f843900 [6515294.873058] ffff88021fa2c390 0000000100000008 00000002844bfcb2 0080001000800013 [6515294.873063] Call Trace: [6515294.873073] [<ffffffff813a9639>] schedule+0x9/0x20 [6515294.873078] [<ffffffff813ab2d5>] __down_read+0x85/0xb7 [6515294.873082] [<ffffffff813aa565>] down_read+0x25/0x30 [6515294.873088] [<ffffffff8114e6f2>] ext4_get_blocks_wrap+0x52/0x290 [6515294.873092] [<ffffffff8114f601>] mpage_da_map_blocks+0xd1/0x7a0 [6515294.873099] [<ffffffff8109d12c>] ? write_cache_pages+0x2fc/0x470 [6515294.873103] [<ffffffff81150330>] ? __mpage_da_writepage+0x0/0x180 [6515294.873107] [<ffffffff8114ffdb>] ext4_da_writepages+0x30b/0x590 [6515294.873112] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6515294.873117] [<ffffffff8109d2f8>] do_writepages+0x28/0x50 [6515294.873123] [<ffffffff810e60fb>] __writeback_single_inode+0x9b/0x3e0 [6515294.873136] [<ffffffff8104f830>] ? process_timeout+0x0/0x10 [6515294.873144] [<ffffffff813a8827>] ? io_schedule_timeout+0x37/0x50 [6515294.873153] [<ffffffff810e68ec>] generic_sync_sb_inodes+0x36c/0x440 [6515294.873161] [<ffffffff810e6b8e>] writeback_inodes+0x5e/0x100 [6515294.873169] [<ffffffff8109d4ac>] wb_kupdate+0xbc/0x140 [6515294.873177] [<ffffffff8109e6a8>] pdflush+0x118/0x220 [6515294.873185] [<ffffffff8109d3f0>] ? wb_kupdate+0x0/0x140 [6515294.873192] [<ffffffff8109e590>] ? pdflush+0x0/0x220 [6515294.873200] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6515294.873209] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6515294.873217] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6515294.873224] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6515294.873232] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6515294.873241] INFO: task kjournald2:1334 blocked for more than 120 seconds. [6515294.873247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515294.873253] kjournald2 D ffff88021e731e28 0 1334 2 [6515294.873273] ffff88021e731d10 0000000000000046 ffff88021e731ca0 ffffffff813a9b3b [6515294.873284] 0000000000010e80 000000000000c748 ffff88021f9c5580 ffff88021f843900 [6515294.873294] ffff88021f9c58f0 0000000100000282 00000002844c0480 0000000000000282 [6515294.873305] Call Trace: [6515294.873311] [<ffffffff813a9b3b>] ? __wait_on_bit+0x7b/0x90 [6515294.873318] [<ffffffff813a9639>] schedule+0x9/0x20 [6515294.873326] [<ffffffff81178ebe>] jbd2_journal_commit_transaction+0x1be/0x1810 [6515294.873334] [<ffffffff81040cd6>] ? dequeue_task_fair+0x256/0x260 [6515294.873341] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6515294.873349] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515294.873356] [<ffffffff8117ea8a>] kjournald2+0x11a/0x370 [6515294.873363] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515294.873369] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6515294.873375] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6515294.873382] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6515294.873389] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6515294.873395] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6515294.873402] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6515294.873408] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6515294.873418] INFO: task receptd:5709 blocked for more than 120 seconds. [6515294.873423] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515294.873428] receptd D 000000000000002a 0 5709 2046 [6515294.873436] ffff88009cce3c78 0000000000000086 ffff88009cce3be8 ffffffff81152dd3 [6515294.873447] 0000000000010e80 000000000000c748 ffff880104901c80 ffffffff81526360 [6515294.873457] ffff880104901ff0 00000000810db183 00000002844fcb53 0000000000000286 [6515294.873468] Call Trace: [6515294.873473] [<ffffffff81152dd3>] ? dx_release+0x23/0x50 [6515294.873480] [<ffffffff813a9639>] schedule+0x9/0x20 [6515294.873488] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6515294.873495] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515294.873502] [<ffffffff81178306>] jbd2_journal_start+0xa6/0xf0 [6515294.873510] [<ffffffff811593c5>] ext4_journal_start_sb+0x55/0x90 [6515294.873517] [<ffffffff81155441>] ext4_create+0x81/0x130 [6515294.873524] [<ffffffff810d23aa>] vfs_create+0x9a/0xb0 [6515294.873531] [<ffffffff810d522e>] do_filp_open+0x85e/0x990 [6515294.873538] [<ffffffff810d1d16>] ? getname+0x36/0x200 [6515294.873545] [<ffffffff810deada>] ? alloc_fd+0x4a/0x140 [6515294.873552] [<ffffffff810c5dfb>] do_sys_open+0x7b/0x110 [6515294.873558] [<ffffffff810c5ebb>] sys_open+0x1b/0x20 [6515294.873565] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6515294.873575] INFO: task drdbmake:3919 blocked for more than 120 seconds. [6515294.873580] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [6515294.873586] drdbmake D ffff88021e56f000 0 3919 27107 [6515294.873594] ffff88021eb4da88 0000000000000082 ffff8801f4cf0ae8 ffff880008ba03f0 [6515294.873605] 0000000000010e80 000000000000c748 ffff8801049063c0 ffff88021f843900 [6515294.873616] ffff880104906730 00000001f4cf0ae8 00000002844c0480 0000000000000282 [6515294.873627] Call Trace: [6515294.873633] [<ffffffff813a9639>] schedule+0x9/0x20 [6515294.873640] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6515294.873647] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6515294.873654] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 [6515294.873662] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 [6515294.873669] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 [6515294.873676] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 [6515294.873685] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 [6515294.873692] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 [6515294.873700] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 [6515294.873707] [<ffffffff810dda10>] inode_setattr+0x30/0x180 [6515294.873714] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 [6515294.873721] [<ffffffff810ddc79>] notify_change+0x119/0x330 [6515294.873728] [<ffffffff810c6df3>] do_truncate+0x63/0x90 [6515294.873734] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 [6515294.873741] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 [6515294.873747] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-19 9:49 ` Fredrik Andersson 2009-10-20 16:49 ` Fredrik Andersson @ 2009-10-21 2:20 ` Eric Sandeen 2009-10-21 9:08 ` Fredrik Andersson 1 sibling, 1 reply; 14+ messages in thread From: Eric Sandeen @ 2009-10-21 2:20 UTC (permalink / raw) To: Fredrik Andersson; +Cc: linux-ext4 Fredrik Andersson wrote: > Hi, here is the data for this process: Including all of the processes in D state (everything reported by sysrq-w) would probably be most helpful. Feel free to file an ext4 bug on bugzilla.kernel.org w/ this information, too, so it doesn't get lost in busy schedules ... Thanks, -Eric > 5958816.744013] drdbmake D ffff88021e4c7800 0 27019 13796 > [5958816.744013] ffff8801d1bcda88 0000000000000082 ffff8801f4ce9bf0 > ffff8801678b1380 > [5958816.744013] 0000000000010e80 000000000000c748 ffff8800404963c0 > ffffffff81526360 > [5958816.744013] ffff880040496730 00000000f4ce9bf0 000000025819cebe > 0000000000000282 > [5958816.744013] Call Trace: > [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 > [5958816.744013] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 > [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/ > 0x40 > [5958816.744013] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 > [5958816.744013] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 > [5958816.744013] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 > [5958816.744013] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 > [5958816.744013] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 > [5958816.744013] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 > [5958816.744013] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 > [5958816.744013] [<ffffffff810dda10>] inode_setattr+0x30/0x180 > [5958816.744013] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 > [5958816.744013] [<ffffffff810ddc79>] notify_change+0x119/0x330 > [5958816.744013] [<ffffffff810c6df3>] do_truncate+0x63/0x90 > [5958816.744013] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 > [5958816.744013] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 > [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b > > Don't know if this has anything to do with it, but I also noticed > that another process of mine, > which is working just fine, is executing a suspicious looking function > called raid0_unplug. > It operates on the same raid0/ext4 filesystem as the hung process. I > include the calltrace for it here too: > > [5958816.744013] nodeserv D ffff880167bd7ca8 0 17900 13796 > [5958816.744013] ffff880167bd7bf8 0000000000000082 ffff88002800a588 > ffff88021e5b56e0 > [5958816.744013] 0000000000010e80 000000000000c748 ffff880100664020 > ffffffff81526360 > [5958816.744013] ffff880100664390 000000008119bd17 000000026327bfa9 > 0000000000000002 > [5958816.744013] Call Trace: > [5958816.744013] [<ffffffffa0039291>] ? raid0_unplug+0x51/0x70 [raid0] > [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 > [5958816.744013] [<ffffffff813a9687>] io_schedule+0x37/0x50 > [5958816.744013] [<ffffffff81095e35>] sync_page+0x35/0x60 > [5958816.744013] [<ffffffff81095e69>] sync_page_killable+0x9/0x50 > [5958816.744013] [<ffffffff813a99d2>] __wait_on_bit_lock+0x52/0xb0 > [5958816.744013] [<ffffffff81095e60>] ? sync_page_killable+0x0/0x50 > [5958816.744013] [<ffffffff81095d74>] __lock_page_killable+0x64/0x70 > [5958816.744013] [<ffffffff8105b940>] ? wake_bit_function+0x0/0x40 > [5958816.744013] [<ffffffff81095c0b>] ? find_get_page+0x1b/0xb0 > [5958816.744013] [<ffffffff81097908>] generic_file_aio_read+0x3b8/0x6b0 > [5958816.744013] [<ffffffff810c7dc1>] do_sync_read+0xf1/0x140 > [5958816.744013] [<ffffffff8106a5e8>] ? do_futex+0xb8/0xb20 > [5958816.744013] [<ffffffff813ab78f>] ? _spin_unlock_irqrestore+0x2f/0x40 > [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 > [5958816.744013] [<ffffffff8105bc73>] ? add_wait_queue+0x43/0x60 > [5958816.744013] [<ffffffff81062a6c>] ? getnstimeofday+0x5c/0xf0 > [5958816.744013] [<ffffffff810c85b8>] vfs_read+0xc8/0x170 > [5958816.744013] [<ffffffff810c86fa>] sys_pread64+0x9a/0xa0 > [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b > > Hope this makes sense to anyone, and please let me know if there is > more info I can provide. > > /Fredrik > > On Sun, Oct 18, 2009 at 5:57 PM, Eric Sandeen <sandeen@redhat.com> wrote: >> Fredrik Andersson wrote: >>> Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope >>> this is the right place to do so. >>> >>> My program creates a big file (around 30 GB) with posix_fallocate (to >>> utilize extents), fills it with data and uses ftruncate to crop it to >>> its final size (usually somewhere between 20 and 25 GB). >>> The problem is that in around 5% of the cases, the program locks up >>> completely in a syscall. The process can thus not be killed even with >>> kill -9, and a reboot is all that will do. >> does echo w > /proc/sysrq-trigger (this does sleeping processes; or use echo t for all processes) show you where the stuck threads are? >> >> -Eric >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-21 2:20 ` Eric Sandeen @ 2009-10-21 9:08 ` Fredrik Andersson 0 siblings, 0 replies; 14+ messages in thread From: Fredrik Andersson @ 2009-10-21 9:08 UTC (permalink / raw) To: Eric Sandeen, linux-ext4 Alright, here is the complete sysrq w data. I will file a bug report too. My programs are drdbmake, nodeserv, receptd. [6130704.521614] SysRq : Show Blocked State [6130704.521696] task PC stack pid father [6130704.521702] pdflush D ffff8801f4ce9bdc 0 269 2 [6130704.521707] ffff88021e551960 0000000000000046 ffff88021e5b56e0 ffff8800d3506a80 [6130704.521712] 0000000000010e80 000000000000c748 ffff88021fa24740 ffff88021f843900 [6130704.521717] ffff88021fa24ab0 0000000100000008 000000025819ca1c 0080001000800013 [6130704.521721] Call Trace: [6130704.521731] [<ffffffff813a9639>] schedule+0x9/0x20 [6130704.521735] [<ffffffff813ab2d5>] __down_read+0x85/0xb7 [6130704.521738] [<ffffffff813aa565>] down_read+0x25/0x30 [6130704.521743] [<ffffffff8114e6f2>] ext4_get_blocks_wrap+0x52/0x290 [6130704.521747] [<ffffffff8114f601>] mpage_da_map_blocks+0xd1/0x7a0 [6130704.521752] [<ffffffff8109d12c>] ? write_cache_pages+0x2fc/0x470 [6130704.521756] [<ffffffff81150330>] ? __mpage_da_writepage+0x0/0x180 [6130704.521760] [<ffffffff8114ffdb>] ext4_da_writepages+0x30b/0x590 [6130704.521764] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6130704.521768] [<ffffffff8109d2f8>] do_writepages+0x28/0x50 [6130704.521773] [<ffffffff810e60fb>] __writeback_single_inode+0x9b/0x3e0 [6130704.521777] [<ffffffff8104f830>] ? process_timeout+0x0/0x10 [6130704.521781] [<ffffffff813a8827>] ? io_schedule_timeout+0x37/0x50 [6130704.521785] [<ffffffff810e68ec>] generic_sync_sb_inodes+0x36c/0x440 [6130704.521789] [<ffffffff810e6b8e>] writeback_inodes+0x5e/0x100 [6130704.521792] [<ffffffff8109d4ac>] wb_kupdate+0xbc/0x140 [6130704.521796] [<ffffffff8109e6a8>] pdflush+0x118/0x220 [6130704.521799] [<ffffffff8109d3f0>] ? wb_kupdate+0x0/0x140 [6130704.521802] [<ffffffff8109e590>] ? pdflush+0x0/0x220 [6130704.521806] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6130704.521811] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6130704.521814] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6130704.521818] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6130704.521821] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6130704.521826] kjournald2 D ffff88021f921e28 0 1334 2 [6130704.521830] ffff88021f921d10 0000000000000046 ffff88021f48c890 0000000000000000 [6130704.521835] 0000000000010e80 000000000000c748 ffff88021e5d8000 ffff88021f844740 [6130704.521839] ffff88021e5d8370 0000000100000282 0000000000000000 0000000000000282 [6130704.521844] Call Trace: [6130704.521847] [<ffffffff813a9639>] schedule+0x9/0x20 [6130704.521853] [<ffffffff81178ebe>] jbd2_journal_commit_transaction+0x1be/0x1810 [6130704.521856] [<ffffffff81040cd6>] ? dequeue_task_fair+0x256/0x260 [6130704.521860] [<ffffffff8103d91b>] ? finish_task_switch+0x5b/0xd0 [6130704.521864] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6130704.521868] [<ffffffff8117ea8a>] kjournald2+0x11a/0x370 [6130704.521872] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6130704.521875] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6130704.521878] [<ffffffff8117e970>] ? kjournald2+0x0/0x370 [6130704.521881] [<ffffffff8105b4c6>] kthread+0x56/0x90 [6130704.521885] [<ffffffff8100cf6a>] child_rip+0xa/0x20 [6130704.521888] [<ffffffff8100c969>] ? restore_args+0x0/0x30 [6130704.521892] [<ffffffff8105b470>] ? kthread+0x0/0x90 [6130704.521895] [<ffffffff8100cf60>] ? child_rip+0x0/0x20 [6130704.521901] receptd D 000000000000002a 0 28746 2063 [6130704.521905] ffff8801d1b39c78 0000000000000082 ffff8801d1b39be8 ffffffff81152dd3 [6130704.521910] 0000000000010e80 000000000000c748 ffff88010014ce60 ffffffff81526360 [6130704.521914] ffff88010014d1d0 00000000810db183 000000025821c45c 0000000000000286 [6130704.521919] Call Trace: [6130704.521922] [<ffffffff81152dd3>] ? dx_release+0x23/0x50 [6130704.521926] [<ffffffff813a9639>] schedule+0x9/0x20 [6130704.521930] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6130704.521934] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6130704.521937] [<ffffffff81178306>] jbd2_journal_start+0xa6/0xf0 [6130704.521942] [<ffffffff811593c5>] ext4_journal_start_sb+0x55/0x90 [6130704.521945] [<ffffffff81155441>] ext4_create+0x81/0x130 [6130704.521949] [<ffffffff810d23aa>] vfs_create+0x9a/0xb0 [6130704.521952] [<ffffffff810d522e>] do_filp_open+0x85e/0x990 [6130704.521956] [<ffffffff810d1d16>] ? getname+0x36/0x200 [6130704.521960] [<ffffffff810deada>] ? alloc_fd+0x4a/0x140 [6130704.521964] [<ffffffff810c5dfb>] do_sys_open+0x7b/0x110 [6130704.521967] [<ffffffff810c5ebb>] sys_open+0x1b/0x20 [6130704.521971] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.521973] receptd D 000000000000002a 0 10643 2063 [6130704.521977] ffff8800155e9c78 0000000000000082 ffff8800155e9be8 ffffffff81152dd3 [6130704.521982] 0000000000010e80 000000000000c748 ffff880100149560 ffffffff81526360 [6130704.521986] ffff8801001498d0 00000000810db183 000000025adbe421 0000000000000286 [6130704.521991] Call Trace: [6130704.521994] [<ffffffff81152dd3>] ? dx_release+0x23/0x50 [6130704.521998] [<ffffffff813a9639>] schedule+0x9/0x20 [6130704.522002] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6130704.522005] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6130704.522009] [<ffffffff81178306>] jbd2_journal_start+0xa6/0xf0 [6130704.522013] [<ffffffff811593c5>] ext4_journal_start_sb+0x55/0x90 [6130704.522016] [<ffffffff81155441>] ext4_create+0x81/0x130 [6130704.522020] [<ffffffff810d23aa>] vfs_create+0x9a/0xb0 [6130704.522023] [<ffffffff810d522e>] do_filp_open+0x85e/0x990 [6130704.522027] [<ffffffff810d1d16>] ? getname+0x36/0x200 [6130704.522030] [<ffffffff810deada>] ? alloc_fd+0x4a/0x140 [6130704.522033] [<ffffffff810c5dfb>] do_sys_open+0x7b/0x110 [6130704.522037] [<ffffffff810c5ebb>] sys_open+0x1b/0x20 [6130704.522040] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522043] receptd D 000000000000002a 0 3658 2063 [6130704.522047] ffff8801792adc78 0000000000000082 ffff8801792adbe8 ffffffff81152dd3 [6130704.522051] 0000000000010e80 000000000000c748 ffff88010014b900 ffffffff81526360 [6130704.522056] ffff88010014bc70 00000000810db183 000000025ddc053f 0000000000000286 [6130704.522060] Call Trace: [6130704.522063] [<ffffffff81152dd3>] ? dx_release+0x23/0x50 [6130704.522067] [<ffffffff813a9639>] schedule+0x9/0x20 [6130704.522071] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6130704.522075] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6130704.522079] [<ffffffff81178306>] jbd2_journal_start+0xa6/0xf0 [6130704.522083] [<ffffffff811593c5>] ext4_journal_start_sb+0x55/0x90 [6130704.522086] [<ffffffff81155441>] ext4_create+0x81/0x130 [6130704.522089] [<ffffffff810d23aa>] vfs_create+0x9a/0xb0 [6130704.522093] [<ffffffff810d522e>] do_filp_open+0x85e/0x990 [6130704.522096] [<ffffffff810d1d16>] ? getname+0x36/0x200 [6130704.522099] [<ffffffff810deada>] ? alloc_fd+0x4a/0x140 [6130704.522103] [<ffffffff810c5dfb>] do_sys_open+0x7b/0x110 [6130704.522106] [<ffffffff810c5ebb>] sys_open+0x1b/0x20 [6130704.522109] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522112] receptd D ffff8801f4d04ba4 0 4716 2063 [6130704.522116] ffff880001f43e58 0000000000000082 ffffffff81095e00 ffff880001f43de8 [6130704.522120] 0000000000010e80 000000000000c748 ffff880100149c80 ffffffff81526360 [6130704.522125] ffff880100149ff0 0000000000060d51 000000025fd791a8 0000000000060d51 [6130704.522128] Call Trace: [6130704.522128] [<ffffffff81095e00>] ? sync_page+0x0/0x60 [6130704.522128] [<ffffffff810b9c48>] ? delete_from_swap_cache+0x48/0x60 [6130704.522128] [<ffffffff810ba70b>] ? swap_info_get+0x6b/0xf0 [6130704.522128] [<ffffffff813a9d7b>] __mutex_lock_killable_slowpath+0xeb/0x1c0 [6130704.522128] [<ffffffff810d6c90>] ? filldir+0x0/0xe0 [6130704.522128] [<ffffffff813a9ea8>] mutex_lock_killable+0x58/0x70 [6130704.522128] [<ffffffff810d6ed2>] vfs_readdir+0x72/0xc0 [6130704.522128] [<ffffffff810d707b>] sys_getdents+0x8b/0xf0 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] receptd D ffff8801f4d04ba4 0 18596 2063 [6130704.522128] ffff88015a003e58 0000000000000082 ffff88021c9cb540 000000012905b025 [6130704.522128] 0000000000010e80 000000000000c748 ffff8801fb166ae0 ffff88021f843900 [6130704.522128] ffff8801fb166e50 0000000101c0ba80 000000026291bdc0 ffffe20006bcac40 [6130704.522128] Call Trace: [6130704.522128] [<ffffffff813a9d7b>] __mutex_lock_killable_slowpath+0xeb/0x1c0 [6130704.522128] [<ffffffff810d6c90>] ? filldir+0x0/0xe0 [6130704.522128] [<ffffffff813a9ea8>] mutex_lock_killable+0x58/0x70 [6130704.522128] [<ffffffff810d6ed2>] vfs_readdir+0x72/0xc0 [6130704.522128] [<ffffffff810d707b>] sys_getdents+0x8b/0xf0 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] receptd D ffff8801f4d04ba4 0 569 2063 [6130704.522128] ffff8801c5705e58 0000000000000082 0000000000000000 ffff88021e7a1150 [6130704.522128] 0000000000010e80 000000000000c748 ffff8801fb161560 ffffffff81526360 [6130704.522128] ffff8801fb1618d0 0000000081580a80 00000002654bd5e6 0000000000000001 [6130704.522128] Call Trace: [6130704.522128] [<ffffffff810b68b4>] ? anon_vma_prepare+0x34/0xf0 [6130704.522128] [<ffffffff813a9d7b>] __mutex_lock_killable_slowpath+0xeb/0x1c0 [6130704.522128] [<ffffffff810d6c90>] ? filldir+0x0/0xe0 [6130704.522128] [<ffffffff813a9ea8>] mutex_lock_killable+0x58/0x70 [6130704.522128] [<ffffffff810d6ed2>] vfs_readdir+0x72/0xc0 [6130704.522128] [<ffffffff810d707b>] sys_getdents+0x8b/0xf0 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] receptd D ffff8801f4d04ba4 0 15225 2063 [6130704.522128] ffff880013545e58 0000000000000082 0000000000000000 ffff88021e7a1150 [6130704.522128] 0000000000010e80 000000000000c748 ffff8801fb165ca0 ffffffff81526360 [6130704.522128] ffff8801fb166010 0000000081580a80 000000026805edc6 0000000000000001 [6130704.522128] Call Trace: [6130704.522128] [<ffffffff810b68b4>] ? anon_vma_prepare+0x34/0xf0 [6130704.522128] [<ffffffff813a9d7b>] __mutex_lock_killable_slowpath+0xeb/0x1c0 [6130704.522128] [<ffffffff810d6c90>] ? filldir+0x0/0xe0 [6130704.522128] [<ffffffff813a9ea8>] mutex_lock_killable+0x58/0x70 [6130704.522128] [<ffffffff810d6ed2>] vfs_readdir+0x72/0xc0 [6130704.522128] [<ffffffff810d707b>] sys_getdents+0x8b/0xf0 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] receptd D ffff8801f4d04ba4 0 29613 2063 [6130704.522128] ffff88004071de58 0000000000000082 0000000000000000 ffff88021e7a1150 [6130704.522128] 0000000000010e80 000000000000c748 ffff8801fb1663c0 ffffffff81526360 [6130704.522128] ffff8801fb166730 0000000081580a80 000000026ac00606 0000000000000001 [6130704.522128] Call Trace: [6130704.522128] [<ffffffff810b68b4>] ? anon_vma_prepare+0x34/0xf0 [6130704.522128] [<ffffffff813a9d7b>] __mutex_lock_killable_slowpath+0xeb/0x1c0 [6130704.522128] [<ffffffff810d6c90>] ? filldir+0x0/0xe0 [6130704.522128] [<ffffffff813a9ea8>] mutex_lock_killable+0x58/0x70 [6130704.522128] [<ffffffff810d6ed2>] vfs_readdir+0x72/0xc0 [6130704.522128] [<ffffffff810d707b>] sys_getdents+0x8b/0xf0 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] receptd D ffff8801f4d04ba4 0 4444 2063 [6130704.522128] ffff8800404e5e58 0000000000000082 0000000000000000 ffff88021e7a1150 [6130704.522128] 0000000000010e80 000000000000c748 ffff880040494020 ffffffff81526360 [6130704.522128] ffff880040494390 0000000081580a80 000000026b2c0f52 0000000000000001 [6130704.522128] Call Trace: [6130704.522128] [<ffffffff810b68b4>] ? anon_vma_prepare+0x34/0xf0 [6130704.522128] [<ffffffff813a9d7b>] __mutex_lock_killable_slowpath+0xeb/0x1c0 [6130704.522128] [<ffffffff810d6c90>] ? filldir+0x0/0xe0 [6130704.522128] [<ffffffff813a9ea8>] mutex_lock_killable+0x58/0x70 [6130704.522128] [<ffffffff810d6ed2>] vfs_readdir+0x72/0xc0 [6130704.522128] [<ffffffff810d707b>] sys_getdents+0x8b/0xf0 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] drdbmake D ffff88021e4c7800 0 27019 13796 [6130704.522128] ffff8801d1bcda88 0000000000000082 ffff8801f4ce9bf0 ffff8801678b1380 [6130704.522128] 0000000000010e80 000000000000c748 ffff8800404963c0 ffffffff81526360 [6130704.522128] ffff880040496730 00000000f4ce9bf0 000000025819cebe 0000000000000282 [6130704.522128] Call Trace: [6130704.522128] [<ffffffff813a9639>] schedule+0x9/0x20 [6130704.522128] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 [6130704.522128] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6130704.522128] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 [6130704.522128] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 [6130704.522128] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 [6130704.522128] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 [6130704.522128] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 [6130704.522128] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 [6130704.522128] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 [6130704.522128] [<ffffffff810dda10>] inode_setattr+0x30/0x180 [6130704.522128] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 [6130704.522128] [<ffffffff810ddc79>] notify_change+0x119/0x330 [6130704.522128] [<ffffffff810c6df3>] do_truncate+0x63/0x90 [6130704.522128] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 [6130704.522128] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] nodeserv D ffff88021eaedca8 0 17888 13796 [6130704.522128] ffff88021eaedbf8 0000000000000082 ffff8800280261e8 ffff88021e5b56e0 [6130704.522128] 0000000000010e80 000000000000c748 ffff880100148e40 ffffffff81526360 [6130704.522128] ffff8801001491b0 000000008119bd17 000000026d668c1b 0000000000000002 [6130704.522128] Call Trace: [6130704.522128] [<ffffffffa0039291>] ? raid0_unplug+0x51/0x70 [raid0] [6130704.522128] [<ffffffff813a9639>] schedule+0x9/0x20 [6130704.522128] [<ffffffff813a9687>] io_schedule+0x37/0x50 [6130704.522128] [<ffffffff81095e35>] sync_page+0x35/0x60 [6130704.522128] [<ffffffff81095e69>] sync_page_killable+0x9/0x50 [6130704.522128] [<ffffffff813a99d2>] __wait_on_bit_lock+0x52/0xb0 [6130704.522128] [<ffffffff81095e60>] ? sync_page_killable+0x0/0x50 [6130704.522128] [<ffffffff81095d74>] __lock_page_killable+0x64/0x70 [6130704.522128] [<ffffffff8105b940>] ? wake_bit_function+0x0/0x40 [6130704.522128] [<ffffffff81095c0b>] ? find_get_page+0x1b/0xb0 [6130704.522128] [<ffffffff81097908>] generic_file_aio_read+0x3b8/0x6b0 [6130704.522128] [<ffffffff810c7dc1>] do_sync_read+0xf1/0x140 [6130704.522128] [<ffffffff8106a5e8>] ? do_futex+0xb8/0xb20 [6130704.522128] [<ffffffff813ab78f>] ? _spin_unlock_irqrestore+0x2f/0x40 [6130704.522128] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 [6130704.522128] [<ffffffff8105bc73>] ? add_wait_queue+0x43/0x60 [6130704.522128] [<ffffffff81062a6c>] ? getnstimeofday+0x5c/0xf0 [6130704.522128] [<ffffffff8105f049>] ? ktime_get_ts+0x59/0x60 [6130704.522128] [<ffffffff810c85b8>] vfs_read+0xc8/0x170 [6130704.522128] [<ffffffff810c86fa>] sys_pread64+0x9a/0xa0 [6130704.522128] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b [6130704.522128] Sched Debug Version: v0.09, 2.6.30.4 #1 [6130704.522128] now at 6130704522.576956 msecs [6130704.522128] .jiffies : 10425371817 [6130704.522128] .sysctl_sched_latency : 40.000000 [6130704.522128] .sysctl_sched_min_granularity : 8.000000 [6130704.522128] .sysctl_sched_wakeup_granularity : 10.000000 [6130704.522128] .sysctl_sched_child_runs_first : 0.000001 [6130704.522128] .sysctl_sched_features : 113917 [6130704.522128] [6130704.522128] cpu#0, 1866.760 MHz [6130704.522128] .nr_running : 0 [6130704.522128] .load : 0 [6130704.522128] .nr_switches : 1088197559 [6130704.522128] .nr_load_updates : 1494088224 [6130704.522128] .nr_uninterruptible : -10143689 [6130704.522128] .next_balance : 10425.371927 [6130704.522128] .curr->pid : 0 [6130704.522128] .clock : 6130704522.001795 [6130704.522128] .cpu_load[0] : 0 [6130704.522128] .cpu_load[1] : 732 [6130704.522128] .cpu_load[2] : 1307 [6130704.522128] .cpu_load[3] : 1628 [6130704.522128] .cpu_load[4] : 2055 [6130704.522128] .yld_count : 62066 [6130704.522128] .sched_switch : 0 [6130704.522128] .sched_count : 1162815210 [6130704.522128] .sched_goidle : 486300434 [6130704.522128] .ttwu_count : 639575934 [6130704.522128] .ttwu_local : 567060963 [6130704.522128] .bkl_count : 643 [6130704.522128] [6130704.522128] cfs_rq[0]: [6130704.522128] .exec_clock : 1105652808.029192 [6130704.522128] .MIN_vruntime : 0.000001 [6130704.522128] .min_vruntime : 7063664391.526226 [6130704.522128] .max_vruntime : 0.000001 [6130704.522128] .spread : 0.000000 [6130704.522128] .spread0 : 0.000000 [6130704.522128] .nr_running : 0 [6130704.522128] .load : 0 [6130704.522128] .nr_spread_over : 8978 [6130704.522128] [6130704.522128] rt_rq[0]: [6130704.522128] .rt_nr_running : 0 [6130704.522128] .rt_throttled : 0 [6130704.522128] .rt_time : 0.000000 [6130704.522128] .rt_runtime : 950.000000 [6130704.522128] [6130704.522128] runnable tasks: [6130704.522128] task PID tree-key switches prio exec-runtime sum-exec sum-sleep [6130704.522128] ---------------------------------------------------------------------------------------------------------- [6130704.522128] [6130704.522128] cpu#1, 1866.760 MHz [6130704.522128] .nr_running : 1 [6130704.522128] .load : 1024 [6130704.522128] .nr_switches : 818082742 [6130704.522128] .nr_load_updates : 1586872071 [6130704.522128] .nr_uninterruptible : 10143702 [6130704.522128] .next_balance : 10425.371857 [6130704.522128] .curr->pid : 11137 [6130704.522128] .clock : 6130704521.459918 [6130704.522128] .cpu_load[0] : 1024 [6130704.522128] .cpu_load[1] : 514 [6130704.522128] .cpu_load[2] : 318 [6130704.522128] .cpu_load[3] : 299 [6130704.522128] .cpu_load[4] : 261 [6130704.522128] .yld_count : 283011 [6130704.522128] .sched_switch : 0 [6130704.522128] .sched_count : 947252346 [6130704.522128] .sched_goidle : 479113257 [6130704.522128] .ttwu_count : 422473135 [6130704.522128] .ttwu_local : 316765436 [6130704.522128] .bkl_count : 10 [6130704.522128] [6130704.522128] cfs_rq[1]: [6130704.522128] .exec_clock : 478641403.088793 [6130704.522128] .MIN_vruntime : 0.000001 [6130704.522128] .min_vruntime : 8051549362.514115 [6130704.522128] .max_vruntime : 0.000001 [6130704.522128] .spread : 0.000000 [6130704.522128] .spread0 : 987884970.987889 [6130704.522128] .nr_running : 1 [6130704.522128] .load : 1024 [6130704.522128] .nr_spread_over : 11703 [6130704.522128] [6130704.522128] rt_rq[1]: [6130704.522128] .rt_nr_running : 0 [6130704.522128] .rt_throttled : 0 [6130704.522128] .rt_time : 0.000000 [6130704.522128] .rt_runtime : 950.000000 [6130704.522128] [6130704.522128] runnable tasks: [6130704.522128] task PID tree-key switches prio exec-runtime sum-exec sum-sleep [6130704.522128] ---------------------------------------------------------------------------------------------------------- [6130704.522128] R bash 11137 8051549322.514115 272 120 8051549322.514115 35.937180 689280132.211960 [6130704.522128] Thanks, Fredrik On Wed, Oct 21, 2009 at 4:20 AM, Eric Sandeen <sandeen@redhat.com> wrote: > Fredrik Andersson wrote: >> >> Hi, here is the data for this process: > > Including all of the processes in D state (everything reported by sysrq-w) > would probably be most helpful. > > Feel free to file an ext4 bug on bugzilla.kernel.org w/ this information, > too, so it doesn't get lost in busy schedules ... > > Thanks, > -Eric > >> 5958816.744013] drdbmake D ffff88021e4c7800 0 27019 13796 >> [5958816.744013] ffff8801d1bcda88 0000000000000082 ffff8801f4ce9bf0 >> ffff8801678b1380 >> [5958816.744013] 0000000000010e80 000000000000c748 ffff8800404963c0 >> ffffffff81526360 >> [5958816.744013] ffff880040496730 00000000f4ce9bf0 000000025819cebe >> 0000000000000282 >> [5958816.744013] Call Trace: >> [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 >> [5958816.744013] [<ffffffff81177ea5>] start_this_handle+0x365/0x5d0 >> [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/ >> 0x40 >> [5958816.744013] [<ffffffff811781ce>] jbd2_journal_restart+0xbe/0x150 >> [5958816.744013] [<ffffffff8116243d>] ext4_ext_truncate+0x6dd/0xa20 >> [5958816.744013] [<ffffffff81095b3b>] ? find_get_pages+0x3b/0xf0 >> [5958816.744013] [<ffffffff81150a78>] ext4_truncate+0x198/0x680 >> [5958816.744013] [<ffffffff810ac984>] ? unmap_mapping_range+0x74/0x280 >> [5958816.744013] [<ffffffff811772c0>] ? jbd2_journal_stop+0x1e0/0x360 >> [5958816.744013] [<ffffffff810acd25>] vmtruncate+0xa5/0x110 >> [5958816.744013] [<ffffffff810dda10>] inode_setattr+0x30/0x180 >> [5958816.744013] [<ffffffff8114d073>] ext4_setattr+0x173/0x310 >> [5958816.744013] [<ffffffff810ddc79>] notify_change+0x119/0x330 >> [5958816.744013] [<ffffffff810c6df3>] do_truncate+0x63/0x90 >> [5958816.744013] [<ffffffff810d0cc3>] ? get_write_access+0x23/0x60 >> [5958816.744013] [<ffffffff810c70cb>] sys_truncate+0x17b/0x180 >> [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b >> >> Don't know if this has anything to do with it, but I also noticed >> that another process of mine, >> which is working just fine, is executing a suspicious looking function >> called raid0_unplug. >> It operates on the same raid0/ext4 filesystem as the hung process. I >> include the calltrace for it here too: >> >> [5958816.744013] nodeserv D ffff880167bd7ca8 0 17900 13796 >> [5958816.744013] ffff880167bd7bf8 0000000000000082 ffff88002800a588 >> ffff88021e5b56e0 >> [5958816.744013] 0000000000010e80 000000000000c748 ffff880100664020 >> ffffffff81526360 >> [5958816.744013] ffff880100664390 000000008119bd17 000000026327bfa9 >> 0000000000000002 >> [5958816.744013] Call Trace: >> [5958816.744013] [<ffffffffa0039291>] ? raid0_unplug+0x51/0x70 [raid0] >> [5958816.744013] [<ffffffff813a9639>] schedule+0x9/0x20 >> [5958816.744013] [<ffffffff813a9687>] io_schedule+0x37/0x50 >> [5958816.744013] [<ffffffff81095e35>] sync_page+0x35/0x60 >> [5958816.744013] [<ffffffff81095e69>] sync_page_killable+0x9/0x50 >> [5958816.744013] [<ffffffff813a99d2>] __wait_on_bit_lock+0x52/0xb0 >> [5958816.744013] [<ffffffff81095e60>] ? sync_page_killable+0x0/0x50 >> [5958816.744013] [<ffffffff81095d74>] __lock_page_killable+0x64/0x70 >> [5958816.744013] [<ffffffff8105b940>] ? wake_bit_function+0x0/0x40 >> [5958816.744013] [<ffffffff81095c0b>] ? find_get_page+0x1b/0xb0 >> [5958816.744013] [<ffffffff81097908>] generic_file_aio_read+0x3b8/0x6b0 >> [5958816.744013] [<ffffffff810c7dc1>] do_sync_read+0xf1/0x140 >> [5958816.744013] [<ffffffff8106a5e8>] ? do_futex+0xb8/0xb20 >> [5958816.744013] [<ffffffff813ab78f>] ? _spin_unlock_irqrestore+0x2f/0x40 >> [5958816.744013] [<ffffffff8105b900>] ? autoremove_wake_function+0x0/0x40 >> [5958816.744013] [<ffffffff8105bc73>] ? add_wait_queue+0x43/0x60 >> [5958816.744013] [<ffffffff81062a6c>] ? getnstimeofday+0x5c/0xf0 >> [5958816.744013] [<ffffffff810c85b8>] vfs_read+0xc8/0x170 >> [5958816.744013] [<ffffffff810c86fa>] sys_pread64+0x9a/0xa0 >> [5958816.744013] [<ffffffff8100bfab>] system_call_fastpath+0x16/0x1b >> >> Hope this makes sense to anyone, and please let me know if there is >> more info I can provide. >> >> /Fredrik >> >> On Sun, Oct 18, 2009 at 5:57 PM, Eric Sandeen <sandeen@redhat.com> wrote: >>> >>> Fredrik Andersson wrote: >>>> >>>> Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope >>>> this is the right place to do so. >>>> >>>> My program creates a big file (around 30 GB) with posix_fallocate (to >>>> utilize extents), fills it with data and uses ftruncate to crop it to >>>> its final size (usually somewhere between 20 and 25 GB). >>>> The problem is that in around 5% of the cases, the program locks up >>>> completely in a syscall. The process can thus not be killed even with >>>> kill -9, and a reboot is all that will do. >>> >>> does echo w > /proc/sysrq-trigger (this does sleeping processes; or use >>> echo t for all processes) show you where the stuck threads are? >>> >>> -Eric >>> > > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-18 15:47 ` Fwd: Ext4 bug with f Fredrik Andersson 2009-10-18 15:57 ` Fwd: Ext4 bug with fallocate Eric Sandeen @ 2009-10-27 4:42 ` Eric Sandeen 2009-10-27 8:17 ` Fredrik Andersson 1 sibling, 1 reply; 14+ messages in thread From: Eric Sandeen @ 2009-10-27 4:42 UTC (permalink / raw) To: Fredrik Andersson; +Cc: linux-ext4 Fredrik Andersson wrote: > Hi, I'd like to report what I'm fairly certain is an ext4 bug. I hope > this is the right place to do so. > > My program creates a big file (around 30 GB) with posix_fallocate (to > utilize extents), fills it with data and uses ftruncate to crop it to > its final size (usually somewhere between 20 and 25 GB). > The problem is that in around 5% of the cases, the program locks up > completely in a syscall. The process can thus not be killed even with > kill -9, and a reboot is all that will do. Not trying to ignore you, but trying to sort out some corruption issues that I'm afraid I have to rank higher than a hang for now... From the traces, things are hung up on the i_data_sem; several pdflushes want it for read in ext4_get_blocks_wrap, and drbdmakes presumably have it downed for write in ext4_ext_truncate. I'm not immediately seeing the deadlock tonight, but then it's getting late... Is the application that's triggering this something you can share? Thanks, -Eric ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-27 4:42 ` Eric Sandeen @ 2009-10-27 8:17 ` Fredrik Andersson 2009-10-27 13:56 ` Eric Sandeen 0 siblings, 1 reply; 14+ messages in thread From: Fredrik Andersson @ 2009-10-27 8:17 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4 > Not trying to ignore you, but trying to sort out some corruption issues that > I'm afraid I have to rank higher than a hang for now... I understand. Good to know someone is thinking about it anyway :-) > > From the traces, things are hung up on the i_data_sem; several pdflushes > want it for read in ext4_get_blocks_wrap, and drbdmakes presumably have it > downed for write in ext4_ext_truncate. I'm not immediately seeing the > deadlock tonight, but then it's getting late... > > Is the application that's triggering this something you can share? > Sorry, I can't provide source code for these applications. But drdbmake does what is described above: Preallocates a very big chunk on disk and then truncates it to its final length once it's done. Please let me know if you need more details. /Fredrik -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-27 8:17 ` Fredrik Andersson @ 2009-10-27 13:56 ` Eric Sandeen 2009-10-27 15:29 ` Fredrik Andersson 0 siblings, 1 reply; 14+ messages in thread From: Eric Sandeen @ 2009-10-27 13:56 UTC (permalink / raw) To: Fredrik Andersson; +Cc: linux-ext4 Fredrik Andersson wrote: >> Not trying to ignore you, but trying to sort out some corruption issues that >> I'm afraid I have to rank higher than a hang for now... > > I understand. Good to know someone is thinking about it anyway :-) > >> From the traces, things are hung up on the i_data_sem; several pdflushes >> want it for read in ext4_get_blocks_wrap, and drbdmakes presumably have it >> downed for write in ext4_ext_truncate. I'm not immediately seeing the >> deadlock tonight, but then it's getting late... >> >> Is the application that's triggering this something you can share? >> > > Sorry, I can't provide source code for these applications. Ok, I understand > But drdbmake does what is described above: Preallocates a very big > chunk on disk and then > truncates it to its final length once it's done. To try to emulate, how does it write into the preallocated space; large or small IOs? Sequential streaming? mmap writes? It may not be relevant but would be nice to try to match it as closely as possible. Thanks, -Eric > Please let me know if you need more details. > > /Fredrik ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-27 13:56 ` Eric Sandeen @ 2009-10-27 15:29 ` Fredrik Andersson 2009-10-27 15:37 ` Eric Sandeen 0 siblings, 1 reply; 14+ messages in thread From: Fredrik Andersson @ 2009-10-27 15:29 UTC (permalink / raw) To: Eric Sandeen, linux-ext4 > To try to emulate, how does it write into the preallocated space; large or > small IOs? Sequential streaming? mmap writes? It may not be relevant but > would be nice to try to match it as closely as possible. This is a big file that is written sequentially using stdio buffered I/O (with a setvbuf of about 4K) in the drdbmake process. No mmap. It is regenerated from an earlier version of the same file, and we preallocate a file that is 25% bigger than the previous version, to allow for more data than was in the previous file and to utilize the extent concept in ext4. We then read the previous file sequentially, update some entries here and there and rewrite it sequentially into the new, fallocated file. There is one single instance of random I/O: Once the whole new file has been written, we seek back to the start to write a fixed-size header. We then ftruncate the file to the proper size. No process is concurrently reading from the file that is being written. There is however another process, nodeserv, that does random reads from the "previous" file (the one we're sequentially reading in drdbmake). The deadlock is always in the final ftruncate. It does not help to close the file and reopen it again before the ftruncate call. /Fredrik -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Fwd: Ext4 bug with fallocate 2009-10-27 15:29 ` Fredrik Andersson @ 2009-10-27 15:37 ` Eric Sandeen 0 siblings, 0 replies; 14+ messages in thread From: Eric Sandeen @ 2009-10-27 15:37 UTC (permalink / raw) To: Fredrik Andersson; +Cc: linux-ext4 Fredrik Andersson wrote: >> To try to emulate, how does it write into the preallocated space; large or >> small IOs? Sequential streaming? mmap writes? It may not be relevant but >> would be nice to try to match it as closely as possible. > > This is a big file that is written sequentially using stdio buffered > I/O (with a setvbuf of about 4K) in the drdbmake process. No mmap. > It is regenerated from an earlier version of the same file, and we > preallocate a file that is 25% bigger than the > previous version, to allow for more data than was in the previous file > and to utilize the extent concept in ext4. FWIW, you do not need to preallocate to get extents. Preallocation fundamentally only guarantees space available (somewhere) though in practice, it can lead to more contiguous allocation of that space since it's all done up front ... > We then read the previous file sequentially, update some entries here > and there and > rewrite it sequentially into the new, fallocated file. There is one > single instance of random I/O: Once the whole new > file has been written, we seek back to the start to write a fixed-size > header. We then ftruncate the file to the proper size. > No process is concurrently reading from the file that is being > written. There is however another process, nodeserv, > that does random reads from the "previous" file (the one we're > sequentially reading in drdbmake). > The deadlock is always in the final ftruncate. It does not help to > close the file and reopen it again before the ftruncate call. Thanks. If find time to think enough about the backtraces you sent it'll probably be obvious, but the complete description of your workload is helpful. Just out of curiosity have you verified that the deadlock doesn't exist if you skip the preallocation? I wonder about a fake test where you simply write a bit extra, and truncate that back. -Eric > /Fredrik ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-10-27 15:37 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <a125a870910180836i66781e1ey8113e7f8569f1fad@mail.gmail.com>
2009-10-18 15:47 ` Fwd: Ext4 bug with f Fredrik Andersson
2009-10-18 15:57 ` Fwd: Ext4 bug with fallocate Eric Sandeen
2009-10-19 9:49 ` Fredrik Andersson
2009-10-20 16:49 ` Fredrik Andersson
2009-10-21 0:24 ` Mingming
2009-10-22 7:37 ` Fredrik Andersson
2009-10-26 10:43 ` Fredrik Andersson
2009-10-21 2:20 ` Eric Sandeen
2009-10-21 9:08 ` Fredrik Andersson
2009-10-27 4:42 ` Eric Sandeen
2009-10-27 8:17 ` Fredrik Andersson
2009-10-27 13:56 ` Eric Sandeen
2009-10-27 15:29 ` Fredrik Andersson
2009-10-27 15:37 ` Eric Sandeen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).