* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI [not found] ` <062801d10265$5a749fc0$0f5ddf40$@alibaba-inc.com> @ 2015-10-09 8:03 ` Nikolay Borisov 2015-10-12 13:40 ` Jan Kara 0 siblings, 1 reply; 11+ messages in thread From: Nikolay Borisov @ 2015-10-09 8:03 UTC (permalink / raw) To: Hillf Danton Cc: 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, vbabka, gilad, mgorman, linux-mm, Marian Marinov On 10/09/2015 10:37 AM, Hillf Danton wrote: >>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) >>>> if (bio->bi_error) >>>> buffer_io_error(bh); >>>> } while ((bh = bh->b_this_page) != head); >>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); >>>> local_irq_restore(flags); >>> >>> What if it takes 100ms to unlock after IRQ restored? >> >> I'm not sure I understand in what direction you are going? Care to >> elaborate? >> > Your change introduces extra time cost the lock waiter has to pay in > the case that irq happens before the lock is released. [CC filesystem and mm people. For reference the thread starts here: http://thread.gmane.org/gmane.linux.kernel/2056996 ] Right, I see what you mean and it's a good point but when doing the patches I was striving for correctness and starting a discussion, hence the RFC. In any case I'd personally choose correctness over performance always ;). As I'm not an fs/ext4 expert and have added the relevant parties (please use reply-all from now on so that the thread is not being cut in the middle) who will be able to say whether it impact is going to be that big. I guess in this particular code path worrying about this is prudent as writeback sounds like a heavily used path. Maybe the problem should be approached from a different angle e.g. drain_all_pages and its reliance on the fact that the IPI will always be delivered in some finite amount of time? But what if a cpu with disabled interrupts is waiting on the task issuing the IPI? > >>>> + bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); >>>> if (!under_io) { >>>> #ifdef CONFIG_EXT4_FS_ENCRYPTION >>>> if (ctx) >>>> -- >>>> 2.5.0 >>> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-09 8:03 ` [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI Nikolay Borisov @ 2015-10-12 13:40 ` Jan Kara 2015-10-12 14:51 ` Nikolay Borisov 0 siblings, 1 reply; 11+ messages in thread From: Jan Kara @ 2015-10-12 13:40 UTC (permalink / raw) To: Nikolay Borisov Cc: Hillf Danton, 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, vbabka, gilad, mgorman, linux-mm, Marian Marinov On Fri 09-10-15 11:03:30, Nikolay Borisov wrote: > On 10/09/2015 10:37 AM, Hillf Danton wrote: > >>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) > >>>> if (bio->bi_error) > >>>> buffer_io_error(bh); > >>>> } while ((bh = bh->b_this_page) != head); > >>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); > >>>> local_irq_restore(flags); > >>> > >>> What if it takes 100ms to unlock after IRQ restored? > >> > >> I'm not sure I understand in what direction you are going? Care to > >> elaborate? > >> > > Your change introduces extra time cost the lock waiter has to pay in > > the case that irq happens before the lock is released. > > [CC filesystem and mm people. For reference the thread starts here: > http://thread.gmane.org/gmane.linux.kernel/2056996 ] > > Right, I see what you mean and it's a good point but when doing the > patches I was striving for correctness and starting a discussion, hence > the RFC. In any case I'd personally choose correctness over performance > always ;). > > As I'm not an fs/ext4 expert and have added the relevant parties (please > use reply-all from now on so that the thread is not being cut in the > middle) who will be able to say whether it impact is going to be that > big. I guess in this particular code path worrying about this is prudent > as writeback sounds like a heavily used path. > > Maybe the problem should be approached from a different angle e.g. > drain_all_pages and its reliance on the fact that the IPI will always be > delivered in some finite amount of time? But what if a cpu with disabled > interrupts is waiting on the task issuing the IPI? So I have looked through your patch and also original report (thread starts here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was __block_write_full_page_endio() call but that cannot really be the case. BH_Uptodate_Lock is used only in IO completion handlers - end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there really should be some end_io function running on some other CPU which holds BH_Uptodate_Lock for that buffer. BTW: I suppose the filesystem uses 4k blocksize, doesn't it? Honza > >>>> + bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); > >>>> if (!under_io) { > >>>> #ifdef CONFIG_EXT4_FS_ENCRYPTION > >>>> if (ctx) > >>>> -- > >>>> 2.5.0 > >>> > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-12 13:40 ` Jan Kara @ 2015-10-12 14:51 ` Nikolay Borisov 2015-10-13 8:15 ` Jan Kara 0 siblings, 1 reply; 11+ messages in thread From: Nikolay Borisov @ 2015-10-12 14:51 UTC (permalink / raw) To: Jan Kara Cc: Hillf Danton, 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, vbabka, gilad, mgorman, linux-mm, Marian Marinov Hello and thanks for the reply, On 10/12/2015 04:40 PM, Jan Kara wrote: > On Fri 09-10-15 11:03:30, Nikolay Borisov wrote: >> On 10/09/2015 10:37 AM, Hillf Danton wrote: >>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) >>>>>> if (bio->bi_error) >>>>>> buffer_io_error(bh); >>>>>> } while ((bh = bh->b_this_page) != head); >>>>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); >>>>>> local_irq_restore(flags); >>>>> >>>>> What if it takes 100ms to unlock after IRQ restored? >>>> >>>> I'm not sure I understand in what direction you are going? Care to >>>> elaborate? >>>> >>> Your change introduces extra time cost the lock waiter has to pay in >>> the case that irq happens before the lock is released. >> >> [CC filesystem and mm people. For reference the thread starts here: >> http://thread.gmane.org/gmane.linux.kernel/2056996 ] >> >> Right, I see what you mean and it's a good point but when doing the >> patches I was striving for correctness and starting a discussion, hence >> the RFC. In any case I'd personally choose correctness over performance >> always ;). >> >> As I'm not an fs/ext4 expert and have added the relevant parties (please >> use reply-all from now on so that the thread is not being cut in the >> middle) who will be able to say whether it impact is going to be that >> big. I guess in this particular code path worrying about this is prudent >> as writeback sounds like a heavily used path. >> >> Maybe the problem should be approached from a different angle e.g. >> drain_all_pages and its reliance on the fact that the IPI will always be >> delivered in some finite amount of time? But what if a cpu with disabled >> interrupts is waiting on the task issuing the IPI? > > So I have looked through your patch and also original report (thread starts > here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't > been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning > on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was > __block_write_full_page_endio() call but that cannot really be the case. > BH_Uptodate_Lock is used only in IO completion handlers - > end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there > really should be some end_io function running on some other CPU which holds > BH_Uptodate_Lock for that buffer. I did check all the call traces of the current processes on the machine at the time of the hard lockup and none of the 3 functions you mentioned were in any of the call chains. But while I was looking the code of end_buffer_async_write and in the comments I saw it was mentioned that those completion handler were called from __block_write_full_page_endio so that's what pointed my attention to that function. But you are right that it doesn't take the BH lock. Furthermore the fact that the BH_Async_Write flag is set points me in the direction that end_buffer_async_write should have been executing but as I said issuing "bt" for all the tasks didn't show this function. I'm beginning to wonder if it's possible that a single bit memory error has crept up, but this still seems like a long shot... Btw I think in any case the spin_lock patch is wrong as this code can be called from within softirq context and we do want to be interrupt safe at that point. > > BTW: I suppose the filesystem uses 4k blocksize, doesn't it? Unfortunately I cannot tell you with 100% certainty, since on this server there are multiple block devices with blocksize either 1k or 4k. So it is one of these. If you know a way to extract this information from a vmcore file I'd be happy to do it. > > Honza > >>>>>> + bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); >>>>>> if (!under_io) { >>>>>> #ifdef CONFIG_EXT4_FS_ENCRYPTION >>>>>> if (ctx) >>>>>> -- >>>>>> 2.5.0 >>>>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-12 14:51 ` Nikolay Borisov @ 2015-10-13 8:15 ` Jan Kara 2015-10-13 10:37 ` Nikolay Borisov 0 siblings, 1 reply; 11+ messages in thread From: Jan Kara @ 2015-10-13 8:15 UTC (permalink / raw) To: Nikolay Borisov Cc: Jan Kara, Hillf Danton, 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, vbabka, gilad, mgorman, linux-mm, Marian Marinov On Mon 12-10-15 17:51:07, Nikolay Borisov wrote: > Hello and thanks for the reply, > > On 10/12/2015 04:40 PM, Jan Kara wrote: > > On Fri 09-10-15 11:03:30, Nikolay Borisov wrote: > >> On 10/09/2015 10:37 AM, Hillf Danton wrote: > >>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) > >>>>>> if (bio->bi_error) > >>>>>> buffer_io_error(bh); > >>>>>> } while ((bh = bh->b_this_page) != head); > >>>>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); > >>>>>> local_irq_restore(flags); > >>>>> > >>>>> What if it takes 100ms to unlock after IRQ restored? > >>>> > >>>> I'm not sure I understand in what direction you are going? Care to > >>>> elaborate? > >>>> > >>> Your change introduces extra time cost the lock waiter has to pay in > >>> the case that irq happens before the lock is released. > >> > >> [CC filesystem and mm people. For reference the thread starts here: > >> http://thread.gmane.org/gmane.linux.kernel/2056996 ] > >> > >> Right, I see what you mean and it's a good point but when doing the > >> patches I was striving for correctness and starting a discussion, hence > >> the RFC. In any case I'd personally choose correctness over performance > >> always ;). > >> > >> As I'm not an fs/ext4 expert and have added the relevant parties (please > >> use reply-all from now on so that the thread is not being cut in the > >> middle) who will be able to say whether it impact is going to be that > >> big. I guess in this particular code path worrying about this is prudent > >> as writeback sounds like a heavily used path. > >> > >> Maybe the problem should be approached from a different angle e.g. > >> drain_all_pages and its reliance on the fact that the IPI will always be > >> delivered in some finite amount of time? But what if a cpu with disabled > >> interrupts is waiting on the task issuing the IPI? > > > > So I have looked through your patch and also original report (thread starts > > here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't > > been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning > > on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was > > __block_write_full_page_endio() call but that cannot really be the case. > > BH_Uptodate_Lock is used only in IO completion handlers - > > end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there > > really should be some end_io function running on some other CPU which holds > > BH_Uptodate_Lock for that buffer. > > I did check all the call traces of the current processes on the machine > at the time of the hard lockup and none of the 3 functions you mentioned > were in any of the call chains. But while I was looking the code of > end_buffer_async_write and in the comments I saw it was mentioned that > those completion handler were called from __block_write_full_page_endio > so that's what pointed my attention to that function. But you are right > that it doesn't take the BH lock. > > Furthermore the fact that the BH_Async_Write flag is set points me in > the direction that end_buffer_async_write should have been executing but > as I said issuing "bt" for all the tasks didn't show this function. Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like a more likely place where that flag got set since ext4_finish_bio() was then handling IO completion. > I'm beginning to wonder if it's possible that a single bit memory error > has crept up, but this still seems like a long shot... Yup. Possible but a long shot. Is the problem reproducible in any way? > Btw I think in any case the spin_lock patch is wrong as this code can be > called from within softirq context and we do want to be interrupt safe > at that point. Agreed, that patch is definitely wrong. > > BTW: I suppose the filesystem uses 4k blocksize, doesn't it? > > Unfortunately I cannot tell you with 100% certainty, since on this > server there are multiple block devices with blocksize either 1k or 4k. > So it is one of these. If you know a way to extract this information > from a vmcore file I'd be happy to do it. Well, if you have a crashdump, then bh->b_size is the block size. So just check that for the bh we are spinning on. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-13 8:15 ` Jan Kara @ 2015-10-13 10:37 ` Nikolay Borisov 2015-10-13 13:14 ` Jan Kara 0 siblings, 1 reply; 11+ messages in thread From: Nikolay Borisov @ 2015-10-13 10:37 UTC (permalink / raw) To: Jan Kara Cc: Hillf Danton, 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, vbabka, gilad, mgorman, linux-mm, Marian Marinov On 10/13/2015 11:15 AM, Jan Kara wrote: > On Mon 12-10-15 17:51:07, Nikolay Borisov wrote: >> Hello and thanks for the reply, >> >> On 10/12/2015 04:40 PM, Jan Kara wrote: >>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote: >>>> On 10/09/2015 10:37 AM, Hillf Danton wrote: >>>>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) >>>>>>>> if (bio->bi_error) >>>>>>>> buffer_io_error(bh); >>>>>>>> } while ((bh = bh->b_this_page) != head); >>>>>>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); >>>>>>>> local_irq_restore(flags); >>>>>>> >>>>>>> What if it takes 100ms to unlock after IRQ restored? >>>>>> >>>>>> I'm not sure I understand in what direction you are going? Care to >>>>>> elaborate? >>>>>> >>>>> Your change introduces extra time cost the lock waiter has to pay in >>>>> the case that irq happens before the lock is released. >>>> >>>> [CC filesystem and mm people. For reference the thread starts here: >>>> http://thread.gmane.org/gmane.linux.kernel/2056996 ] >>>> >>>> Right, I see what you mean and it's a good point but when doing the >>>> patches I was striving for correctness and starting a discussion, hence >>>> the RFC. In any case I'd personally choose correctness over performance >>>> always ;). >>>> >>>> As I'm not an fs/ext4 expert and have added the relevant parties (please >>>> use reply-all from now on so that the thread is not being cut in the >>>> middle) who will be able to say whether it impact is going to be that >>>> big. I guess in this particular code path worrying about this is prudent >>>> as writeback sounds like a heavily used path. >>>> >>>> Maybe the problem should be approached from a different angle e.g. >>>> drain_all_pages and its reliance on the fact that the IPI will always be >>>> delivered in some finite amount of time? But what if a cpu with disabled >>>> interrupts is waiting on the task issuing the IPI? >>> >>> So I have looked through your patch and also original report (thread starts >>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't >>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning >>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was >>> __block_write_full_page_endio() call but that cannot really be the case. >>> BH_Uptodate_Lock is used only in IO completion handlers - >>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there >>> really should be some end_io function running on some other CPU which holds >>> BH_Uptodate_Lock for that buffer. >> >> I did check all the call traces of the current processes on the machine >> at the time of the hard lockup and none of the 3 functions you mentioned >> were in any of the call chains. But while I was looking the code of >> end_buffer_async_write and in the comments I saw it was mentioned that >> those completion handler were called from __block_write_full_page_endio >> so that's what pointed my attention to that function. But you are right >> that it doesn't take the BH lock. >> >> Furthermore the fact that the BH_Async_Write flag is set points me in >> the direction that end_buffer_async_write should have been executing but >> as I said issuing "bt" for all the tasks didn't show this function. > > Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like > a more likely place where that flag got set since ext4_finish_bio() was > then handling IO completion. > >> I'm beginning to wonder if it's possible that a single bit memory error >> has crept up, but this still seems like a long shot... > > Yup. Possible but a long shot. Is the problem reproducible in any way? Okay, I rule out hardware issue since a different server today experienced the same hard lockup. One thing which looks suspicious to me are the repetitions of bio_endio/clone_endio: Oct 13 03:16:54 10.80.5.48 Call Trace: Oct 13 03:16:54 10.80.5.48 <NMI> Oct 13 03:16:54 10.80.5.48 [<ffffffff81651631>] dump_stack+0x58/0x7f Oct 13 03:16:54 10.80.5.48 [<ffffffff81089a6c>] warn_slowpath_common+0x8c/0xc0 Oct 13 03:16:54 10.80.5.48 [<ffffffff81089b56>] warn_slowpath_fmt+0x46/0x50 Oct 13 03:16:54 10.80.5.48 [<ffffffff811015f8>] watchdog_overflow_callback+0x98/0xc0 Oct 13 03:16:54 10.80.5.48 [<ffffffff81132d0c>] __perf_event_overflow+0x9c/0x250 Oct 13 03:16:54 10.80.5.48 [<ffffffff81133664>] perf_event_overflow+0x14/0x20 Oct 13 03:16:54 10.80.5.48 [<ffffffff81061796>] intel_pmu_handle_irq+0x1d6/0x3e0 Oct 13 03:16:54 10.80.5.48 [<ffffffff8105b4c4>] perf_event_nmi_handler+0x34/0x60 Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c152>] nmi_handle+0xa2/0x1a0 Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c3b4>] do_nmi+0x164/0x430 Oct 13 03:16:54 10.80.5.48 [<ffffffff81656e2e>] end_repeat_nmi+0x1a/0x1e Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 Oct 13 03:16:54 10.80.5.48 <<EOE>> Oct 13 03:16:54 10.80.5.48 <IRQ> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120 Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 Oct 13 03:16:54 10.80.5.48 [<ffffffff812fad2b>] blk_update_request+0x21b/0x450 Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7797>] ? generic_exec_single+0xa7/0xb0 Oct 13 03:16:54 10.80.5.48 [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0 Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7817>] ? __smp_call_function_single+0x77/0x120 Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80 Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcd20>] blk_end_request+0x10/0x20 Oct 13 03:16:54 10.80.5.48 [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620 Oct 13 03:16:54 10.80.5.48 [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130 Oct 13 03:16:54 10.80.5.48 [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170 Oct 13 03:16:54 10.80.5.48 [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90 Oct 13 03:16:54 10.80.5.48 [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0 Oct 13 03:16:54 10.80.5.48 [<ffffffff81658a0c>] call_softirq+0x1c/0x30 Oct 13 03:16:54 10.80.5.48 [<ffffffff8104a35d>] do_softirq+0x8d/0xc0 Oct 13 03:16:54 10.80.5.48 [<ffffffff8108e925>] irq_exit+0x95/0xa0 Oct 13 03:16:54 10.80.5.48 [<ffffffff81658f76>] do_IRQ+0x66/0xe0 Oct 13 03:16:54 10.80.5.48 [<ffffffff816567ef>] common_interrupt+0x6f/0x6f Oct 13 03:16:54 10.80.5.48 <EOI> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656836>] ? retint_swapgs+0xe/0x13 Oct 13 03:16:54 10.80.5.48 ---[ end trace 4a0584a583c66b92 ]--- Doing addr2line on ffffffff8125c2c8 shows: /home/projects/linux-stable/fs/ext4/page-io.c:335 which for me is the last bio_put in ext4_end_bio. However, the ? addresses, right at the beginning of the NMI stack (ffffffff8125be19) map to inner loop in bit_spin_lock: } while (test_bit(bitnum, addr)); and this is in line with my initial bug report. Unfortunately I wasn't able to acquire a crashdump since the machine hard-locked way too fast. On a slightly different note is it possible to panic the machine via NMIs? Since if all the CPUs are hard lockedup they cannot process sysrq interrupts? > >> Btw I think in any case the spin_lock patch is wrong as this code can be >> called from within softirq context and we do want to be interrupt safe >> at that point. > > Agreed, that patch is definitely wrong. > >>> BTW: I suppose the filesystem uses 4k blocksize, doesn't it? >> >> Unfortunately I cannot tell you with 100% certainty, since on this >> server there are multiple block devices with blocksize either 1k or 4k. >> So it is one of these. If you know a way to extract this information >> from a vmcore file I'd be happy to do it. > > Well, if you have a crashdump, then bh->b_size is the block size. So just > check that for the bh we are spinning on. Turns out in my original email the bh->b_size was shown : b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. > > Honza > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-13 10:37 ` Nikolay Borisov @ 2015-10-13 13:14 ` Jan Kara 2015-10-14 9:02 ` Nikolay Borisov 2015-10-16 8:08 ` Nikolay Borisov 0 siblings, 2 replies; 11+ messages in thread From: Jan Kara @ 2015-10-13 13:14 UTC (permalink / raw) To: Nikolay Borisov Cc: Jan Kara, Hillf Danton, 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, vbabka, gilad, mgorman, linux-mm, Marian Marinov [-- Attachment #1: Type: text/plain, Size: 9547 bytes --] On Tue 13-10-15 13:37:16, Nikolay Borisov wrote: > > > On 10/13/2015 11:15 AM, Jan Kara wrote: > > On Mon 12-10-15 17:51:07, Nikolay Borisov wrote: > >> Hello and thanks for the reply, > >> > >> On 10/12/2015 04:40 PM, Jan Kara wrote: > >>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote: > >>>> On 10/09/2015 10:37 AM, Hillf Danton wrote: > >>>>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) > >>>>>>>> if (bio->bi_error) > >>>>>>>> buffer_io_error(bh); > >>>>>>>> } while ((bh = bh->b_this_page) != head); > >>>>>>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); > >>>>>>>> local_irq_restore(flags); > >>>>>>> > >>>>>>> What if it takes 100ms to unlock after IRQ restored? > >>>>>> > >>>>>> I'm not sure I understand in what direction you are going? Care to > >>>>>> elaborate? > >>>>>> > >>>>> Your change introduces extra time cost the lock waiter has to pay in > >>>>> the case that irq happens before the lock is released. > >>>> > >>>> [CC filesystem and mm people. For reference the thread starts here: > >>>> http://thread.gmane.org/gmane.linux.kernel/2056996 ] > >>>> > >>>> Right, I see what you mean and it's a good point but when doing the > >>>> patches I was striving for correctness and starting a discussion, hence > >>>> the RFC. In any case I'd personally choose correctness over performance > >>>> always ;). > >>>> > >>>> As I'm not an fs/ext4 expert and have added the relevant parties (please > >>>> use reply-all from now on so that the thread is not being cut in the > >>>> middle) who will be able to say whether it impact is going to be that > >>>> big. I guess in this particular code path worrying about this is prudent > >>>> as writeback sounds like a heavily used path. > >>>> > >>>> Maybe the problem should be approached from a different angle e.g. > >>>> drain_all_pages and its reliance on the fact that the IPI will always be > >>>> delivered in some finite amount of time? But what if a cpu with disabled > >>>> interrupts is waiting on the task issuing the IPI? > >>> > >>> So I have looked through your patch and also original report (thread starts > >>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't > >>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning > >>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was > >>> __block_write_full_page_endio() call but that cannot really be the case. > >>> BH_Uptodate_Lock is used only in IO completion handlers - > >>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there > >>> really should be some end_io function running on some other CPU which holds > >>> BH_Uptodate_Lock for that buffer. > >> > >> I did check all the call traces of the current processes on the machine > >> at the time of the hard lockup and none of the 3 functions you mentioned > >> were in any of the call chains. But while I was looking the code of > >> end_buffer_async_write and in the comments I saw it was mentioned that > >> those completion handler were called from __block_write_full_page_endio > >> so that's what pointed my attention to that function. But you are right > >> that it doesn't take the BH lock. > >> > >> Furthermore the fact that the BH_Async_Write flag is set points me in > >> the direction that end_buffer_async_write should have been executing but > >> as I said issuing "bt" for all the tasks didn't show this function. > > > > Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like > > a more likely place where that flag got set since ext4_finish_bio() was > > then handling IO completion. > > > >> I'm beginning to wonder if it's possible that a single bit memory error > >> has crept up, but this still seems like a long shot... > > > > Yup. Possible but a long shot. Is the problem reproducible in any way? > > Okay, I rule out hardware issue since a different server today > experienced the same hard lockup. One thing which looks > suspicious to me are the repetitions of bio_endio/clone_endio: > > Oct 13 03:16:54 10.80.5.48 Call Trace: > Oct 13 03:16:54 10.80.5.48 <NMI> > Oct 13 03:16:54 10.80.5.48 [<ffffffff81651631>] dump_stack+0x58/0x7f > Oct 13 03:16:54 10.80.5.48 [<ffffffff81089a6c>] warn_slowpath_common+0x8c/0xc0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81089b56>] warn_slowpath_fmt+0x46/0x50 > Oct 13 03:16:54 10.80.5.48 [<ffffffff811015f8>] watchdog_overflow_callback+0x98/0xc0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81132d0c>] __perf_event_overflow+0x9c/0x250 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81133664>] perf_event_overflow+0x14/0x20 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81061796>] intel_pmu_handle_irq+0x1d6/0x3e0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8105b4c4>] perf_event_nmi_handler+0x34/0x60 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c152>] nmi_handle+0xa2/0x1a0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c3b4>] do_nmi+0x164/0x430 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81656e2e>] end_repeat_nmi+0x1a/0x1e > Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 > Oct 13 03:16:54 10.80.5.48 <<EOE>> > Oct 13 03:16:54 10.80.5.48 <IRQ> > Oct 13 03:16:54 10.80.5.48 [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120 > Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 > Oct 13 03:16:54 10.80.5.48 [<ffffffff812fad2b>] blk_update_request+0x21b/0x450 > Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7797>] ? generic_exec_single+0xa7/0xb0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7817>] ? __smp_call_function_single+0x77/0x120 > Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80 > Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcd20>] blk_end_request+0x10/0x20 > Oct 13 03:16:54 10.80.5.48 [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620 > Oct 13 03:16:54 10.80.5.48 [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130 > Oct 13 03:16:54 10.80.5.48 [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170 > Oct 13 03:16:54 10.80.5.48 [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81658a0c>] call_softirq+0x1c/0x30 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8104a35d>] do_softirq+0x8d/0xc0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff8108e925>] irq_exit+0x95/0xa0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff81658f76>] do_IRQ+0x66/0xe0 > Oct 13 03:16:54 10.80.5.48 [<ffffffff816567ef>] common_interrupt+0x6f/0x6f > Oct 13 03:16:54 10.80.5.48 <EOI> > Oct 13 03:16:54 10.80.5.48 [<ffffffff81656836>] ? retint_swapgs+0xe/0x13 > Oct 13 03:16:54 10.80.5.48 ---[ end trace 4a0584a583c66b92 ]--- > > Doing addr2line on ffffffff8125c2c8 shows: > /home/projects/linux-stable/fs/ext4/page-io.c:335 which for me is the > last bio_put in ext4_end_bio. However, the ? addresses, right at the > beginning of the NMI stack (ffffffff8125be19) map to inner loop in > bit_spin_lock: > > } while (test_bit(bitnum, addr)); > > and this is in line with my initial bug report. OK. > Unfortunately I wasn't able to acquire a crashdump since the machine > hard-locked way too fast. > > On a slightly different note is it possible to > panic the machine via NMIs? Since if all the CPUs are hard lockedup they > cannot process sysrq interrupts? Certainly it's possible to do that - the easiest way is actually to use nmi_watchdog=panic Then panic will automatically trigger when watchdog fires. > >> Btw I think in any case the spin_lock patch is wrong as this code can be > >> called from within softirq context and we do want to be interrupt safe > >> at that point. > > > > Agreed, that patch is definitely wrong. > > > >>> BTW: I suppose the filesystem uses 4k blocksize, doesn't it? > >> > >> Unfortunately I cannot tell you with 100% certainty, since on this > >> server there are multiple block devices with blocksize either 1k or 4k. > >> So it is one of these. If you know a way to extract this information > >> from a vmcore file I'd be happy to do it. > > > > Well, if you have a crashdump, then bh->b_size is the block size. So just > > check that for the bh we are spinning on. > > Turns out in my original email the bh->b_size was shown : > b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. OK, then I have a theory. We can manipulate bh->b_state in a non-atomic manner in _ext4_get_block(). If we happen to do that on the first buffer in a page while IO completes on another buffer in the same page, we could in theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try whether the attached patch fixes your problem? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: 0001-ext4-Fix-bh-b_state-corruption.patch --] [-- Type: text/x-patch, Size: 2911 bytes --] >From 4437fcab09fdbac2136f4fbd1dd7530ac0ec5b3a Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.com> Date: Tue, 13 Oct 2015 14:59:54 +0200 Subject: [PATCH] ext4: Fix bh->b_state corruption ext4 can update bh->b_state non-atomically in _ext4_get_block() and ext4_da_get_block_prep(). Usually this is fine since bh is just an temporary storage for mapping information on stack but in some cases it can be fully living bh attached to a page. In such case non-atomic update of bh->b_state can race with an atomic update which then gets lost. Usually when we are mapping bh and thus updating bh->b_state non-atomically, nobody else touches the bh and so things work out fine but there is one case to especially worry about: ext4_finish_bio() uses BH_Uptodate_Lock on the first bh in the page to synchronize handling of PageWriteback state. So when blocksize < pagesize, we can be atomically modifying bh->b_state of a buffer that actually isn't under IO and thus can race e.g. with delalloc trying to map that buffer. The result is that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock. Fix the problem by always updating bh->b_state bits atomically. Signed-off-by: Jan Kara <jack@suse.com> --- fs/ext4/inode.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 612fbcf76b5c..77604002ae75 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -657,6 +657,24 @@ has_zeroout: return retval; } +/* + * Update EXT4_MAP_FLAGS in bh->b_state atomically according to 'flags'. This + * is ugly but once we get rid of using bh as a container for mapping + * information to pass to / from get_block functions, this can go away. + */ +static void ext4_update_bh_state(struct buffer_head *bh, unsigned long flags) +{ + int i; + + for (i = 0; i < BITS_PER_LONG; i++) + if ((1 << i) & EXT4_MAP_FLAGS) { + if (flags & (1 << i)) + set_bit(i, &bh->b_state); + else + clear_bit(i, &bh->b_state); + } +} + /* Maximum number of blocks we map for direct IO at once. */ #define DIO_MAX_BLOCKS 4096 @@ -693,7 +711,7 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock, ext4_io_end_t *io_end = ext4_inode_aio(inode); map_bh(bh, inode->i_sb, map.m_pblk); - bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; + ext4_update_bh_state(bh, map.m_flags); if (IS_DAX(inode) && buffer_unwritten(bh)) { /* * dgc: I suspect unwritten conversion on ext4+DAX is @@ -1637,7 +1655,7 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock, return ret; map_bh(bh, inode->i_sb, map.m_pblk); - bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags; + ext4_update_bh_state(bh, map.m_flags); if (buffer_unwritten(bh)) { /* A delayed write to unwritten bh should be marked -- 2.1.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-13 13:14 ` Jan Kara @ 2015-10-14 9:02 ` Nikolay Borisov 2015-10-16 8:08 ` Nikolay Borisov 1 sibling, 0 replies; 11+ messages in thread From: Nikolay Borisov @ 2015-10-14 9:02 UTC (permalink / raw) To: Jan Kara Cc: Hillf Danton, 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, vbabka, gilad, mgorman, linux-mm, Marian Marinov On 10/13/2015 04:14 PM, Jan Kara wrote: > On Tue 13-10-15 13:37:16, Nikolay Borisov wrote: >> >> >> On 10/13/2015 11:15 AM, Jan Kara wrote: >>> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote: >>>> Hello and thanks for the reply, >>>> >>>> On 10/12/2015 04:40 PM, Jan Kara wrote: >>>>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote: >>>>>> On 10/09/2015 10:37 AM, Hillf Danton wrote: >>>>>>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) >>>>>>>>>> if (bio->bi_error) >>>>>>>>>> buffer_io_error(bh); >>>>>>>>>> } while ((bh = bh->b_this_page) != head); >>>>>>>>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); >>>>>>>>>> local_irq_restore(flags); >>>>>>>>> >>>>>>>>> What if it takes 100ms to unlock after IRQ restored? >>>>>>>> >>>>>>>> I'm not sure I understand in what direction you are going? Care to >>>>>>>> elaborate? >>>>>>>> >>>>>>> Your change introduces extra time cost the lock waiter has to pay in >>>>>>> the case that irq happens before the lock is released. >>>>>> >>>>>> [CC filesystem and mm people. For reference the thread starts here: >>>>>> http://thread.gmane.org/gmane.linux.kernel/2056996 ] >>>>>> >>>>>> Right, I see what you mean and it's a good point but when doing the >>>>>> patches I was striving for correctness and starting a discussion, hence >>>>>> the RFC. In any case I'd personally choose correctness over performance >>>>>> always ;). >>>>>> >>>>>> As I'm not an fs/ext4 expert and have added the relevant parties (please >>>>>> use reply-all from now on so that the thread is not being cut in the >>>>>> middle) who will be able to say whether it impact is going to be that >>>>>> big. I guess in this particular code path worrying about this is prudent >>>>>> as writeback sounds like a heavily used path. >>>>>> >>>>>> Maybe the problem should be approached from a different angle e.g. >>>>>> drain_all_pages and its reliance on the fact that the IPI will always be >>>>>> delivered in some finite amount of time? But what if a cpu with disabled >>>>>> interrupts is waiting on the task issuing the IPI? >>>>> >>>>> So I have looked through your patch and also original report (thread starts >>>>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't >>>>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning >>>>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was >>>>> __block_write_full_page_endio() call but that cannot really be the case. >>>>> BH_Uptodate_Lock is used only in IO completion handlers - >>>>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there >>>>> really should be some end_io function running on some other CPU which holds >>>>> BH_Uptodate_Lock for that buffer. >>>> >>>> I did check all the call traces of the current processes on the machine >>>> at the time of the hard lockup and none of the 3 functions you mentioned >>>> were in any of the call chains. But while I was looking the code of >>>> end_buffer_async_write and in the comments I saw it was mentioned that >>>> those completion handler were called from __block_write_full_page_endio >>>> so that's what pointed my attention to that function. But you are right >>>> that it doesn't take the BH lock. >>>> >>>> Furthermore the fact that the BH_Async_Write flag is set points me in >>>> the direction that end_buffer_async_write should have been executing but >>>> as I said issuing "bt" for all the tasks didn't show this function. >>> >>> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like >>> a more likely place where that flag got set since ext4_finish_bio() was >>> then handling IO completion. >>> >>>> I'm beginning to wonder if it's possible that a single bit memory error >>>> has crept up, but this still seems like a long shot... >>> >>> Yup. Possible but a long shot. Is the problem reproducible in any way? >> >> Okay, I rule out hardware issue since a different server today >> experienced the same hard lockup. One thing which looks >> suspicious to me are the repetitions of bio_endio/clone_endio: >> >> Oct 13 03:16:54 10.80.5.48 Call Trace: >> Oct 13 03:16:54 10.80.5.48 <NMI> >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81651631>] dump_stack+0x58/0x7f >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089a6c>] warn_slowpath_common+0x8c/0xc0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089b56>] warn_slowpath_fmt+0x46/0x50 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811015f8>] watchdog_overflow_callback+0x98/0xc0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81132d0c>] __perf_event_overflow+0x9c/0x250 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81133664>] perf_event_overflow+0x14/0x20 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81061796>] intel_pmu_handle_irq+0x1d6/0x3e0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8105b4c4>] perf_event_nmi_handler+0x34/0x60 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c152>] nmi_handle+0xa2/0x1a0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c3b4>] do_nmi+0x164/0x430 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656e2e>] end_repeat_nmi+0x1a/0x1e >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 >> Oct 13 03:16:54 10.80.5.48 <<EOE>> >> Oct 13 03:16:54 10.80.5.48 <IRQ> >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fad2b>] blk_update_request+0x21b/0x450 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7797>] ? generic_exec_single+0xa7/0xb0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7817>] ? __smp_call_function_single+0x77/0x120 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcd20>] blk_end_request+0x10/0x20 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658a0c>] call_softirq+0x1c/0x30 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104a35d>] do_softirq+0x8d/0xc0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108e925>] irq_exit+0x95/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658f76>] do_IRQ+0x66/0xe0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff816567ef>] common_interrupt+0x6f/0x6f >> Oct 13 03:16:54 10.80.5.48 <EOI> >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656836>] ? retint_swapgs+0xe/0x13 >> Oct 13 03:16:54 10.80.5.48 ---[ end trace 4a0584a583c66b92 ]--- >> >> Doing addr2line on ffffffff8125c2c8 shows: >> /home/projects/linux-stable/fs/ext4/page-io.c:335 which for me is the >> last bio_put in ext4_end_bio. However, the ? addresses, right at the >> beginning of the NMI stack (ffffffff8125be19) map to inner loop in >> bit_spin_lock: >> >> } while (test_bit(bitnum, addr)); >> >> and this is in line with my initial bug report. > > OK. > >> Unfortunately I wasn't able to acquire a crashdump since the machine >> hard-locked way too fast. >> >> On a slightly different note is it possible to >> panic the machine via NMIs? Since if all the CPUs are hard lockedup they >> cannot process sysrq interrupts? > > Certainly it's possible to do that - the easiest way is actually to use > > nmi_watchdog=panic > > Then panic will automatically trigger when watchdog fires. > >>>> Btw I think in any case the spin_lock patch is wrong as this code can be >>>> called from within softirq context and we do want to be interrupt safe >>>> at that point. >>> >>> Agreed, that patch is definitely wrong. >>> >>>>> BTW: I suppose the filesystem uses 4k blocksize, doesn't it? >>>> >>>> Unfortunately I cannot tell you with 100% certainty, since on this >>>> server there are multiple block devices with blocksize either 1k or 4k. >>>> So it is one of these. If you know a way to extract this information >>>> from a vmcore file I'd be happy to do it. >>> >>> Well, if you have a crashdump, then bh->b_size is the block size. So just >>> check that for the bh we are spinning on. >> >> Turns out in my original email the bh->b_size was shown : >> b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. > > OK, then I have a theory. We can manipulate bh->b_state in a non-atomic > manner in _ext4_get_block(). If we happen to do that on the first buffer in > a page while IO completes on another buffer in the same page, we could in > theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try > whether the attached patch fixes your problem? I will try the patch, however it might take some time to report back since scheduling reboots on the live servers is going to be tricky and unfortunately at the moment I cannot reproduce the issue on demand. > > Honza > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-13 13:14 ` Jan Kara 2015-10-14 9:02 ` Nikolay Borisov @ 2015-10-16 8:08 ` Nikolay Borisov 2015-10-16 12:51 ` Jan Kara 1 sibling, 1 reply; 11+ messages in thread From: Nikolay Borisov @ 2015-10-16 8:08 UTC (permalink / raw) To: Jan Kara Cc: 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, Marian Marinov On 10/13/2015 04:14 PM, Jan Kara wrote: > On Tue 13-10-15 13:37:16, Nikolay Borisov wrote: >> >> >> On 10/13/2015 11:15 AM, Jan Kara wrote: >>> On Mon 12-10-15 17:51:07, Nikolay Borisov wrote: >>>> Hello and thanks for the reply, >>>> >>>> On 10/12/2015 04:40 PM, Jan Kara wrote: >>>>> On Fri 09-10-15 11:03:30, Nikolay Borisov wrote: >>>>>> On 10/09/2015 10:37 AM, Hillf Danton wrote: >>>>>>>>>> @@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio) >>>>>>>>>> if (bio->bi_error) >>>>>>>>>> buffer_io_error(bh); >>>>>>>>>> } while ((bh = bh->b_this_page) != head); >>>>>>>>>> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state); >>>>>>>>>> local_irq_restore(flags); >>>>>>>>> >>>>>>>>> What if it takes 100ms to unlock after IRQ restored? >>>>>>>> >>>>>>>> I'm not sure I understand in what direction you are going? Care to >>>>>>>> elaborate? >>>>>>>> >>>>>>> Your change introduces extra time cost the lock waiter has to pay in >>>>>>> the case that irq happens before the lock is released. >>>>>> >>>>>> [CC filesystem and mm people. For reference the thread starts here: >>>>>> http://thread.gmane.org/gmane.linux.kernel/2056996 ] >>>>>> >>>>>> Right, I see what you mean and it's a good point but when doing the >>>>>> patches I was striving for correctness and starting a discussion, hence >>>>>> the RFC. In any case I'd personally choose correctness over performance >>>>>> always ;). >>>>>> >>>>>> As I'm not an fs/ext4 expert and have added the relevant parties (please >>>>>> use reply-all from now on so that the thread is not being cut in the >>>>>> middle) who will be able to say whether it impact is going to be that >>>>>> big. I guess in this particular code path worrying about this is prudent >>>>>> as writeback sounds like a heavily used path. >>>>>> >>>>>> Maybe the problem should be approached from a different angle e.g. >>>>>> drain_all_pages and its reliance on the fact that the IPI will always be >>>>>> delivered in some finite amount of time? But what if a cpu with disabled >>>>>> interrupts is waiting on the task issuing the IPI? >>>>> >>>>> So I have looked through your patch and also original report (thread starts >>>>> here: https://lkml.org/lkml/2015/10/8/341) and IMHO one question hasn't >>>>> been properly answered yet: Who is holding BH_Uptodate_Lock we are spinning >>>>> on? You have suggested in https://lkml.org/lkml/2015/10/8/464 that it was >>>>> __block_write_full_page_endio() call but that cannot really be the case. >>>>> BH_Uptodate_Lock is used only in IO completion handlers - >>>>> end_buffer_async_read, end_buffer_async_write, ext4_finish_bio. So there >>>>> really should be some end_io function running on some other CPU which holds >>>>> BH_Uptodate_Lock for that buffer. >>>> >>>> I did check all the call traces of the current processes on the machine >>>> at the time of the hard lockup and none of the 3 functions you mentioned >>>> were in any of the call chains. But while I was looking the code of >>>> end_buffer_async_write and in the comments I saw it was mentioned that >>>> those completion handler were called from __block_write_full_page_endio >>>> so that's what pointed my attention to that function. But you are right >>>> that it doesn't take the BH lock. >>>> >>>> Furthermore the fact that the BH_Async_Write flag is set points me in >>>> the direction that end_buffer_async_write should have been executing but >>>> as I said issuing "bt" for all the tasks didn't show this function. >>> >>> Actually ext4_bio_write_page() also sets BH_Async_Write so that seems like >>> a more likely place where that flag got set since ext4_finish_bio() was >>> then handling IO completion. >>> >>>> I'm beginning to wonder if it's possible that a single bit memory error >>>> has crept up, but this still seems like a long shot... >>> >>> Yup. Possible but a long shot. Is the problem reproducible in any way? >> >> Okay, I rule out hardware issue since a different server today >> experienced the same hard lockup. One thing which looks >> suspicious to me are the repetitions of bio_endio/clone_endio: >> >> Oct 13 03:16:54 10.80.5.48 Call Trace: >> Oct 13 03:16:54 10.80.5.48 <NMI> >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81651631>] dump_stack+0x58/0x7f >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089a6c>] warn_slowpath_common+0x8c/0xc0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81089b56>] warn_slowpath_fmt+0x46/0x50 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811015f8>] watchdog_overflow_callback+0x98/0xc0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81132d0c>] __perf_event_overflow+0x9c/0x250 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81133664>] perf_event_overflow+0x14/0x20 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81061796>] intel_pmu_handle_irq+0x1d6/0x3e0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8105b4c4>] perf_event_nmi_handler+0x34/0x60 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c152>] nmi_handle+0xa2/0x1a0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104c3b4>] do_nmi+0x164/0x430 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656e2e>] end_repeat_nmi+0x1a/0x1e >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125be19>] ? ext4_finish_bio+0x279/0x2a0 >> Oct 13 03:16:54 10.80.5.48 <<EOE>> >> Oct 13 03:16:54 10.80.5.48 <IRQ> >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546781>] dec_pending+0x1c1/0x360 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81546996>] clone_endio+0x76/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff811dbf1d>] bio_endio+0x1d/0x40 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fad2b>] blk_update_request+0x21b/0x450 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7797>] ? generic_exec_single+0xa7/0xb0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff810e7817>] ? __smp_call_function_single+0x77/0x120 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff812fcd20>] blk_end_request+0x10/0x20 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658a0c>] call_softirq+0x1c/0x30 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8104a35d>] do_softirq+0x8d/0xc0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff8108e925>] irq_exit+0x95/0xa0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81658f76>] do_IRQ+0x66/0xe0 >> Oct 13 03:16:54 10.80.5.48 [<ffffffff816567ef>] common_interrupt+0x6f/0x6f >> Oct 13 03:16:54 10.80.5.48 <EOI> >> Oct 13 03:16:54 10.80.5.48 [<ffffffff81656836>] ? retint_swapgs+0xe/0x13 >> Oct 13 03:16:54 10.80.5.48 ---[ end trace 4a0584a583c66b92 ]--- >> >> Doing addr2line on ffffffff8125c2c8 shows: >> /home/projects/linux-stable/fs/ext4/page-io.c:335 which for me is the >> last bio_put in ext4_end_bio. However, the ? addresses, right at the >> beginning of the NMI stack (ffffffff8125be19) map to inner loop in >> bit_spin_lock: >> >> } while (test_bit(bitnum, addr)); >> >> and this is in line with my initial bug report. > > OK. > >> Unfortunately I wasn't able to acquire a crashdump since the machine >> hard-locked way too fast. >> >> On a slightly different note is it possible to >> panic the machine via NMIs? Since if all the CPUs are hard lockedup they >> cannot process sysrq interrupts? > > Certainly it's possible to do that - the easiest way is actually to use > > nmi_watchdog=panic > > Then panic will automatically trigger when watchdog fires. > >>>> Btw I think in any case the spin_lock patch is wrong as this code can be >>>> called from within softirq context and we do want to be interrupt safe >>>> at that point. >>> >>> Agreed, that patch is definitely wrong. >>> >>>>> BTW: I suppose the filesystem uses 4k blocksize, doesn't it? >>>> >>>> Unfortunately I cannot tell you with 100% certainty, since on this >>>> server there are multiple block devices with blocksize either 1k or 4k. >>>> So it is one of these. If you know a way to extract this information >>>> from a vmcore file I'd be happy to do it. >>> >>> Well, if you have a crashdump, then bh->b_size is the block size. So just >>> check that for the bh we are spinning on. >> >> Turns out in my original email the bh->b_size was shown : >> b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. > > OK, then I have a theory. We can manipulate bh->b_state in a non-atomic > manner in _ext4_get_block(). If we happen to do that on the first buffer in > a page while IO completes on another buffer in the same page, we could in > theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try > whether the attached patch fixes your problem? I just want to make sure I have fully understood the problem. So the way I understand what you have said is that since the blocksize is 1k this potentially means we might need up to 4 buffer heads to map everything in a page. But as you mentioned the locking in ext4_finish_bio is done only on the first buffer_head in this set of bh. At the same time in _ext4_get_block bh->b_state is modified non-atomically as you said, and this can happen in one of those 4bh used to map the whole page? Correct? If my reasoning is correct then I see no reason why this corruption couldn't happen even with blocksize == pagesize since ext4_get_block's non-atomic modification of the bh->b_state could still race with ext4_finish_bio's atomic modification (even though ext4_finish_bio would just work on a single buffer head)? Sorry if the question seems stupid but I just want to fully wrap my head around the internals of it. Thanks > > Honza > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI 2015-10-16 8:08 ` Nikolay Borisov @ 2015-10-16 12:51 ` Jan Kara 0 siblings, 0 replies; 11+ messages in thread From: Jan Kara @ 2015-10-16 12:51 UTC (permalink / raw) To: Nikolay Borisov Cc: Jan Kara, 'linux-kernel', Theodore Ts'o, Andreas Dilger, linux-fsdevel, SiteGround Operations, Marian Marinov On Fri 16-10-15 11:08:46, Nikolay Borisov wrote: > On 10/13/2015 04:14 PM, Jan Kara wrote: > >> Turns out in my original email the bh->b_size was shown : > >> b_size = 0x400 == 1k. So the filesystem is not 4k but 1k. > > > > OK, then I have a theory. We can manipulate bh->b_state in a non-atomic > > manner in _ext4_get_block(). If we happen to do that on the first buffer in > > a page while IO completes on another buffer in the same page, we could in > > theory mess up and miss clearing of BH_Uptodate_Lock flag. Can you try > > whether the attached patch fixes your problem? > > I just want to make sure I have fully understood the problem. So the way > I understand what you have said is that since the blocksize is 1k this > potentially means we might need up to 4 buffer heads to map everything > in a page. But as you mentioned the locking in ext4_finish_bio is done > only on the first buffer_head in this set of bh. At the same time in > _ext4_get_block bh->b_state is modified non-atomically as you said, and > this can happen in one of those 4bh used to map the whole page? Correct? > > If my reasoning is correct then I see no reason why this corruption > couldn't happen even with blocksize == pagesize since ext4_get_block's > non-atomic modification of the bh->b_state could still race with > ext4_finish_bio's atomic modification (even though ext4_finish_bio would > just work on a single buffer head)? Well, the reason why this race is not possible with 4k blocks is that we map blocks in a page (i.e., call _ext4_get_block()) only under page lock and only when the buffer isn't already mapped. And in such case IO completion cannot happen (for IO to run buffer must be mapped). So it's mostly a pure luck but the race is not possible... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI
@ 2015-10-08 15:31 Nikolay Borisov
[not found] ` <CAOtvUMcrhq3epOPCEciMGq53S6rTyURAKEWhQ=NwrkF95aJ+_Q@mail.gmail.com>
0 siblings, 1 reply; 11+ messages in thread
From: Nikolay Borisov @ 2015-10-08 15:31 UTC (permalink / raw)
To: tytso, adilger.kernel, viro, linux-kernel, linux-fsdevel; +Cc: operations, mm
Currently when bios are being finished in ext4_finish_bio this is done by
first disabling interrupts and then acquiring a bit_spin_lock.
However, those buffer heads might be under async write and as such the
wait on bit_spin_lock might cause the CPU to be spinning with interrupts
disabled for arbitrary period of time. If in the mean time there is
demand for memory and such cannot be freed the allocator's code might
have to resort to dumping the per-cpu lists, like so:
PID: 31111 TASK: ffff881cbb2fb870 CPU: 2 COMMAND: "kworker/u96:0"
#0 [ffff881fffa46dc0] crash_nmi_callback at ffffffff8106f24e
#1 [ffff881fffa46de0] nmi_handle at ffffffff8104c152
#2 [ffff881fffa46e70] do_nmi at ffffffff8104c3b4
#3 [ffff881fffa46ef0] end_repeat_nmi at ffffffff81656e2e
[exception RIP: smp_call_function_many+577]
RIP: ffffffff810e7f81 RSP: ffff880d35b815c8 RFLAGS: 00000202
RAX: 0000000000000017 RBX: ffffffff81142690 RCX: 0000000000000017
RDX: ffff883fff375478 RSI: 0000000000000040 RDI: 0000000000000040
RBP: ffff880d35b81628 R8: ffff881fffa51ec8 R9: 0000000000000000
R10: 0000000000000000 R11: ffffffff812943f3 R12: 0000000000000000
R13: ffff881fffa51ec0 R14: ffff881fffa51ec8 R15: 0000000000011f00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#4 [ffff880d35b815c8] smp_call_function_many at ffffffff810e7f81
#5 [ffff880d35b81630] on_each_cpu_mask at ffffffff810e801c
#6 [ffff880d35b81660] drain_all_pages at ffffffff81140178
#7 [ffff880d35b81690] __alloc_pages_nodemask at ffffffff8114310b
#8 [ffff880d35b81810] alloc_pages_current at ffffffff81181c5e
#9 [ffff880d35b81860] new_slab at ffffffff81188305
However, this will never return, since on_each_cpu_mask is being called
with last argument 1 i.e. wait until the IPI handler is invoked on every
cpu. Additionally, if there is another thread on which ext4_finish_bio
depends to complete e.g:
PID: 34220 TASK: ffff883937660810 CPU: 44 COMMAND: "kworker/u98:39"
#0 [ffff88209d5b10b8] __schedule at ffffffff81653d5a
#1 [ffff88209d5b1150] schedule at ffffffff816542f9
#2 [ffff88209d5b1160] schedule_preempt_disabled at ffffffff81654686
#3 [ffff88209d5b1180] __mutex_lock_slowpath at ffffffff816521eb
#4 [ffff88209d5b1200] mutex_lock at ffffffff816522d1
#5 [ffff88209d5b1220] new_read at ffffffffa0152a7e [dm_bufio]
#6 [ffff88209d5b1280] dm_bufio_get at ffffffffa0152ba6 [dm_bufio]
#7 [ffff88209d5b1290] dm_bm_read_try_lock at ffffffffa015c878 [dm_persistent_data]
#8 [ffff88209d5b12e0] dm_tm_read_lock at ffffffffa015f7ad [dm_persistent_data]
#9 [ffff88209d5b12f0] bn_read_lock at ffffffffa016281b [dm_persistent_data]
And in turn this second thread is dependent on the original, allocation
to succeed a hard lockup occurs, since ext4_finish_bio would be waitin for
block_write_full_page to complete, which in turn is dependent on the original
memory allocation to succeeds, which in turn is dependent on the IPI executing
on each core. For completeness here is how the call stack for hung ext4_bio_finish
would look like:
[427160.405277] NMI backtrace for cpu 23
[427160.405279] CPU: 23 PID: 4611 Comm: kworker/u98:7 Tainted: G W 3.12.47-clouder1 #1
[427160.405281] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015
[427160.405285] Workqueue: writeback bdi_writeback_workfn (flush-252:148)
[427160.405286] task: ffff8825aa819830 ti: ffff882b19180000 task.ti: ffff882b19180000
[427160.405290] RIP: 0010:[<ffffffff8125be13>] [<ffffffff8125be13>] ext4_finish_bio+0x273/0x2a0
[427160.405291] RSP: 0000:ffff883fff3639b0 EFLAGS: 00000002
[427160.405292] RAX: ffff882b19180000 RBX: ffff883f67480a80 RCX: 0000000000000110
[427160.405292] RDX: ffff882b19180000 RSI: 0000000000000000 RDI: ffff883f67480a80
[427160.405293] RBP: ffff883fff363a70 R08: 0000000000014b80 R09: ffff881fff454f00
[427160.405294] R10: ffffea00473214c0 R11: ffffffff8113bfd7 R12: ffff880826272138
[427160.405295] R13: 0000000000000000 R14: 0000000000000000 R15: ffffea00aeaea400
[427160.405296] FS: 0000000000000000(0000) GS:ffff883fff360000(0000) knlGS:0000000000000000
[427160.405296] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[427160.405297] CR2: 0000003c5b009c24 CR3: 0000000001c0b000 CR4: 00000000001407e0
[427160.405297] Stack:
[427160.405305] 0000000000000000 ffffffff8203f230 ffff883fff363a00 ffff882b19180000
[427160.405312] ffff882b19180000 ffff882b19180000 00000400018e0af8 ffff882b19180000
[427160.405319] ffff883f67480a80 0000000000000000 0000000000000202 00000000d219e720
[427160.405320] Call Trace:
[427160.405324] <IRQ>
[427160.405327] [<ffffffff8125c2c8>] ext4_end_bio+0xc8/0x120
[427160.405335] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405341] [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405345] [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405350] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405353] [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405358] [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405362] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405365] [<ffffffff81546781>] dec_pending+0x1c1/0x360
[427160.405369] [<ffffffff81546996>] clone_endio+0x76/0xa0
[427160.405373] [<ffffffff811dbf1d>] bio_endio+0x1d/0x40
[427160.405380] [<ffffffff812fad2b>] blk_update_request+0x21b/0x450
[427160.405385] [<ffffffff812faf87>] blk_update_bidi_request+0x27/0xb0
[427160.405389] [<ffffffff812fcc7f>] blk_end_bidi_request+0x2f/0x80
[427160.405392] [<ffffffff812fcd20>] blk_end_request+0x10/0x20
[427160.405400] [<ffffffff813fdc1c>] scsi_io_completion+0xbc/0x620
[427160.405404] [<ffffffff813f57f9>] scsi_finish_command+0xc9/0x130
[427160.405408] [<ffffffff813fe2e7>] scsi_softirq_done+0x147/0x170
[427160.405413] [<ffffffff813035ad>] blk_done_softirq+0x7d/0x90
[427160.405418] [<ffffffff8108ed87>] __do_softirq+0x137/0x2e0
[427160.405422] [<ffffffff81658a0c>] call_softirq+0x1c/0x30
[427160.405427] [<ffffffff8104a35d>] do_softirq+0x8d/0xc0
[427160.405428] [<ffffffff8108e925>] irq_exit+0x95/0xa0
[427160.405431] [<ffffffff8106f755>] smp_call_function_single_interrupt+0x35/0x40
[427160.405434] [<ffffffff8165826f>] call_function_single_interrupt+0x6f/0x80
[427160.405436] <EOI>
[427160.405438] [<ffffffff813276e6>] ? memcpy+0x6/0x110
[427160.405440] [<ffffffff811dc6d6>] ? __bio_clone+0x26/0x70
[427160.405442] [<ffffffff81546db9>] __clone_and_map_data_bio+0x139/0x160
[427160.405445] [<ffffffff815471cd>] __split_and_process_bio+0x3ed/0x490
[427160.405447] [<ffffffff815473a6>] dm_request+0x136/0x1e0
[427160.405449] [<ffffffff812fbe0a>] generic_make_request+0xca/0x100
[427160.405451] [<ffffffff812fbeb9>] submit_bio+0x79/0x160
[427160.405453] [<ffffffff81144c3d>] ? account_page_writeback+0x2d/0x40
[427160.405455] [<ffffffff81144dbd>] ? __test_set_page_writeback+0x16d/0x1f0
[427160.405457] [<ffffffff8125b7a9>] ext4_io_submit+0x29/0x50
[427160.405459] [<ffffffff8125b8fb>] ext4_bio_write_page+0x12b/0x2f0
[427160.405461] [<ffffffff81252fe8>] mpage_submit_page+0x68/0x90
[427160.405463] [<ffffffff81253100>] mpage_process_page_bufs+0xf0/0x110
[427160.405465] [<ffffffff81254a80>] mpage_prepare_extent_to_map+0x210/0x310
[427160.405468] [<ffffffff8125a911>] ? ext4_writepages+0x361/0xc60
[427160.405472] [<ffffffff81283c09>] ? __ext4_journal_start_sb+0x79/0x110
[427160.405474] [<ffffffff8125a948>] ext4_writepages+0x398/0xc60
[427160.405477] [<ffffffff812fd358>] ? blk_finish_plug+0x18/0x50
[427160.405479] [<ffffffff81146b40>] do_writepages+0x20/0x40
[427160.405483] [<ffffffff811cec79>] __writeback_single_inode+0x49/0x2b0
[427160.405487] [<ffffffff810aeeef>] ? wake_up_bit+0x2f/0x40
[427160.405488] [<ffffffff811cfdee>] writeback_sb_inodes+0x2de/0x540
[427160.405492] [<ffffffff811a6e65>] ? put_super+0x25/0x50
[427160.405494] [<ffffffff811d00ee>] __writeback_inodes_wb+0x9e/0xd0
[427160.405495] [<ffffffff811d035b>] wb_writeback+0x23b/0x340
[427160.405497] [<ffffffff811d04f9>] wb_do_writeback+0x99/0x230
[427160.405500] [<ffffffff810a40f1>] ? set_worker_desc+0x81/0x90
[427160.405503] [<ffffffff810c7a6a>] ? dequeue_task_fair+0x36a/0x4c0
[427160.405505] [<ffffffff811d0bf8>] bdi_writeback_workfn+0x88/0x260
[427160.405509] [<ffffffff810bbb3e>] ? finish_task_switch+0x4e/0xe0
[427160.405511] [<ffffffff81653dac>] ? __schedule+0x2dc/0x760
[427160.405514] [<ffffffff810a61e5>] process_one_work+0x195/0x550
[427160.405517] [<ffffffff810a848a>] worker_thread+0x13a/0x430
[427160.405519] [<ffffffff810a8350>] ? manage_workers+0x2c0/0x2c0
[427160.405521] [<ffffffff810ae48e>] kthread+0xce/0xe0
[427160.405523] [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80
[427160.405525] [<ffffffff816571c8>] ret_from_fork+0x58/0x90
[427160.405527] [<ffffffff810ae3c0>] ? kthread_freezable_should_stop+0x80/0x80
To fix the situation this patch changes the order in which the
bit_spin_lock and interrupts disabling occcurs. The exepected
effect is that even if a core is spinning on the bitlock it will
have its interrupts enabled, thus being able to respond to IPIs.
This eventually would allow memory allocation requiring draining of
the per cpu pages to succeed.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
fs/ext4/page-io.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 84ba4d2..095331b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -96,8 +96,8 @@ static void ext4_finish_bio(struct bio *bio)
* We check all buffers in the page under BH_Uptodate_Lock
* to avoid races with other end io clearing async_write flags
*/
- local_irq_save(flags);
bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
+ local_irq_save(flags);
do {
if (bh_offset(bh) < bio_start ||
bh_offset(bh) + bh->b_size > bio_end) {
@@ -109,8 +109,8 @@ static void ext4_finish_bio(struct bio *bio)
if (bio->bi_error)
buffer_io_error(bh);
} while ((bh = bh->b_this_page) != head);
- bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
local_irq_restore(flags);
+ bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
if (!under_io) {
#ifdef CONFIG_EXT4_FS_ENCRYPTION
if (ctx)
--
2.5.0
^ permalink raw reply related [flat|nested] 11+ messages in thread[parent not found: <CAOtvUMcrhq3epOPCEciMGq53S6rTyURAKEWhQ=NwrkF95aJ+_Q@mail.gmail.com>]
* Re: [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI [not found] ` <CAOtvUMcrhq3epOPCEciMGq53S6rTyURAKEWhQ=NwrkF95aJ+_Q@mail.gmail.com> @ 2015-10-09 8:50 ` Nikolay Borisov 0 siblings, 0 replies; 11+ messages in thread From: Nikolay Borisov @ 2015-10-09 8:50 UTC (permalink / raw) To: Gilad Ben-Yossef Cc: mm, linux-fsdevel, viro, adilger.kernel, linux-kernel, operations, tytso On 10/09/2015 11:41 AM, Gilad Ben-Yossef wrote: > On Oct 8, 2015 18:31, "Nikolay Borisov" <kernel@kyup.com> wrote: >> >> Currently when bios are being finished in ext4_finish_bio this is done by >> first disabling interrupts and then acquiring a bit_spin_lock. > ... >> >> To fix the situation this patch changes the order in which the >> bit_spin_lock and interrupts disabling occcurs. The exepected >> effect is that even if a core is spinning on the bitlock it will >> have its interrupts enabled, thus being able to respond to IPIs. > > Are you sure this spin lock is never taken from interrupt context? Good point. I think indeed this fix is wrong as this lock is actually taken from soft irq context as evident from this thread, which prompted me to write the patch in the first place: http://permalink.gmane.org/gmane.linux.kernel/2056730 > If it does an interrupt occurring after the lock is taken and before > interrupts are disabled can deadlock . > > Gilad > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-10-16 12:51 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <062501d10262$d40d0a50$7c271ef0$@alibaba-inc.com>
[not found] ` <56176C10.8040709@kyup.com>
[not found] ` <062801d10265$5a749fc0$0f5ddf40$@alibaba-inc.com>
2015-10-09 8:03 ` [RFC PATCH 1/2] ext4: Fix possible deadlock with local interrupts disabled and page-draining IPI Nikolay Borisov
2015-10-12 13:40 ` Jan Kara
2015-10-12 14:51 ` Nikolay Borisov
2015-10-13 8:15 ` Jan Kara
2015-10-13 10:37 ` Nikolay Borisov
2015-10-13 13:14 ` Jan Kara
2015-10-14 9:02 ` Nikolay Borisov
2015-10-16 8:08 ` Nikolay Borisov
2015-10-16 12:51 ` Jan Kara
2015-10-08 15:31 Nikolay Borisov
[not found] ` <CAOtvUMcrhq3epOPCEciMGq53S6rTyURAKEWhQ=NwrkF95aJ+_Q@mail.gmail.com>
2015-10-09 8:50 ` Nikolay Borisov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).