* BUG: write-lock lockup @ 2006-06-12 19:53 Charles C. Bennett, Jr. 2006-06-17 17:07 ` Andrew Morton 0 siblings, 1 reply; 18+ messages in thread From: Charles C. Bennett, Jr. @ 2006-06-12 19:53 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 622 bytes --] Hi All - I'm seeing write-lock lockups... this is happening with the Fedora kernels from at least as far back as their 2.6.15-1.1833_FC4 (2.6.15.5) and as recently as their 2.6.16-1.2111_FC4 (2.6.16.17). It's a beefy box: Gateway 9515R - Intel ICH5/ICH5R Two Dual Core 3.0 ghz Xeons, 4GB RAM All disks via Emulex LP101-H (thor), switched fabric to Hitachi WMS SAN storage. The lockups are not process-specific. Running mke2fs on large filesystems seems to get it happen sooner rather than later. I can fish out any other data you need, run tests, etc. Thanks, ccb -- Charles C. Bennett, Jr. <ccb@acm.org> [-- Attachment #2: lockup --] [-- Type: text/plain, Size: 16256 bytes --] Jun 9 20:43:40 localhost kernel: BUG: write-lock lockup on CPU#6, mkfs.ext2/3258, f799c864 (Not tainted) Jun 9 20:43:40 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 9 20:43:40 localhost kernel: [<c015532c>] shrink_list+0x197/0x45d [<c01557a3>] shrink_cache+0xe7/0x29b Jun 9 20:43:40 localhost kernel: [<c014d955>] bad_range+0x22/0x2f [<c014dfe3>] __rmqueue+0xd1/0x156 Jun 9 20:43:40 localhost kernel: [<c0155db9>] shrink_zone+0x89/0xd8 [<c0155e6e>] shrink_caches+0x66/0x74 Jun 9 20:43:40 localhost kernel: [<c0155f29>] try_to_free_pages+0xad/0x1b7 [<c014e8a5>] __alloc_pages+0x136/0x2ec Jun 9 20:43:40 localhost kernel: [<c016ca3e>] __block_commit_write+0x83/0x8f [<c014be23>] generic_file_buffered_write+0x16b/0x655 Jun 9 20:43:40 localhost kernel: [<c012a04b>] current_fs_time+0x5a/0x75 [<c012a04b>] current_fs_time+0x5a/0x75 Jun 9 20:43:40 localhost kernel: [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9 Jun 9 20:43:40 localhost kernel: [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92 [<c01719e0>] blkdev_file_write+0x0/0x24 Jun 9 20:43:40 localhost kernel: [<c014c99c>] generic_file_write_nolock+0x85/0x9f [<c0139196>] autoremove_wake_function+0x0/0x37 Jun 9 20:43:40 localhost kernel: [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a Jun 9 20:43:40 localhost kernel: [<c0169b55>] sys_write+0x41/0x6a [<c0104035>] syscall_call+0x7/0xb Jun 9 21:02:58 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3259, f799c864 (Not tainted) Jun 9 21:02:58 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 9 21:02:58 localhost kernel: [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6 Jun 9 21:02:58 localhost kernel: [<c014a09c>] end_page_writeback+0x76/0x84 [<c016b2c9>] end_buffer_async_write+0xbf/0x12a Jun 9 21:02:58 localhost kernel: [<c01f1167>] memmove+0x24/0x2d [<c014d292>] mempool_free+0x3a/0x73 Jun 9 21:02:58 localhost kernel: [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f Jun 9 21:02:58 localhost kernel: [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f [<c016f307>] bio_endio+0x3e/0x69 Jun 9 21:02:58 localhost kernel: [<c01e2e88>] __end_that_request_first+0x101/0x235 [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod] Jun 9 21:02:58 localhost kernel: [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod] [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod] Jun 9 21:02:58 localhost kernel: [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod] [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod] Jun 9 21:02:58 localhost kernel: [<c012a1f2>] __do_softirq+0x72/0xdc [<c0106383>] do_softirq+0x4b/0x4f Jun 9 21:02:58 localhost kernel: ======================= Jun 9 21:02:58 localhost kernel: [<c0106265>] do_IRQ+0x55/0x86 [<c0104a7a>] common_interrupt+0x1a/0x20 Jun 9 21:03:05 localhost kernel: [<c014e387>] free_hot_cold_page+0xd5/0x126 [<c014eaf4>] __pagevec_free+0x1f/0x2e Jun 9 21:03:08 localhost kernel: [<c0154371>] __pagevec_release_nonlru+0x29/0x8b [<c01553aa>] shrink_list+0x215/0x45d Jun 9 21:03:09 localhost kernel: [<c01557a3>] shrink_cache+0xe7/0x29b [<c0155db9>] shrink_zone+0x89/0xd8 Jun 9 21:03:09 localhost kernel: [<c0155e6e>] shrink_caches+0x66/0x74 [<c0155f29>] try_to_free_pages+0xad/0x1b7 Jun 9 21:03:09 localhost kernel: [<c014e8a5>] __alloc_pages+0x136/0x2ec [<c016ca3e>] __block_commit_write+0x83/0x8f Jun 9 21:03:10 localhost kernel: [<c014be23>] generic_file_buffered_write+0x16b/0x655 [<c012a04b>] current_fs_time+0x5a/0x75 Jun 9 21:03:11 localhost kernel: [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9 Jun 9 21:03:11 localhost kernel: [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92 [<c01719e0>] blkdev_file_write+0x0/0x24 Jun 9 21:03:12 localhost kernel: [<c014c99c>] generic_file_write_nolock+0x85/0x9f [<c0139196>] autoremove_wake_function+0x0/0x37 Jun 9 21:03:12 localhost kernel: [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a Jun 9 21:03:12 localhost kernel: [<c0169b55>] sys_write+0x41/0x6a [<c0104035>] syscall_call+0x7/0xb Jun 9 21:28:07 localhost mgetty[3159]: failed dev=ttyS0, pid=3159, login time out Jun 9 21:42:31 localhost kernel: BUG: write-lock lockup on CPU#3, mkfs.ext2/3257, f5e0ef24 (Not tainted) Jun 9 21:42:31 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3259, f5e0ef24 (Not tainted) Jun 9 21:42:31 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 9 21:42:31 localhost kernel: [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6 Jun 9 21:42:31 localhost kernel: [<c014a09c>] end_page_writeback+0x76/0x84 [<c016b2c9>] end_buffer_async_write+0xbf/0x12a Jun 9 21:42:31 localhost kernel: [<c014d292>] mempool_free+0x3a/0x73 [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f Jun 9 21:42:31 localhost kernel: [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f Jun 9 21:42:31 localhost kernel: [<c016f307>] bio_endio+0x3e/0x69 [<c0120685>] __wake_up+0x32/0x43 Jun 9 21:42:31 localhost kernel: [<c01e2e88>] __end_that_request_first+0x101/0x235 [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod] Jun 9 21:42:31 localhost kernel: [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod] [<f896681c>] e1000_clean_tx_irq+0x76/0x62d [e1000]alhost kernel: [<c016f307>] bio_endio+0x3e/0x69 [<c0120685>] __wake_up+0x32/0x43 Jun 9 21:42:48 localhost kernel: [<c01e2e88>] __end_that_request_first+0x101/0x235 [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod] Jun 9 21:42:48 localhost kernel: [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod] [<f896681c>] e1000_clean_tx_irq+0x76/0x62d [e1000] Jun 9 21:42:48 localhost kernel: [<c01391ab>] autoremove_wake_function+0x15/0x37 [<f8833c97>] sd_rw_intr+0x70/04 (Not tainted) Jun 9 23:34:03 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 9 23:34:03 localhost kernel: [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015060d>] test_clear_page_dirty+0x2a/0xa9 Jun 9 23:34:03 localhost kernel: [<c016df8f>] try_to_free_buffers+0_file_aio_write_nolock+0x2a3/0x4d9 [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92 Jun 10 00:53:26 localhost kernel: [<c01719e0>] blkdev_file_write+0x0/0x24 [<c014c99c>] generic_file_write_nolock+0x85/0x9f Jun 10 00:53:30 localhost kernel: [<c0139196>] autoremove_wake_function+0x0/0x37 [<c0171a00>] blkdev_file_write+0x20/0x24 Jun 10 00:53:33 localhost kernel: [<c01699f2>] vfs_write+0xa2/0x15a [<c0169b55>] sys_write+0x41/0x6a Jun 10 00:53:36 localhost kernel: [<c0104035>] syscall_call+0x7/0xb Jun 10 01:03:29 localhost kernel: BUG: write-lock lockup on CPU#4, mkfs.exthe+0xe7/0x29b [<c01f21d7>] Jun 10 01:05:21 localhost kernel: __write_lock_debug+0xb4/0xdd [<c0152aac>] cache_alloc_refill+0xfd/0x27c [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 10 01:05:21 localhost kernel: [<c0155db9>] [<c032bd15>] shrink_zone+0x89/0xd8 Jun 10 01:05:21 localhost kernel: [<c0155e6e>] shrink_caches+0x66/0x74 _write_lock_irqsave+0x9/0xd [<c0155f29>] [<c015070d>] try_to_free_pages+0xad/0x1b7 test_clear_page_writeback+0x2d/0xa6 Jun 10 01:05:21 localhost kernel: Jun 10 01:05:21 localhost kernel: [<c014a09c3a/0x73 [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f Jun 10 01:06:14 localhost kernel: [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f Jun 10 01:06:17 localhost kernel: [<c016f307] bio_endio+0x3e/0x69 [<c01e2e88>] __end_that_request_first+0x101/0x235 Jun 10 01:06:20 localhost kernel: [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod] [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod] Jun 10 01:06:21 localhost kernel: [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod] [<c0120633>] __wake_up_common+0x39/0x59 Jun 10 01:06:22 localhost kernel: [<c0120685>] __wake_up+0x32/0x43 [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod] Jun 10 01:06:24 localhost kernel: [<c0106265>] do_IRQ+0x55/0x86 [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod] Jun 10 01:06:25 localhost kernel: [<c012a1f2>] __do_softirq+0x72/0xdc [<c0106383>] do_softirq+0x4b/0x4f Jun 10 01:06:25 localhost kernel: ======================= Jun 10 01:06:26 localhost kernel: [<c0104b08>] apic_timer_interrupt+0x1c/0x24 [<c0155554>] shrink_list+0x3bf/0x45d Jun 10 01:06:27 localhost kernel: [<c01557a3>] shrink_cache+0xe7/0x29b [<c014fdb0>] get_writeback_state+0x30/0x35 Jun 10 01:06:27 localhost kernel: [<c0195d7e>] mb_cache_shrink_fn+0x182/0x1c4 [<c0155db9>] shrink_zone+0x89/0xd8 Jun 10 01:06:27 localhost kernel: [<c015622a>] balance_pgdat+0x1f7/0x3d6 [<c01564e5>] kswapd+0xdc/0x127 Jun 10 01:06:27 localhost kernel: [<c0139196>] autoremove_wake_function+0x0/0x37 [<c0156409>] kswapd+0x0/0x127 Jun 10 01:06:27 localhost kernel: [<c010243d>] kernel_thread_helper+0x5/0xb Jun 10 01:09:40 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3258, f513cf24 (Not tainted) Jun 10 01:09:40 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 10 01:09:40 localhost kernel: [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6 Jun 10 01:09:40 localhost kernel: [<c014a09c>] end_page_writeback+0x76/0x84 [<c016b2c9>] end_buffer_async_write+0xbf/0x12a Jun 10 01:09:40 localhost kernel: [<c014d292>] mempool_free+0x3a/0x73 [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f Jun 10 01:09:40 localhost kernel: [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f Jun 10 01:09:41 localhost kernel: [<c016f307>] bio_endio+0x3e/0x69 [<c01e2e88>] __end_that_request_first+0x101/0x235 Jun 10 01:09:41 localhost kernel: [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod] [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod] Jun 10 01:09:41 localhost kernel: [<c0104b08>] apic_timer_interrupt+0x1c/0x24 [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod] Jun 10 01:09:41 localhost kernel: [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod] [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod] Jun 10 01:09:41 localhost kernel: [<c012a1f2>] __do_softirq+0x72/0xdc [<c0106383>] do_softirq+0x4b/0x4f Jun 10 01:09:41 localhost kernel: ======================= Jun 10 01:09:41 localhost kernel: [<c0106265>] do_IRQ+0x55/0x86 [<c0104a7a>] common_interrupt+0x1a/0x20 Jun 10 01:09:45 localhost kernel: [<c0150517>] __set_page_dirty_nobuffers+0x71/0xb2 [<c016ca3e>] __block_commit_write+0x83/0x8f Jun 10 01:09:46 localhost kernel: [<c016d154>] block_commit_write+0x15/0x1c [<c0170dc6>] blkdev_commit_write+0x0/0xd Jun 10 01:09:46 localhost kernel: [<c014bfc8>] generic_file_buffered_write+0x310/0x655 [<c012a04b>] current_fs_time+0x5a/0x75 Jun 10 01:09:47 localhost kernel: [<c012a04b>] current_fs_time+0x5a/0x75 [<c0182cd8>] inode_update_time+0x2d/0x99 Jun 10 01:09:47 localhost kernel: [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9 [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92 Jun 10 01:09:47 localhost kernel: [<c01719e0>] blkdev_file_write+0x0/0x24 [<c014c99c>] generic_file_write_nolock+0x85/0x9f Jun 10 01:09:47 localhost kernel: [<c0139196>] autoremove_wake_function+0x0/0x37 [<c0171a00>] blkdev_file_write+0x20/0x24 Jun 10 01:09:47 localhost kernel: [<c01699f2>] vfs_write+0xa2/0x15a [<c0169b55>] sys_write+0x41/0x6a Jun 10 01:09:48 localhost kernel: [<c0104035>] syscall_call+0x7/0xb Jun 10 01:15:55 localhost kernel: BUG: write-lock lockup on CPU#3, kswapd0/482, f69efaa4 (Not tainted) [restart] Jun 10 07:12:16 localhost kernel: BUG: write-lock lockup on CPU#5, mkfs.ext2/3107, f650ace4 (Not tainted) Jun 10 07:12:16 localhost kernel: BUG: write-lock lockup on CPU#2, mkfs.ext2/3108, f650ace4 (Not tainted) Jun 10 07:12:16 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 10 07:12:16 localhost kernel: [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6 Jun 10 07:12:16 localhost kernel: [<c014a09c>] end_page_writeback+0x76/0x84 [<c016b2c9>] end_buffer_async_write+0xbf/0x12a Jun 10 07:12:16 localhost kernel: [<c01f1167>] memmove+0x24/0x2d [<c014d292>] mempool_free+0x3a/0x73 Jun 10 07:12:16 localhost kernel: [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f Jun 10 07:12:16 localhost kernel: [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f [<c016f307>] bio_endio+0x3e/0x69 Jun 10 07:12:16 localhost kernel: [<c01e2e88>] __end_that_request_first+0x101/0x235 [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod] Jun 10 07:12:16 localhost kernel: [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod] [<c011e8fa>] try_to_wake_up+0x6e/0x3ec Jun 10 07:12:16 localhost kernel: [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod] [<c0120633>] __wake_up_common+0x39/0x59 Jun 10 07:12:16 localhost kernel: [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod] [<c0106265>] do_IRQ+0x55/0x86 Jun 10 07:12:16 localhost kernel: [<c0134a95>] __queue_work+0x45/0x54 [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod] Jun 10 07:12:19 localhost kernel: [<c012a1f2>] __do_softirq+0x72/0xdc [<c0106383>] do_softirq+0x4b/0x4f Jun 10 07:12:20 localhost kernel: ======================= Jun 10 07:12:20 localhost kernel: [<c0104b08>] apic_timer_interrupt+0x1c/0x24 [<c0154471>] __pagevec_lru_add+0x9e/0xbb Jun 10 07:12:20 localhost kernel: [<c014c0a1>] generic_file_buffered_write+0x3e9/0x655 [<c012a04b>] current_fs_time+0x5a/0x75 Jun 10 07:12:21 localhost kernel: [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9 Jun 10 07:12:21 localhost kernel: [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92 [<c01719e0>] blkdev_file_write+0x0/0x24 Jun 10 07:12:21 localhost kernel: [<c014c99c>] generic_file_write_nolock+0x85/0x9f [<c0139196>]BUG: write-lock lockup on CPU#4, safte-monitor/2750, f650ace4 (Not tainted) Jun 10 07:12:21 localhost kernel: autoremove_wake_function+0x0/0x37 Jun 10 07:12:22 localhost kernel: [<c0171a00>] [<c01f21d7>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a __write_lock_debug+0xb4/0xdd Jun 10 07:12:22 localhost kernel: [<c0169b55>] [<c01f223f>] sys_write+0x41/0x6a _raw_write_lock+0x3f/0x7c [<c0104035>] Jun 10 07:12:22 localhost kernel: [<c032bd15>] syscall_call+0x7/0xb Jun 10 07:12:22 localhost kernel: _write_lock_irqsave+0x9/0xd [<c015060d>] test_clear_page_dirty+0x2a/0xa9 Jun 10 07:12:22 localhost kernel: [<c016df8f>] try_to_free_buffers+0x6b/0x84 [<c0155567>] shrink_list+0x3d2/0x45d Jun 10 07:12:22 localhost kernel: [<c01c5a2a>] avc_has_perm_noaudit+0x24/0xce [<c014dede>] prep_new_page+0x41/0x75 Jun 10 07:12:23 localhost kernel: [<c01c5b22>] avc_has_perm+0x4e/0x5c [<c01557a3>] shrink_cache+0xe7/0x29b Jun 10 07:12:23 localhost kernel: [<c0155db9>] shrink_zone+0x89/0xd8 [<c0155e6e>] shrink_caches+0x66/0x74 Jun 10 07:12:23 localhost kernel: [<c0155f29>] try_to_free_pages+0xad/0x1b7 [<c014e8a5>] __alloc_pages+0x136/0x2ec Jun 10 07:12:23 localhost kernel: [<c014ea79>] __get_free_pages+0x1e/0x34 [<c017ba85>] __pollwait+0x64/0x97 Jun 10 07:12:23 localhost kernel: [<c02f0644>] tcp_poll+0x15/0x183 [<c02c0db0>] sock_poll+0xf/0x11 Jun 10 07:12:23 localhost kernel: [<c017bd30>] do_select+0x177/0x303 [<c01483af>] audit_syscall_entry+0x10a/0x15d Jun 10 07:12:24 localhost kernel: [<c017ba21>] __pollwait+0x0/0x97 [<c017c0ba>] sys_select+0x1e6/0x38f Jun 10 07:12:24 localhost kernel: [<c01084cd>] do_syscall_trace+0x1ac/0x1c4 [<c0104035>] syscall_call+0x7/0xb Jun 10 07:12:24 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c Jun 10 07:12:24 localhost kernel: [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: BUG: write-lock lockup 2006-06-12 19:53 BUG: write-lock lockup Charles C. Bennett, Jr. @ 2006-06-17 17:07 ` Andrew Morton 2006-06-19 7:02 ` [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min Ingo Molnar 0 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2006-06-17 17:07 UTC (permalink / raw) To: Charles C. Bennett, Jr.; +Cc: linux-kernel, Ingo Molnar On Mon, 12 Jun 2006 15:53:43 -0400 "Charles C. Bennett, Jr." <ccb@acm.org> wrote: > > Hi All - > > I'm seeing write-lock lockups... this is happening with > the Fedora kernels from at least as far back as their 2.6.15-1.1833_FC4 > (2.6.15.5) and as recently as their 2.6.16-1.2111_FC4 (2.6.16.17). > > It's a beefy box: > Gateway 9515R - Intel ICH5/ICH5R > Two Dual Core 3.0 ghz Xeons, 4GB RAM > All disks via Emulex LP101-H (thor), switched fabric to > Hitachi WMS SAN storage. > > The lockups are not process-specific. Running mke2fs on large > filesystems seems to get it happen sooner rather than later. Apart from these messages, does the machine otherwise work OK? > I can fish out any other data you need, run tests, etc. > > > Jun 9 20:43:40 localhost kernel: BUG: write-lock lockup on CPU#6, mkfs.ext2/3258, f799c864 (Not tainted) > Jun 9 20:43:40 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c > Jun 9 20:43:40 localhost kernel: [<c015532c>] shrink_list+0x197/0x45d [<c01557a3>] shrink_cache+0xe7/0x29b > Jun 9 20:43:40 localhost kernel: [<c014d955>] bad_range+0x22/0x2f [<c014dfe3>] __rmqueue+0xd1/0x156 > Jun 9 20:43:40 localhost kernel: [<c0155db9>] shrink_zone+0x89/0xd8 [<c0155e6e>] shrink_caches+0x66/0x74 > Jun 9 20:43:40 localhost kernel: [<c0155f29>] try_to_free_pages+0xad/0x1b7 [<c014e8a5>] __alloc_pages+0x136/0x2ec > Jun 9 20:43:40 localhost kernel: [<c016ca3e>] __block_commit_write+0x83/0x8f [<c014be23>] generic_file_buffered_write+0x16b/0x655 > Jun 9 20:43:40 localhost kernel: [<c012a04b>] current_fs_time+0x5a/0x75 [<c012a04b>] current_fs_time+0x5a/0x75 > Jun 9 20:43:40 localhost kernel: [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9 > Jun 9 20:43:40 localhost kernel: [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92 [<c01719e0>] blkdev_file_write+0x0/0x24 > Jun 9 20:43:40 localhost kernel: [<c014c99c>] generic_file_write_nolock+0x85/0x9f [<c0139196>] autoremove_wake_function+0x0/0x37 > Jun 9 20:43:40 localhost kernel: [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a > Jun 9 20:43:40 localhost kernel: [<c0169b55>] sys_write+0x41/0x6a [<c0104035>] syscall_call+0x7/0xb > Jun 9 21:02:58 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3259, f799c864 (Not tainted) > Jun 9 21:02:58 localhost kernel: [<c01f21d7>] __write_lock_debug+0xb4/0xdd [<c01f223f>] _raw_write_lock+0x3f/0x7c > Jun 9 21:02:58 localhost kernel: [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6 > Jun 9 21:02:58 localhost kernel: [<c014a09c>] end_page_writeback+0x76/0x84 [<c016b2c9>] end_buffer_async_write+0xbf/0x12a > Jun 9 21:02:58 localhost kernel: [<c01f1167>] memmove+0x24/0x2d [<c014d292>] mempool_free+0x3a/0x73 > Jun 9 21:02:58 localhost kernel: [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f > Jun 9 21:02:58 localhost kernel: [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f [<c016f307>] bio_endio+0x3e/0x69 > Jun 9 21:02:58 localhost kernel: [<c01e2e88>] __end_that_request_first+0x101/0x235 [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod] > Jun 9 21:02:58 localhost kernel: [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod] [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod] > Jun 9 21:02:58 localhost kernel: [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod] [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod] > Jun 9 21:02:58 localhost kernel: [<c012a1f2>] __do_softirq+0x72/0xdc [<c0106383>] do_softirq+0x4b/0x4f > Jun 9 21:02:58 localhost kernel: ======================= > Jun 9 21:02:58 localhost kernel: [<c0106265>] do_IRQ+0x55/0x86 [<c0104a7a>] common_interrupt+0x1a/0x20 > Jun 9 21:03:05 localhost kernel: [<c014e387>] free_hot_cold_page+0xd5/0x126 [<c014eaf4>] __pagevec_free+0x1f/0x2e > Jun 9 21:03:08 localhost kernel: [<c0154371>] __pagevec_release_nonlru+0x29/0x8b [<c01553aa>] shrink_list+0x215/0x45d > Jun 9 21:03:09 localhost kernel: [<c01557a3>] shrink_cache+0xe7/0x29b [<c0155db9>] shrink_zone+0x89/0xd8 > Jun 9 21:03:09 localhost kernel: [<c0155e6e>] shrink_caches+0x66/0x74 [<c0155f29>] try_to_free_pages+0xad/0x1b7 > Jun 9 21:03:09 localhost kernel: [<c014e8a5>] __alloc_pages+0x136/0x2ec [<c016ca3e>] __block_commit_write+0x83/0x8f > Jun 9 21:03:10 localhost kernel: [<c014be23>] generic_file_buffered_write+0x16b/0x655 [<c012a04b>] current_fs_time+0x5a/0x75 > Jun 9 21:03:11 localhost kernel: [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9 > Jun 9 21:03:11 localhost kernel: [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92 [<c01719e0>] blkdev_file_write+0x0/0x24 > Jun 9 21:03:12 localhost kernel: [<c014c99c>] generic_file_write_nolock+0x85/0x9f [<c0139196>] autoremove_wake_function+0x0/0x37 > Jun 9 21:03:12 localhost kernel: [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a > Jun 9 21:03:12 localhost kernel: [<c0169b55>] sys_write+0x41/0x6a [<c0104035>] syscall_call+0x7/0xb Well. Obviously __write_lock_debug():print_once was meant to have static scope (__read_mostly, too). But that's a cosmetic thing. I'm suspecting that the debug code has simply gone wrong here - that there's such a lot of read_lock() traffic happening with this workload that the debug version of write_lock() simply isn't able to take the lock. This might fix it, but it'll break the timing calculations in that loop. --- devel/lib/spinlock_debug.c~a 2006-06-17 03:08:35.000000000 -0700 +++ devel-akpm/lib/spinlock_debug.c 2006-06-17 03:08:56.000000000 -0700 @@ -219,7 +219,7 @@ static void __write_lock_debug(rwlock_t for (i = 0; i < loops_per_jiffy * HZ; i++) { if (__raw_write_trylock(&lock->raw_lock)) return; - __delay(1); + cpu_relax(); } /* lockup suspected: */ if (print_once) { _ (I'm a little surprised that RH-FC ships with this debug option enabled. It's good for kernel development, but not optimal for users...) Ingo, we have two bugs in there - some assist and thought would be appreciated, please. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-17 17:07 ` Andrew Morton @ 2006-06-19 7:02 ` Ingo Molnar 2006-06-19 7:59 ` Andrew Morton 0 siblings, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2006-06-19 7:02 UTC (permalink / raw) To: Andrew Morton; +Cc: Charles C. Bennett, Jr., linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > Well. Obviously __write_lock_debug():print_once was meant to have > static scope (__read_mostly, too). But that's a cosmetic thing. no - this is an infinite loop we must not break out of, and we only want to print the warning once per infinite loop. But we dont want to shut up these messages globally. (so that we get a backtrace on all CPUs that are looping, etc.) > I'm suspecting that the debug code has simply gone wrong here - that > there's such a lot of read_lock() traffic happening with this workload > that the debug version of write_lock() simply isn't able to take the > lock. i'd go for the patch below - the current 1 second timeout might be both inadequate for really high loads, and it might also be unrobust when there's some miscalibration of loops_per_jiffies - turning a 1 sec timeout into half-a-sec timeout or so. Builds/boots here on 32-bit and 64-bit. Ingo ----------- Subject: increase spinlock-debug looping timeouts from 1 sec to 1 min From: Ingo Molnar <mingo@elte.hu> increase the loop timeouts from 1 second to 60 seconds. This should also handle the occasional case of loops_per_jiffy miscalibration, and it should handle false positives due to high workloads. (also make sure the right hand side of the comparison does not overflow on 32-bits, and make the messages indicate that we dont know for a fact whether this is due to a lockup or due to some extreme load.) Signed-off-by: Ingo Molnar <mingo@elte.hu> --- lib/spinlock_debug.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux/lib/spinlock_debug.c =================================================================== --- linux.orig/lib/spinlock_debug.c +++ linux/lib/spinlock_debug.c @@ -104,7 +104,7 @@ static void __spin_lock_debug(spinlock_t u64 i; for (;;) { - for (i = 0; i < loops_per_jiffy * HZ; i++) { + for (i = 0; i < (u64)loops_per_jiffy * 60 * HZ; i++) { if (__raw_spin_trylock(&lock->raw_lock)) return; __delay(1); @@ -112,7 +112,7 @@ static void __spin_lock_debug(spinlock_t /* lockup suspected: */ if (print_once) { print_once = 0; - printk(KERN_EMERG "BUG: spinlock lockup on CPU#%d, " + printk(KERN_EMERG "BUG: possible spinlock lockup on CPU#%d, " "%s/%d, %p\n", raw_smp_processor_id(), current->comm, current->pid, lock); @@ -169,7 +169,7 @@ static void __read_lock_debug(rwlock_t * u64 i; for (;;) { - for (i = 0; i < loops_per_jiffy * HZ; i++) { + for (i = 0; i < (u64)loops_per_jiffy * 60 * HZ; i++) { if (__raw_read_trylock(&lock->raw_lock)) return; __delay(1); @@ -177,7 +177,7 @@ static void __read_lock_debug(rwlock_t * /* lockup suspected: */ if (print_once) { print_once = 0; - printk(KERN_EMERG "BUG: read-lock lockup on CPU#%d, " + printk(KERN_EMERG "BUG: possible read-lock lockup on CPU#%d, " "%s/%d, %p\n", raw_smp_processor_id(), current->comm, current->pid, lock); @@ -242,7 +242,7 @@ static void __write_lock_debug(rwlock_t u64 i; for (;;) { - for (i = 0; i < loops_per_jiffy * HZ; i++) { + for (i = 0; i < (u64)loops_per_jiffy * 60 * HZ; i++) { if (__raw_write_trylock(&lock->raw_lock)) return; __delay(1); @@ -250,7 +250,7 @@ static void __write_lock_debug(rwlock_t /* lockup suspected: */ if (print_once) { print_once = 0; - printk(KERN_EMERG "BUG: write-lock lockup on CPU#%d, " + printk(KERN_EMERG "BUG: possible write-lock lockup on CPU#%d, " "%s/%d, %p\n", raw_smp_processor_id(), current->comm, current->pid, lock); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 7:02 ` [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min Ingo Molnar @ 2006-06-19 7:59 ` Andrew Morton 2006-06-19 8:12 ` Ingo Molnar 0 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2006-06-19 7:59 UTC (permalink / raw) To: Ingo Molnar; +Cc: ccb, linux-kernel On Mon, 19 Jun 2006 09:02:29 +0200 Ingo Molnar <mingo@elte.hu> wrote: > Subject: increase spinlock-debug looping timeouts from 1 sec to 1 min But it's broken. In the non-debug case we subtract RW_LOCK_BIAS so we know that the writer will get the lock when all readers vacate. But in the CONFIG_DEBUG_SPINLOCK case we don't do that, with the result that taking a write_lock can take over a second. A much, much better fix (which involves visiting all architectures) would be to subtract RW_LOCK_BIAS and _then_ wait for a second. OK, it's only debug code. But RH (for one) have decided to ship zillions of kernels with this debug code turned on. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 7:59 ` Andrew Morton @ 2006-06-19 8:12 ` Ingo Molnar 2006-06-19 8:21 ` Ingo Molnar 2006-06-19 8:32 ` Andrew Morton 0 siblings, 2 replies; 18+ messages in thread From: Ingo Molnar @ 2006-06-19 8:12 UTC (permalink / raw) To: Andrew Morton; +Cc: ccb, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > On Mon, 19 Jun 2006 09:02:29 +0200 > Ingo Molnar <mingo@elte.hu> wrote: > > > Subject: increase spinlock-debug looping timeouts from 1 sec to 1 > > min > > But it's broken. In the non-debug case we subtract RW_LOCK_BIAS so we > know that the writer will get the lock when all readers vacate. But > in the CONFIG_DEBUG_SPINLOCK case we don't do that, with the result > that taking a write_lock can take over a second. > > A much, much better fix (which involves visiting all architectures) > would be to subtract RW_LOCK_BIAS and _then_ wait for a second. no. Write-locks are unfair too, and there's no guarantee that writes are listened to. That's why nested read_lock() is valid, while nested down_read() is invalid. Take a look at arch/i386/kernel/semaphore.c, __write_lock_failed() just adds back the RW_LOCK_BIAS and retries in a loop. There's no difference to an open-coded write_trylock loop - unless i'm missing something fundamental. > OK, it's only debug code. But RH (for one) have decided to ship > zillions of kernels with this debug code turned on. yes - Fedora enables most of the transparent kernel debugging options (slab, lock debugging) in rawhide. The current list is: CONFIG_DEBUG_KERNEL=y CONFIG_DEBUG_SLAB=y CONFIG_DEBUG_SLAB_LEAK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_SPINLOCK_SLEEP=y CONFIG_DEBUG_HIGHMEM=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_VM=y CONFIG_DEBUG_STACKOVERFLOW=y CONFIG_DEBUG_STACK_USAGE=y CONFIG_DEBUG_RODATA=y CONFIG_KEYS_DEBUG_PROC_KEYS=y Ingo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 8:12 ` Ingo Molnar @ 2006-06-19 8:21 ` Ingo Molnar 2006-06-19 8:32 ` Andrew Morton 1 sibling, 0 replies; 18+ messages in thread From: Ingo Molnar @ 2006-06-19 8:21 UTC (permalink / raw) To: Andrew Morton; +Cc: ccb, linux-kernel * Ingo Molnar <mingo@elte.hu> wrote: > no. Write-locks are unfair too, and there's no guarantee that writes > are listened to. That's why nested read_lock() is valid, while nested > down_read() is invalid. > > Take a look at arch/i386/kernel/semaphore.c, __write_lock_failed() > just adds back the RW_LOCK_BIAS and retries in a loop. There's no > difference to an open-coded write_trylock loop - unless i'm missing > something fundamental. did i ever mention that i find rwlocks evil, inefficient and bug-prone, and that we should get rid of them? :-) (Most rwlock users can be converted to straight spinlocks just fine, but there are a couple of places that rely on read-lock nesting. The hardest-to-fix offenders are nested rcu_read_locks() in the netfilter code. I gave up converting them to saner locking, PREEMPT_RCU works it around in the -rt tree, by not being rwlock based.) Ingo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 8:12 ` Ingo Molnar 2006-06-19 8:21 ` Ingo Molnar @ 2006-06-19 8:32 ` Andrew Morton 2006-06-19 8:35 ` Ingo Molnar 1 sibling, 1 reply; 18+ messages in thread From: Andrew Morton @ 2006-06-19 8:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: ccb, linux-kernel On Mon, 19 Jun 2006 10:12:52 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@osdl.org> wrote: > > > On Mon, 19 Jun 2006 09:02:29 +0200 > > Ingo Molnar <mingo@elte.hu> wrote: > > > > > Subject: increase spinlock-debug looping timeouts from 1 sec to 1 > > > min > > > > But it's broken. In the non-debug case we subtract RW_LOCK_BIAS so we > > know that the writer will get the lock when all readers vacate. But > > in the CONFIG_DEBUG_SPINLOCK case we don't do that, with the result > > that taking a write_lock can take over a second. > > > > A much, much better fix (which involves visiting all architectures) > > would be to subtract RW_LOCK_BIAS and _then_ wait for a second. > > no. Write-locks are unfair too, and there's no guarantee that writes are > listened to. That's why nested read_lock() is valid, while nested > down_read() is invalid. > > Take a look at arch/i386/kernel/semaphore.c, __write_lock_failed() just > adds back the RW_LOCK_BIAS and retries in a loop. There's no difference > to an open-coded write_trylock loop - unless i'm missing something > fundamental. OK. That sucks. A sufficiently large machine with the right mix of latencies will get hit by the NMI watchdog in write_lock_irq(). But presumably the situation is much worse with CONFIG_DEBUG_SPINLOCK because of that __delay(). So how about we remove the __delay() (which is wrong anyway, because loops_per_jiffy isn't calculated with a write_trylock() in the loop (which means we're getting scarily close to the NMI watchdog at present)). Instead, calculate a custom loops_per_jiffy for this purpose in lib/spinlock_debug.c? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 8:32 ` Andrew Morton @ 2006-06-19 8:35 ` Ingo Molnar 2006-06-19 9:13 ` Andrew Morton 0 siblings, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2006-06-19 8:35 UTC (permalink / raw) To: Andrew Morton; +Cc: ccb, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > OK. That sucks. A sufficiently large machine with the right mix of > latencies will get hit by the NMI watchdog in write_lock_irq(). > > But presumably the situation is much worse with CONFIG_DEBUG_SPINLOCK > because of that __delay(). > > So how about we remove the __delay() (which is wrong anyway, because > loops_per_jiffy isn't calculated with a write_trylock() in the loop > (which means we're getting scarily close to the NMI watchdog at > present)). > > Instead, calculate a custom loops_per_jiffy for this purpose in > lib/spinlock_debug.c? hm, that would be yet another calibration loop with the potential to be wrong (and which would slow down the bootup process). If loops_per_jiffy is wrong then our timings are toast anyway. I think increasing the timeout to 60 secs ought to be enough - 1 sec was a bit too close to valid delays and i can imagine really high loads causing 1 sec delays (especially if something like SysRq-T is holding the tasklist_lock for long). The write_trylock + __delay in the loop is not a problem or a bug, as the trylock will at most _increase_ the delay - and our goal is to not have a false positive, not to be absolutely accurate about the measurement here. Ingo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 8:35 ` Ingo Molnar @ 2006-06-19 9:13 ` Andrew Morton 2006-06-19 11:39 ` Ingo Molnar 0 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2006-06-19 9:13 UTC (permalink / raw) To: Ingo Molnar; +Cc: ccb, linux-kernel On Mon, 19 Jun 2006 10:35:18 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@osdl.org> wrote: > > > OK. That sucks. A sufficiently large machine with the right mix of > > latencies will get hit by the NMI watchdog in write_lock_irq(). > > > > But presumably the situation is much worse with CONFIG_DEBUG_SPINLOCK > > because of that __delay(). > > > > So how about we remove the __delay() (which is wrong anyway, because > > loops_per_jiffy isn't calculated with a write_trylock() in the loop > > (which means we're getting scarily close to the NMI watchdog at > > present)). > > > > Instead, calculate a custom loops_per_jiffy for this purpose in > > lib/spinlock_debug.c? > > hm, that would be yet another calibration loop with the potential to be > wrong (and which would slow down the bootup process). We should be able to do it in two jiffies, max. > If loops_per_jiffy > is wrong then our timings are toast anyway. loops_per_jiffy measures how long we can run __delay(), which is basically two rdtsc's (x86_64). But in __write_lock_debug() we're also doing a failed write_trylock(), so it'll take more than a second. Perhaps a lot more, depending on relative speeds, and whether the lock is exclusive-owned in another CPU (it often will be). ... which means that calculating loops_per_jiffy_for_write_lock_debug will be hard - need to have N other CPUs simultaneously hammering the lock for reading (!) Maybe we just shouldn't enable this code if the CPU doesn't have an rdtsc-like thingy we can get at. > I think increasing the timeout to 60 secs ought to be enough - 1 sec was > a bit too close to valid delays and i can imagine really high loads > causing 1 sec delays (especially if something like SysRq-T is holding > the tasklist_lock for long). No way can _this_ workload be taking the lock for anything like that amount of time. We have - one CPU taking the lock for reading, looking to see if a page is in pagecache. - the same CPU taking the lock for writing, adding pages to pagecache. This is infrequent on average because we need to modify the contents of each page. - one CPU perhaps taking the lock for writing running page reclaim at the same average rate as the rate of pagecache consumption. Also slow. Each of these code paths are really short, and relatively infrequent. There's just no way that a write_lock on tree_lock should be taking a second (more than a second, but we don't know how much more). Surely something is busted. > The write_trylock + __delay in the loop is not a problem or a bug, as > the trylock will at most _increase_ the delay - and our goal is to not > have a false positive, not to be absolutely accurate about the > measurement here. Precisely. We have delays of over a second (but we don't know how much more than a second). Let's say two seconds. The NMI watchdog timeout is, what? Five seconds? That's getting too close. The result will be a total system crash. And RH are shipping this. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 9:13 ` Andrew Morton @ 2006-06-19 11:39 ` Ingo Molnar 2006-06-19 19:55 ` Andrew Morton 0 siblings, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2006-06-19 11:39 UTC (permalink / raw) To: Andrew Morton; +Cc: ccb, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > > The write_trylock + __delay in the loop is not a problem or a bug, as > > the trylock will at most _increase_ the delay - and our goal is to not > > have a false positive, not to be absolutely accurate about the > > measurement here. > > Precisely. We have delays of over a second (but we don't know how > much more than a second). Let's say two seconds. The NMI watchdog > timeout is, what? Five seconds? i dont see the problem. We'll have tried that lock hundreds of thousands of times before this happens. The NMI watchdog will only trigger if we do this with IRQs disabled. And it's not like the normal __write_lock_failed codepath would be any different: for heavily contended workloads the overhead is likely in the cacheline bouncing, not in the __delay(). > That's getting too close. The result will be a total system crash. > And RH are shipping this. I dont see a connection. Pretty much the only thing the loop condition impacts is the condition under which we print out a 'i think we deadlocked' message. Have i missed your point perhaps? Ingo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 11:39 ` Ingo Molnar @ 2006-06-19 19:55 ` Andrew Morton 2006-06-20 8:06 ` Arjan van de Ven 2006-06-20 8:40 ` [patch] fix spinlock-debug looping Ingo Molnar 0 siblings, 2 replies; 18+ messages in thread From: Andrew Morton @ 2006-06-19 19:55 UTC (permalink / raw) To: Ingo Molnar; +Cc: ccb, linux-kernel On Mon, 19 Jun 2006 13:39:44 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@osdl.org> wrote: > > > > The write_trylock + __delay in the loop is not a problem or a bug, as > > > the trylock will at most _increase_ the delay - and our goal is to not > > > have a false positive, not to be absolutely accurate about the > > > measurement here. > > > > Precisely. We have delays of over a second (but we don't know how > > much more than a second). Let's say two seconds. The NMI watchdog > > timeout is, what? Five seconds? > > i dont see the problem. It's taking over a second to acquire a write_lock. A lock which is unlikely to be held for more than a microsecond anywhere. That's really bad, isn't it? Being on the edge of an NMI watchdog induced system crash is bad, too. > We'll have tried that lock hundreds of thousands > of times before this happens. The NMI watchdog will only trigger if we > do this with IRQs disabled. tree_lock uses write_lock_irq(). > And it's not like the normal > __write_lock_failed codepath would be any different: for heavily > contended workloads the overhead is likely in the cacheline bouncing, > not in the __delay(). Yes, it might also happen with !CONFIG_DEBUG_SPINLOCK. We need to find out if that's so and if so, why. > > That's getting too close. The result will be a total system crash. > > And RH are shipping this. > > I dont see a connection. Pretty much the only thing the loop condition > impacts is the condition under which we print out a 'i think we > deadlocked' message. I'm assuming that the additional delay in the debug code has worsened the situation. > Have i missed your point perhaps? I get that impression ;) If it takes 1-2 seconds to get this lock then it can take five seconds. a) that's just gross and b) the NMI watchdog will nuke the box. Why is it taking so long to get the lock? Does it happen in non-debug mode? What do we do about it? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min 2006-06-19 19:55 ` Andrew Morton @ 2006-06-20 8:06 ` Arjan van de Ven 2006-06-20 8:40 ` [patch] fix spinlock-debug looping Ingo Molnar 1 sibling, 0 replies; 18+ messages in thread From: Arjan van de Ven @ 2006-06-20 8:06 UTC (permalink / raw) To: Andrew Morton; +Cc: Ingo Molnar, ccb, linux-kernel On Mon, 2006-06-19 at 12:55 -0700, Andrew Morton wrote: > I get that impression ;) If it takes 1-2 seconds to get this lock then it > can take five seconds. a) that's just gross and b) the NMI watchdog will > nuke the box. > > Why is it taking so long to get the lock? on a high level, this is a "feature" of unfair read/write locks in general. Just have enough processors and a lock that can be taken for read as a result from a userspace action, and BLAM you have this as a user triggerable condition... which is effectively a DoS security issue. > What do we do about it? read/write spinlocks are effectively silly. There is NO scalability advantage in practice (well, there would be if you have really long hold times, which you don't for spinlocks, rwsems are obviously different there). This may sound counter intuitive.... but nowadays the scalability cost for taking a lock is in the cache miss for bouncing the cacheline of the lock to your cpu for write. (once this is done the actual locked cycle tends to be free or near free) In this cache bounce aspect, read/write spinlocks are no better than normal spinlocks. In fact, in the ">1 concurrent readers" case, they are *worse* than normal spinlocks! So I'm entirely on Ingo's side with the "we should get rid of rw spinlocks as much as we can" proposal; they make no sense whatsoever for scalability, the only reason they are useful is for recursion... (well there is a border line case with the rare writer in user context but the normal users for read in irq context, but even there the scalability is horrible and this should consider rcu or per-cpu data realistically) ^ permalink raw reply [flat|nested] 18+ messages in thread
* [patch] fix spinlock-debug looping 2006-06-19 19:55 ` Andrew Morton 2006-06-20 8:06 ` Arjan van de Ven @ 2006-06-20 8:40 ` Ingo Molnar 2006-06-20 8:52 ` Andrew Morton 1 sibling, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2006-06-20 8:40 UTC (permalink / raw) To: Andrew Morton; +Cc: ccb, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > > > > The write_trylock + __delay in the loop is not a problem or a bug, as > > > > the trylock will at most _increase_ the delay - and our goal is to not > > > > have a false positive, not to be absolutely accurate about the > > > > measurement here. > > > > > > Precisely. We have delays of over a second (but we don't know how > > > much more than a second). Let's say two seconds. The NMI watchdog > > > timeout is, what? Five seconds? > > > > i dont see the problem. > > It's taking over a second to acquire a write_lock. A lock which is > unlikely to be held for more than a microsecond anywhere. That's > really bad, isn't it? Being on the edge of an NMI watchdog induced > system crash is bad, too. i obviously agree that any such crash is a serious problem, but is it caused by the spinlock-debugging code? I doubt it is, unless __delay() is seriously buggered. in any case, to move this problem forward i'd suggest we go with the patch below in -mm and wait for feedback. It fixes a potential overflow in the comparison (if HZ*lpj overflows 32-bits) That needs a really fast box to run the 32-bit kernel though, so i doubt this is the cause of the problems. In any case, this change makes it easier to increase the looping timeout from 1 second to 10 seconds later on or so - at which point the overflow can happen for real and must be handled . Ingo ------------ Subject: fix spinlock-debug looping From: Ingo Molnar <mingo@elte.hu> make sure the right hand side of the comparison does not overflow on 32-bits. Also print out more info when detecting a lockup, so that we see how many times the code tried (and failed) to get the lock. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- lib/spinlock_debug.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) Index: linux/lib/spinlock_debug.c =================================================================== --- linux.orig/lib/spinlock_debug.c +++ linux/lib/spinlock_debug.c @@ -104,7 +104,7 @@ static void __spin_lock_debug(spinlock_t u64 i; for (;;) { - for (i = 0; i < loops_per_jiffy * HZ; i++) { + for (i = 0; i < (u64)loops_per_jiffy * HZ; i++) { if (__raw_spin_trylock(&lock->raw_lock)) return; __delay(1); @@ -112,10 +112,10 @@ static void __spin_lock_debug(spinlock_t /* lockup suspected: */ if (print_once) { print_once = 0; - printk(KERN_EMERG "BUG: spinlock lockup on CPU#%d, " - "%s/%d, %p\n", + printk(KERN_EMERG "BUG: possible spinlock lockup on CPU#%d, " + "%s/%d, %p [%Ld/%ld]\n", raw_smp_processor_id(), current->comm, - current->pid, lock); + current->pid, lock, i, loops_per_jiffy); dump_stack(); } } @@ -169,7 +169,7 @@ static void __read_lock_debug(rwlock_t * u64 i; for (;;) { - for (i = 0; i < loops_per_jiffy * HZ; i++) { + for (i = 0; i < (u64)loops_per_jiffy * HZ; i++) { if (__raw_read_trylock(&lock->raw_lock)) return; __delay(1); @@ -177,10 +177,10 @@ static void __read_lock_debug(rwlock_t * /* lockup suspected: */ if (print_once) { print_once = 0; - printk(KERN_EMERG "BUG: read-lock lockup on CPU#%d, " - "%s/%d, %p\n", + printk(KERN_EMERG "BUG: possible read-lock lockup on CPU#%d, " + "%s/%d, %p [%Ld/%ld]\n", raw_smp_processor_id(), current->comm, - current->pid, lock); + current->pid, lock, i, loops_per_jiffy); dump_stack(); } } @@ -242,7 +242,7 @@ static void __write_lock_debug(rwlock_t u64 i; for (;;) { - for (i = 0; i < loops_per_jiffy * HZ; i++) { + for (i = 0; i < (u64)loops_per_jiffy * HZ; i++) { if (__raw_write_trylock(&lock->raw_lock)) return; __delay(1); @@ -250,10 +250,10 @@ static void __write_lock_debug(rwlock_t /* lockup suspected: */ if (print_once) { print_once = 0; - printk(KERN_EMERG "BUG: write-lock lockup on CPU#%d, " - "%s/%d, %p\n", + printk(KERN_EMERG "BUG: possible write-lock lockup on CPU#%d, " + "%s/%d, %p [%Ld/%ld]\n", raw_smp_processor_id(), current->comm, - current->pid, lock); + current->pid, lock, i, loops_per_jiffy); dump_stack(); } } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] fix spinlock-debug looping 2006-06-20 8:40 ` [patch] fix spinlock-debug looping Ingo Molnar @ 2006-06-20 8:52 ` Andrew Morton 2006-06-20 9:15 ` Ingo Molnar 0 siblings, 1 reply; 18+ messages in thread From: Andrew Morton @ 2006-06-20 8:52 UTC (permalink / raw) To: Ingo Molnar; +Cc: ccb, linux-kernel On Tue, 20 Jun 2006 10:40:01 +0200 Ingo Molnar <mingo@elte.hu> wrote: > i obviously agree that any such crash is a serious problem, but is it > caused by the spinlock-debugging code? Judging from Dave Olson <olson@unixfolk.com>'s report: no. He's getting hit by NMI watchdog expiry on write_lock(tree_lock) in a !CONFIG_DEBUG_SPINLOCK kernel. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] fix spinlock-debug looping 2006-06-20 8:52 ` Andrew Morton @ 2006-06-20 9:15 ` Ingo Molnar 2006-06-20 9:32 ` Andrew Morton 0 siblings, 1 reply; 18+ messages in thread From: Ingo Molnar @ 2006-06-20 9:15 UTC (permalink / raw) To: Andrew Morton; +Cc: ccb, linux-kernel * Andrew Morton <akpm@osdl.org> wrote: > On Tue, 20 Jun 2006 10:40:01 +0200 > Ingo Molnar <mingo@elte.hu> wrote: > > > i obviously agree that any such crash is a serious problem, but is > > it caused by the spinlock-debugging code? > > Judging from Dave Olson <olson@unixfolk.com>'s report: no. He's > getting hit by NMI watchdog expiry on write_lock(tree_lock) in a > !CONFIG_DEBUG_SPINLOCK kernel. hm, that means 5 seconds of looping with irqs off? That's really insane. Is there any definitive testcase or testsystem where we could try a simple tree_lock rwlock -> spinlock conversion? Spinlocks are alot fairer. Or as a simple experiment, s/read_lock/write_lock, as the patch below (against rc6-mm2) does. This is phase #1, if it works out we can switch tree_lock to a spinlock. [write_lock()s are roughly as fair to each other as spinlocks - it's a bit more expensive but not significantly] Builds & boots fine here. Ingo --- drivers/mtd/devices/block2mtd.c | 8 ++++---- fs/reiser4/plugin/file/cryptcompress.c | 8 ++++---- fs/reiser4/plugin/file/file.c | 12 ++++++------ mm/filemap.c | 32 ++++++++++++++++---------------- mm/page-writeback.c | 4 ++-- mm/readahead.c | 22 +++++++++++----------- mm/swap_prefetch.c | 4 ++-- 7 files changed, 45 insertions(+), 45 deletions(-) Index: linux/drivers/mtd/devices/block2mtd.c =================================================================== --- linux.orig/drivers/mtd/devices/block2mtd.c +++ linux/drivers/mtd/devices/block2mtd.c @@ -59,7 +59,7 @@ static void cache_readahead(struct addre end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); for (i = 0; i < PAGE_READAHEAD; i++) { pagei = index + i; if (pagei > end_index) { @@ -71,16 +71,16 @@ static void cache_readahead(struct addre break; if (page) continue; - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); page = page_cache_alloc_cold(mapping); - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); if (!page) break; page->index = pagei; list_add(&page->lru, &page_pool); ret++; } - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); if (ret) read_cache_pages(mapping, &page_pool, filler, NULL); } Index: linux/fs/reiser4/plugin/file/cryptcompress.c =================================================================== --- linux.orig/fs/reiser4/plugin/file/cryptcompress.c +++ linux/fs/reiser4/plugin/file/cryptcompress.c @@ -3413,7 +3413,7 @@ static void clear_moved_tag_cluster(stru { int i; void * ret; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); for (i = 0; i < clust->nr_pages; i++) { assert("edward-1438", clust->pages[i] != NULL); ret = radix_tree_tag_clear(&mapping->page_tree, @@ -3421,7 +3421,7 @@ static void clear_moved_tag_cluster(stru PAGECACHE_TAG_REISER4_MOVED); assert("edward-1439", ret == clust->pages[i]); } - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); } /* Capture an anonymous pager cluster. (Page cluser is @@ -3446,11 +3446,11 @@ capture_page_cluster(reiser4_cluster_t * if (unlikely(result)) { /* set cleared tag back, so it will be possible to capture it again later */ - read_lock_irq(&inode->i_mapping->tree_lock); + write_lock_irq(&inode->i_mapping->tree_lock); radix_tree_tag_set(&inode->i_mapping->page_tree, clust_to_pg(clust->index, inode), PAGECACHE_TAG_REISER4_MOVED); - read_unlock_irq(&inode->i_mapping->tree_lock); + write_unlock_irq(&inode->i_mapping->tree_lock); release_cluster_pages_and_jnode(clust); } Index: linux/fs/reiser4/plugin/file/file.c =================================================================== --- linux.orig/fs/reiser4/plugin/file/file.c +++ linux/fs/reiser4/plugin/file/file.c @@ -832,9 +832,9 @@ static int has_anonymous_pages(struct in { int result; - read_lock_irq(&inode->i_mapping->tree_lock); + write_lock_irq(&inode->i_mapping->tree_lock); result = radix_tree_tagged(&inode->i_mapping->page_tree, PAGECACHE_TAG_REISER4_MOVED); - read_unlock_irq(&inode->i_mapping->tree_lock); + write_unlock_irq(&inode->i_mapping->tree_lock); return result; } @@ -1124,7 +1124,7 @@ static int sync_page_list(struct inode * mapping = inode->i_mapping; from = 0; result = 0; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); while (result == 0) { struct page *page; @@ -1138,17 +1138,17 @@ static int sync_page_list(struct inode * /* page may not leave radix tree because it is protected from truncating by inode->i_mutex locked by sys_fsync */ page_cache_get(page); - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); from = page->index + 1; result = sync_page(page); page_cache_release(page); - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); } - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return result; } Index: linux/mm/filemap.c =================================================================== --- linux.orig/mm/filemap.c +++ linux/mm/filemap.c @@ -568,9 +568,9 @@ int probe_page(struct address_space *map { int exists; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); exists = __probe_page(mapping, offset); - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return exists; } @@ -583,11 +583,11 @@ struct page * find_get_page(struct addre { struct page *page; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); if (page) page_cache_get(page); - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return page; } @@ -600,11 +600,11 @@ struct page *find_trylock_page(struct ad { struct page *page; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); if (page && TestSetPageLocked(page)) page = NULL; - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return page; } @@ -626,15 +626,15 @@ struct page *find_lock_page(struct addre { struct page *page; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); repeat: page = radix_tree_lookup(&mapping->page_tree, offset); if (page) { page_cache_get(page); if (TestSetPageLocked(page)) { - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); __lock_page(page); - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); /* Has the page been truncated while we slept? */ if (unlikely(page->mapping != mapping || @@ -645,7 +645,7 @@ repeat: } } } - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return page; } @@ -718,12 +718,12 @@ unsigned find_get_pages(struct address_s unsigned int i; unsigned int ret; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, start, nr_pages); for (i = 0; i < ret; i++) page_cache_get(pages[i]); - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return ret; } EXPORT_SYMBOL(find_get_pages); @@ -746,7 +746,7 @@ unsigned find_get_pages_contig(struct ad unsigned int i; unsigned int ret; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, index, nr_pages); for (i = 0; i < ret; i++) { @@ -756,7 +756,7 @@ unsigned find_get_pages_contig(struct ad page_cache_get(pages[i]); index++; } - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return i; } @@ -770,14 +770,14 @@ unsigned find_get_pages_tag(struct addre unsigned int i; unsigned int ret; - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)pages, *index, nr_pages, tag); for (i = 0; i < ret; i++) page_cache_get(pages[i]); if (ret) *index = pages[ret - 1]->index + 1; - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return ret; } EXPORT_SYMBOL(find_get_pages_tag); Index: linux/mm/page-writeback.c =================================================================== --- linux.orig/mm/page-writeback.c +++ linux/mm/page-writeback.c @@ -800,9 +800,9 @@ int mapping_tagged(struct address_space unsigned long flags; int ret; - read_lock_irqsave(&mapping->tree_lock, flags); + write_lock_irqsave(&mapping->tree_lock, flags); ret = radix_tree_tagged(&mapping->page_tree, tag); - read_unlock_irqrestore(&mapping->tree_lock, flags); + write_unlock_irqrestore(&mapping->tree_lock, flags); return ret; } EXPORT_SYMBOL(mapping_tagged); Index: linux/mm/readahead.c =================================================================== --- linux.orig/mm/readahead.c +++ linux/mm/readahead.c @@ -398,7 +398,7 @@ __do_page_cache_readahead(struct address /* * Preallocate as many pages as we will need. */ - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); for (page_idx = 0; page_idx < nr_to_read; page_idx++) { pgoff_t page_offset = offset + page_idx; @@ -409,10 +409,10 @@ __do_page_cache_readahead(struct address if (page) continue; - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); cond_resched(); page = page_cache_alloc_cold(mapping); - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); if (!page) break; page->index = page_offset; @@ -421,7 +421,7 @@ __do_page_cache_readahead(struct address SetPageReadahead(page); ret++; } - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); /* * Now start the IO. We ignore I/O errors - if the page is not @@ -1318,7 +1318,7 @@ static pgoff_t find_segtail(struct addre pgoff_t ra_index; cond_resched(); - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan); #ifdef DEBUG_READAHEAD_RADIXTREE BUG_ON(!__probe_page(mapping, index)); @@ -1330,7 +1330,7 @@ static pgoff_t find_segtail(struct addre if (ra_index != ~0UL && ra_index - index < max_scan) WARN_ON(__probe_page(mapping, ra_index)); #endif - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); if (ra_index <= index + max_scan) return ra_index; @@ -1353,13 +1353,13 @@ static pgoff_t find_segtail_backward(str * Poor man's radix_tree_scan_data_backward() implementation. * Acceptable because max_scan won't be large. */ - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); for (; origin - index < max_scan;) if (__probe_page(mapping, --index)) { - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return index + 1; } - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); return 0; } @@ -1410,7 +1410,7 @@ static unsigned long query_page_cache_se * The count here determines ra_size. */ cond_resched(); - read_lock_irq(&mapping->tree_lock); + write_lock_irq(&mapping->tree_lock); index = radix_tree_scan_hole_backward(&mapping->page_tree, offset - 1, ra_max); #ifdef DEBUG_READAHEAD_RADIXTREE @@ -1452,7 +1452,7 @@ static unsigned long query_page_cache_se break; out_unlock: - read_unlock_irq(&mapping->tree_lock); + write_unlock_irq(&mapping->tree_lock); /* * For sequential read that extends from index 0, the counted value Index: linux/mm/swap_prefetch.c =================================================================== --- linux.orig/mm/swap_prefetch.c +++ linux/mm/swap_prefetch.c @@ -189,10 +189,10 @@ static enum trickle_return trickle_swap_ enum trickle_return ret = TRICKLE_FAILED; struct page *page; - read_lock_irq(&swapper_space.tree_lock); + write_lock_irq(&swapper_space.tree_lock); /* Entry may already exist */ page = radix_tree_lookup(&swapper_space.page_tree, entry.val); - read_unlock_irq(&swapper_space.tree_lock); + write_unlock_irq(&swapper_space.tree_lock); if (page) { remove_from_swapped_list(entry.val); goto out; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] fix spinlock-debug looping 2006-06-20 9:15 ` Ingo Molnar @ 2006-06-20 9:32 ` Andrew Morton 2006-06-20 9:34 ` Ingo Molnar 2006-06-20 16:02 ` Dave Olson 0 siblings, 2 replies; 18+ messages in thread From: Andrew Morton @ 2006-06-20 9:32 UTC (permalink / raw) To: Ingo Molnar; +Cc: ccb, linux-kernel, Dave Olson On Tue, 20 Jun 2006 11:15:05 +0200 Ingo Molnar <mingo@elte.hu> wrote: > * Andrew Morton <akpm@osdl.org> wrote: > > > On Tue, 20 Jun 2006 10:40:01 +0200 > > Ingo Molnar <mingo@elte.hu> wrote: > > > > > i obviously agree that any such crash is a serious problem, but is > > > it caused by the spinlock-debugging code? > > > > Judging from Dave Olson <olson@unixfolk.com>'s report: no. He's > > getting hit by NMI watchdog expiry on write_lock(tree_lock) in a > > !CONFIG_DEBUG_SPINLOCK kernel. > > hm, that means 5 seconds of looping with irqs off? Yup. > That's really insane. Yup. > Is there any definitive testcase or testsystem where we could try a > simple tree_lock rwlock -> spinlock conversion? Not that I'm aware of. I just tried three CPUs doing fadvise(FADV_WILLNEED, 1GB) with the fourth CPU trying to write the file, but it didn't lock up as I expected. > Spinlocks are alot > fairer. Or as a simple experiment, s/read_lock/write_lock, as the patch > below (against rc6-mm2) does. This is phase #1, if it works out we can > switch tree_lock to a spinlock. [write_lock()s are roughly as fair to > each other as spinlocks - it's a bit more expensive but not > significantly] Builds & boots fine here. tree_lock was initially an rwlock. Then we made it a spinlock. Then we made it an rwlock. We change the dang thing so often we should make it a macro ;) Let's just make it a spinlock and be done with it. Hopefully Dave or ccb@acm.org (?) will be able to test it. I was planning on doing a patch tomorrowish. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] fix spinlock-debug looping 2006-06-20 9:32 ` Andrew Morton @ 2006-06-20 9:34 ` Ingo Molnar 2006-06-20 16:02 ` Dave Olson 1 sibling, 0 replies; 18+ messages in thread From: Ingo Molnar @ 2006-06-20 9:34 UTC (permalink / raw) To: Andrew Morton; +Cc: ccb, linux-kernel, Dave Olson * Andrew Morton <akpm@osdl.org> wrote: > > Spinlocks are alot fairer. Or as a simple experiment, > > s/read_lock/write_lock, as the patch below (against rc6-mm2) does. > > This is phase #1, if it works out we can switch tree_lock to a > > spinlock. [write_lock()s are roughly as fair to each other as > > spinlocks - it's a bit more expensive but not significantly] Builds > > & boots fine here. > > tree_lock was initially an rwlock. Then we made it a spinlock. Then > we made it an rwlock. We change the dang thing so often we should > make it a macro ;) ha! In -rt we can change types of locks by changing the type definition and the declaration only ;-) [It makes for some confusing reading though if done without restraint] > Let's just make it a spinlock and be done with it. Hopefully Dave or > ccb@acm.org (?) will be able to test it. I was planning on doing a > patch tomorrowish. ok. Until that happens the patch i sent can be used for testing. Ingo ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [patch] fix spinlock-debug looping 2006-06-20 9:32 ` Andrew Morton 2006-06-20 9:34 ` Ingo Molnar @ 2006-06-20 16:02 ` Dave Olson 1 sibling, 0 replies; 18+ messages in thread From: Dave Olson @ 2006-06-20 16:02 UTC (permalink / raw) To: Andrew Morton; +Cc: Ingo Molnar, ccb, linux-kernel On Tue, 20 Jun 2006, Andrew Morton wrote: | Let's just make it a spinlock and be done with it. Hopefully Dave or | ccb@acm.org (?) will be able to test it. I was planning on doing a patch | tomorrowish. Our test case is extremely reproducible in about a minute, so I should be able to give a pretty conclusive answer as to whether the patch works for our case. Dave Olson olson@unixfolk.com http://www.unixfolk.com/dave ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2006-06-20 16:03 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-12 19:53 BUG: write-lock lockup Charles C. Bennett, Jr. 2006-06-17 17:07 ` Andrew Morton 2006-06-19 7:02 ` [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min Ingo Molnar 2006-06-19 7:59 ` Andrew Morton 2006-06-19 8:12 ` Ingo Molnar 2006-06-19 8:21 ` Ingo Molnar 2006-06-19 8:32 ` Andrew Morton 2006-06-19 8:35 ` Ingo Molnar 2006-06-19 9:13 ` Andrew Morton 2006-06-19 11:39 ` Ingo Molnar 2006-06-19 19:55 ` Andrew Morton 2006-06-20 8:06 ` Arjan van de Ven 2006-06-20 8:40 ` [patch] fix spinlock-debug looping Ingo Molnar 2006-06-20 8:52 ` Andrew Morton 2006-06-20 9:15 ` Ingo Molnar 2006-06-20 9:32 ` Andrew Morton 2006-06-20 9:34 ` Ingo Molnar 2006-06-20 16:02 ` Dave Olson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox