public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* BUG: write-lock lockup
@ 2006-06-12 19:53 Charles C. Bennett, Jr.
  2006-06-17 17:07 ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Charles C. Bennett, Jr. @ 2006-06-12 19:53 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 622 bytes --]


Hi All -

	I'm seeing write-lock lockups...  this is happening with
the Fedora kernels from at least as far back as their 2.6.15-1.1833_FC4
(2.6.15.5) and as recently as their 2.6.16-1.2111_FC4 (2.6.16.17).

It's a beefy box:
 Gateway 9515R - Intel ICH5/ICH5R
 Two Dual Core 3.0 ghz Xeons, 4GB RAM
 All disks via Emulex LP101-H (thor), switched fabric to
  Hitachi WMS SAN storage.

The lockups are not process-specific.  Running mke2fs on large
filesystems seems to get it happen sooner rather than later.

I can fish out any other data you need, run tests, etc.

Thanks,
ccb


-- 
Charles C. Bennett, Jr. <ccb@acm.org>

[-- Attachment #2: lockup --]
[-- Type: text/plain, Size: 16256 bytes --]


Jun  9 20:43:40 localhost kernel: BUG: write-lock lockup on CPU#6, mkfs.ext2/3258, f799c864 (Not tainted)
Jun  9 20:43:40 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
Jun  9 20:43:40 localhost kernel:  [<c015532c>] shrink_list+0x197/0x45d     [<c01557a3>] shrink_cache+0xe7/0x29b
Jun  9 20:43:40 localhost kernel:  [<c014d955>] bad_range+0x22/0x2f     [<c014dfe3>] __rmqueue+0xd1/0x156
Jun  9 20:43:40 localhost kernel:  [<c0155db9>] shrink_zone+0x89/0xd8     [<c0155e6e>] shrink_caches+0x66/0x74
Jun  9 20:43:40 localhost kernel:  [<c0155f29>] try_to_free_pages+0xad/0x1b7  [<c014e8a5>] __alloc_pages+0x136/0x2ec
Jun  9 20:43:40 localhost kernel:  [<c016ca3e>] __block_commit_write+0x83/0x8f    [<c014be23>] generic_file_buffered_write+0x16b/0x655
Jun  9 20:43:40 localhost kernel:  [<c012a04b>] current_fs_time+0x5a/0x75     [<c012a04b>] current_fs_time+0x5a/0x75
Jun  9 20:43:40 localhost kernel:  [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9
Jun  9 20:43:40 localhost kernel:  [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92     [<c01719e0>] blkdev_file_write+0x0/0x24
Jun  9 20:43:40 localhost kernel:  [<c014c99c>] generic_file_write_nolock+0x85/0x9f     [<c0139196>] autoremove_wake_function+0x0/0x37
Jun  9 20:43:40 localhost kernel:  [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a
Jun  9 20:43:40 localhost kernel:  [<c0169b55>] sys_write+0x41/0x6a     [<c0104035>] syscall_call+0x7/0xb
Jun  9 21:02:58 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3259, f799c864 (Not tainted)
Jun  9 21:02:58 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
Jun  9 21:02:58 localhost kernel:  [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6
Jun  9 21:02:58 localhost kernel:  [<c014a09c>] end_page_writeback+0x76/0x84  [<c016b2c9>] end_buffer_async_write+0xbf/0x12a
Jun  9 21:02:58 localhost kernel:  [<c01f1167>] memmove+0x24/0x2d     [<c014d292>] mempool_free+0x3a/0x73
Jun  9 21:02:58 localhost kernel:  [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f
Jun  9 21:02:58 localhost kernel:  [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f  [<c016f307>] bio_endio+0x3e/0x69
Jun  9 21:02:58 localhost kernel:  [<c01e2e88>] __end_that_request_first+0x101/0x235     [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod]
Jun  9 21:02:58 localhost kernel:  [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod]     [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod]
Jun  9 21:02:58 localhost kernel:  [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod]     [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod]
Jun  9 21:02:58 localhost kernel:  [<c012a1f2>] __do_softirq+0x72/0xdc     [<c0106383>] do_softirq+0x4b/0x4f
Jun  9 21:02:58 localhost kernel:  =======================
Jun  9 21:02:58 localhost kernel:  [<c0106265>] do_IRQ+0x55/0x86     [<c0104a7a>] common_interrupt+0x1a/0x20
Jun  9 21:03:05 localhost kernel:  [<c014e387>] free_hot_cold_page+0xd5/0x126   [<c014eaf4>] __pagevec_free+0x1f/0x2e
Jun  9 21:03:08 localhost kernel:  [<c0154371>] __pagevec_release_nonlru+0x29/0x8b     [<c01553aa>] shrink_list+0x215/0x45d
Jun  9 21:03:09 localhost kernel:  [<c01557a3>] shrink_cache+0xe7/0x29b     [<c0155db9>] shrink_zone+0x89/0xd8
Jun  9 21:03:09 localhost kernel:  [<c0155e6e>] shrink_caches+0x66/0x74     [<c0155f29>] try_to_free_pages+0xad/0x1b7
Jun  9 21:03:09 localhost kernel:  [<c014e8a5>] __alloc_pages+0x136/0x2ec     [<c016ca3e>] __block_commit_write+0x83/0x8f
Jun  9 21:03:10 localhost kernel:  [<c014be23>] generic_file_buffered_write+0x16b/0x655     [<c012a04b>] current_fs_time+0x5a/0x75
Jun  9 21:03:11 localhost kernel:  [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9
Jun  9 21:03:11 localhost kernel:  [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92     [<c01719e0>] blkdev_file_write+0x0/0x24
Jun  9 21:03:12 localhost kernel:  [<c014c99c>] generic_file_write_nolock+0x85/0x9f     [<c0139196>] autoremove_wake_function+0x0/0x37
Jun  9 21:03:12 localhost kernel:  [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a
Jun  9 21:03:12 localhost kernel:  [<c0169b55>] sys_write+0x41/0x6a     [<c0104035>] syscall_call+0x7/0xb
Jun  9 21:28:07 localhost mgetty[3159]: failed dev=ttyS0, pid=3159, login time out
Jun  9 21:42:31 localhost kernel: BUG: write-lock lockup on CPU#3, mkfs.ext2/3257, f5e0ef24 (Not tainted)
Jun  9 21:42:31 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3259, f5e0ef24 (Not tainted)
Jun  9 21:42:31 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
Jun  9 21:42:31 localhost kernel:  [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6
Jun  9 21:42:31 localhost kernel:  [<c014a09c>] end_page_writeback+0x76/0x84  [<c016b2c9>] end_buffer_async_write+0xbf/0x12a
Jun  9 21:42:31 localhost kernel:  [<c014d292>] mempool_free+0x3a/0x73     [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f
Jun  9 21:42:31 localhost kernel:  [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f
Jun  9 21:42:31 localhost kernel:  [<c016f307>] bio_endio+0x3e/0x69     [<c0120685>] __wake_up+0x32/0x43
Jun  9 21:42:31 localhost kernel:  [<c01e2e88>] __end_that_request_first+0x101/0x235     [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod]
Jun  9 21:42:31 localhost kernel:  [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod]     [<f896681c>] e1000_clean_tx_irq+0x76/0x62d [e1000]alhost kernel:  [<c016f307>] bio_endio+0x3e/0x69     [<c0120685>] __wake_up+0x32/0x43
Jun  9 21:42:48 localhost kernel:  [<c01e2e88>] __end_that_request_first+0x101/0x235     [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod]
Jun  9 21:42:48 localhost kernel:  [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod]     [<f896681c>] e1000_clean_tx_irq+0x76/0x62d [e1000]
Jun  9 21:42:48 localhost kernel:  [<c01391ab>] autoremove_wake_function+0x15/0x37     [<f8833c97>] sd_rw_intr+0x70/04 (Not tainted)
Jun  9 23:34:03 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
Jun  9 23:34:03 localhost kernel:  [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015060d>] test_clear_page_dirty+0x2a/0xa9
Jun  9 23:34:03 localhost kernel:  [<c016df8f>] try_to_free_buffers+0_file_aio_write_nolock+0x2a3/0x4d9     [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92
Jun 10 00:53:26 localhost kernel:  [<c01719e0>] blkdev_file_write+0x0/0x24     [<c014c99c>] generic_file_write_nolock+0x85/0x9f
Jun 10 00:53:30 localhost kernel:  [<c0139196>] autoremove_wake_function+0x0/0x37     [<c0171a00>] blkdev_file_write+0x20/0x24
Jun 10 00:53:33 localhost kernel:  [<c01699f2>] vfs_write+0xa2/0x15a     [<c0169b55>] sys_write+0x41/0x6a
Jun 10 00:53:36 localhost kernel:  [<c0104035>] syscall_call+0x7/0xb
Jun 10 01:03:29 localhost kernel: BUG: write-lock lockup on CPU#4, mkfs.exthe+0xe7/0x29b [<c01f21d7>]
Jun 10 01:05:21 localhost kernel:  __write_lock_debug+0xb4/0xdd [<c0152aac>]  cache_alloc_refill+0xfd/0x27c [<c01f223f>]     _raw_write_lock+0x3f/0x7c
Jun 10 01:05:21 localhost kernel:  [<c0155db9>] [<c032bd15>] shrink_zone+0x89/0xd8
Jun 10 01:05:21 localhost kernel:  [<c0155e6e>] shrink_caches+0x66/0x74 _write_lock_irqsave+0x9/0xd         [<c0155f29>] [<c015070d>] try_to_free_pages+0xad/0x1b7 test_clear_page_writeback+0x2d/0xa6
Jun 10 01:05:21 localhost kernel:
Jun 10 01:05:21 localhost kernel:  [<c014a09c3a/0x73     [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f
Jun 10 01:06:14 localhost kernel:  [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f
Jun 10 01:06:17 localhost kernel:  [<c016f307] bio_endio+0x3e/0x69     [<c01e2e88>] __end_that_request_first+0x101/0x235
Jun 10 01:06:20 localhost kernel:  [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod]     [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod]
Jun 10 01:06:21 localhost kernel:  [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod]    [<c0120633>] __wake_up_common+0x39/0x59
Jun 10 01:06:22 localhost kernel:  [<c0120685>] __wake_up+0x32/0x43     [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod]
Jun 10 01:06:24 localhost kernel:  [<c0106265>] do_IRQ+0x55/0x86     [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod]
Jun 10 01:06:25 localhost kernel:  [<c012a1f2>] __do_softirq+0x72/0xdc     [<c0106383>] do_softirq+0x4b/0x4f
Jun 10 01:06:25 localhost kernel:  =======================
Jun 10 01:06:26 localhost kernel:  [<c0104b08>] apic_timer_interrupt+0x1c/0x24    [<c0155554>] shrink_list+0x3bf/0x45d
Jun 10 01:06:27 localhost kernel:  [<c01557a3>] shrink_cache+0xe7/0x29b     [<c014fdb0>] get_writeback_state+0x30/0x35
Jun 10 01:06:27 localhost kernel:  [<c0195d7e>] mb_cache_shrink_fn+0x182/0x1c4    [<c0155db9>] shrink_zone+0x89/0xd8
Jun 10 01:06:27 localhost kernel:  [<c015622a>] balance_pgdat+0x1f7/0x3d6     [<c01564e5>] kswapd+0xdc/0x127
Jun 10 01:06:27 localhost kernel:  [<c0139196>] autoremove_wake_function+0x0/0x37     [<c0156409>] kswapd+0x0/0x127
Jun 10 01:06:27 localhost kernel:  [<c010243d>] kernel_thread_helper+0x5/0xb
Jun 10 01:09:40 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3258, f513cf24 (Not tainted)
Jun 10 01:09:40 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
Jun 10 01:09:40 localhost kernel:  [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6
Jun 10 01:09:40 localhost kernel:  [<c014a09c>] end_page_writeback+0x76/0x84  [<c016b2c9>] end_buffer_async_write+0xbf/0x12a
Jun 10 01:09:40 localhost kernel:  [<c014d292>] mempool_free+0x3a/0x73     [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f
Jun 10 01:09:40 localhost kernel:  [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f
Jun 10 01:09:41 localhost kernel:  [<c016f307>] bio_endio+0x3e/0x69     [<c01e2e88>] __end_that_request_first+0x101/0x235
Jun 10 01:09:41 localhost kernel:  [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod]     [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod]
Jun 10 01:09:41 localhost kernel:  [<c0104b08>] apic_timer_interrupt+0x1c/0x24    [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod]
Jun 10 01:09:41 localhost kernel:  [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod]     [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod]
Jun 10 01:09:41 localhost kernel:  [<c012a1f2>] __do_softirq+0x72/0xdc     [<c0106383>] do_softirq+0x4b/0x4f
Jun 10 01:09:41 localhost kernel:  =======================
Jun 10 01:09:41 localhost kernel:  [<c0106265>] do_IRQ+0x55/0x86     [<c0104a7a>] common_interrupt+0x1a/0x20
Jun 10 01:09:45 localhost kernel:  [<c0150517>] __set_page_dirty_nobuffers+0x71/0xb2     [<c016ca3e>] __block_commit_write+0x83/0x8f
Jun 10 01:09:46 localhost kernel:  [<c016d154>] block_commit_write+0x15/0x1c  [<c0170dc6>] blkdev_commit_write+0x0/0xd
Jun 10 01:09:46 localhost kernel:  [<c014bfc8>] generic_file_buffered_write+0x310/0x655     [<c012a04b>] current_fs_time+0x5a/0x75
Jun 10 01:09:47 localhost kernel:  [<c012a04b>] current_fs_time+0x5a/0x75     [<c0182cd8>] inode_update_time+0x2d/0x99
Jun 10 01:09:47 localhost kernel:  [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9     [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92
Jun 10 01:09:47 localhost kernel:  [<c01719e0>] blkdev_file_write+0x0/0x24     [<c014c99c>] generic_file_write_nolock+0x85/0x9f
Jun 10 01:09:47 localhost kernel:  [<c0139196>] autoremove_wake_function+0x0/0x37     [<c0171a00>] blkdev_file_write+0x20/0x24
Jun 10 01:09:47 localhost kernel:  [<c01699f2>] vfs_write+0xa2/0x15a     [<c0169b55>] sys_write+0x41/0x6a
Jun 10 01:09:48 localhost kernel:  [<c0104035>] syscall_call+0x7/0xb
Jun 10 01:15:55 localhost kernel: BUG: write-lock lockup on CPU#3, kswapd0/482, f69efaa4 (Not tainted)

[restart]


Jun 10 07:12:16 localhost kernel: BUG: write-lock lockup on CPU#5, mkfs.ext2/3107, f650ace4 (Not tainted)
Jun 10 07:12:16 localhost kernel: BUG: write-lock lockup on CPU#2, mkfs.ext2/3108, f650ace4 (Not tainted)
Jun 10 07:12:16 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
Jun 10 07:12:16 localhost kernel:  [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6
Jun 10 07:12:16 localhost kernel:  [<c014a09c>] end_page_writeback+0x76/0x84  [<c016b2c9>] end_buffer_async_write+0xbf/0x12a
Jun 10 07:12:16 localhost kernel:  [<c01f1167>] memmove+0x24/0x2d     [<c014d292>] mempool_free+0x3a/0x73
Jun 10 07:12:16 localhost kernel:  [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f
Jun 10 07:12:16 localhost kernel:  [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f  [<c016f307>] bio_endio+0x3e/0x69
Jun 10 07:12:16 localhost kernel:  [<c01e2e88>] __end_that_request_first+0x101/0x235     [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod]
Jun 10 07:12:16 localhost kernel:  [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod]     [<c011e8fa>] try_to_wake_up+0x6e/0x3ec
Jun 10 07:12:16 localhost kernel:  [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod]    [<c0120633>] __wake_up_common+0x39/0x59
Jun 10 07:12:16 localhost kernel:  [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod]     [<c0106265>] do_IRQ+0x55/0x86
Jun 10 07:12:16 localhost kernel:  [<c0134a95>] __queue_work+0x45/0x54     [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod]
Jun 10 07:12:19 localhost kernel:  [<c012a1f2>] __do_softirq+0x72/0xdc     [<c0106383>] do_softirq+0x4b/0x4f
Jun 10 07:12:20 localhost kernel:  =======================
Jun 10 07:12:20 localhost kernel:  [<c0104b08>] apic_timer_interrupt+0x1c/0x24    [<c0154471>] __pagevec_lru_add+0x9e/0xbb
Jun 10 07:12:20 localhost kernel:  [<c014c0a1>] generic_file_buffered_write+0x3e9/0x655     [<c012a04b>] current_fs_time+0x5a/0x75
Jun 10 07:12:21 localhost kernel:  [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9
Jun 10 07:12:21 localhost kernel:  [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92     [<c01719e0>] blkdev_file_write+0x0/0x24
Jun 10 07:12:21 localhost kernel:  [<c014c99c>] generic_file_write_nolock+0x85/0x9f     [<c0139196>]BUG: write-lock lockup on CPU#4, safte-monitor/2750, f650ace4 (Not tainted)
Jun 10 07:12:21 localhost kernel:  autoremove_wake_function+0x0/0x37
Jun 10 07:12:22 localhost kernel:  [<c0171a00>] [<c01f21d7>] blkdev_file_write+0x20/0x24     [<c01699f2>] vfs_write+0xa2/0x15a __write_lock_debug+0xb4/0xdd
Jun 10 07:12:22 localhost kernel:      [<c0169b55>] [<c01f223f>] sys_write+0x41/0x6a     _raw_write_lock+0x3f/0x7c [<c0104035>]
Jun 10 07:12:22 localhost kernel:  [<c032bd15>] syscall_call+0x7/0xb
Jun 10 07:12:22 localhost kernel:  _write_lock_irqsave+0x9/0xd     [<c015060d>] test_clear_page_dirty+0x2a/0xa9
Jun 10 07:12:22 localhost kernel:  [<c016df8f>] try_to_free_buffers+0x6b/0x84   [<c0155567>] shrink_list+0x3d2/0x45d
Jun 10 07:12:22 localhost kernel:  [<c01c5a2a>] avc_has_perm_noaudit+0x24/0xce    [<c014dede>] prep_new_page+0x41/0x75
Jun 10 07:12:23 localhost kernel:  [<c01c5b22>] avc_has_perm+0x4e/0x5c     [<c01557a3>] shrink_cache+0xe7/0x29b
Jun 10 07:12:23 localhost kernel:  [<c0155db9>] shrink_zone+0x89/0xd8     [<c0155e6e>] shrink_caches+0x66/0x74
Jun 10 07:12:23 localhost kernel:  [<c0155f29>] try_to_free_pages+0xad/0x1b7  [<c014e8a5>] __alloc_pages+0x136/0x2ec
Jun 10 07:12:23 localhost kernel:  [<c014ea79>] __get_free_pages+0x1e/0x34     [<c017ba85>] __pollwait+0x64/0x97
Jun 10 07:12:23 localhost kernel:  [<c02f0644>] tcp_poll+0x15/0x183     [<c02c0db0>] sock_poll+0xf/0x11
Jun 10 07:12:23 localhost kernel:  [<c017bd30>] do_select+0x177/0x303     [<c01483af>] audit_syscall_entry+0x10a/0x15d
Jun 10 07:12:24 localhost kernel:  [<c017ba21>] __pollwait+0x0/0x97     [<c017c0ba>] sys_select+0x1e6/0x38f
Jun 10 07:12:24 localhost kernel:  [<c01084cd>] do_syscall_trace+0x1ac/0x1c4  [<c0104035>] syscall_call+0x7/0xb
Jun 10 07:12:24 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
Jun 10 07:12:24 localhost kernel:  [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: BUG: write-lock lockup
  2006-06-12 19:53 BUG: write-lock lockup Charles C. Bennett, Jr.
@ 2006-06-17 17:07 ` Andrew Morton
  2006-06-19  7:02   ` [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2006-06-17 17:07 UTC (permalink / raw)
  To: Charles C. Bennett, Jr.; +Cc: linux-kernel, Ingo Molnar

On Mon, 12 Jun 2006 15:53:43 -0400
"Charles C. Bennett, Jr." <ccb@acm.org> wrote:

> 
> Hi All -
> 
> 	I'm seeing write-lock lockups...  this is happening with
> the Fedora kernels from at least as far back as their 2.6.15-1.1833_FC4
> (2.6.15.5) and as recently as their 2.6.16-1.2111_FC4 (2.6.16.17).
> 
> It's a beefy box:
>  Gateway 9515R - Intel ICH5/ICH5R
>  Two Dual Core 3.0 ghz Xeons, 4GB RAM
>  All disks via Emulex LP101-H (thor), switched fabric to
>   Hitachi WMS SAN storage.
> 
> The lockups are not process-specific.  Running mke2fs on large
> filesystems seems to get it happen sooner rather than later.

Apart from these messages, does the machine otherwise work OK?

> I can fish out any other data you need, run tests, etc.
>
>
> Jun  9 20:43:40 localhost kernel: BUG: write-lock lockup on CPU#6, mkfs.ext2/3258, f799c864 (Not tainted)
> Jun  9 20:43:40 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
> Jun  9 20:43:40 localhost kernel:  [<c015532c>] shrink_list+0x197/0x45d     [<c01557a3>] shrink_cache+0xe7/0x29b
> Jun  9 20:43:40 localhost kernel:  [<c014d955>] bad_range+0x22/0x2f     [<c014dfe3>] __rmqueue+0xd1/0x156
> Jun  9 20:43:40 localhost kernel:  [<c0155db9>] shrink_zone+0x89/0xd8     [<c0155e6e>] shrink_caches+0x66/0x74
> Jun  9 20:43:40 localhost kernel:  [<c0155f29>] try_to_free_pages+0xad/0x1b7  [<c014e8a5>] __alloc_pages+0x136/0x2ec
> Jun  9 20:43:40 localhost kernel:  [<c016ca3e>] __block_commit_write+0x83/0x8f    [<c014be23>] generic_file_buffered_write+0x16b/0x655
> Jun  9 20:43:40 localhost kernel:  [<c012a04b>] current_fs_time+0x5a/0x75     [<c012a04b>] current_fs_time+0x5a/0x75
> Jun  9 20:43:40 localhost kernel:  [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9
> Jun  9 20:43:40 localhost kernel:  [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92     [<c01719e0>] blkdev_file_write+0x0/0x24
> Jun  9 20:43:40 localhost kernel:  [<c014c99c>] generic_file_write_nolock+0x85/0x9f     [<c0139196>] autoremove_wake_function+0x0/0x37
> Jun  9 20:43:40 localhost kernel:  [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a
> Jun  9 20:43:40 localhost kernel:  [<c0169b55>] sys_write+0x41/0x6a     [<c0104035>] syscall_call+0x7/0xb
> Jun  9 21:02:58 localhost kernel: BUG: write-lock lockup on CPU#1, mkfs.ext2/3259, f799c864 (Not tainted)
> Jun  9 21:02:58 localhost kernel:  [<c01f21d7>] __write_lock_debug+0xb4/0xdd  [<c01f223f>] _raw_write_lock+0x3f/0x7c
> Jun  9 21:02:58 localhost kernel:  [<c032bd15>] _write_lock_irqsave+0x9/0xd [<c015070d>] test_clear_page_writeback+0x2d/0xa6
> Jun  9 21:02:58 localhost kernel:  [<c014a09c>] end_page_writeback+0x76/0x84  [<c016b2c9>] end_buffer_async_write+0xbf/0x12a
> Jun  9 21:02:58 localhost kernel:  [<c01f1167>] memmove+0x24/0x2d     [<c014d292>] mempool_free+0x3a/0x73
> Jun  9 21:02:58 localhost kernel:  [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f [<c016db6a>] end_bio_bh_io_sync+0x0/0x4f
> Jun  9 21:02:58 localhost kernel:  [<c016db8d>] end_bio_bh_io_sync+0x23/0x4f  [<c016f307>] bio_endio+0x3e/0x69
> Jun  9 21:02:58 localhost kernel:  [<c01e2e88>] __end_that_request_first+0x101/0x235     [<f885c8af>] scsi_end_request+0x1b/0xb0 [scsi_mod]
> Jun  9 21:02:58 localhost kernel:  [<f885cbd5>] scsi_io_completion+0x151/0x4d2 [scsi_mod]     [<f8833c97>] sd_rw_intr+0x70/0x3c1 [sd_mod]
> Jun  9 21:02:58 localhost kernel:  [<f8857cad>] scsi_finish_command+0x82/0xd0 [scsi_mod]     [<f8857b8e>] scsi_softirq+0xc0/0x137 [scsi_mod]
> Jun  9 21:02:58 localhost kernel:  [<c012a1f2>] __do_softirq+0x72/0xdc     [<c0106383>] do_softirq+0x4b/0x4f
> Jun  9 21:02:58 localhost kernel:  =======================
> Jun  9 21:02:58 localhost kernel:  [<c0106265>] do_IRQ+0x55/0x86     [<c0104a7a>] common_interrupt+0x1a/0x20
> Jun  9 21:03:05 localhost kernel:  [<c014e387>] free_hot_cold_page+0xd5/0x126   [<c014eaf4>] __pagevec_free+0x1f/0x2e
> Jun  9 21:03:08 localhost kernel:  [<c0154371>] __pagevec_release_nonlru+0x29/0x8b     [<c01553aa>] shrink_list+0x215/0x45d
> Jun  9 21:03:09 localhost kernel:  [<c01557a3>] shrink_cache+0xe7/0x29b     [<c0155db9>] shrink_zone+0x89/0xd8
> Jun  9 21:03:09 localhost kernel:  [<c0155e6e>] shrink_caches+0x66/0x74     [<c0155f29>] try_to_free_pages+0xad/0x1b7
> Jun  9 21:03:09 localhost kernel:  [<c014e8a5>] __alloc_pages+0x136/0x2ec     [<c016ca3e>] __block_commit_write+0x83/0x8f
> Jun  9 21:03:10 localhost kernel:  [<c014be23>] generic_file_buffered_write+0x16b/0x655     [<c012a04b>] current_fs_time+0x5a/0x75
> Jun  9 21:03:11 localhost kernel:  [<c0182cd8>] inode_update_time+0x2d/0x99 [<c014c5b0>] __generic_file_aio_write_nolock+0x2a3/0x4d9
> Jun  9 21:03:11 localhost kernel:  [<c014c822>] generic_file_aio_write_nolock+0x3c/0x92     [<c01719e0>] blkdev_file_write+0x0/0x24
> Jun  9 21:03:12 localhost kernel:  [<c014c99c>] generic_file_write_nolock+0x85/0x9f     [<c0139196>] autoremove_wake_function+0x0/0x37
> Jun  9 21:03:12 localhost kernel:  [<c0171a00>] blkdev_file_write+0x20/0x24 [<c01699f2>] vfs_write+0xa2/0x15a
> Jun  9 21:03:12 localhost kernel:  [<c0169b55>] sys_write+0x41/0x6a     [<c0104035>] syscall_call+0x7/0xb

Well.  Obviously __write_lock_debug():print_once was meant to have static
scope (__read_mostly, too).  But that's a cosmetic thing.

I'm suspecting that the debug code has simply gone wrong here - that
there's such a lot of read_lock() traffic happening with this workload that
the debug version of write_lock() simply isn't able to take the lock.

This might fix it, but it'll break the timing calculations in that loop.

--- devel/lib/spinlock_debug.c~a	2006-06-17 03:08:35.000000000 -0700
+++ devel-akpm/lib/spinlock_debug.c	2006-06-17 03:08:56.000000000 -0700
@@ -219,7 +219,7 @@ static void __write_lock_debug(rwlock_t 
 		for (i = 0; i < loops_per_jiffy * HZ; i++) {
 			if (__raw_write_trylock(&lock->raw_lock))
 				return;
-			__delay(1);
+			cpu_relax();
 		}
 		/* lockup suspected: */
 		if (print_once) {
_


(I'm a little surprised that RH-FC ships with this debug option enabled.  It's
good for kernel development, but not optimal for users...)

Ingo, we have two bugs in there - some assist and thought would be
appreciated, please.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-17 17:07 ` Andrew Morton
@ 2006-06-19  7:02   ` Ingo Molnar
  2006-06-19  7:59     ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2006-06-19  7:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Charles C. Bennett, Jr., linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> Well.  Obviously __write_lock_debug():print_once was meant to have 
> static scope (__read_mostly, too).  But that's a cosmetic thing.

no - this is an infinite loop we must not break out of, and we only want 
to print the warning once per infinite loop. But we dont want to shut up 
these messages globally. (so that we get a backtrace on all CPUs that 
are looping, etc.)

> I'm suspecting that the debug code has simply gone wrong here - that 
> there's such a lot of read_lock() traffic happening with this workload 
> that the debug version of write_lock() simply isn't able to take the 
> lock.

i'd go for the patch below - the current 1 second timeout might be both 
inadequate for really high loads, and it might also be unrobust when 
there's some miscalibration of loops_per_jiffies - turning a 1 sec 
timeout into half-a-sec timeout or so. Builds/boots here on 32-bit and 
64-bit.

	Ingo

-----------
Subject: increase spinlock-debug looping timeouts from 1 sec to 1 min
From: Ingo Molnar <mingo@elte.hu>

increase the loop timeouts from 1 second to 60 seconds. This should also
handle the occasional case of loops_per_jiffy miscalibration, and it
should handle false positives due to high workloads.

(also make sure the right hand side of the comparison does not overflow 
on 32-bits, and make the messages indicate that we dont know for a fact 
whether this is due to a lockup or due to some extreme load.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 lib/spinlock_debug.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/lib/spinlock_debug.c
===================================================================
--- linux.orig/lib/spinlock_debug.c
+++ linux/lib/spinlock_debug.c
@@ -104,7 +104,7 @@ static void __spin_lock_debug(spinlock_t
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (i = 0; i < (u64)loops_per_jiffy * 60 * HZ; i++) {
 			if (__raw_spin_trylock(&lock->raw_lock))
 				return;
 			__delay(1);
@@ -112,7 +112,7 @@ static void __spin_lock_debug(spinlock_t
 		/* lockup suspected: */
 		if (print_once) {
 			print_once = 0;
-			printk(KERN_EMERG "BUG: spinlock lockup on CPU#%d, "
+			printk(KERN_EMERG "BUG: possible spinlock lockup on CPU#%d, "
 					"%s/%d, %p\n",
 				raw_smp_processor_id(), current->comm,
 				current->pid, lock);
@@ -169,7 +169,7 @@ static void __read_lock_debug(rwlock_t *
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (i = 0; i < (u64)loops_per_jiffy * 60 * HZ; i++) {
 			if (__raw_read_trylock(&lock->raw_lock))
 				return;
 			__delay(1);
@@ -177,7 +177,7 @@ static void __read_lock_debug(rwlock_t *
 		/* lockup suspected: */
 		if (print_once) {
 			print_once = 0;
-			printk(KERN_EMERG "BUG: read-lock lockup on CPU#%d, "
+			printk(KERN_EMERG "BUG: possible read-lock lockup on CPU#%d, "
 					"%s/%d, %p\n",
 				raw_smp_processor_id(), current->comm,
 				current->pid, lock);
@@ -242,7 +242,7 @@ static void __write_lock_debug(rwlock_t 
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (i = 0; i < (u64)loops_per_jiffy * 60 * HZ; i++) {
 			if (__raw_write_trylock(&lock->raw_lock))
 				return;
 			__delay(1);
@@ -250,7 +250,7 @@ static void __write_lock_debug(rwlock_t 
 		/* lockup suspected: */
 		if (print_once) {
 			print_once = 0;
-			printk(KERN_EMERG "BUG: write-lock lockup on CPU#%d, "
+			printk(KERN_EMERG "BUG: possible write-lock lockup on CPU#%d, "
 					"%s/%d, %p\n",
 				raw_smp_processor_id(), current->comm,
 				current->pid, lock);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19  7:02   ` [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min Ingo Molnar
@ 2006-06-19  7:59     ` Andrew Morton
  2006-06-19  8:12       ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2006-06-19  7:59 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ccb, linux-kernel

On Mon, 19 Jun 2006 09:02:29 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> Subject: increase spinlock-debug looping timeouts from 1 sec to 1 min

But it's broken.  In the non-debug case we subtract RW_LOCK_BIAS so we know
that the writer will get the lock when all readers vacate.  But in the
CONFIG_DEBUG_SPINLOCK case we don't do that, with the result that taking a
write_lock can take over a second.

A much, much better fix (which involves visiting all architectures) would
be to subtract RW_LOCK_BIAS and _then_ wait for a second.

OK, it's only debug code.  But RH (for one) have decided to ship zillions
of kernels with this debug code turned on.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19  7:59     ` Andrew Morton
@ 2006-06-19  8:12       ` Ingo Molnar
  2006-06-19  8:21         ` Ingo Molnar
  2006-06-19  8:32         ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2006-06-19  8:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ccb, linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> On Mon, 19 Jun 2006 09:02:29 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > Subject: increase spinlock-debug looping timeouts from 1 sec to 1 
> > min
> 
> But it's broken.  In the non-debug case we subtract RW_LOCK_BIAS so we 
> know that the writer will get the lock when all readers vacate.  But 
> in the CONFIG_DEBUG_SPINLOCK case we don't do that, with the result 
> that taking a write_lock can take over a second.
> 
> A much, much better fix (which involves visiting all architectures) 
> would be to subtract RW_LOCK_BIAS and _then_ wait for a second.

no. Write-locks are unfair too, and there's no guarantee that writes are 
listened to. That's why nested read_lock() is valid, while nested 
down_read() is invalid.

Take a look at arch/i386/kernel/semaphore.c, __write_lock_failed() just 
adds back the RW_LOCK_BIAS and retries in a loop. There's no difference 
to an open-coded write_trylock loop - unless i'm missing something 
fundamental.

> OK, it's only debug code.  But RH (for one) have decided to ship 
> zillions of kernels with this debug code turned on.

yes - Fedora enables most of the transparent kernel debugging options 
(slab, lock debugging) in rawhide. The current list is:

 CONFIG_DEBUG_KERNEL=y
 CONFIG_DEBUG_SLAB=y
 CONFIG_DEBUG_SLAB_LEAK=y
 CONFIG_DEBUG_MUTEXES=y
 CONFIG_DEBUG_SPINLOCK=y
 CONFIG_DEBUG_SPINLOCK_SLEEP=y
 CONFIG_DEBUG_HIGHMEM=y
 CONFIG_DEBUG_BUGVERBOSE=y
 CONFIG_DEBUG_INFO=y
 CONFIG_DEBUG_FS=y
 CONFIG_DEBUG_VM=y
 CONFIG_DEBUG_STACKOVERFLOW=y
 CONFIG_DEBUG_STACK_USAGE=y
 CONFIG_DEBUG_RODATA=y
 CONFIG_KEYS_DEBUG_PROC_KEYS=y

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19  8:12       ` Ingo Molnar
@ 2006-06-19  8:21         ` Ingo Molnar
  2006-06-19  8:32         ` Andrew Morton
  1 sibling, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2006-06-19  8:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ccb, linux-kernel


* Ingo Molnar <mingo@elte.hu> wrote:

> no. Write-locks are unfair too, and there's no guarantee that writes 
> are listened to. That's why nested read_lock() is valid, while nested 
> down_read() is invalid.
> 
> Take a look at arch/i386/kernel/semaphore.c, __write_lock_failed() 
> just adds back the RW_LOCK_BIAS and retries in a loop. There's no 
> difference to an open-coded write_trylock loop - unless i'm missing 
> something fundamental.

did i ever mention that i find rwlocks evil, inefficient and bug-prone, 
and that we should get rid of them? :-)

(Most rwlock users can be converted to straight spinlocks just fine, but 
there are a couple of places that rely on read-lock nesting. The 
hardest-to-fix offenders are nested rcu_read_locks() in the netfilter 
code. I gave up converting them to saner locking, PREEMPT_RCU works it 
around in the -rt tree, by not being rwlock based.)

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19  8:12       ` Ingo Molnar
  2006-06-19  8:21         ` Ingo Molnar
@ 2006-06-19  8:32         ` Andrew Morton
  2006-06-19  8:35           ` Ingo Molnar
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2006-06-19  8:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ccb, linux-kernel

On Mon, 19 Jun 2006 10:12:52 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > On Mon, 19 Jun 2006 09:02:29 +0200
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > Subject: increase spinlock-debug looping timeouts from 1 sec to 1 
> > > min
> > 
> > But it's broken.  In the non-debug case we subtract RW_LOCK_BIAS so we 
> > know that the writer will get the lock when all readers vacate.  But 
> > in the CONFIG_DEBUG_SPINLOCK case we don't do that, with the result 
> > that taking a write_lock can take over a second.
> > 
> > A much, much better fix (which involves visiting all architectures) 
> > would be to subtract RW_LOCK_BIAS and _then_ wait for a second.
> 
> no. Write-locks are unfair too, and there's no guarantee that writes are 
> listened to. That's why nested read_lock() is valid, while nested 
> down_read() is invalid.
> 
> Take a look at arch/i386/kernel/semaphore.c, __write_lock_failed() just 
> adds back the RW_LOCK_BIAS and retries in a loop. There's no difference 
> to an open-coded write_trylock loop - unless i'm missing something 
> fundamental.

OK.  That sucks.  A sufficiently large machine with the right mix of
latencies will get hit by the NMI watchdog in write_lock_irq().

But presumably the situation is much worse with CONFIG_DEBUG_SPINLOCK
because of that __delay().

So how about we remove the __delay() (which is wrong anyway, because
loops_per_jiffy isn't calculated with a write_trylock() in the loop (which
means we're getting scarily close to the NMI watchdog at present)).

Instead, calculate a custom loops_per_jiffy for this purpose in
lib/spinlock_debug.c?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19  8:32         ` Andrew Morton
@ 2006-06-19  8:35           ` Ingo Molnar
  2006-06-19  9:13             ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2006-06-19  8:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ccb, linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> OK.  That sucks.  A sufficiently large machine with the right mix of 
> latencies will get hit by the NMI watchdog in write_lock_irq().
> 
> But presumably the situation is much worse with CONFIG_DEBUG_SPINLOCK 
> because of that __delay().
> 
> So how about we remove the __delay() (which is wrong anyway, because 
> loops_per_jiffy isn't calculated with a write_trylock() in the loop 
> (which means we're getting scarily close to the NMI watchdog at 
> present)).
> 
> Instead, calculate a custom loops_per_jiffy for this purpose in 
> lib/spinlock_debug.c?

hm, that would be yet another calibration loop with the potential to be 
wrong (and which would slow down the bootup process). If loops_per_jiffy 
is wrong then our timings are toast anyway.

I think increasing the timeout to 60 secs ought to be enough - 1 sec was 
a bit too close to valid delays and i can imagine really high loads 
causing 1 sec delays (especially if something like SysRq-T is holding 
the tasklist_lock for long).

The write_trylock + __delay in the loop is not a problem or a bug, as 
the trylock will at most _increase_ the delay - and our goal is to not 
have a false positive, not to be absolutely accurate about the 
measurement here.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19  8:35           ` Ingo Molnar
@ 2006-06-19  9:13             ` Andrew Morton
  2006-06-19 11:39               ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2006-06-19  9:13 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ccb, linux-kernel

On Mon, 19 Jun 2006 10:35:18 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > OK.  That sucks.  A sufficiently large machine with the right mix of 
> > latencies will get hit by the NMI watchdog in write_lock_irq().
> > 
> > But presumably the situation is much worse with CONFIG_DEBUG_SPINLOCK 
> > because of that __delay().
> > 
> > So how about we remove the __delay() (which is wrong anyway, because 
> > loops_per_jiffy isn't calculated with a write_trylock() in the loop 
> > (which means we're getting scarily close to the NMI watchdog at 
> > present)).
> > 
> > Instead, calculate a custom loops_per_jiffy for this purpose in 
> > lib/spinlock_debug.c?
> 
> hm, that would be yet another calibration loop with the potential to be 
> wrong (and which would slow down the bootup process).

We should be able to do it in two jiffies, max.

> If loops_per_jiffy 
> is wrong then our timings are toast anyway.

loops_per_jiffy measures how long we can run __delay(), which is basically
two rdtsc's (x86_64).  But in __write_lock_debug() we're also doing a
failed write_trylock(), so it'll take more than a second.  Perhaps a lot
more, depending on relative speeds, and whether the lock is exclusive-owned
in another CPU (it often will be).

...  which means that calculating loops_per_jiffy_for_write_lock_debug will
be hard - need to have N other CPUs simultaneously hammering the lock for
reading (!)

Maybe we just shouldn't enable this code if the CPU doesn't have an
rdtsc-like thingy we can get at.

> I think increasing the timeout to 60 secs ought to be enough - 1 sec was 
> a bit too close to valid delays and i can imagine really high loads 
> causing 1 sec delays (especially if something like SysRq-T is holding 
> the tasklist_lock for long).

No way can _this_ workload be taking the lock for anything like that amount
of time.  We have

- one CPU taking the lock for reading, looking to see if a page is in
  pagecache.

- the same CPU taking the lock for writing, adding pages to pagecache. 
  This is infrequent on average because we need to modify the contents of
  each page.

- one CPU perhaps taking the lock for writing running page reclaim at the
  same average rate as the rate of pagecache consumption.  Also slow.

Each of these code paths are really short, and relatively infrequent. 
There's just no way that a write_lock on tree_lock should be taking a
second (more than a second, but we don't know how much more).

Surely something is busted.

> The write_trylock + __delay in the loop is not a problem or a bug, as 
> the trylock will at most _increase_ the delay - and our goal is to not 
> have a false positive, not to be absolutely accurate about the 
> measurement here.

Precisely.  We have delays of over a second (but we don't know how much
more than a second).  Let's say two seconds.  The NMI watchdog timeout is,
what?  Five seconds?

That's getting too close.  The result will be a total system crash.  And RH
are shipping this.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19  9:13             ` Andrew Morton
@ 2006-06-19 11:39               ` Ingo Molnar
  2006-06-19 19:55                 ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2006-06-19 11:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ccb, linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> > The write_trylock + __delay in the loop is not a problem or a bug, as 
> > the trylock will at most _increase_ the delay - and our goal is to not 
> > have a false positive, not to be absolutely accurate about the 
> > measurement here.
> 
> Precisely.  We have delays of over a second (but we don't know how 
> much more than a second).  Let's say two seconds.  The NMI watchdog 
> timeout is, what?  Five seconds?

i dont see the problem. We'll have tried that lock hundreds of thousands 
of times before this happens. The NMI watchdog will only trigger if we 
do this with IRQs disabled. And it's not like the normal 
__write_lock_failed codepath would be any different: for heavily 
contended workloads the overhead is likely in the cacheline bouncing, 
not in the __delay().

> That's getting too close.  The result will be a total system crash.  
> And RH are shipping this.

I dont see a connection. Pretty much the only thing the loop condition 
impacts is the condition under which we print out a 'i think we 
deadlocked' message. Have i missed your point perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19 11:39               ` Ingo Molnar
@ 2006-06-19 19:55                 ` Andrew Morton
  2006-06-20  8:06                   ` Arjan van de Ven
  2006-06-20  8:40                   ` [patch] fix spinlock-debug looping Ingo Molnar
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2006-06-19 19:55 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ccb, linux-kernel

On Mon, 19 Jun 2006 13:39:44 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > > The write_trylock + __delay in the loop is not a problem or a bug, as 
> > > the trylock will at most _increase_ the delay - and our goal is to not 
> > > have a false positive, not to be absolutely accurate about the 
> > > measurement here.
> > 
> > Precisely.  We have delays of over a second (but we don't know how 
> > much more than a second).  Let's say two seconds.  The NMI watchdog 
> > timeout is, what?  Five seconds?
> 
> i dont see the problem.

It's taking over a second to acquire a write_lock.  A lock which is
unlikely to be held for more than a microsecond anywhere.  That's really
bad, isn't it?  Being on the edge of an NMI watchdog induced system crash
is bad, too.

> We'll have tried that lock hundreds of thousands 
> of times before this happens. The NMI watchdog will only trigger if we 
> do this with IRQs disabled.

tree_lock uses write_lock_irq().

> And it's not like the normal 
> __write_lock_failed codepath would be any different: for heavily 
> contended workloads the overhead is likely in the cacheline bouncing, 
> not in the __delay().

Yes, it might also happen with !CONFIG_DEBUG_SPINLOCK.  We need to find out
if that's so and if so, why.

> > That's getting too close.  The result will be a total system crash.  
> > And RH are shipping this.
> 
> I dont see a connection. Pretty much the only thing the loop condition 
> impacts is the condition under which we print out a 'i think we 
> deadlocked' message.

I'm assuming that the additional delay in the debug code has worsened the
situation.

> Have i missed your point perhaps?

I get that impression ;) If it takes 1-2 seconds to get this lock then it
can take five seconds.  a) that's just gross and b) the NMI watchdog will
nuke the box.

Why is it taking so long to get the lock?

Does it happen in non-debug mode?

What do we do about it?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min
  2006-06-19 19:55                 ` Andrew Morton
@ 2006-06-20  8:06                   ` Arjan van de Ven
  2006-06-20  8:40                   ` [patch] fix spinlock-debug looping Ingo Molnar
  1 sibling, 0 replies; 18+ messages in thread
From: Arjan van de Ven @ 2006-06-20  8:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, ccb, linux-kernel

On Mon, 2006-06-19 at 12:55 -0700, Andrew Morton wrote:

> I get that impression ;) If it takes 1-2 seconds to get this lock then it
> can take five seconds.  a) that's just gross and b) the NMI watchdog will
> nuke the box.
> 
> Why is it taking so long to get the lock?

on a high level, this is a "feature" of unfair read/write locks in
general. Just have enough processors and a lock that can be taken for
read as a result from a userspace action, and BLAM you have this as a
user triggerable condition... which is effectively a DoS security issue.

> What do we do about it?

read/write spinlocks are effectively silly. There is NO scalability
advantage in practice (well, there would be if you have really long hold
times, which you don't for spinlocks, rwsems are obviously different
there). This may sound counter intuitive.... but nowadays the
scalability cost for taking a lock is in the cache miss for bouncing the
cacheline of the lock to your cpu for write. (once this is done the
actual locked cycle tends to be free or near free)

In this cache bounce aspect, read/write spinlocks are no better than
normal spinlocks. In fact, in the ">1 concurrent readers" case, they are
*worse* than normal spinlocks! 

So I'm entirely on Ingo's side with the "we should get rid of rw
spinlocks as much as we can" proposal; they make no sense whatsoever for
scalability, the only reason they are useful is for recursion...

(well there is a border line case with the rare writer in user context
but the normal users for read in irq context, but even there the
scalability is horrible and this should consider rcu or per-cpu data
realistically)




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch] fix spinlock-debug looping
  2006-06-19 19:55                 ` Andrew Morton
  2006-06-20  8:06                   ` Arjan van de Ven
@ 2006-06-20  8:40                   ` Ingo Molnar
  2006-06-20  8:52                     ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2006-06-20  8:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ccb, linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> > > > The write_trylock + __delay in the loop is not a problem or a bug, as 
> > > > the trylock will at most _increase_ the delay - and our goal is to not 
> > > > have a false positive, not to be absolutely accurate about the 
> > > > measurement here.
> > > 
> > > Precisely.  We have delays of over a second (but we don't know how 
> > > much more than a second).  Let's say two seconds.  The NMI watchdog 
> > > timeout is, what?  Five seconds?
> > 
> > i dont see the problem.
> 
> It's taking over a second to acquire a write_lock.  A lock which is 
> unlikely to be held for more than a microsecond anywhere.  That's 
> really bad, isn't it?  Being on the edge of an NMI watchdog induced 
> system crash is bad, too.

i obviously agree that any such crash is a serious problem, but is it 
caused by the spinlock-debugging code? I doubt it is, unless __delay() 
is seriously buggered.

in any case, to move this problem forward i'd suggest we go with the 
patch below in -mm and wait for feedback. It fixes a potential overflow 
in the comparison (if HZ*lpj overflows 32-bits) That needs a really fast 
box to run the 32-bit kernel though, so i doubt this is the cause of the 
problems. In any case, this change makes it easier to increase the 
looping timeout from 1 second to 10 seconds later on or so - at which 
point the overflow can happen for real and must be handled .

	Ingo

------------
Subject: fix spinlock-debug looping
From: Ingo Molnar <mingo@elte.hu>

make sure the right hand side of the comparison does not overflow
on 32-bits. Also print out more info when detecting a lockup, so
that we see how many times the code tried (and failed) to get the
lock.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 lib/spinlock_debug.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

Index: linux/lib/spinlock_debug.c
===================================================================
--- linux.orig/lib/spinlock_debug.c
+++ linux/lib/spinlock_debug.c
@@ -104,7 +104,7 @@ static void __spin_lock_debug(spinlock_t
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (i = 0; i < (u64)loops_per_jiffy * HZ; i++) {
 			if (__raw_spin_trylock(&lock->raw_lock))
 				return;
 			__delay(1);
@@ -112,10 +112,10 @@ static void __spin_lock_debug(spinlock_t
 		/* lockup suspected: */
 		if (print_once) {
 			print_once = 0;
-			printk(KERN_EMERG "BUG: spinlock lockup on CPU#%d, "
-					"%s/%d, %p\n",
+			printk(KERN_EMERG "BUG: possible spinlock lockup on CPU#%d, "
+					"%s/%d, %p [%Ld/%ld]\n",
 				raw_smp_processor_id(), current->comm,
-				current->pid, lock);
+				current->pid, lock, i, loops_per_jiffy);
 			dump_stack();
 		}
 	}
@@ -169,7 +169,7 @@ static void __read_lock_debug(rwlock_t *
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (i = 0; i < (u64)loops_per_jiffy * HZ; i++) {
 			if (__raw_read_trylock(&lock->raw_lock))
 				return;
 			__delay(1);
@@ -177,10 +177,10 @@ static void __read_lock_debug(rwlock_t *
 		/* lockup suspected: */
 		if (print_once) {
 			print_once = 0;
-			printk(KERN_EMERG "BUG: read-lock lockup on CPU#%d, "
-					"%s/%d, %p\n",
+			printk(KERN_EMERG "BUG: possible read-lock lockup on CPU#%d, "
+					"%s/%d, %p [%Ld/%ld]\n",
 				raw_smp_processor_id(), current->comm,
-				current->pid, lock);
+				current->pid, lock, i, loops_per_jiffy);
 			dump_stack();
 		}
 	}
@@ -242,7 +242,7 @@ static void __write_lock_debug(rwlock_t 
 	u64 i;
 
 	for (;;) {
-		for (i = 0; i < loops_per_jiffy * HZ; i++) {
+		for (i = 0; i < (u64)loops_per_jiffy * HZ; i++) {
 			if (__raw_write_trylock(&lock->raw_lock))
 				return;
 			__delay(1);
@@ -250,10 +250,10 @@ static void __write_lock_debug(rwlock_t 
 		/* lockup suspected: */
 		if (print_once) {
 			print_once = 0;
-			printk(KERN_EMERG "BUG: write-lock lockup on CPU#%d, "
-					"%s/%d, %p\n",
+			printk(KERN_EMERG "BUG: possible write-lock lockup on CPU#%d, "
+					"%s/%d, %p [%Ld/%ld]\n",
 				raw_smp_processor_id(), current->comm,
-				current->pid, lock);
+				current->pid, lock, i, loops_per_jiffy);
 			dump_stack();
 		}
 	}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] fix spinlock-debug looping
  2006-06-20  8:40                   ` [patch] fix spinlock-debug looping Ingo Molnar
@ 2006-06-20  8:52                     ` Andrew Morton
  2006-06-20  9:15                       ` Ingo Molnar
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2006-06-20  8:52 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ccb, linux-kernel

On Tue, 20 Jun 2006 10:40:01 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> i obviously agree that any such crash is a serious problem, but is it 
> caused by the spinlock-debugging code?

Judging from Dave Olson <olson@unixfolk.com>'s report: no.  He's getting
hit by NMI watchdog expiry on write_lock(tree_lock) in a !CONFIG_DEBUG_SPINLOCK
kernel.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] fix spinlock-debug looping
  2006-06-20  8:52                     ` Andrew Morton
@ 2006-06-20  9:15                       ` Ingo Molnar
  2006-06-20  9:32                         ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Ingo Molnar @ 2006-06-20  9:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ccb, linux-kernel


* Andrew Morton <akpm@osdl.org> wrote:

> On Tue, 20 Jun 2006 10:40:01 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > i obviously agree that any such crash is a serious problem, but is 
> > it caused by the spinlock-debugging code?
> 
> Judging from Dave Olson <olson@unixfolk.com>'s report: no.  He's 
> getting hit by NMI watchdog expiry on write_lock(tree_lock) in a 
> !CONFIG_DEBUG_SPINLOCK kernel.

hm, that means 5 seconds of looping with irqs off? That's really insane. 
Is there any definitive testcase or testsystem where we could try a 
simple tree_lock rwlock -> spinlock conversion? Spinlocks are alot 
fairer. Or as a simple experiment, s/read_lock/write_lock, as the patch 
below (against rc6-mm2) does. This is phase #1, if it works out we can 
switch tree_lock to a spinlock. [write_lock()s are roughly as fair to 
each other as spinlocks - it's a bit more expensive but not 
significantly] Builds & boots fine here.

	Ingo

---
 drivers/mtd/devices/block2mtd.c        |    8 ++++----
 fs/reiser4/plugin/file/cryptcompress.c |    8 ++++----
 fs/reiser4/plugin/file/file.c          |   12 ++++++------
 mm/filemap.c                           |   32 ++++++++++++++++----------------
 mm/page-writeback.c                    |    4 ++--
 mm/readahead.c                         |   22 +++++++++++-----------
 mm/swap_prefetch.c                     |    4 ++--
 7 files changed, 45 insertions(+), 45 deletions(-)

Index: linux/drivers/mtd/devices/block2mtd.c
===================================================================
--- linux.orig/drivers/mtd/devices/block2mtd.c
+++ linux/drivers/mtd/devices/block2mtd.c
@@ -59,7 +59,7 @@ static void cache_readahead(struct addre
 
 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	for (i = 0; i < PAGE_READAHEAD; i++) {
 		pagei = index + i;
 		if (pagei > end_index) {
@@ -71,16 +71,16 @@ static void cache_readahead(struct addre
 			break;
 		if (page)
 			continue;
-		read_unlock_irq(&mapping->tree_lock);
+		write_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
+		write_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = pagei;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	if (ret)
 		read_cache_pages(mapping, &page_pool, filler, NULL);
 }
Index: linux/fs/reiser4/plugin/file/cryptcompress.c
===================================================================
--- linux.orig/fs/reiser4/plugin/file/cryptcompress.c
+++ linux/fs/reiser4/plugin/file/cryptcompress.c
@@ -3413,7 +3413,7 @@ static void clear_moved_tag_cluster(stru
 {
 	int i;
 	void * ret;
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	for (i = 0; i < clust->nr_pages; i++) {
 		assert("edward-1438", clust->pages[i] != NULL);
 		ret = radix_tree_tag_clear(&mapping->page_tree,
@@ -3421,7 +3421,7 @@ static void clear_moved_tag_cluster(stru
 					   PAGECACHE_TAG_REISER4_MOVED);
 		assert("edward-1439", ret == clust->pages[i]);
 	}
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 }
 
 /* Capture an anonymous pager cluster. (Page cluser is
@@ -3446,11 +3446,11 @@ capture_page_cluster(reiser4_cluster_t *
 	if (unlikely(result)) {
 		/* set cleared tag back, so it will be
 		   possible to capture it again later */
-		read_lock_irq(&inode->i_mapping->tree_lock);
+		write_lock_irq(&inode->i_mapping->tree_lock);
 		radix_tree_tag_set(&inode->i_mapping->page_tree,
 				   clust_to_pg(clust->index, inode),
 				   PAGECACHE_TAG_REISER4_MOVED);
-		read_unlock_irq(&inode->i_mapping->tree_lock);
+		write_unlock_irq(&inode->i_mapping->tree_lock);
 
 		release_cluster_pages_and_jnode(clust);
 	}
Index: linux/fs/reiser4/plugin/file/file.c
===================================================================
--- linux.orig/fs/reiser4/plugin/file/file.c
+++ linux/fs/reiser4/plugin/file/file.c
@@ -832,9 +832,9 @@ static int has_anonymous_pages(struct in
 {
 	int result;
 
-	read_lock_irq(&inode->i_mapping->tree_lock);
+	write_lock_irq(&inode->i_mapping->tree_lock);
 	result = radix_tree_tagged(&inode->i_mapping->page_tree, PAGECACHE_TAG_REISER4_MOVED);
-	read_unlock_irq(&inode->i_mapping->tree_lock);
+	write_unlock_irq(&inode->i_mapping->tree_lock);
 	return result;
 }
 
@@ -1124,7 +1124,7 @@ static int sync_page_list(struct inode *
 	mapping = inode->i_mapping;
 	from = 0;
 	result = 0;
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	while (result == 0) {
 		struct page *page;
 
@@ -1138,17 +1138,17 @@ static int sync_page_list(struct inode *
 		/* page may not leave radix tree because it is protected from truncating by inode->i_mutex locked by
 		   sys_fsync */
 		page_cache_get(page);
-		read_unlock_irq(&mapping->tree_lock);
+		write_unlock_irq(&mapping->tree_lock);
 
 		from = page->index + 1;
 
 		result = sync_page(page);
 
 		page_cache_release(page);
-		read_lock_irq(&mapping->tree_lock);
+		write_lock_irq(&mapping->tree_lock);
 	}
 
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	return result;
 }
 
Index: linux/mm/filemap.c
===================================================================
--- linux.orig/mm/filemap.c
+++ linux/mm/filemap.c
@@ -568,9 +568,9 @@ int probe_page(struct address_space *map
 {
 	int exists;
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	exists = __probe_page(mapping, offset);
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 
 	return exists;
 }
@@ -583,11 +583,11 @@ struct page * find_get_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page)
 		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 
@@ -600,11 +600,11 @@ struct page *find_trylock_page(struct ad
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page && TestSetPageLocked(page))
 		page = NULL;
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 
@@ -626,15 +626,15 @@ struct page *find_lock_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
+			write_unlock_irq(&mapping->tree_lock);
 			__lock_page(page);
-			read_lock_irq(&mapping->tree_lock);
+			write_lock_irq(&mapping->tree_lock);
 
 			/* Has the page been truncated while we slept? */
 			if (unlikely(page->mapping != mapping ||
@@ -645,7 +645,7 @@ repeat:
 			}
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	return page;
 }
 
@@ -718,12 +718,12 @@ unsigned find_get_pages(struct address_s
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 EXPORT_SYMBOL(find_get_pages);
@@ -746,7 +746,7 @@ unsigned find_get_pages_contig(struct ad
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, index, nr_pages);
 	for (i = 0; i < ret; i++) {
@@ -756,7 +756,7 @@ unsigned find_get_pages_contig(struct ad
 		page_cache_get(pages[i]);
 		index++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	return i;
 }
 
@@ -770,14 +770,14 @@ unsigned find_get_pages_tag(struct addre
 	unsigned int i;
 	unsigned int ret;
 
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
 	for (i = 0; i < ret; i++)
 		page_cache_get(pages[i]);
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
 EXPORT_SYMBOL(find_get_pages_tag);
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -800,9 +800,9 @@ int mapping_tagged(struct address_space 
 	unsigned long flags;
 	int ret;
 
-	read_lock_irqsave(&mapping->tree_lock, flags);
+	write_lock_irqsave(&mapping->tree_lock, flags);
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
-	read_unlock_irqrestore(&mapping->tree_lock, flags);
+	write_unlock_irqrestore(&mapping->tree_lock, flags);
 	return ret;
 }
 EXPORT_SYMBOL(mapping_tagged);
Index: linux/mm/readahead.c
===================================================================
--- linux.orig/mm/readahead.c
+++ linux/mm/readahead.c
@@ -398,7 +398,7 @@ __do_page_cache_readahead(struct address
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
 		
@@ -409,10 +409,10 @@ __do_page_cache_readahead(struct address
 		if (page)
 			continue;
 
-		read_unlock_irq(&mapping->tree_lock);
+		write_unlock_irq(&mapping->tree_lock);
 		cond_resched();
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
+		write_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
@@ -421,7 +421,7 @@ __do_page_cache_readahead(struct address
 			SetPageReadahead(page);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
@@ -1318,7 +1318,7 @@ static pgoff_t find_segtail(struct addre
 	pgoff_t ra_index;
 
 	cond_resched();
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	ra_index = radix_tree_scan_hole(&mapping->page_tree, index, max_scan);
 #ifdef DEBUG_READAHEAD_RADIXTREE
 	BUG_ON(!__probe_page(mapping, index));
@@ -1330,7 +1330,7 @@ static pgoff_t find_segtail(struct addre
 	if (ra_index != ~0UL && ra_index - index < max_scan)
 		WARN_ON(__probe_page(mapping, ra_index));
 #endif
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 
 	if (ra_index <= index + max_scan)
 		return ra_index;
@@ -1353,13 +1353,13 @@ static pgoff_t find_segtail_backward(str
 	 * Poor man's radix_tree_scan_data_backward() implementation.
 	 * Acceptable because max_scan won't be large.
 	 */
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	for (; origin - index < max_scan;)
 		if (__probe_page(mapping, --index)) {
-			read_unlock_irq(&mapping->tree_lock);
+			write_unlock_irq(&mapping->tree_lock);
 			return index + 1;
 		}
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 
 	return 0;
 }
@@ -1410,7 +1410,7 @@ static unsigned long query_page_cache_se
 	 * The count here determines ra_size.
 	 */
 	cond_resched();
-	read_lock_irq(&mapping->tree_lock);
+	write_lock_irq(&mapping->tree_lock);
 	index = radix_tree_scan_hole_backward(&mapping->page_tree,
 							offset - 1, ra_max);
 #ifdef DEBUG_READAHEAD_RADIXTREE
@@ -1452,7 +1452,7 @@ static unsigned long query_page_cache_se
 			break;
 
 out_unlock:
-	read_unlock_irq(&mapping->tree_lock);
+	write_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 *  For sequential read that extends from index 0, the counted value
Index: linux/mm/swap_prefetch.c
===================================================================
--- linux.orig/mm/swap_prefetch.c
+++ linux/mm/swap_prefetch.c
@@ -189,10 +189,10 @@ static enum trickle_return trickle_swap_
 	enum trickle_return ret = TRICKLE_FAILED;
 	struct page *page;
 
-	read_lock_irq(&swapper_space.tree_lock);
+	write_lock_irq(&swapper_space.tree_lock);
 	/* Entry may already exist */
 	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
-	read_unlock_irq(&swapper_space.tree_lock);
+	write_unlock_irq(&swapper_space.tree_lock);
 	if (page) {
 		remove_from_swapped_list(entry.val);
 		goto out;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] fix spinlock-debug looping
  2006-06-20  9:15                       ` Ingo Molnar
@ 2006-06-20  9:32                         ` Andrew Morton
  2006-06-20  9:34                           ` Ingo Molnar
  2006-06-20 16:02                           ` Dave Olson
  0 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2006-06-20  9:32 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ccb, linux-kernel, Dave Olson

On Tue, 20 Jun 2006 11:15:05 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> * Andrew Morton <akpm@osdl.org> wrote:
> 
> > On Tue, 20 Jun 2006 10:40:01 +0200
> > Ingo Molnar <mingo@elte.hu> wrote:
> > 
> > > i obviously agree that any such crash is a serious problem, but is 
> > > it caused by the spinlock-debugging code?
> > 
> > Judging from Dave Olson <olson@unixfolk.com>'s report: no.  He's 
> > getting hit by NMI watchdog expiry on write_lock(tree_lock) in a 
> > !CONFIG_DEBUG_SPINLOCK kernel.
> 
> hm, that means 5 seconds of looping with irqs off?

Yup.

> That's really insane. 

Yup.

> Is there any definitive testcase or testsystem where we could try a 
> simple tree_lock rwlock -> spinlock conversion?

Not that I'm aware of.  I just tried three CPUs doing
fadvise(FADV_WILLNEED, 1GB) with the fourth CPU trying to write the file,
but it didn't lock up as I expected.

> Spinlocks are alot 
> fairer. Or as a simple experiment, s/read_lock/write_lock, as the patch 
> below (against rc6-mm2) does. This is phase #1, if it works out we can 
> switch tree_lock to a spinlock. [write_lock()s are roughly as fair to 
> each other as spinlocks - it's a bit more expensive but not 
> significantly] Builds & boots fine here.

tree_lock was initially an rwlock.  Then we made it a spinlock.  Then we
made it an rwlock.  We change the dang thing so often we should make it a
macro ;)

Let's just make it a spinlock and be done with it.  Hopefully Dave or
ccb@acm.org (?) will be able to test it.  I was planning on doing a patch
tomorrowish.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] fix spinlock-debug looping
  2006-06-20  9:32                         ` Andrew Morton
@ 2006-06-20  9:34                           ` Ingo Molnar
  2006-06-20 16:02                           ` Dave Olson
  1 sibling, 0 replies; 18+ messages in thread
From: Ingo Molnar @ 2006-06-20  9:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ccb, linux-kernel, Dave Olson


* Andrew Morton <akpm@osdl.org> wrote:

> > Spinlocks are alot fairer. Or as a simple experiment, 
> > s/read_lock/write_lock, as the patch below (against rc6-mm2) does. 
> > This is phase #1, if it works out we can switch tree_lock to a 
> > spinlock. [write_lock()s are roughly as fair to each other as 
> > spinlocks - it's a bit more expensive but not significantly] Builds 
> > & boots fine here.
> 
> tree_lock was initially an rwlock.  Then we made it a spinlock.  Then 
> we made it an rwlock.  We change the dang thing so often we should 
> make it a macro ;)

ha! In -rt we can change types of locks by changing the type definition 
and the declaration only ;-) [It makes for some confusing reading though 
if done without restraint]

> Let's just make it a spinlock and be done with it.  Hopefully Dave or 
> ccb@acm.org (?) will be able to test it.  I was planning on doing a 
> patch tomorrowish.

ok. Until that happens the patch i sent can be used for testing.

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch] fix spinlock-debug looping
  2006-06-20  9:32                         ` Andrew Morton
  2006-06-20  9:34                           ` Ingo Molnar
@ 2006-06-20 16:02                           ` Dave Olson
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Olson @ 2006-06-20 16:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ingo Molnar, ccb, linux-kernel

On Tue, 20 Jun 2006, Andrew Morton wrote:
| Let's just make it a spinlock and be done with it.  Hopefully Dave or
| ccb@acm.org (?) will be able to test it.  I was planning on doing a patch
| tomorrowish.

Our test case is extremely reproducible in about a minute, so
I should be able to give a pretty conclusive answer as to whether
the patch works for our case.

Dave Olson
olson@unixfolk.com
http://www.unixfolk.com/dave

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-06-20 16:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-12 19:53 BUG: write-lock lockup Charles C. Bennett, Jr.
2006-06-17 17:07 ` Andrew Morton
2006-06-19  7:02   ` [patch] increase spinlock-debug looping timeouts from 1 sec to 1 min Ingo Molnar
2006-06-19  7:59     ` Andrew Morton
2006-06-19  8:12       ` Ingo Molnar
2006-06-19  8:21         ` Ingo Molnar
2006-06-19  8:32         ` Andrew Morton
2006-06-19  8:35           ` Ingo Molnar
2006-06-19  9:13             ` Andrew Morton
2006-06-19 11:39               ` Ingo Molnar
2006-06-19 19:55                 ` Andrew Morton
2006-06-20  8:06                   ` Arjan van de Ven
2006-06-20  8:40                   ` [patch] fix spinlock-debug looping Ingo Molnar
2006-06-20  8:52                     ` Andrew Morton
2006-06-20  9:15                       ` Ingo Molnar
2006-06-20  9:32                         ` Andrew Morton
2006-06-20  9:34                           ` Ingo Molnar
2006-06-20 16:02                           ` Dave Olson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox