From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tregaron Bayly Subject: Kernel panic after hot remove in raid1d Date: Thu, 28 Mar 2013 19:33:56 -0600 Message-ID: <1364520836.2613.205.camel@148> Reply-To: tbayly@bluehost.com Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: 7bit Return-path: Sender: linux-raid-owner@vger.kernel.org To: linux-raid List-Id: linux-raid.ids We have around 50 boxes running kernel 2.6.32-220.23.1.el6.x86_64 (mdadm version 3.2.5-4) with RAID1 arrays built out of iscsi mounts - primarily mounted as backup disks. Last night as backups kicked off to use the mirror 21 of them panicked with this stack (or very close to it): Call Trace: [] ? panic+0x78/0x143 [] ? oops_end+0xe4/0x100 [] ? no_context+0xe4/0x100 [] ? find_busiest_group+0x244/0x9f0 [] ? __bad_area_nosemaphore+0x125/0x1e0 [] ? bad_area_no_semaphore+0x13/0x20 [] ? __do_page_fault+0x31d/0x480 [] ? __switch_to+0x2c2/0x320 [] ? thread_return+0x4e/0x76e [] ? do_page_fault+0x3e/0xa0 [] ? page_fault+0x25/0x30 [] ? bitmap_unplug+0x22f/0x250 [] ? md_check_recovery+0x4d/0x6d0 [] ? flush_pending_writes+0x6a/0xc0 [raid1] [] ? raid1d+0x8d/0x1050 [raid1] [] ? schedule_timeout+0x215/0x2e0 [] ? md_thread+0x116/0x150 [] ? autoremove_wake_function+0x0/0x40 [] ? md_thread+0x0/0x150 [] ? kthread+0x96/0xa0 [] ? child_rip+0xa/0x20 [] ? kthread+0x0/0xa0 [] ? child_rip+0x0/0x20 Unfortunately I don't have console logs for what happened immediately preceding it, but it seems safe to assume based on bitmap_unplug and the synchronized nature of the panic we lost communication to one of the iscsi targets. Today playing around in my lab I was able to trigger it by doing: mdadm --manage /dev/md/bigcarve --fail /dev/dm-0 mdadm --manage /dev/md/bigcarve --remove /dev/dm-0 and then doing an rm in the filesystem, but I can't duplicate it at will. I'd love to move to a 3.4 kernel but unfortunately I need a little more to go on than a personal gut feeling to get the move approved. I realize it's a long shot, but does anyone have any insight into what may have gone awry here and what could be done to address it? Changes in recovery / bitmaps / hot remove in later kernels? Thanks in advance, Tregaron