From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754234Ab1IGBa4 (ORCPT ); Tue, 6 Sep 2011 21:30:56 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46977 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753812Ab1IGBat (ORCPT ); Tue, 6 Sep 2011 21:30:49 -0400 Date: Wed, 7 Sep 2011 11:30:38 +1000 From: NeilBrown To: Anthony DeRobertis Cc: "Linux-kernel mailing list" Subject: Re: Hard lockup in 3.0.3 with Oracle & mdraid check Message-ID: <20110907113038.1fed2304@notabene.brown> In-Reply-To: <4E6634B2.6030204@metrics.net> References: <4E6634B2.6030204@metrics.net> X-Mailer: Claws Mail 3.7.9 (GTK+ 2.22.1; x86_64-unknown-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 06 Sep 2011 10:56:50 -0400 Anthony DeRobertis wrote: > [please CC me on replies; I'm not subscribed to linux-kernel] > > This is a dual-Xeon E5649, 96GiB RAM, Linux 3.0.3 (+ debian patches, + 1 > local, see bottom of message) running Oracle 11gR2. Oracle is running > with the data on raw devices, those devices being LVM logical volumes. > There are several volume groups, each one having on PV. The PV is a > mdraid RAID10. The underlying disks are 22 SAS, 2 SATA. Most of Oracle's > IO is going to the SAS disks (the SATA ones are used for archive log). > > This had been running fine for several days, until cron fired off a > check of the raid partitions. Less than a minute after the Oracle ones > started being checked, the machine died (md0 is /boot, 125 holds > archive, 126 holds redo, 127 holds data + index) > > The machine has ECC memory, and I have seen some corrected errors > reported (but no uncorrected errors). > > [290510.837380] md: data-check of RAID array md0 > [290510.841778] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [290510.847723] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. > [290510.857568] md: using 128k window, over a total of 123840k. > [290510.858748] md: delaying data-check of md125 until md0 has finished (they share one or more physical units) > [290510.879077] md: delaying data-check of md126 until md0 has finished (they share one or more physical units) > [290510.879085] md: delaying data-check of md125 until md0 has finished (they share one or more physical units) > [290510.899702] md: delaying data-check of md127 until md0 has finished (they share one or more physical units) > [290573.795960] md: md0: data-check done. > [290573.937561] md: data-check of RAID array md126 > [290573.937564] md: data-check of RAID array md125 > [290573.937567] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [290573.937568] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. > [290573.937577] md: using 128k window, over a total of 976635768k. > [290573.937899] md: data-check of RAID array md127 > [290573.937912] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [290573.937914] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. > [290573.937949] md: using 128k window, over a total of 572485632k. > [290573.994395] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [290574.000317] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. > [290574.010139] md: using 128k window, over a total of 143121408k. > [290624.606210] ------------[ cut here ]------------ > [290624.610918] WARNING: at /home/anthony-ldap/linux/linux-2.6-3.0.0/debian/build/source_amd64_none/kernel/watchdog.c:240 watchdog_overflow_callback+0x96/0xa1() > [290624.624974] Hardware name: X8DT6 > [290624.628282] Watchdog detected hard LOCKUP on cpu 0 > [290624.632978] Modules linked in: ext2 loop snd_pcm snd_timer snd usbhid hid uhci_hcd soundcore ioatdma ahci snd_page_alloc ehci_hcd tpm_tis libahci i7core_edac tpm psmouse ses ghes dca i2c_i801 tpm_bios hed evdev edac_core libata usbcore serio_raw pcspkr e1000e i2c_core enclosure processor thermal_sys button ext4 mbcache jbd2 crc16 dm_mod raid10 raid1 md_mod shpchp pci_hotplug sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class scsi_mod > [290624.672916] Pid: 3944, comm: oracle Not tainted 3.0.0-1-amd64 #1 > [290624.679004] Call Trace: > [290624.681535] [] ? warn_slowpath_common+0x78/0x8c > [290624.688422] [] ? warn_slowpath_fmt+0x45/0x4a > [290624.694426] [] ? watchdog_overflow_callback+0x96/0xa1 > [290624.701209] [] ? __perf_event_overflow+0x101/0x198 > [290624.707734] [] ? intel_pmu_enable_all+0x9d/0x144 > [290624.714082] [] ? intel_pmu_handle_irq+0x40e/0x481 > [290624.720521] [] ? perf_event_nmi_handler+0x39/0x82 > [290624.726954] [] ? notifier_call_chain+0x2e/0x5b > [290624.733123] [] ? notify_die+0x2d/0x32 > [290624.738518] [] ? do_nmi+0x63/0x206 > [290624.743647] [] ? nmi+0x20/0x30 > [290624.748430] [] ? try_to_wake_up+0x73/0x18c > [290624.754259] <> [] ? __wake_up_common+0x41/0x78 > [290624.761598] [] ? __wake_up+0x35/0x46 > [290624.766909] [] ? raid_end_bio_io+0x30/0x76 [raid10] > [290624.773519] [] ? raid10_end_write_request+0xdc/0xbe5 [raid10] > [290624.780995] [] ? is_swiotlb_buffer+0x26/0x31 > [290624.786998] [] ? blk_update_request+0x1a6/0x35d > [290624.793261] [] ? blk_update_bidi_request+0x11/0x5b > [290624.799782] [] ? blk_end_bidi_request+0x19/0x55 > [290624.806043] [] ? scsi_io_completion+0x1d0/0x48e [scsi_mod] > [290624.813257] [] ? mpt2sas_base_free_smid+0x118/0x20e [mpt2sas] > [290624.820731] [] ? blk_done_softirq+0x6b/0x78 > [290624.826647] [] ? __do_softirq+0xc4/0x1a0 > [290624.832304] [] ? handle_irq_event_percpu+0x166/0x184 > [290624.839001] [] ? call_softirq+0x1c/0x30 > [290624.844571] [] ? do_softirq+0x3f/0x79 > [290624.849966] [] ? irq_exit+0x44/0xb5 > [290624.855183] [] ? do_IRQ+0x94/0xaa > [290624.860232] [] ? common_interrupt+0x13/0x13 > [290624.866140] [] ? scsi_request_fn+0x468/0x49d [scsi_mod] > [290624.873723] [] ? scsi_request_fn+0x191/0x49d [scsi_mod] > [290624.880679] [] ? blk_flush_plug_list+0x194/0x1d1 > [290624.887030] [] ? schedule+0x243/0x61a > [290624.892425] [] ? wait_barrier+0x8e/0xc7 [raid10] > [290624.898774] [] ? try_to_wake_up+0x18c/0x18c > [290624.904691] [] ? make_request+0x17b/0x4fb [raid10] > [290624.911214] [] ? ll_new_hw_segment+0x1b/0x7e > [290624.917221] [] ? md_make_request+0xc6/0x1c1 [md_mod] > [290624.923913] [] ? generic_make_request+0x2cb/0x341 > [290624.930345] [] ? dm_get_live_table+0x35/0x3d [dm_mod] > [290624.937124] [] ? submit_bio+0xda/0xf8 > [290624.942516] [] ? set_page_dirty_lock+0x21/0x29 > [290624.948691] [] ? dio_bio_submit+0x6c/0x8a > [290624.954433] [] ? dio_send_cur_page+0x6e/0x93 > [290624.960427] [] ? submit_page_section+0xb5/0x135 > [290624.966683] [] ? __blockdev_direct_IO+0x670/0x8ed > [290624.973123] [] ? scsi_dma_map+0x7a/0x93 [scsi_mod] > [290624.979643] [] ? blkdev_direct_IO+0x4e/0x53 > [290624.985559] [] ? blkdev_get_block+0x5b/0x5b > [290624.991478] [] ? generic_file_aio_read+0xed/0x5c3 > [290624.997914] [] ? virt_to_slab+0x9/0x3c > [290625.003396] [] ? lock_page_killable+0x2c/0x2c > [290625.009480] [] ? aio_rw_vect_retry+0x7d/0x180 > [290625.015567] [] ? aio_run_iocb+0x6b/0x132 > [290625.021215] [] ? do_io_submit+0x419/0x4c8 > [290625.026951] [] ? system_call_fastpath+0x16/0x1b > [290625.033212] ---[ end trace 3a6a21008417f262 ]--- The implication is that something in this stack is looping endlessly. Unfortunately it is not clear what - and there are several different subsystems there that could be blamed. If this happens again then comparing the new trace with the old could be very informative - it would point the finger and the highers item in the stack which is common to both. Without that extra information - it is just guesswork to go forward. NeilBrown > The log keeps on going for a while, with more "rcu_sched_state detected > stalls on CPUs/tasks: { 0}" messages. It kept spewing them, doing no > useful work, until I reset the machine. Log (from a serial console) is > 87M total. 12M compressed. I've placed it at: > > > > The extra patch just gets rid of REQ_FLUSH/FUA in mdraid, hence the > 'eatmydata' in the kernel version. This makes the Oracle workload an > order of magnitude faster. Here it is: > > Index: linux-2.6-3.0.0/drivers/md/md.c > =================================================================== > --- linux-2.6-3.0.0.orig/drivers/md/md.c 2011-08-31 15:23:46.474036586 -0400 > +++ linux-2.6-3.0.0/drivers/md/md.c 2011-08-31 15:24:01.294007938 -0400 > @@ -4361,7 +4361,7 @@ > disk->fops = &md_fops; > disk->private_data = mddev; > disk->queue = mddev->queue; > - blk_queue_flush(mddev->queue, REQ_FLUSH | REQ_FUA); > + blk_queue_flush(mddev->queue, 0); /* omm nom nom, data tasty! */ > /* Allow extended partitions. This makes the > * 'mdp' device redundant, but we can't really > * remove it now. > > > > [please CC me on replies; I'm not subscribed to linux-kernel] > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/