From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ren Mingxin Subject: Re: [PATCHv3 0/9] New EH command timeout handler Date: Mon, 15 Jul 2013 14:05:04 +0800 Message-ID: <51E39110.1080004@cn.fujitsu.com> References: <1372688671-85639-1-git-send-email-hare@suse.de> <51DF82B9.8030406@cn.fujitsu.com> <51DF9DB0.7080502@suse.de> <51DFD3D9.4080306@cn.fujitsu.com> <51DFDA12.4080905@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from cn.fujitsu.com ([222.73.24.84]:27571 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1753908Ab3GOGBJ (ORCPT ); Mon, 15 Jul 2013 02:01:09 -0400 In-Reply-To: <51DFDA12.4080905@suse.de> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Hannes Reinecke Cc: James Bottomley , linux-scsi@vger.kernel.org, Ewan Milne , Bart van Assche , Joern Engel , James Smart , Roland Dreier Hi, Hannes: On 07/12/2013 06:27 PM, Hannes Reinecke wrote: > On 07/12/2013 12:00 PM, Ren Mingxin wrote: >> On 07/12/2013 02:09 PM, Hannes Reinecke wrote: >>> On 07/12/2013 06:14 AM, Ren Mingxin wrote: >>>> On 07/01/2013 10:24 PM, Hannes Reinecke wrote: >>>>> With the original SCSI EH I got: >>>>> # time dd if=/dev/zero of=/dev/dm-2 bs=4k count=4k oflag=direct >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 16777216 bytes (17 MB) copied, 142.652 s, 118 kB/s >>>>> >>>>> real 2m22.657s >>>>> user 0m0.013s >>>>> sys 0m0.145s >>>>> >>>>> With this patchset I got: >>>>> # time dd if=/dev/zero of=/dev/dm-2 bs=4k count=4k oflag=direct >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 16777216 bytes (17 MB) copied, 52.1579 s, 322 kB/s >>>>> >>>>> real 0m52.163s >>>>> user 0m0.012s >>>>> sys 0m0.145s >>>>> >>>>> Test was to disable RSCN on the target port, disable the >>>>> target port, and then start the 'dd' command as indicated. >>>> >>>> Do you mean disabling RSCN/port is enough? I'm afraid I couldn't >>>> reproduce the problem by your steps. Both with and without your >>>> patchset are the same 'dd' result: 27s. Please let me know where I >>>> neglected or mistook: >>>> >>>> 1) I made a dm-multipath target 'dm-0' whose grouping policy was >>>> failover; >>>> 2) Disable RSCN/port via brocade fc switch: >>>> SW300:root> portcfg rscnsupr 15 --enable; portDisable 15 >>>> 3) Start the 'dd' command: >>>> # time dd if=/dev/zero of=/dev/dm-0 bs=4k count=4k oflag=direct >>>> dd: writing `/dev/sde': Input/output error >>>> 1+0 records in >>>> 0+0 records out >>>> 0 bytes (0 B) copied, 27.8588 s, 0.0 kB/s >>>> >>>> real 0m27.860s >>>> user 0m0.001s >>>> sys 0m0.000s >>> >>> You are aware that you have to disable RSCNs on the _target_ port, >>> right? >>> Disabling RSCNs on the _initiator_ ports is a well-tested case, and >>> the one which actually makes sense (and is even implemented in >>> QLogic switches). >>> Disabling RSCNs for the _target_ port, OTOH, has a very questionable >>> nature (hence QLogic switches don't even allow you to do this). >> >> You're right. By disabling RSCNs on target port, I've reproduced this >> problem. Thank you so much. But I've encountered the bug I said >> before. I'll test again with your new patchset once you send. >> > > Could you check with the attached patch? That should convert it to > delayed_work and avoid this issue. Unfortunately, the login prompt couldn't be entered in and BUGs were printed ceaselessly while os booting with this patch. The BUGs are like below: BUG: scheduling while atomic: swapper/0/0/0x10000100 Modules linked in: mptsas(F+) mptscsih(F) mptbase(F) scsi_transport_sas(F) CPU: 0 PID: 0 Comm: swapper/0 Tainted: GF 3.10.0hannes+ #10 Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB-8GDIMM-CN, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 0000000000000000 ffff88047ee03b68 ffffffff8153ada4 ffff88047ee03b78 ffffffff8107389d ffff88047ee03c08 ffffffff8153ca26 ffffffff81a01fd8 0000000000012d00 ffffffff81a00010 0000000000012d00 0000000000012d00 Call Trace: [] dump_stack+0x19/0x1d [] __schedule_bug+0x4d/0x60 [] __schedule+0x646/0x6f0 [] __cond_resched+0x2a/0x40 [] _cond_resched+0x30/0x40 [] start_flush_work+0x2c/0x140 [] flush_work+0x1a/0x40 [] ? try_to_grab_pending+0x109/0x190 [] __cancel_work_timer+0x7e/0x110 [] cancel_delayed_work_sync+0x13/0x20 [] scsi_put_command+0x65/0xa0 [] scsi_next_command+0x3a/0x60 [] scsi_end_request+0xab/0xb0 [] scsi_io_completion+0x9f/0x670 [] scsi_finish_command+0xd4/0x140 [] scsi_softirq_done+0x147/0x170 [] blk_done_softirq+0x74/0x90 [] __do_softirq+0xef/0x260 [] irq_exit+0xb5/0xc0 [] do_IRQ+0x66/0xe0 [] common_interrupt+0x6a/0x6a [] ? clockevents_notify+0x52/0x150 [] ? cpuidle_enter_state+0x53/0xd0 [] ? cpuidle_enter_state+0x4f/0xd0 [] cpuidle_idle_call+0xcf/0x160 [] arch_cpu_idle+0xe/0x30 [] cpu_idle_loop+0x65/0x1f0 [] cpu_startup_entry+0x70/0x80 [] rest_init+0x77/0x80 [] start_kernel+0x41a/0x427 [] ? repair_env_string+0x5b/0x5b [] x86_64_start_reservations+0x2a/0x2c [] x86_64_start_kernel+0x12f/0x136 If there is any info I havn't expatiated, please let me know. Thanks, Ren