From mboxrd@z Thu Jan 1 00:00:00 1970 From: Harri Olin Subject: Re: sata_mv, io stucks Date: Sun, 16 Nov 2008 01:41:54 +0200 Message-ID: <491F5E42.8010906@gmail.com> References: <48F88449.1000704@ngs.ru> <49003B9C.1010303@ngs.ru> <4900A12F.3030307@rtr.ca> <491EE84B.1010600@gmail.com> <491F4096.9090701@rtr.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from gw02.mail.saunalahti.fi ([195.197.172.116]:46118 "EHLO gw02.mail.saunalahti.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751346AbYKOXmF (ORCPT ); Sat, 15 Nov 2008 18:42:05 -0500 In-Reply-To: <491F4096.9090701@rtr.ca> Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: Mark Lord Cc: linux-ide@vger.kernel.org Mark Lord wrote: > Harri Olin wrote: >> Mark Lord wrote: >>>>> Two marvell controllers, 16 disks, software raid10, IO stucks on >>>>> different disks, kernel 2.6.26.5. >>>>> With default ubuntu's 8.04 2.6.24 kernel the problem can not be >>>>> repeated >>>>> >>>>> >>>>> [ 289.851609] ata11.00: exception Emask 0x0 SAct 0x1 SErr 0x0 >>>>> action 0x6 frozen >>>>> [ 289.851695] ata11.00: cmd 61/08:00:60:1e:bf/00:00:01:00:00/40 >>>>> tag 0 ncq 4096 out >>>>> [ 289.851697] res 40/00:00:00:00:00/00:00:00:00:00/00 >>>>> Emask 0x4 (timeout) >>>>> [ 289.851774] ata11.00: status: { DRDY } >>>>> [ 289.851834] ata11: hard resetting link >>>>> [ 290.649259] ata11: SATA link up 3.0 Gbps (SStatus 123 SControl >>>>> 300) >>>>> [ 290.749239] ata11.00: max_sectors limited to 256 for NCQ >>>>> [ 290.809189] ata11.00: max_sectors limited to 256 for NCQ >>>>> [ 290.809194] ata11.00: configured for UDMA/133 >>>>> [ 290.809200] ata11: EH complete >>>>> [ 290.809242] sd 10:0:0:0: [sdk] 1953525168 512-byte hardware >>>>> sectors (1000205 MB) >>>>> [ 290.809258] sd 10:0:0:0: [sdk] Write Protect is off >>>>> [ 290.809263] sd 10:0:0:0: [sdk] Mode Sense: 00 3a 00 00 >>>>> [ 290.809286] sd 10:0:0:0: [sdk] Write cache: enabled, read >>>>> cache: enabled, doesn't support DPO or FUA >>> ... >>> >>> I've just returned here from a month holiday in Italy, >>> and I'll have a look at this and other sata_mv issues >>> next week or so. >> >> I ran git-bisect on it and it returned >> a3718c1f230240361ed92d3e53342df0ff7efa8c as first bad commit. Also >> verified by hand that patching it on working tree breaks it. > Looking at later kernels (after the commit in question), I see that > the code was further fixed to remove some possible races and stuff, > but that's still just 2.6.26.5, which you guys see failures on. > > So here's some instrumentation to help us figure it out. > Please apply and report back once it triggers again. > Thanks. I have to take back that bisect, as just couple of minutes ago it happened again, with last 'good' kernel from bisect. Just the frequency of stalls has dropped quite much. I also noticed that on current kernels are much better too. pre-..0ff7efa8c: only once after 6 hours of testing post-..0ff7efa8c: one hd stalled while filesystem was mounting. Before boot was complete, 3 stalls. Also at shutdown kernel hung at Synchronizing SCSI cache for a while. 2.6.27: once in 5 minutes or so on heavy load When some hd/port stalls, other ports sill work fine. I applied your patch on 2.6.27.1, no results: ata14.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen ata14.00: cmd 61/08:00:3f:52:54/00:00:57:00:00/40 tag 0 ncq 4096 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata14.00: status: { DRDY } ata14: hard resetting link ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 300) ata14.00: max_sectors limited to 256 for NCQ ata14.00: max_sectors limited to 256 for NCQ ata14.00: configured for UDMA/133 ata14: EH complete sd 13:0:0:0: [sdh] 1465149168 512-byte hardware sectors (750156 MB) sd 13:0:0:0: [sdh] Write Protect is off sd 13:0:0:0: [sdh] Mode Sense: 00 3a 00 00 sd 13:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Do I have to enable something somewhere else too? I also compiled and patched linux-2.6-stable tree from git but it just paniced after stall instead of recovering. I'm currently trying to reproduce that on second computer where I can capture the panic. -- Harri.