From mboxrd@z Thu Jan 1 00:00:00 1970 From: James Bottomley Subject: Re: [PATCH] mvsas: fix default can_queue Date: Wed, 05 Mar 2008 15:02:40 -0600 Message-ID: <1204750960.3047.67.camel@localhost.localdomain> References: <1204308113.4003.45.camel@localhost.localdomain> <1204504945.3069.30.camel@localhost.localdomain> <6b2481670803030017h43da68bcxd78a6142f8f5c6bb@mail.gmail.com> <1204556371.3043.7.camel@localhost.localdomain> <1204682849.3091.95.camel@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Return-path: Received: from accolon.hansenpartnership.com ([76.243.235.52]:53935 "EHLO accolon.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755031AbYCEVCo (ORCPT ); Wed, 5 Mar 2008 16:02:44 -0500 In-Reply-To: <1204682849.3091.95.camel@localhost.localdomain> Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Ke Wei Cc: linux-scsi , jgarzik On Tue, 2008-03-04 at 20:07 -0600, James Bottomley wrote: > On Mon, 2008-03-03 at 08:59 -0600, James Bottomley wrote: > > On Mon, 2008-03-03 at 16:17 +0800, Ke Wei wrote: > > > On Mon, Mar 3, 2008 at 8:42 AM, James Bottomley > > > wrote: > > > > > > > > On Fri, 2008-02-29 at 12:01 -0600, James Bottomley wrote: > > > > > I noticed that the current marvell sas driver wasn't performing very > > > > > well. It turns out that it's setting can_queue not in the SCSI host, > > > > > but in its own internal data structure, meaning it's always operating > > > > > with a global queue depth of one. This patch raises it to what the code > > > > > seemed to be intending ... although I think can_queue should be > > > > > MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? > > > > > > > > > > The good news is that with this change, I'm getting a respectable > > > > > throughput on the fio hammer test; plus zapping random phy resets across > > > > > the disk triggers error handler recovery correctly (so far). > > > > > > > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > > > SCSI commands ... I'm still investigating this one. > > > > > > > > Actually, I've run into another problem with this patch applied. It > > > > looks like NCQ fails with ATA disks. What I see is that I/O goes fine > > > > until I get more than one command outstanding to the device, then the > > > > device stops responding. I can keep the I/O flowing if I clamp the > > > > device queue depth at 1. SAS disks seem to be fine ... I can get > > > > multiple outstanding commands to them correctly serviced. > > > > > > Yes, I have to say that testing failed when I plugged SATA and SAS > > > disk. Sometimes "insmod mvsas" will cause the system to hang. > > > Only look good if can_queue is set to 1. I will investigate this case. > > > > Thanks. For the NCQ case, it does look like turning NCQ off makes the > > disk work fine, so I'd suspect some issue with NCQ handling. > > > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > > SCSI commands ... I'm still investigating this one. > > > > > > I think we need set BLIST_NOREPORTLUN or some other flags (see > > > scsi_devinfo.h) about new some ATAPI device.When calling > > > scsi_report_lun_scan , it will bypass REPORT_LUNS command. > > > > It doesn't seem to be anything the DVD does ... it works fine with the > > aic94xx controller doing SATAPI (it sends the correct reply to REPORT > > LUNS). It looks like the first hang comes at around the second or third > > Test Unit Ready. > > > > Traces seem to show IDENTIFY_PACKET, INQUIRY, INQUIRY, TUR, TUR (hang) > > and then every following command hangs, but I'll try to instrument more > > accurate tracing. > > OK, I instrumented more ... you're right, the first failing command is > REPORT_LUNS. The failure isn't because the DVD doesn't accept the > command, but because it gets errored and we fail to report back the > error data. > > What I see is the mvsas driver returning RXQ_ERR, so the device is > trying to terminate the transaction with an error code. Unfortunately, > when it sees this code, mvsas does nothing at all, leaving the request > to time out and be aborted (even through it already finished). > > I can plumb it in ... it looks like we should also be doing is calling > mvs_slot_complete(), but this still isn't quite correct ... it just sets > SAM_STAT_CHECK_COND ... it needs to collect the ATA error code somehow. Just by way of update, the slot is completing with RXQ_ERR set, but RXQ_DONE clear. The mvs_err_info field has TFILE_ERR set (the only set bit) and MVS_INT_STAT_SRS is zero. I assume the slot processing has halted, and that we need to collect the task file error registers and resume it somehow, but how? James