* [PATCH] mvsas: fix default can_queue @ 2008-02-29 18:01 James Bottomley 2008-03-03 0:42 ` James Bottomley 0 siblings, 1 reply; 14+ messages in thread From: James Bottomley @ 2008-02-29 18:01 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi I noticed that the current marvell sas driver wasn't performing very well. It turns out that it's setting can_queue not in the SCSI host, but in its own internal data structure, meaning it's always operating with a global queue depth of one. This patch raises it to what the code seemed to be intending ... although I think can_queue should be MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? The good news is that with this change, I'm getting a respectable throughput on the fio hammer test; plus zapping random phy resets across the disk triggers error handler recovery correctly (so far). I'm having less happy results with a SATAPI DVD ... it looks like the initial IDENTIFY goes across just fine, but that we stall on the other SCSI commands ... I'm still investigating this one. James --- diff --git a/drivers/scsi/mvsas.c b/drivers/scsi/mvsas.c index 5ec0665..a128f67 100755 --- a/drivers/scsi/mvsas.c +++ b/drivers/scsi/mvsas.c @@ -618,7 +618,6 @@ struct mvs_info { struct mvs_phy phy[MVS_MAX_PHYS]; struct mvs_port port[MVS_MAX_PHYS]; - u32 can_queue; /* per adapter */ u32 tag_out; /*Get*/ u32 tag_in; /*Give*/ }; @@ -2150,7 +2149,7 @@ static struct mvs_info *__devinit mvs_alloc(struct pci_dev *pdev, mvi->sas.num_phys = chip->n_phy; mvi->sas.lldd_max_execute_num = MVS_CHIP_SLOT_SZ - 1; mvi->sas.lldd_queue_size = MVS_QUEUE_SIZE; - mvi->can_queue = (MVS_CHIP_SLOT_SZ >> 1) - 1; + mvi->shost->can_queue = (MVS_CHIP_SLOT_SZ >> 1) - 1; mvi->sas.lldd_ha = mvi; mvi->sas.core.shost = mvi->shost; ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-02-29 18:01 [PATCH] mvsas: fix default can_queue James Bottomley @ 2008-03-03 0:42 ` James Bottomley 2008-03-03 8:17 ` Ke Wei 0 siblings, 1 reply; 14+ messages in thread From: James Bottomley @ 2008-03-03 0:42 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi On Fri, 2008-02-29 at 12:01 -0600, James Bottomley wrote: > I noticed that the current marvell sas driver wasn't performing very > well. It turns out that it's setting can_queue not in the SCSI host, > but in its own internal data structure, meaning it's always operating > with a global queue depth of one. This patch raises it to what the code > seemed to be intending ... although I think can_queue should be > MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? > > The good news is that with this change, I'm getting a respectable > throughput on the fio hammer test; plus zapping random phy resets across > the disk triggers error handler recovery correctly (so far). > > I'm having less happy results with a SATAPI DVD ... it looks like the > initial IDENTIFY goes across just fine, but that we stall on the other > SCSI commands ... I'm still investigating this one. Actually, I've run into another problem with this patch applied. It looks like NCQ fails with ATA disks. What I see is that I/O goes fine until I get more than one command outstanding to the device, then the device stops responding. I can keep the I/O flowing if I clamp the device queue depth at 1. SAS disks seem to be fine ... I can get multiple outstanding commands to them correctly serviced. James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-03 0:42 ` James Bottomley @ 2008-03-03 8:17 ` Ke Wei 2008-03-03 14:59 ` James Bottomley 0 siblings, 1 reply; 14+ messages in thread From: Ke Wei @ 2008-03-03 8:17 UTC (permalink / raw) To: James Bottomley; +Cc: linux-scsi On Mon, Mar 3, 2008 at 8:42 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Fri, 2008-02-29 at 12:01 -0600, James Bottomley wrote: > > I noticed that the current marvell sas driver wasn't performing very > > well. It turns out that it's setting can_queue not in the SCSI host, > > but in its own internal data structure, meaning it's always operating > > with a global queue depth of one. This patch raises it to what the code > > seemed to be intending ... although I think can_queue should be > > MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? > > > > The good news is that with this change, I'm getting a respectable > > throughput on the fio hammer test; plus zapping random phy resets across > > the disk triggers error handler recovery correctly (so far). > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > initial IDENTIFY goes across just fine, but that we stall on the other > > SCSI commands ... I'm still investigating this one. > > Actually, I've run into another problem with this patch applied. It > looks like NCQ fails with ATA disks. What I see is that I/O goes fine > until I get more than one command outstanding to the device, then the > device stops responding. I can keep the I/O flowing if I clamp the > device queue depth at 1. SAS disks seem to be fine ... I can get > multiple outstanding commands to them correctly serviced. Yes, I have to say that testing failed when I plugged SATA and SAS disk. Sometimes "insmod mvsas" will cause the system to hang. Only look good if can_queue is set to 1. I will investigate this case. > I'm having less happy results with a SATAPI DVD ... it looks like the > initial IDENTIFY goes across just fine, but that we stall on the other > SCSI commands ... I'm still investigating this one. I think we need set BLIST_NOREPORTLUN or some other flags (see scsi_devinfo.h) about new some ATAPI device.When calling scsi_report_lun_scan , it will bypass REPORT_LUNS command. -- Best Regards, Ke Wei ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-03 8:17 ` Ke Wei @ 2008-03-03 14:59 ` James Bottomley 2008-03-03 16:40 ` James Bottomley 2008-03-05 2:07 ` James Bottomley 0 siblings, 2 replies; 14+ messages in thread From: James Bottomley @ 2008-03-03 14:59 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi On Mon, 2008-03-03 at 16:17 +0800, Ke Wei wrote: > On Mon, Mar 3, 2008 at 8:42 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > On Fri, 2008-02-29 at 12:01 -0600, James Bottomley wrote: > > > I noticed that the current marvell sas driver wasn't performing very > > > well. It turns out that it's setting can_queue not in the SCSI host, > > > but in its own internal data structure, meaning it's always operating > > > with a global queue depth of one. This patch raises it to what the code > > > seemed to be intending ... although I think can_queue should be > > > MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? > > > > > > The good news is that with this change, I'm getting a respectable > > > throughput on the fio hammer test; plus zapping random phy resets across > > > the disk triggers error handler recovery correctly (so far). > > > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > SCSI commands ... I'm still investigating this one. > > > > Actually, I've run into another problem with this patch applied. It > > looks like NCQ fails with ATA disks. What I see is that I/O goes fine > > until I get more than one command outstanding to the device, then the > > device stops responding. I can keep the I/O flowing if I clamp the > > device queue depth at 1. SAS disks seem to be fine ... I can get > > multiple outstanding commands to them correctly serviced. > > Yes, I have to say that testing failed when I plugged SATA and SAS > disk. Sometimes "insmod mvsas" will cause the system to hang. > Only look good if can_queue is set to 1. I will investigate this case. Thanks. For the NCQ case, it does look like turning NCQ off makes the disk work fine, so I'd suspect some issue with NCQ handling. > > I'm having less happy results with a SATAPI DVD ... it looks like the > > initial IDENTIFY goes across just fine, but that we stall on the other > > SCSI commands ... I'm still investigating this one. > > I think we need set BLIST_NOREPORTLUN or some other flags (see > scsi_devinfo.h) about new some ATAPI device.When calling > scsi_report_lun_scan , it will bypass REPORT_LUNS command. It doesn't seem to be anything the DVD does ... it works fine with the aic94xx controller doing SATAPI (it sends the correct reply to REPORT LUNS). It looks like the first hang comes at around the second or third Test Unit Ready. Traces seem to show IDENTIFY_PACKET, INQUIRY, INQUIRY, TUR, TUR (hang) and then every following command hangs, but I'll try to instrument more accurate tracing. James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-03 14:59 ` James Bottomley @ 2008-03-03 16:40 ` James Bottomley 2008-03-05 2:07 ` James Bottomley 1 sibling, 0 replies; 14+ messages in thread From: James Bottomley @ 2008-03-03 16:40 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi On Mon, 2008-03-03 at 08:59 -0600, James Bottomley wrote: > On Mon, 2008-03-03 at 16:17 +0800, Ke Wei wrote: > > On Mon, Mar 3, 2008 at 8:42 AM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > On Fri, 2008-02-29 at 12:01 -0600, James Bottomley wrote: > > > > I noticed that the current marvell sas driver wasn't performing very > > > > well. It turns out that it's setting can_queue not in the SCSI host, > > > > but in its own internal data structure, meaning it's always operating > > > > with a global queue depth of one. This patch raises it to what the code > > > > seemed to be intending ... although I think can_queue should be > > > > MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? > > > > > > > > The good news is that with this change, I'm getting a respectable > > > > throughput on the fio hammer test; plus zapping random phy resets across > > > > the disk triggers error handler recovery correctly (so far). > > > > > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > > SCSI commands ... I'm still investigating this one. > > > > > > Actually, I've run into another problem with this patch applied. It > > > looks like NCQ fails with ATA disks. What I see is that I/O goes fine > > > until I get more than one command outstanding to the device, then the > > > device stops responding. I can keep the I/O flowing if I clamp the > > > device queue depth at 1. SAS disks seem to be fine ... I can get > > > multiple outstanding commands to them correctly serviced. > > > > Yes, I have to say that testing failed when I plugged SATA and SAS > > disk. Sometimes "insmod mvsas" will cause the system to hang. > > Only look good if can_queue is set to 1. I will investigate this case. > > Thanks. For the NCQ case, it does look like turning NCQ off makes the > disk work fine, so I'd suspect some issue with NCQ handling. As a work around until the NCQ issue is fixed, this will allow can_queue to be raised while still forcing all ATA devices to the non-NCQ case. Also, can can_queue be set to MVS_CHIP_SLOT_SZ - 1? It seems like a reasonable default. James --- diff --git a/drivers/scsi/mvsas.c b/drivers/scsi/mvsas.c index 5ec0665..425231f 100755 --- a/drivers/scsi/mvsas.c +++ b/drivers/scsi/mvsas.c @@ -37,6 +37,8 @@ #include <linux/dma-mapping.h> #include <linux/ctype.h> #include <scsi/libsas.h> +#include <scsi/scsi_tcq.h> +#include <scsi/sas_ata.h> #include <asm/io.h> #define DRV_NAME "mvsas" @@ -654,12 +655,31 @@ static const struct mvs_chip_info mvs_chips[] = { [chip_6480] = { 8, 32, 10 }, }; +static int mvsas_slave_configure(struct scsi_device *sdev) +{ + struct domain_device *dev = sdev_to_domain_dev(sdev); + int ret = sas_slave_configure(sdev); + + if (ret) + return ret; + + if (dev_is_sata(dev)) { + struct ata_port *ap = dev->sata_dev.ap; + struct ata_device *adev = ap->link.device; + + /* clamp at no NCQ for the time being */ + adev->flags |= ATA_DFLAG_NCQ_OFF; + scsi_adjust_queue_depth(sdev, MSG_SIMPLE_TAG, 1); + } + return 0; +} + static struct scsi_host_template mvs_sht = { .module = THIS_MODULE, .name = DRV_NAME, .queuecommand = sas_queuecommand, .target_alloc = sas_target_alloc, - .slave_configure = sas_slave_configure, + .slave_configure = mvsas_slave_configure, .slave_destroy = sas_slave_destroy, .scan_finished = mvs_scan_finished, .scan_start = mvs_scan_start, ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-03 14:59 ` James Bottomley 2008-03-03 16:40 ` James Bottomley @ 2008-03-05 2:07 ` James Bottomley 2008-03-05 21:02 ` James Bottomley 1 sibling, 1 reply; 14+ messages in thread From: James Bottomley @ 2008-03-05 2:07 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi On Mon, 2008-03-03 at 08:59 -0600, James Bottomley wrote: > On Mon, 2008-03-03 at 16:17 +0800, Ke Wei wrote: > > On Mon, Mar 3, 2008 at 8:42 AM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > On Fri, 2008-02-29 at 12:01 -0600, James Bottomley wrote: > > > > I noticed that the current marvell sas driver wasn't performing very > > > > well. It turns out that it's setting can_queue not in the SCSI host, > > > > but in its own internal data structure, meaning it's always operating > > > > with a global queue depth of one. This patch raises it to what the code > > > > seemed to be intending ... although I think can_queue should be > > > > MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? > > > > > > > > The good news is that with this change, I'm getting a respectable > > > > throughput on the fio hammer test; plus zapping random phy resets across > > > > the disk triggers error handler recovery correctly (so far). > > > > > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > > SCSI commands ... I'm still investigating this one. > > > > > > Actually, I've run into another problem with this patch applied. It > > > looks like NCQ fails with ATA disks. What I see is that I/O goes fine > > > until I get more than one command outstanding to the device, then the > > > device stops responding. I can keep the I/O flowing if I clamp the > > > device queue depth at 1. SAS disks seem to be fine ... I can get > > > multiple outstanding commands to them correctly serviced. > > > > Yes, I have to say that testing failed when I plugged SATA and SAS > > disk. Sometimes "insmod mvsas" will cause the system to hang. > > Only look good if can_queue is set to 1. I will investigate this case. > > Thanks. For the NCQ case, it does look like turning NCQ off makes the > disk work fine, so I'd suspect some issue with NCQ handling. > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > SCSI commands ... I'm still investigating this one. > > > > I think we need set BLIST_NOREPORTLUN or some other flags (see > > scsi_devinfo.h) about new some ATAPI device.When calling > > scsi_report_lun_scan , it will bypass REPORT_LUNS command. > > It doesn't seem to be anything the DVD does ... it works fine with the > aic94xx controller doing SATAPI (it sends the correct reply to REPORT > LUNS). It looks like the first hang comes at around the second or third > Test Unit Ready. > > Traces seem to show IDENTIFY_PACKET, INQUIRY, INQUIRY, TUR, TUR (hang) > and then every following command hangs, but I'll try to instrument more > accurate tracing. OK, I instrumented more ... you're right, the first failing command is REPORT_LUNS. The failure isn't because the DVD doesn't accept the command, but because it gets errored and we fail to report back the error data. What I see is the mvsas driver returning RXQ_ERR, so the device is trying to terminate the transaction with an error code. Unfortunately, when it sees this code, mvsas does nothing at all, leaving the request to time out and be aborted (even through it already finished). I can plumb it in ... it looks like we should also be doing is calling mvs_slot_complete(), but this still isn't quite correct ... it just sets SAM_STAT_CHECK_COND ... it needs to collect the ATA error code somehow. James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-05 2:07 ` James Bottomley @ 2008-03-05 21:02 ` James Bottomley 2008-03-06 14:46 ` Ke Wei 0 siblings, 1 reply; 14+ messages in thread From: James Bottomley @ 2008-03-05 21:02 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi, jgarzik On Tue, 2008-03-04 at 20:07 -0600, James Bottomley wrote: > On Mon, 2008-03-03 at 08:59 -0600, James Bottomley wrote: > > On Mon, 2008-03-03 at 16:17 +0800, Ke Wei wrote: > > > On Mon, Mar 3, 2008 at 8:42 AM, James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > > > On Fri, 2008-02-29 at 12:01 -0600, James Bottomley wrote: > > > > > I noticed that the current marvell sas driver wasn't performing very > > > > > well. It turns out that it's setting can_queue not in the SCSI host, > > > > > but in its own internal data structure, meaning it's always operating > > > > > with a global queue depth of one. This patch raises it to what the code > > > > > seemed to be intending ... although I think can_queue should be > > > > > MVS_CHIP_SLOT_SZ - 1 (without the divide by two)? > > > > > > > > > > The good news is that with this change, I'm getting a respectable > > > > > throughput on the fio hammer test; plus zapping random phy resets across > > > > > the disk triggers error handler recovery correctly (so far). > > > > > > > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > > > SCSI commands ... I'm still investigating this one. > > > > > > > > Actually, I've run into another problem with this patch applied. It > > > > looks like NCQ fails with ATA disks. What I see is that I/O goes fine > > > > until I get more than one command outstanding to the device, then the > > > > device stops responding. I can keep the I/O flowing if I clamp the > > > > device queue depth at 1. SAS disks seem to be fine ... I can get > > > > multiple outstanding commands to them correctly serviced. > > > > > > Yes, I have to say that testing failed when I plugged SATA and SAS > > > disk. Sometimes "insmod mvsas" will cause the system to hang. > > > Only look good if can_queue is set to 1. I will investigate this case. > > > > Thanks. For the NCQ case, it does look like turning NCQ off makes the > > disk work fine, so I'd suspect some issue with NCQ handling. > > > > > > I'm having less happy results with a SATAPI DVD ... it looks like the > > > > initial IDENTIFY goes across just fine, but that we stall on the other > > > > SCSI commands ... I'm still investigating this one. > > > > > > I think we need set BLIST_NOREPORTLUN or some other flags (see > > > scsi_devinfo.h) about new some ATAPI device.When calling > > > scsi_report_lun_scan , it will bypass REPORT_LUNS command. > > > > It doesn't seem to be anything the DVD does ... it works fine with the > > aic94xx controller doing SATAPI (it sends the correct reply to REPORT > > LUNS). It looks like the first hang comes at around the second or third > > Test Unit Ready. > > > > Traces seem to show IDENTIFY_PACKET, INQUIRY, INQUIRY, TUR, TUR (hang) > > and then every following command hangs, but I'll try to instrument more > > accurate tracing. > > OK, I instrumented more ... you're right, the first failing command is > REPORT_LUNS. The failure isn't because the DVD doesn't accept the > command, but because it gets errored and we fail to report back the > error data. > > What I see is the mvsas driver returning RXQ_ERR, so the device is > trying to terminate the transaction with an error code. Unfortunately, > when it sees this code, mvsas does nothing at all, leaving the request > to time out and be aborted (even through it already finished). > > I can plumb it in ... it looks like we should also be doing is calling > mvs_slot_complete(), but this still isn't quite correct ... it just sets > SAM_STAT_CHECK_COND ... it needs to collect the ATA error code somehow. Just by way of update, the slot is completing with RXQ_ERR set, but RXQ_DONE clear. The mvs_err_info field has TFILE_ERR set (the only set bit) and MVS_INT_STAT_SRS is zero. I assume the slot processing has halted, and that we need to collect the task file error registers and resume it somehow, but how? James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-05 21:02 ` James Bottomley @ 2008-03-06 14:46 ` Ke Wei 2008-03-06 15:52 ` James Bottomley 0 siblings, 1 reply; 14+ messages in thread From: Ke Wei @ 2008-03-06 14:46 UTC (permalink / raw) To: James Bottomley; +Cc: linux-scsi, jgarzik On Thu, Mar 6, 2008 at 5:02 AM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: 1. > > > > OK, I instrumented more ... you're right, the first failing command is > > REPORT_LUNS. The failure isn't because the DVD doesn't accept the > > command, but because it gets errored and we fail to report back the > > error data. First, I noticed that scsi_probe_and_add_lun function has changed since v2.6.17-git4. It will cause bflagsp to miss BLIST_NOREPORTLUN flag after calling scsi_add_lun. On the other hand, Even though mvsas driver can succeed in reporting back the error data, scsi subsystem also will force device to reset because of wrong REPORTLUN status. But Error again, then reset. Maybe I may write simulation codes for this command. 2.6.17-git4: /** * scsi_probe_and_add_lun - probe a LUN, if a LUN is found add it @@ -880,10 +906,12 @@ static int scsi_probe_and_add_lun(struct if (scsi_probe_lun(sdev, result, result_len, &bflags)) goto out_free_result; + if (bflagsp) + *bflagsp = bflags; /* * result contains valid SCSI INQUIRY data. */ - if ((result[0] >> 5) == 3) { + if (((result[0] >> 5) == 3) && !(bflags & BLIST_ATTACH_PQ3)) { /* * For a Peripheral qualifier 3 (011b), the SCSI * spec says: The device server is not capable of @@ -894,9 +922,22 @@ static int scsi_probe_and_add_lun(struct * logical disk configured at sdev->lun, but there * is a target id responding. */ - SCSI_LOG_SCAN_BUS(3, printk(KERN_INFO - "scsi scan: peripheral qualifier of 3," - " no device added\n")); + SCSI_LOG_SCAN_BUS(2, sdev_printk(KERN_INFO, sdev, "scsi scan:" + " peripheral qualifier of 3, device not" + " added\n")) + if (lun == 0) { + SCSI_LOG_SCAN_BUS(1, { + unsigned char vend[9]; + unsigned char mod[17]; + + sdev_printk(KERN_INFO, sdev, + "scsi scan: consider passing scsi_mod." + "dev_flags=%s:%s:0x240 or 0x800240\n", + scsi_inq_str(vend, result, 8, 16), + scsi_inq_str(mod, result, 16, 32)); + }); + } + res = SCSI_SCAN_TARGET_PRESENT; goto out_free_result; } @@ -920,8 +961,6 @@ static int scsi_probe_and_add_lun(struct sdev->lockable = 0; scsi_unlock_floptical(sdev, result); } - if (bflagsp) - *bflagsp = bflags; } 2. > As a work around until the NCQ issue is fixed, this will allow can_queue > to be raised while still forcing all ATA devices to the non-NCQ case. > > Also, can can_queue be set to MVS_CHIP_SLOT_SZ - 1? It seems like a > reasonable default. I also test failed. By far, can_queue only be set to 1. I will keep on investigating. 3. > The good news is that with this change, I'm getting a respectable > throughput on the fio hammer test; plus zapping random phy resets across > the disk triggers error handler recovery correctly (so far). I fixed it because of a register mistake. And I'm testing for the follow patch. -- Best Regards, Ke Wei ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-06 14:46 ` Ke Wei @ 2008-03-06 15:52 ` James Bottomley 2008-03-06 17:44 ` James Bottomley 0 siblings, 1 reply; 14+ messages in thread From: James Bottomley @ 2008-03-06 15:52 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi, jgarzik On Thu, 2008-03-06 at 22:46 +0800, Ke Wei wrote: > On Thu, Mar 6, 2008 at 5:02 AM, James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > 1. > > > > > > OK, I instrumented more ... you're right, the first failing command is > > > REPORT_LUNS. The failure isn't because the DVD doesn't accept the > > > command, but because it gets errored and we fail to report back the > > > error data. > > First, I noticed that scsi_probe_and_add_lun function has changed > since v2.6.17-git4. > It will cause bflagsp to miss BLIST_NOREPORTLUN flag after calling scsi_add_lun. > On the other hand, Even though mvsas driver can succeed in reporting > back the error data, scsi subsystem also will force device to reset > because of wrong REPORTLUN status. But Error again, then reset. > Maybe I may write simulation codes for this command. But this isn't the problem. I'll look into the mid layer changes and see if we have a problem. However, this device does respond correctly (with an error) to REPORT_LUNS (however, at the moment, for me, the current setup is a great test of ATA error handling). When I have it on the aic94xx it terminates the command and sends back a Register D2H FIS with error = 0x54 (sense key ILLEGAL REQUEST, ABRT) status = 0x51 (DRDY, SERV, CHK) So the device does a correct termination for an unsupported command. The problem seems to be that mvsas isn't terminating and picking up the register D2H fis properly. Which, in turn seems to be connected with the way it's not really processing a RXQ_ERR return. This is a perfectly proper termination, and the scan routine will proceed normally without invoking any error handling (it's aware of the conflict between SPC and MMC regarding REPORT_LUNS). James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-06 15:52 ` James Bottomley @ 2008-03-06 17:44 ` James Bottomley 2008-03-06 17:59 ` Jeff Garzik 0 siblings, 1 reply; 14+ messages in thread From: James Bottomley @ 2008-03-06 17:44 UTC (permalink / raw) To: Ke Wei; +Cc: linux-scsi, jgarzik On Thu, 2008-03-06 at 09:52 -0600, James Bottomley wrote: > On Thu, 2008-03-06 at 22:46 +0800, Ke Wei wrote: > > On Thu, Mar 6, 2008 at 5:02 AM, James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > 1. > > > > > > > > OK, I instrumented more ... you're right, the first failing command is > > > > REPORT_LUNS. The failure isn't because the DVD doesn't accept the > > > > command, but because it gets errored and we fail to report back the > > > > error data. > > > > First, I noticed that scsi_probe_and_add_lun function has changed > > since v2.6.17-git4. > > It will cause bflagsp to miss BLIST_NOREPORTLUN flag after calling scsi_add_lun. > > On the other hand, Even though mvsas driver can succeed in reporting > > back the error data, scsi subsystem also will force device to reset > > because of wrong REPORTLUN status. But Error again, then reset. > > Maybe I may write simulation codes for this command. > > But this isn't the problem. > > I'll look into the mid layer changes and see if we have a problem. > However, this device does respond correctly (with an error) to > REPORT_LUNS (however, at the moment, for me, the current setup is a > great test of ATA error handling). > > When I have it on the aic94xx it terminates the command and sends back a > Register D2H FIS with > > error = 0x54 (sense key ILLEGAL REQUEST, ABRT) > status = 0x51 (DRDY, SERV, CHK) > > So the device does a correct termination for an unsupported command. > The problem seems to be that mvsas isn't terminating and picking up the > register D2H fis properly. Which, in turn seems to be connected with > the way it's not really processing a RXQ_ERR return. > > This is a perfectly proper termination, and the scan routine will > proceed normally without invoking any error handling (it's aware of the > conflict between SPC and MMC regarding REPORT_LUNS). Actually, I looked in your mvi->rx_fis area where the D2H FIS should be and there seems to be a correct one collected. The problem is that if the slot returns RXQ_ERR without any other flags set (other than the slot number), the driver does nothing. Shouldn't the slot be completed and the D2H FIS returned so we can then do a request sense? James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-06 17:44 ` James Bottomley @ 2008-03-06 17:59 ` Jeff Garzik 2008-03-07 10:50 ` Ke Wei 0 siblings, 1 reply; 14+ messages in thread From: Jeff Garzik @ 2008-03-06 17:59 UTC (permalink / raw) To: James Bottomley; +Cc: Ke Wei, linux-scsi James Bottomley wrote: > On Thu, 2008-03-06 at 09:52 -0600, James Bottomley wrote: >> On Thu, 2008-03-06 at 22:46 +0800, Ke Wei wrote: >>> On Thu, Mar 6, 2008 at 5:02 AM, James Bottomley >>> <James.Bottomley@hansenpartnership.com> wrote: >>> >>> 1. >>>>> OK, I instrumented more ... you're right, the first failing command is >>>>> REPORT_LUNS. The failure isn't because the DVD doesn't accept the >>>>> command, but because it gets errored and we fail to report back the >>>>> error data. >>> First, I noticed that scsi_probe_and_add_lun function has changed >>> since v2.6.17-git4. >>> It will cause bflagsp to miss BLIST_NOREPORTLUN flag after calling scsi_add_lun. >>> On the other hand, Even though mvsas driver can succeed in reporting >>> back the error data, scsi subsystem also will force device to reset >>> because of wrong REPORTLUN status. But Error again, then reset. >>> Maybe I may write simulation codes for this command. >> But this isn't the problem. >> >> I'll look into the mid layer changes and see if we have a problem. >> However, this device does respond correctly (with an error) to >> REPORT_LUNS (however, at the moment, for me, the current setup is a >> great test of ATA error handling). >> >> When I have it on the aic94xx it terminates the command and sends back a >> Register D2H FIS with >> >> error = 0x54 (sense key ILLEGAL REQUEST, ABRT) >> status = 0x51 (DRDY, SERV, CHK) >> >> So the device does a correct termination for an unsupported command. >> The problem seems to be that mvsas isn't terminating and picking up the >> register D2H fis properly. Which, in turn seems to be connected with >> the way it's not really processing a RXQ_ERR return. >> >> This is a perfectly proper termination, and the scan routine will >> proceed normally without invoking any error handling (it's aware of the >> conflict between SPC and MMC regarding REPORT_LUNS). > > Actually, I looked in your mvi->rx_fis area where the D2H FIS should be > and there seems to be a correct one collected. The problem is that if > the slot returns RXQ_ERR without any other flags set (other than the > slot number), the driver does nothing. Shouldn't the slot be completed > and the D2H FIS returned so we can then do a request sense? That's a good question... Ke? I wrote that slot completion code according to the docs, which seemed to indicate that nothing associated with a slot could be touched until RXQ_DONE was set. It seemed analagous to the 'OWN' bit of a DMA ring found in many other hardwares, where software should not touch anything related to the descriptor if that bit is not set. Jeff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-06 17:59 ` Jeff Garzik @ 2008-03-07 10:50 ` Ke Wei 2008-03-07 15:03 ` James Bottomley 0 siblings, 1 reply; 14+ messages in thread From: Ke Wei @ 2008-03-07 10:50 UTC (permalink / raw) To: Jeff Garzik; +Cc: James Bottomley, linux-scsi On a system with many SAS targets, It appears possible that a scsi_cmnd can time out without ever making it to the SAS LLDD or at the same time that a completion is occurring. in file sas_scsi_host.c : /* Queue up, Direct Mode or Task Collector Mode. */ if (sas_ha->lldd_max_execute_num < 2) res = i->dft->lldd_execute_task(task, 1, GFP_ATOMIC); else res = sas_queue_up(task); If I set lldd_max_execute_num above 1, I find that libsas couldn't queue a task to the SAS LLDD sometimes. System will always report: not at initiator: EH_RESET_TIMER. Is queue_thread pending? I will keep investigating. -- Best Regards, Ke Wei ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-07 10:50 ` Ke Wei @ 2008-03-07 15:03 ` James Bottomley 2008-03-07 15:31 ` Ke Wei 0 siblings, 1 reply; 14+ messages in thread From: James Bottomley @ 2008-03-07 15:03 UTC (permalink / raw) To: Ke Wei; +Cc: Jeff Garzik, linux-scsi On Fri, 2008-03-07 at 18:50 +0800, Ke Wei wrote: > On a system with many SAS targets, It appears possible that a > scsi_cmnd can time out without ever making it to the SAS LLDD or at > the same time that a completion is occurring. > in file sas_scsi_host.c : > /* Queue up, Direct Mode or Task Collector Mode. */ > if (sas_ha->lldd_max_execute_num < 2) > res = i->dft->lldd_execute_task(task, > 1, GFP_ATOMIC); > else > res = sas_queue_up(task); > If I set lldd_max_execute_num above 1, I find that libsas couldn't > queue a task to the SAS LLDD sometimes. > System will always report: not at initiator: EH_RESET_TIMER. > Is queue_thread pending? I will keep investigating. Oh, actually, that's the old task collector mode. I keep meaning to strip it out of libsas ... this provides a good excuse. Basically the only valid case is lldd_max_execute_num == 1 since, as you found, anything above that doesn't work. Are you still seeing timeouts in the lldd_max_execute_num == 1 case? And if so, is that still with can_queue == 1? James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] mvsas: fix default can_queue 2008-03-07 15:03 ` James Bottomley @ 2008-03-07 15:31 ` Ke Wei 0 siblings, 0 replies; 14+ messages in thread From: Ke Wei @ 2008-03-07 15:31 UTC (permalink / raw) To: James Bottomley; +Cc: Jeff Garzik, linux-scsi On Fri, Mar 7, 2008 at 11:03 PM, James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > > Are you still seeing timeouts in the lldd_max_execute_num == 1 case? > And if so, is that still with can_queue == 1? > Yes , I set lldd_max_execute_num = 1 & can_queue = 1. By far, no timeout event. > James > > > -- Best Regards, Ke Wei ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2008-03-07 15:31 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-02-29 18:01 [PATCH] mvsas: fix default can_queue James Bottomley 2008-03-03 0:42 ` James Bottomley 2008-03-03 8:17 ` Ke Wei 2008-03-03 14:59 ` James Bottomley 2008-03-03 16:40 ` James Bottomley 2008-03-05 2:07 ` James Bottomley 2008-03-05 21:02 ` James Bottomley 2008-03-06 14:46 ` Ke Wei 2008-03-06 15:52 ` James Bottomley 2008-03-06 17:44 ` James Bottomley 2008-03-06 17:59 ` Jeff Garzik 2008-03-07 10:50 ` Ke Wei 2008-03-07 15:03 ` James Bottomley 2008-03-07 15:31 ` Ke Wei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).