From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 100DBC433EF for ; Thu, 17 Feb 2022 10:47:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237832AbiBQKri (ORCPT ); Thu, 17 Feb 2022 05:47:38 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:36080 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232911AbiBQKrg (ORCPT ); Thu, 17 Feb 2022 05:47:36 -0500 Received: from esa2.hgst.iphmx.com (esa2.hgst.iphmx.com [68.232.143.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9DB161DF25B for ; Thu, 17 Feb 2022 02:47:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1645094840; x=1676630840; h=message-id:date:mime-version:subject:to:references:from: in-reply-to:content-transfer-encoding; bh=XV79uhJiPhZdpb+nOlyMWLX0rnZ0u4Ex5TEAbrH+6Vc=; b=dcVA+UKMwkHU5kyqqnJGfyQHFw/KPNuhWEGgBdp3jSYm+Ra4WazU3y6X P5+fGQrRqgSiDkGr53vWYMqFriXo/XbHrl+eRxWOatlI8OfVyHt8/K169 60Q7dkafQLI6NinHFCPkuhgTjijU7CEr833aUlDHE3bH7Z1/E81RtSNW9 35m1STpVNTZ/ae+3/jBbNXRb/VLVGmSkPJYpY0G1vL5p835sZEjXiwb9H MYVJbyXWpfnKyt+F0PUaekyVil2alnX2KHByGT1ck9mUhH8rAGXm34fvJ 4N4Mvj/hYM3mtWVBXNvbqcqe6w7oiuihR8wpQsw9TjSnPd3RvBBDLVrC2 w==; X-IronPort-AV: E=Sophos;i="5.88,375,1635177600"; d="scan'208";a="297293166" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 17 Feb 2022 18:47:19 +0800 IronPort-SDR: 28jCyJSRyUmBHifiMikGh1K/cucO+SKAlAenVCizowtjYCkYFOYabcIoEoO+ahWZXVA8jnuZbG 6Rt4j0HY0r9PqJfYDPj177d8CsuF9OF+gH7+eyJhjMkBCN3dE7M1YkNdVYAg/v08iZHwmx24og wjwTzhaGPVV1u5d9GGHhO+1LhPpjbKgTP6aBPD9a2Cd7lvQhJEh8iGIHVe4HvGsp93Wp8iuVQK 135yPpJrsKjyv8s4wko33JzTpEtVd5NsEC9xPoG1feHx7C9mEd1qePNQb7dx22BijB6AQfhFS/ vYB6ksTP85K4XoiQNdLlKjQS Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Feb 2022 02:20:05 -0800 IronPort-SDR: /tkJXDj4PlqOtuNgn5apTF8cZLoTHh0d+Cvkj4LWLGO6Rv8HCwJUxIC96JwnpWnnvZoZ6MLzZo IqSPxtmQqZ+Tqg6PJeqSUhd0mj7HfXgsTXbzIkpQzK3eQQjiayVK5W7I7jnBfAT+DB7Yu9sBUh +JAYc8BjlEEZJt8G3EOmF67CaeWyVDi9Zo1E1ObukvrWe2g6oQMSQLhtsxfhFynFKZzoVkI+ar TVhvs1fopRHb5cAskdueu/u0y+b+urWAMYk9nCC4crdj4qajsVZdpw0gZXtTMth35p3B1sIT2I CzE= WDCIronportException: Internal Received: from usg-ed-osssrv.wdc.com ([10.3.10.180]) by uls-op-cesaip02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Feb 2022 02:47:20 -0800 Received: from usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTP id 4Jzs3C0hlxz1SVnx for ; Thu, 17 Feb 2022 02:47:19 -0800 (PST) Authentication-Results: usg-ed-osssrv.wdc.com (amavisd-new); dkim=pass reason="pass (just generated, assumed good)" header.d=opensource.wdc.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d= opensource.wdc.com; h=content-transfer-encoding:content-type :in-reply-to:organization:from:references:to:content-language :subject:user-agent:mime-version:date:message-id; s=dkim; t= 1645094838; x=1647686839; bh=XV79uhJiPhZdpb+nOlyMWLX0rnZ0u4Ex5TE AbrH+6Vc=; b=MF8UvvG/f2zsqJ5RAzlPjToKkCIrfFoPP+XZaIr4RlYvPVXN5rC dS69Xd3f5L1T+bdlXaoto5Ac0/9S0xgwB82w6Bcb4WVdUoKY/RSBhzllUXP0gc3J Ltd1ucMYUXou9CQwpcoBP+3F7VDtTbWLoxKKzVuL9vA3dx8AYmKwbIapv1zKmGJV PCI5MOhSXLGrmDfaY4aa/mg0Kh38qGolaQlT3NhMNVLeepKXjdyozETkyboEMjh0 /t+JjE6tuzC+AeBuW/LyE4y2FbpIy2iJmYrUGZHSakeG2yS2oiRxg6tbLVqh3z9n fvA6bfzjyOLQdNyr/LYOR624hCq+cJ5x1mA== X-Virus-Scanned: amavisd-new at usg-ed-osssrv.wdc.com Received: from usg-ed-osssrv.wdc.com ([127.0.0.1]) by usg-ed-osssrv.wdc.com (usg-ed-osssrv.wdc.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ywxtP5NIF99P for ; Thu, 17 Feb 2022 02:47:18 -0800 (PST) Received: from [10.225.163.77] (unknown [10.225.163.77]) by usg-ed-osssrv.wdc.com (Postfix) with ESMTPSA id 4Jzs391zQrz1Rwrw; Thu, 17 Feb 2022 02:47:17 -0800 (PST) Message-ID: Date: Thu, 17 Feb 2022 19:47:15 +0900 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Subject: Re: [PATCH v3 27/31] scsi: pm8001: Cleanup pm8001_queue_command() Content-Language: en-US To: John Garry , linux-scsi@vger.kernel.org, "Martin K . Petersen" , Xiang Chen , Jason Yan References: <20220214021747.4976-1-damien.lemoal@opensource.wdc.com> <20220214021747.4976-28-damien.lemoal@opensource.wdc.com> <51d7c127-f975-14e9-a036-c37416ed8679@huawei.com> <32efd519-3485-ee34-84e2-34a0d8c27e17@opensource.wdc.com> <38090771-ad24-1b20-9b79-f7f89d42ea66@huawei.com> <37df3c92-c28e-72d4-76d8-33356829af5a@opensource.wdc.com> <5a5481af-e975-c6fb-2d48-961769eae551@huawei.com> <9c22abeb-1b22-4613-66bc-276aaa4a639c@opensource.wdc.com> <712a4702-8534-fad2-2679-cc5cf62e4a9e@huawei.com> From: Damien Le Moal Organization: Western Digital Research In-Reply-To: <712a4702-8534-fad2-2679-cc5cf62e4a9e@huawei.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org On 2/17/22 18:23, John Garry wrote: > On 17/02/2022 00:12, Damien Le Moal wrote: >>>>> I'll have a look at it. And that is on mainline or mkp-scsi staging, and >>>>> not your patchset. >>>> Are you saying that my patches suppresses the above ? This is submission >>>> path and the dma code seems to complain about alignment... So bad buffer >>>> addresses ? >>> Your series does not suppress it. It doesn't occur often, so I need to >>> check more. >>> >>> I think the issue is that we call dma_map_sg() twice, i.e. ccb never >>> unmapped. >> That would be a big issue indeed. We could add a flag to CCBs to track >> the buf_prd DMA mapping state and BUG_ON() when ccb free function is >> called with the buffer still mapped. That should allow catching this >> infrequent problem ? >> > > I figured out what is happening here and it does not help solve the > mystery of my hang. > > Here's the steps: > a. scsi_cmnd times out > b. scsi error handling kicks in > c. libsas attempts to abort the task, which fails > d. libsas then tries IT nexus reset, which passes > - libsas assumes the scsi_cmnd has completed with failure > e. error handling concludes > f. scsi midlayer then retries the same scsi_cmnd > g. since we did not "free" associated ccb earlier or dma unmap at d., > the dma unmap on the same scsi_cmnd causes the warn > > So the LLD should really free resources and dma unmap at point IT nexus > reset completes, but it doesn't. I think in certain conditions dma map > should not be done twice. > > Anyway, that can be fixed, but I still have the hang :( I guess (a) (cmd timeout) is only the symptom of the hang ? That is, the hang is causing the timeout ? It may be good to turn on scsi trace to see if the command was only partially done, or not at all, or if it is a non-data command. And speaking of errors, I am currently testing v4 of my series and noticed some weird things in the error handling. E.g., with one of the test executing a report zones command with an LBA out of range, I see this: [23962.027105] pm80xx0:: mpi_sata_event 2788:SATA EVENT 0x23 [23962.036099] pm80xx0:: pm80xx_send_read_log 1863:Executing read log end All good: this is IO_XFER_ERROR_ABORTED_NCQ_MODE. And the read log is to get the drive queue out of error state. [23962.046101] pm80xx0:: mpi_sata_event 2788:SATA EVENT 0x26 This is IO_XFER_ERROR_UNEXPECTED_PHASE. No clue what this is doing. [23962.054947] pm80xx0:: mpi_sata_event 2805:task or dev null Why ? [23962.063865] pm80xx0:: pm80xx_send_abort_all 1796:Executing abort task end All queued commands are aborted when the read log completes. Again, per ATA NCQ specs, all good. After that, normal (some useless) messages. [23962.063964] pm80xx0:: mpi_sata_completion 2395:IO failed device_id 16388 status 0x1 tag 1 [23962.084587] pm80xx0:: mpi_sata_completion 2430:SAS Address of IO Failure Drive:50010860002f5657 [23962.095526] sas: sas_ata_task_done: SAS error 0x8d [23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0 [23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED [23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11 ncq dma 512 in [23962.163307] res 43/10:00:00:00:00/00:00:00:00:00/00 Emask 0x480 (invalid argument) Good here, correct error... [23962.182879] ata24.00: status: { DRDY SENSE ERR } [23962.189100] ata24.00: error: { IDNF } ... but I need to look at this in libata. Getting the same with AHCI. [23962.215456] ata24.00: configured for UDMA/133 [23962.221469] ata24: EH complete [23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 That event 0x26 and the "task or dev null" are obscure. No clue, but they look wrong. Overall, this driver is by default way too verbose I think. I would prefer to reduce the above to something like: [23962.095526] sas: sas_ata_task_done: SAS error 0x8d [23962.139470] sas: Enter sas_scsi_recover_host busy: 1 failed: 1 [23962.147897] ata24.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0 [23962.156253] ata24.00: failed command: RECEIVE FPDMA QUEUED [23962.163307] ata24.00: cmd 65/01:00:00:00:0c/00:02:23:01:00/40 tag 11 ncq dma 512 in [23962.163307] res 43/10:00:00:00:00/00:00:00:00:00/00 Emask 0x480 (invalid argument) [23962.182879] ata24.00: status: { DRDY SENSE ERR } [23962.189100] ata24.00: error: { IDNF } [23962.215456] ata24.00: configured for UDMA/133 [23962.221469] ata24: EH complete [23962.226056] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 That is, by default, remove most FAIL messages, changing them to DEV level. The same test with a SAS drive is totally silent, as it should be, since the command error is not a fatal one. But not touching this for now. I first want the series to be queued. -- Damien Le Moal Western Digital Research