From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0641212B85 for ; Sat, 16 Dec 2023 05:35:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="oxOxuBj3" Received: by smtp.kernel.org (Postfix) with ESMTPS id 6CCFCC4339A for ; Sat, 16 Dec 2023 05:35:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1702704927; bh=5iKtdCFjgq5745qvGECjqmqRowxZfQrHrHVhpKxRhCo=; h=From:To:Subject:Date:In-Reply-To:References:From; b=oxOxuBj30UyZU9MV7qG5JnTpufMcMAqiEQEWgyRjs3T+9sR7dBY2/ED9ES/Lzhlk4 Qqc2z+Oy3+s0W8dBDY8acQ20J+AebapzZaJVoXQDLLub12CllN/P0IQZPq5HUn9YdV sFExovrDKOfOPRC62X4v0N11VWewu65ipcn+R0gVv0ScsMgSY8dlaF2FZN5wFvC6bT CvdD15b3xuUbWrGMtZx4XjDPZAE9hiuYpKWcTDgd77LG6/bV21UQ7GUSczfQWnF154 150pvy1laHLRZvbldLCR+8nNJgkD5CWSb6TPcfnobYxdvEEeBcTzKUvtiKUZapXnDb GCoH6aBYNNHsw== Received: by aws-us-west-2-korg-bugzilla-1.web.codeaurora.org (Postfix, from userid 48) id 5C77FC53BD1; Sat, 16 Dec 2023 05:35:27 +0000 (UTC) From: bugzilla-daemon@kernel.org To: linux-scsi@vger.kernel.org Subject: [Bug 217599] Adaptec 71605z hangs with aacraid: Host adapter abort request after update to linux 6.4.0 Date: Sat, 16 Dec 2023 05:35:26 +0000 X-Bugzilla-Reason: None X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: AssignedTo scsi_drivers-aacraid@kernel-bugs.osdl.org X-Bugzilla-Product: SCSI Drivers X-Bugzilla-Component: AACRAID X-Bugzilla-Version: 2.5 X-Bugzilla-Keywords: X-Bugzilla-Severity: high X-Bugzilla-Who: encore2097@hotmail.com X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: scsi_drivers-aacraid@kernel-bugs.osdl.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugzilla.kernel.org/ Auto-Submitted: auto-generated Precedence: bulk X-Mailing-List: linux-scsi@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 https://bugzilla.kernel.org/show_bug.cgi?id=3D217599 --- Comment #48 from encore2097@hotmail.com --- Hi Sagar, I'm using a setup with 10 SATA disks in HBA mode and running a zfs raidz2 filesystem (akin to raid-6). This is a single CPU system so I don't believe= the CPU count is the main issue here =E2=80=94- although its likely related. >From examining the logs, doing some research, and drawing from my experienc= e, it seems that timeouts and queues are the primary culprits. My suspicion is that during heavy loads, there's an overflow somewhere in the stack (could = be in the kernel driver, firmware, or hardware), causing I/O requests to get l= ost and timeout. After a series of these timeouts, the driver triggers an error= and resets the adapter. I stumbled upon threads dating back to around 2017 where users faced similar issues (check this one: https://forum.proxmox.com/threads/pve-5-1-aacraid-scsi-hang.38259/). One suggestion for a fix was to extend the disk timeout window for waiting on I= /O. However, the current kernel (set at 60s) has already doubled the previous v= alue of 30s, which makes me think it might not be the root cause but is also related. I'm not sure of the physical disk setup of other users connecting to their controllers, but I reliably see this issue with my 10 disk setup so my recommendation would be to increase the number of disks attached to the controller and stress test it with simultaneous sequential and random I/O u= sing tools like dd and fio at the same time.=20 My specific use case involves a file server and database with multiple user= s. I consistently observe the adapter aborting requests and resetting a few minu= tes after boot, when the file server and database applications start and warm up their caches (cache size is approximately 120GB in RAM). Upon further investigation, I found that anyone experiencing this issue cou= ld gather more information by modifying aacraid with dump_stack() added around line 713 of linux/latest/source/drivers/scsi/aacraid/linit.c within aac_eh_abort (refer to this: https://stackoverflow.com/questions/32557040/how-to-get-stack-trace-at-vari= ous-points-in-kernel-device-driver-code). Unfortunately, due to unacceptable downtime I had to revert my system to a different HBA and lack spare systems to test with. Best regards. --=20 You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.=