* tcm_loop and aborted TMRs @ 2022-11-11 16:20 Maurizio Lombardi 2022-11-11 16:21 ` Maurizio Lombardi 2022-11-11 21:18 ` Mike Christie 0 siblings, 2 replies; 15+ messages in thread From: Maurizio Lombardi @ 2022-11-11 16:20 UTC (permalink / raw) To: Bodo Stroesser; +Cc: Mike Christie, target-devel Hello Bodo, Mike, Some of our customers reported that the tcm_loop module is unable to handle aborted TMRs, resulting in kernel hangs. I noticed that Bodo submitted a patch some time ago (our customers confirm it works), Mike instead proposed to revert commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". The discussion unfortunately died out without reaching a conclusion. Personally, I think that if the handling of aborted TMRs was working before the "Use system workqueues" commit then this should be considered as a regression and the commit should be reverted. Maurizio ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-11 16:20 tcm_loop and aborted TMRs Maurizio Lombardi @ 2022-11-11 16:21 ` Maurizio Lombardi 2022-11-11 21:18 ` Mike Christie 1 sibling, 0 replies; 15+ messages in thread From: Maurizio Lombardi @ 2022-11-11 16:21 UTC (permalink / raw) To: Bodo Stroesser; +Cc: Mike Christie, target-devel pá 11. 11. 2022 v 17:20 odesílatel Maurizio Lombardi <mlombard@redhat.com> napsal: > > I noticed that Bodo submitted a patch some time ago (our customers > confirm it works), > Mike instead proposed to revert > commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". > Link to the patch submitted by Bodo: https://patchwork.kernel.org/project/target-devel/patch/20200715160403.12578-1-bstroesser@ts.fujitsu.com/#23512015 Thanks, Maurizio ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-11 16:20 tcm_loop and aborted TMRs Maurizio Lombardi 2022-11-11 16:21 ` Maurizio Lombardi @ 2022-11-11 21:18 ` Mike Christie 2022-11-12 13:59 ` Bodo Stroesser 1 sibling, 1 reply; 15+ messages in thread From: Mike Christie @ 2022-11-11 21:18 UTC (permalink / raw) To: Maurizio Lombardi, Bodo Stroesser; +Cc: target-devel On 11/11/22 10:20 AM, Maurizio Lombardi wrote: > Hello Bodo, Mike, > > Some of our customers reported that the tcm_loop module is unable > to handle aborted TMRs, resulting in kernel hangs. > > I noticed that Bodo submitted a patch some time ago (our customers > confirm it works), > Mike instead proposed to revert > commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". > > The discussion unfortunately died out without reaching a conclusion. > > Personally, I think that if the handling of aborted TMRs was working > before the "Use system workqueues" commit then this should be > considered as a regression and the commit should be reverted. > I'm fine with reverting it because multiple drivers are affected. I had talked to Bart offlist back then and he was also ok since we couldn't fix up the drivers. I think Bodo and I tried to convert qla, but it was difficult without marvell's help (we both pinged them but didn't hear back) because from what I could tell we needed to send some hw/fw commands to perform cleanup to fully handle that case. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-11 21:18 ` Mike Christie @ 2022-11-12 13:59 ` Bodo Stroesser 2022-11-12 21:46 ` michael.christie 2022-11-21 13:35 ` Dmitry Bogdanov 0 siblings, 2 replies; 15+ messages in thread From: Bodo Stroesser @ 2022-11-12 13:59 UTC (permalink / raw) To: Mike Christie, Maurizio Lombardi; +Cc: target-devel Hello Mike, Maurizio, Even if we couldn't yet find a method to fix handling of aborted TMRs in the core or in all fabric drivers, I still think that keeping the parallel handling of TMRs would be fine. Tcmu offers a TMR notification mechanism to make userspace aware of ABORT or RESET_LUN. So userspace can try to break cmd handling and thus speed up TMR response. If we serialize TMR handling, then the notifications are also serialized and thus lose some of their power. But maybe I have a new (?) idea of how to fix handling of aborted TMRs in fabric drivers: 1) Modify core to not call target_put_sess_cmd, no matter whether SCF_ACK_REF is set. 2) Modify fabric drivers to handle an aborted TMR just like a normal TMR response. This means, e.g. qla2xxx would send a normal response for the Abort. This exactly is what happens when serializing TMRs, because in that case despite of the RESET_LUN the core always calls queue_tm_rsp callback instead of aborted_task callback. So to initiators we would show the 'old' behavior, while internally keeping the parallel processing of TMRs. If fabric driver maintainers don't like that approach, they can change their drivers to correctly kill aborted TMRs. What do you think? Bodo On 11.11.22 22:18, Mike Christie wrote: > On 11/11/22 10:20 AM, Maurizio Lombardi wrote: >> Hello Bodo, Mike, >> >> Some of our customers reported that the tcm_loop module is unable >> to handle aborted TMRs, resulting in kernel hangs. >> >> I noticed that Bodo submitted a patch some time ago (our customers >> confirm it works), >> Mike instead proposed to revert >> commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". >> >> The discussion unfortunately died out without reaching a conclusion. >> >> Personally, I think that if the handling of aborted TMRs was working >> before the "Use system workqueues" commit then this should be >> considered as a regression and the commit should be reverted. >> > > I'm fine with reverting it because multiple drivers are affected. I had > talked to Bart offlist back then and he was also ok since we couldn't > fix up the drivers. > > I think Bodo and I tried to convert qla, but it was difficult without > marvell's help (we both pinged them but didn't hear back) because from > what I could tell we needed to send some hw/fw commands to perform cleanup > to fully handle that case. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-12 13:59 ` Bodo Stroesser @ 2022-11-12 21:46 ` michael.christie 2022-11-19 11:42 ` Bodo Stroesser 2022-11-21 13:35 ` Dmitry Bogdanov 1 sibling, 1 reply; 15+ messages in thread From: michael.christie @ 2022-11-12 21:46 UTC (permalink / raw) To: Bodo Stroesser, Maurizio Lombardi; +Cc: target-devel On 11/12/22 7:59 AM, Bodo Stroesser wrote: > Hello Mike, Maurizio, > > Even if we couldn't yet find a method to fix handling of aborted > TMRs in the core or in all fabric drivers, I still think that keeping > the parallel handling of TMRs would be fine. > > Tcmu offers a TMR notification mechanism to make userspace aware > of ABORT or RESET_LUN. So userspace can try to break cmd handling > and thus speed up TMR response. If we serialize TMR handling, then > the notifications are also serialized and thus lose some of their > power. > > But maybe I have a new (?) idea of how to fix handling of aborted > TMRs in fabric drivers: > 1) Modify core to not call target_put_sess_cmd, no matter whether > SCF_ACK_REF is set. > 2) Modify fabric drivers to handle an aborted TMR just like a > normal TMR response. This means, e.g. qla2xxx would send a > normal response for the Abort. This exactly is what happens > when serializing TMRs, because in that case despite of the > RESET_LUN the core always calls queue_tm_rsp callback instead > of aborted_task callback. > > So to initiators we would show the 'old' behavior, while internally > keeping the parallel processing of TMRs. > > If fabric driver maintainers don't like that approach, they can > change their drivers to correctly kill aborted TMRs. > > What do you think? > I'm fine with doing it in parallel. However, the issue is we have real users hitting it now and we have to fix all the drivers because it's a regression. So if your idea is going take a while then we should revert now and then do your idea whenever it's ready. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-12 21:46 ` michael.christie @ 2022-11-19 11:42 ` Bodo Stroesser 0 siblings, 0 replies; 15+ messages in thread From: Bodo Stroesser @ 2022-11-19 11:42 UTC (permalink / raw) To: michael.christie, Maurizio Lombardi; +Cc: target-devel On 12.11.22 22:46, michael.christie@oracle.com wrote: > On 11/12/22 7:59 AM, Bodo Stroesser wrote: >> Hello Mike, Maurizio, >> >> Even if we couldn't yet find a method to fix handling of aborted >> TMRs in the core or in all fabric drivers, I still think that keeping >> the parallel handling of TMRs would be fine. >> >> Tcmu offers a TMR notification mechanism to make userspace aware >> of ABORT or RESET_LUN. So userspace can try to break cmd handling >> and thus speed up TMR response. If we serialize TMR handling, then >> the notifications are also serialized and thus lose some of their >> power. >> >> But maybe I have a new (?) idea of how to fix handling of aborted >> TMRs in fabric drivers: >> 1) Modify core to not call target_put_sess_cmd, no matter whether >> SCF_ACK_REF is set. >> 2) Modify fabric drivers to handle an aborted TMR just like a >> normal TMR response. This means, e.g. qla2xxx would send a >> normal response for the Abort. This exactly is what happens >> when serializing TMRs, because in that case despite of the >> RESET_LUN the core always calls queue_tm_rsp callback instead >> of aborted_task callback. >> >> So to initiators we would show the 'old' behavior, while internally >> keeping the parallel processing of TMRs. >> >> If fabric driver maintainers don't like that approach, they can >> change their drivers to correctly kill aborted TMRs. >> >> What do you think? >> > > I'm fine with doing it in parallel. However, the issue is we have real > users hitting it now and we have to fix all the drivers because it's a > regression. So if your idea is going take a while then we should revert > now and then do your idea whenever it's ready. I agree. Even if my old patch fixes the issue for tcm_loop users, it does not make sense to apply it, since the new idea would lead to reverting parts of it. And with the patch we would still take the risk of users running into trouble with fabrics other than tcm_loop. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-12 13:59 ` Bodo Stroesser 2022-11-12 21:46 ` michael.christie @ 2022-11-21 13:35 ` Dmitry Bogdanov 2022-11-21 19:25 ` Mike Christie 1 sibling, 1 reply; 15+ messages in thread From: Dmitry Bogdanov @ 2022-11-21 13:35 UTC (permalink / raw) To: Bodo Stroesser; +Cc: Mike Christie, Maurizio Lombardi, target-devel Hi Bodo, On Sat, Nov 12, 2022 at 02:59:48PM +0100, Bodo Stroesser wrote: > > Hello Mike, Maurizio, > > Even if we couldn't yet find a method to fix handling of aborted > TMRs in the core or in all fabric drivers, I still think that keeping > the parallel handling of TMRs would be fine. > > Tcmu offers a TMR notification mechanism to make userspace aware > of ABORT or RESET_LUN. So userspace can try to break cmd handling > and thus speed up TMR response. If we serialize TMR handling, then > the notifications are also serialized and thus lose some of their > power. > > But maybe I have a new (?) idea of how to fix handling of aborted > TMRs in fabric drivers: > 1) Modify core to not call target_put_sess_cmd, no matter whether > SCF_ACK_REF is set. > 2) Modify fabric drivers to handle an aborted TMR just like a > normal TMR response. This means, e.g. qla2xxx would send a > normal response for the Abort. This exactly is what happens > when serializing TMRs, because in that case despite of the > RESET_LUN the core always calls queue_tm_rsp callback instead > of aborted_task callback. I am not sure for all initiators, but usually it sends LUN_RESET only after ABORT has been timed out. There is no reason to send a response on the ABORT that was actually aborted already internally on the initiator side. > So to initiators we would show the 'old' behavior, while internally > keeping the parallel processing of TMRs. > > If fabric driver maintainers don't like that approach, they can > change their drivers to correctly kill aborted TMRs. > > What do you think? > I will vote to your old patch. qla2xxx was fixed long time ago. Other fabric drivers have not that issue too. Only tcm_loop and xen still do not adapted for parallel TMRs. I think it's a not good idea to revert 2 years old patch. It was a reason for some other patches (that fixes issues thanks to parallel TMR handling). > > On 11.11.22 22:18, Mike Christie wrote: > > On 11/11/22 10:20 AM, Maurizio Lombardi wrote: > > > Hello Bodo, Mike, > > > > > > Some of our customers reported that the tcm_loop module is unable > > > to handle aborted TMRs, resulting in kernel hangs. > > > > > > I noticed that Bodo submitted a patch some time ago (our customers > > > confirm it works), > > > Mike instead proposed to revert > > > commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". > > > > > > The discussion unfortunately died out without reaching a conclusion. > > > > > > Personally, I think that if the handling of aborted TMRs was working > > > before the "Use system workqueues" commit then this should be > > > considered as a regression and the commit should be reverted. > > > > > > > I'm fine with reverting it because multiple drivers are affected. I had > > talked to Bart offlist back then and he was also ok since we couldn't > > fix up the drivers. > > > > I think Bodo and I tried to convert qla, but it was difficult without > > marvell's help (we both pinged them but didn't hear back) because from > > what I could tell we needed to send some hw/fw commands to perform cleanup > > to fully handle that case. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-21 13:35 ` Dmitry Bogdanov @ 2022-11-21 19:25 ` Mike Christie 2022-11-25 9:34 ` Dmitry Bogdanov 0 siblings, 1 reply; 15+ messages in thread From: Mike Christie @ 2022-11-21 19:25 UTC (permalink / raw) To: Dmitry Bogdanov, Bodo Stroesser; +Cc: Maurizio Lombardi, target-devel On 11/21/22 7:35 AM, Dmitry Bogdanov wrote: >> > I will vote to your old patch. qla2xxx was fixed long time ago. What is the qla fix? I think we still leak. In commit commit 605e74025f953b995a3a241ead43bde71c1c99b5 Author: Mike Christie <michael.christie@oracle.com> Date: Sun Nov 1 12:59:31 2020 -0600 scsi: qla2xxx: Move sess cmd list/lock to driver when I changed the locking I had added the check: static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd) { struct qla_tgt_cmd *cmd; unsigned long flags; if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) return; because tmrs are not on the sess_cmd_list that's accessed the next line down. We don't crash as a result, but I think we need to add code to send the cleanup command to the FW. Bodo and I were working on that part, but someone with more qla experience needed to work on it so it could be properly tested. We didn't hear back from the qla engineers so progress had stalled. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-21 19:25 ` Mike Christie @ 2022-11-25 9:34 ` Dmitry Bogdanov 2022-11-27 18:59 ` Mike Christie 0 siblings, 1 reply; 15+ messages in thread From: Dmitry Bogdanov @ 2022-11-25 9:34 UTC (permalink / raw) To: Mike Christie; +Cc: Bodo Stroesser, Maurizio Lombardi, target-devel On Mon, Nov 21, 2022 at 01:25:55PM -0600, Mike Christie wrote: > > On 11/21/22 7:35 AM, Dmitry Bogdanov wrote: > >> > > I will vote to your old patch. qla2xxx was fixed long time ago. > > What is the qla fix? I think we still leak. In commit > > commit 605e74025f953b995a3a241ead43bde71c1c99b5 > Author: Mike Christie <michael.christie@oracle.com> > Date: Sun Nov 1 12:59:31 2020 -0600 > > scsi: qla2xxx: Move sess cmd list/lock to driver > > when I changed the locking I had added the check: > > static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd) > { > struct qla_tgt_cmd *cmd; > unsigned long flags; > > if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) > return; > Yes, I was thinking about that commit. > because tmrs are not on the sess_cmd_list that's accessed the > next line down. We don't crash as a result, but I think we need > to add code to send the cleanup command to the FW. Bodo and I > were working on that part, but someone with more qla experience > needed to work on it so it could be properly tested. We didn't > hear back from the qla engineers so progress had stalled. > Yes, you are right, FW expects some response on every ABORT IOCB to clear its resources. I can prepare the patch for qla2xxx. But still, I do not see a sense of new Bodo's solution. Calling target_put_sess_cmd due to SCF_ACK_REF forbids any async long term work in fabric drivers in aborted_task callback. It is intentionally. qla2xxx does not require to wait for the completion of all requests to FW. All terminate exchange requests are such. SAM-5 states that an aborted(due to LUN_RESET) TMR should not be responded to initiator: 7.11 Task management function lifetime The task management function shall exist until: a) the task manager sends a service response for the task management function; b) an I_T nexus loss (see 6.3.4); * c) a logical unit reset (see 6.3.3); d) a hard reset (see 6.3.2); e) power loss expected (see 6.3.5); or f) a power on condition (see 6.3.1) BR, Dmitry ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-25 9:34 ` Dmitry Bogdanov @ 2022-11-27 18:59 ` Mike Christie 2022-12-01 14:15 ` Bodo Stroesser 0 siblings, 1 reply; 15+ messages in thread From: Mike Christie @ 2022-11-27 18:59 UTC (permalink / raw) To: Dmitry Bogdanov; +Cc: Bodo Stroesser, Maurizio Lombardi, target-devel On 11/25/22 3:34 AM, Dmitry Bogdanov wrote: > On Mon, Nov 21, 2022 at 01:25:55PM -0600, Mike Christie wrote: >> >> On 11/21/22 7:35 AM, Dmitry Bogdanov wrote: >>>> >>> I will vote to your old patch. qla2xxx was fixed long time ago. >> >> What is the qla fix? I think we still leak. In commit >> >> commit 605e74025f953b995a3a241ead43bde71c1c99b5 >> Author: Mike Christie <michael.christie@oracle.com> >> Date: Sun Nov 1 12:59:31 2020 -0600 >> >> scsi: qla2xxx: Move sess cmd list/lock to driver >> >> when I changed the locking I had added the check: >> >> static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd) >> { >> struct qla_tgt_cmd *cmd; >> unsigned long flags; >> >> if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) >> return; >> > Yes, I was thinking about that commit. >> because tmrs are not on the sess_cmd_list that's accessed the >> next line down. We don't crash as a result, but I think we need >> to add code to send the cleanup command to the FW. Bodo and I >> were working on that part, but someone with more qla experience >> needed to work on it so it could be properly tested. We didn't >> hear back from the qla engineers so progress had stalled. >> > Yes, you are right, FW expects some response on every ABORT IOCB > to clear its resources. > > I can prepare the patch for qla2xxx. That would be awesome. > > But still, I do not see a sense of new Bodo's solution. Drivers are crashing in the aborted_task callout. His idea was a workaround/hack so we would call the queue_tm_rsp callout instead of aborted_ask like we did before which would avoid the crashes but allow us to keep the async behavior. To the initiator's it would work like before where it looks like a race where the abort response is received right after it has sent the lun reset. His assumption is that initiators handled that before so it would continue to work. If you fix qla, then we can easily fix loop and xen and we can do the correct behavior and keep the async feature. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-11-27 18:59 ` Mike Christie @ 2022-12-01 14:15 ` Bodo Stroesser 2022-12-08 3:45 ` Mike Christie 0 siblings, 1 reply; 15+ messages in thread From: Bodo Stroesser @ 2022-12-01 14:15 UTC (permalink / raw) To: Mike Christie, Dmitry Bogdanov; +Cc: Maurizio Lombardi, target-devel On 27.11.22 19:59, Mike Christie wrote: > On 11/25/22 3:34 AM, Dmitry Bogdanov wrote: >> On Mon, Nov 21, 2022 at 01:25:55PM -0600, Mike Christie wrote: >>> >>> On 11/21/22 7:35 AM, Dmitry Bogdanov wrote: >>>>> >>>> I will vote to your old patch. qla2xxx was fixed long time ago. >>> >>> What is the qla fix? I think we still leak. In commit >>> >>> commit 605e74025f953b995a3a241ead43bde71c1c99b5 >>> Author: Mike Christie <michael.christie@oracle.com> >>> Date: Sun Nov 1 12:59:31 2020 -0600 >>> >>> scsi: qla2xxx: Move sess cmd list/lock to driver >>> >>> when I changed the locking I had added the check: >>> >>> static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd) >>> { >>> struct qla_tgt_cmd *cmd; >>> unsigned long flags; >>> >>> if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) >>> return; >>> >> Yes, I was thinking about that commit. >>> because tmrs are not on the sess_cmd_list that's accessed the >>> next line down. We don't crash as a result, but I think we need >>> to add code to send the cleanup command to the FW. Bodo and I >>> were working on that part, but someone with more qla experience >>> needed to work on it so it could be properly tested. We didn't >>> hear back from the qla engineers so progress had stalled. >>> >> Yes, you are right, FW expects some response on every ABORT IOCB >> to clear its resources. >> >> I can prepare the patch for qla2xxx. > > That would be awesome. > >> >> But still, I do not see a sense of new Bodo's solution. > > Drivers are crashing in the aborted_task callout. His idea was a > workaround/hack so we would call the queue_tm_rsp callout instead of > aborted_ask like we did before which would avoid the crashes but > allow us to keep the async behavior. > Not exactly. In case we are not able to fix a driver, I wanted to change drivers's aborted_task callout to use its internal queue_tm_rsp routines in case of aborted TMRs. But that would work only if we changed the core to not call target_put_sess_cmd, which again would lead to changes in currently working drivers. Really a hack only. > To the initiator's it would work like before where it looks like a > race where the abort response is received right after it has sent > the lun reset. His assumption is that initiators handled that before > so it would continue to work. > > If you fix qla, then we can easily fix loop and xen and we can do > the correct behavior and keep the async feature. > Are we sure qla, loop and xen are the only drivers that handle aborted TMRs incorrectly? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-12-01 14:15 ` Bodo Stroesser @ 2022-12-08 3:45 ` Mike Christie 2024-07-24 13:42 ` Gao Xiang 0 siblings, 1 reply; 15+ messages in thread From: Mike Christie @ 2022-12-08 3:45 UTC (permalink / raw) To: Bodo Stroesser, Dmitry Bogdanov; +Cc: Maurizio Lombardi, target-devel On 12/1/22 8:15 AM, Bodo Stroesser wrote: > Are we sure qla, loop and xen are the only drivers that handle aborted > TMRs incorrectly? I'm not sure now. When we looked at this before I was only checking for crashes, but didn't check if there could be issues like where the driver needed to do some cleanup in their aborted_task callout but hasn't been doing it. For example ibmvscsi's aborted_task callout won't crash because the fields it references are ok for a IO or tmr se_cmds. It doesn't do vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok there. However, I didn't look at the driver to see if maybe it did need to do some cleanup in the aborted_task callout and we just haven't been doing it. Same for the other drivers. I only checked if aborted_task would crash. Also we have a new driver efct, so we need to review that as well. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2022-12-08 3:45 ` Mike Christie @ 2024-07-24 13:42 ` Gao Xiang 2024-07-30 21:40 ` michael.christie 0 siblings, 1 reply; 15+ messages in thread From: Gao Xiang @ 2024-07-24 13:42 UTC (permalink / raw) To: Mike Christie, Bodo Stroesser, Dmitry Bogdanov Cc: Maurizio Lombardi, target-devel Hi all, On 2022/12/8 11:45, Mike Christie wrote: > On 12/1/22 8:15 AM, Bodo Stroesser wrote: >> Are we sure qla, loop and xen are the only drivers that handle aborted >> TMRs incorrectly? > > I'm not sure now. When we looked at this before I was only checking > for crashes, but didn't check if there could be issues like where > the driver needed to do some cleanup in their aborted_task callout > but hasn't been doing it. > > For example ibmvscsi's aborted_task callout won't crash because the > fields it references are ok for a IO or tmr se_cmds. It doesn't do > vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task > callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok > there. However, I didn't look at the driver to see if maybe it did need > to do some cleanup in the aborted_task callout and we just haven't > been doing it. > > Same for the other drivers. I only checked if aborted_task would crash. > Also we have a new driver efct, so we need to review that as well. Sorry I have very little knowledge of TCMU, but currently we have some call traces stuck as below: [811824.868078] task:kworker/u256:1 state:D stack: 0 pid:213661 ppid: 2 flags:0x00004000 [811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler [811824.868085] Call Trace: [811824.868091] __schedule+0x1ac/0x480 [811824.868092] schedule+0x46/0xb0 [811824.868095] schedule_timeout+0xe5/0x130 [811824.868110] ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod] [811824.868112] ? __prepare_to_swait+0x4f/0x70 [811824.868114] wait_for_completion+0x71/0xc0 [811824.868118] tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop] [811824.868120] tcm_loop_abort_task+0x3d/0x50 [tcm_loop] [811824.868121] scmd_eh_abort_handler+0x7b/0x210 [811824.868124] process_one_work+0x1a8/0x340 [811824.868125] worker_thread+0x49/0x2f0 [811824.868126] ? rescuer_thread+0x350/0x350 [811824.868127] kthread+0x118/0x140 [811824.868129] ? __kthread_bind_mask+0x60/0x60 [811824.868131] ret_from_fork+0x1f/0x30 [811824.868166] task:kworker/121:2 state:D stack: 0 pid:242954 ppid: 2 flags:0x00004000 [811824.868172] Workqueue: events target_tmr_work [target_core_mod] [811824.868172] Call Trace: [811824.868174] __schedule+0x1ac/0x480 [811824.868175] schedule+0x46/0xb0 [811824.868176] schedule_timeout+0xe5/0x130 [811824.868177] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 [811824.868178] ? __prepare_to_swait+0x4f/0x70 [811824.868179] wait_for_completion+0x71/0xc0 [811824.868184] target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod] [811824.868192] core_tmr_abort_task.cold+0x187/0x21a [target_core_mod] [811824.868198] target_tmr_work+0xa3/0xf0 [target_core_mod] [811824.868200] process_one_work+0x1a8/0x340 [811824.868201] worker_thread+0x49/0x2f0 [811824.868202] ? rescuer_thread+0x350/0x350 [811824.868202] kthread+0x118/0x140 [811824.868203] ? __kthread_bind_mask+0x60/0x60 [811824.868204] ret_from_fork+0x1f/0x30 I'm not sure how to recover from this state. Is it resolved upstream? Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2024-07-24 13:42 ` Gao Xiang @ 2024-07-30 21:40 ` michael.christie 2024-07-31 13:49 ` Gao Xiang 0 siblings, 1 reply; 15+ messages in thread From: michael.christie @ 2024-07-30 21:40 UTC (permalink / raw) To: Gao Xiang, Bodo Stroesser, Dmitry Bogdanov Cc: Maurizio Lombardi, target-devel On 7/24/24 8:42 AM, Gao Xiang wrote: > Hi all, > > On 2022/12/8 11:45, Mike Christie wrote: >> On 12/1/22 8:15 AM, Bodo Stroesser wrote: >>> Are we sure qla, loop and xen are the only drivers that handle aborted >>> TMRs incorrectly? >> >> I'm not sure now. When we looked at this before I was only checking >> for crashes, but didn't check if there could be issues like where >> the driver needed to do some cleanup in their aborted_task callout >> but hasn't been doing it. >> >> For example ibmvscsi's aborted_task callout won't crash because the >> fields it references are ok for a IO or tmr se_cmds. It doesn't do >> vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task >> callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok >> there. However, I didn't look at the driver to see if maybe it did need >> to do some cleanup in the aborted_task callout and we just haven't >> been doing it. >> >> Same for the other drivers. I only checked if aborted_task would crash. >> Also we have a new driver efct, so we need to review that as well. > > > Sorry I have very little knowledge of TCMU, but currently we have > some call traces stuck as below: > > [811824.868078] task:kworker/u256:1 state:D stack: 0 pid:213661 > ppid: 2 flags:0x00004000 > [811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler > [811824.868085] Call Trace: > [811824.868091] __schedule+0x1ac/0x480 > [811824.868092] schedule+0x46/0xb0 > [811824.868095] schedule_timeout+0xe5/0x130 > [811824.868110] ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod] > [811824.868112] ? __prepare_to_swait+0x4f/0x70 > [811824.868114] wait_for_completion+0x71/0xc0 > [811824.868118] tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop] > [811824.868120] tcm_loop_abort_task+0x3d/0x50 [tcm_loop] > [811824.868121] scmd_eh_abort_handler+0x7b/0x210 > [811824.868124] process_one_work+0x1a8/0x340 > [811824.868125] worker_thread+0x49/0x2f0 > [811824.868126] ? rescuer_thread+0x350/0x350 > [811824.868127] kthread+0x118/0x140 > [811824.868129] ? __kthread_bind_mask+0x60/0x60 > [811824.868131] ret_from_fork+0x1f/0x30 > [811824.868166] task:kworker/121:2 state:D stack: 0 pid:242954 > ppid: 2 flags:0x00004000 > [811824.868172] Workqueue: events target_tmr_work [target_core_mod] > [811824.868172] Call Trace: > [811824.868174] __schedule+0x1ac/0x480 > [811824.868175] schedule+0x46/0xb0 > [811824.868176] schedule_timeout+0xe5/0x130 > [811824.868177] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 > [811824.868178] ? __prepare_to_swait+0x4f/0x70 > [811824.868179] wait_for_completion+0x71/0xc0 > [811824.868184] target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod] > [811824.868192] core_tmr_abort_task.cold+0x187/0x21a [target_core_mod] > [811824.868198] target_tmr_work+0xa3/0xf0 [target_core_mod] > [811824.868200] process_one_work+0x1a8/0x340 > [811824.868201] worker_thread+0x49/0x2f0 > [811824.868202] ? rescuer_thread+0x350/0x350 > [811824.868202] kthread+0x118/0x140 > [811824.868203] ? __kthread_bind_mask+0x60/0x60 > [811824.868204] ret_from_fork+0x1f/0x30 > > I'm not sure how to recover from this state. Is it resolved upstream? > It's not. I think the safest thing is to just revert the patch which caused this commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". because we don't currently have the resources to fix qla. Do you want to send the patch? If not, I can send it. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: tcm_loop and aborted TMRs 2024-07-30 21:40 ` michael.christie @ 2024-07-31 13:49 ` Gao Xiang 0 siblings, 0 replies; 15+ messages in thread From: Gao Xiang @ 2024-07-31 13:49 UTC (permalink / raw) To: michael.christie Cc: Maurizio Lombardi, target-devel, Bodo Stroesser, Dmitry Bogdanov Hi Michael, On 2024/7/31 05:40, michael.christie@oracle.com wrote: > On 7/24/24 8:42 AM, Gao Xiang wrote: >> Hi all, >> >> On 2022/12/8 11:45, Mike Christie wrote: >>> On 12/1/22 8:15 AM, Bodo Stroesser wrote: >>>> Are we sure qla, loop and xen are the only drivers that handle aborted >>>> TMRs incorrectly? >>> >>> I'm not sure now. When we looked at this before I was only checking >>> for crashes, but didn't check if there could be issues like where >>> the driver needed to do some cleanup in their aborted_task callout >>> but hasn't been doing it. >>> >>> For example ibmvscsi's aborted_task callout won't crash because the >>> fields it references are ok for a IO or tmr se_cmds. It doesn't do >>> vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task >>> callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok >>> there. However, I didn't look at the driver to see if maybe it did need >>> to do some cleanup in the aborted_task callout and we just haven't >>> been doing it. >>> >>> Same for the other drivers. I only checked if aborted_task would crash. >>> Also we have a new driver efct, so we need to review that as well. >> >> >> Sorry I have very little knowledge of TCMU, but currently we have >> some call traces stuck as below: >> >> [811824.868078] task:kworker/u256:1 state:D stack: 0 pid:213661 >> ppid: 2 flags:0x00004000 >> [811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler >> [811824.868085] Call Trace: >> [811824.868091] __schedule+0x1ac/0x480 >> [811824.868092] schedule+0x46/0xb0 >> [811824.868095] schedule_timeout+0xe5/0x130 >> [811824.868110] ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod] >> [811824.868112] ? __prepare_to_swait+0x4f/0x70 >> [811824.868114] wait_for_completion+0x71/0xc0 >> [811824.868118] tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop] >> [811824.868120] tcm_loop_abort_task+0x3d/0x50 [tcm_loop] >> [811824.868121] scmd_eh_abort_handler+0x7b/0x210 >> [811824.868124] process_one_work+0x1a8/0x340 >> [811824.868125] worker_thread+0x49/0x2f0 >> [811824.868126] ? rescuer_thread+0x350/0x350 >> [811824.868127] kthread+0x118/0x140 >> [811824.868129] ? __kthread_bind_mask+0x60/0x60 >> [811824.868131] ret_from_fork+0x1f/0x30 >> [811824.868166] task:kworker/121:2 state:D stack: 0 pid:242954 >> ppid: 2 flags:0x00004000 >> [811824.868172] Workqueue: events target_tmr_work [target_core_mod] >> [811824.868172] Call Trace: >> [811824.868174] __schedule+0x1ac/0x480 >> [811824.868175] schedule+0x46/0xb0 >> [811824.868176] schedule_timeout+0xe5/0x130 >> [811824.868177] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 >> [811824.868178] ? __prepare_to_swait+0x4f/0x70 >> [811824.868179] wait_for_completion+0x71/0xc0 >> [811824.868184] target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod] >> [811824.868192] core_tmr_abort_task.cold+0x187/0x21a [target_core_mod] >> [811824.868198] target_tmr_work+0xa3/0xf0 [target_core_mod] >> [811824.868200] process_one_work+0x1a8/0x340 >> [811824.868201] worker_thread+0x49/0x2f0 >> [811824.868202] ? rescuer_thread+0x350/0x350 >> [811824.868202] kthread+0x118/0x140 >> [811824.868203] ? __kthread_bind_mask+0x60/0x60 >> [811824.868204] ret_from_fork+0x1f/0x30 >> >> I'm not sure how to recover from this state. Is it resolved upstream? >> > > It's not. > > I think the safest thing is to just revert the patch which caused this > > commit db5b21a24e01d354 "scsi: target/core: Use system workqueues for TMF". > > because we don't currently have the resources to fix qla. > > Do you want to send the patch? If not, I can send it. Thanks for the reply. I have very little knowledge and no time on this, just confirm its current status. But yeah, it'd be better to resolve, anyway. Thanks, Gao Xiang ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-07-31 13:49 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-11-11 16:20 tcm_loop and aborted TMRs Maurizio Lombardi 2022-11-11 16:21 ` Maurizio Lombardi 2022-11-11 21:18 ` Mike Christie 2022-11-12 13:59 ` Bodo Stroesser 2022-11-12 21:46 ` michael.christie 2022-11-19 11:42 ` Bodo Stroesser 2022-11-21 13:35 ` Dmitry Bogdanov 2022-11-21 19:25 ` Mike Christie 2022-11-25 9:34 ` Dmitry Bogdanov 2022-11-27 18:59 ` Mike Christie 2022-12-01 14:15 ` Bodo Stroesser 2022-12-08 3:45 ` Mike Christie 2024-07-24 13:42 ` Gao Xiang 2024-07-30 21:40 ` michael.christie 2024-07-31 13:49 ` Gao Xiang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.