tcm_loop and aborted TMRs

All of lore.kernel.org
 help / color / mirror / Atom feed

* tcm_loop and aborted TMRs
@ 2022-11-11 16:20 Maurizio Lombardi
  2022-11-11 16:21 ` Maurizio Lombardi
  2022-11-11 21:18 ` Mike Christie
  0 siblings, 2 replies; 15+ messages in thread
From: Maurizio Lombardi @ 2022-11-11 16:20 UTC (permalink / raw)
  To: Bodo Stroesser; +Cc: Mike Christie, target-devel

Hello Bodo, Mike,

Some of our customers reported that the tcm_loop module is unable
to handle aborted TMRs, resulting in kernel hangs.

I noticed that Bodo submitted a patch some time ago (our customers
confirm it works),
Mike instead proposed to revert
commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".

The discussion unfortunately died out without reaching a conclusion.

Personally, I think that if the handling of aborted TMRs was working
before the "Use system workqueues" commit then this should be
considered as a regression and the commit should be reverted.

Maurizio

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-11 16:20 tcm_loop and aborted TMRs Maurizio Lombardi
@ 2022-11-11 16:21 ` Maurizio Lombardi
  2022-11-11 21:18 ` Mike Christie
  1 sibling, 0 replies; 15+ messages in thread
From: Maurizio Lombardi @ 2022-11-11 16:21 UTC (permalink / raw)
  To: Bodo Stroesser; +Cc: Mike Christie, target-devel

pá 11. 11. 2022 v 17:20 odesílatel Maurizio Lombardi
<mlombard@redhat.com> napsal:
>
> I noticed that Bodo submitted a patch some time ago (our customers
> confirm it works),
> Mike instead proposed to revert
> commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".
>

Link to the patch submitted by Bodo:
https://patchwork.kernel.org/project/target-devel/patch/20200715160403.12578-1-bstroesser@ts.fujitsu.com/#23512015

Thanks,
Maurizio


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-11 16:20 tcm_loop and aborted TMRs Maurizio Lombardi
  2022-11-11 16:21 ` Maurizio Lombardi
@ 2022-11-11 21:18 ` Mike Christie
  2022-11-12 13:59   ` Bodo Stroesser
  1 sibling, 1 reply; 15+ messages in thread
From: Mike Christie @ 2022-11-11 21:18 UTC (permalink / raw)
  To: Maurizio Lombardi, Bodo Stroesser; +Cc: target-devel

On 11/11/22 10:20 AM, Maurizio Lombardi wrote:
> Hello Bodo, Mike,
> 
> Some of our customers reported that the tcm_loop module is unable
> to handle aborted TMRs, resulting in kernel hangs.
> 
> I noticed that Bodo submitted a patch some time ago (our customers
> confirm it works),
> Mike instead proposed to revert
> commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".
> 
> The discussion unfortunately died out without reaching a conclusion.
> 
> Personally, I think that if the handling of aborted TMRs was working
> before the "Use system workqueues" commit then this should be
> considered as a regression and the commit should be reverted.
> 

I'm fine with reverting it because multiple drivers are affected. I had
talked to Bart offlist back then and he was also ok since we couldn't
fix up the drivers.

I think Bodo and I tried to convert qla, but it was difficult without
marvell's help (we both pinged them but didn't hear back) because from
what I could tell we needed to send some hw/fw commands to perform cleanup
to fully handle that case.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-11 21:18 ` Mike Christie
@ 2022-11-12 13:59   ` Bodo Stroesser
  2022-11-12 21:46     ` michael.christie
  2022-11-21 13:35     ` Dmitry Bogdanov
  0 siblings, 2 replies; 15+ messages in thread
From: Bodo Stroesser @ 2022-11-12 13:59 UTC (permalink / raw)
  To: Mike Christie, Maurizio Lombardi; +Cc: target-devel

Hello Mike, Maurizio,

Even if we couldn't yet find a method to fix handling of aborted
TMRs in the core or in all fabric drivers, I still think that keeping
the parallel handling of TMRs would be fine.

Tcmu offers a TMR notification mechanism to make userspace aware
of ABORT or RESET_LUN. So userspace can try to break cmd handling
and thus speed up TMR response. If we serialize TMR handling, then
the notifications are also serialized and thus lose some of their
power.

But maybe I have a new (?) idea of how to fix handling of aborted
TMRs in fabric drivers:
1) Modify core to not call target_put_sess_cmd, no matter whether
    SCF_ACK_REF is set.
2) Modify fabric drivers to handle an aborted TMR just like a
    normal TMR response. This means, e.g. qla2xxx would send a
    normal response for the Abort. This exactly is what happens
    when serializing TMRs, because in that case despite of the
    RESET_LUN the core always calls queue_tm_rsp callback instead
    of aborted_task callback.

So to initiators we would show the 'old' behavior, while internally
keeping the parallel processing of TMRs.

If fabric driver maintainers don't like that approach, they can
change their drivers to correctly kill aborted TMRs.

What do you think?

Bodo
  

On 11.11.22 22:18, Mike Christie wrote:
> On 11/11/22 10:20 AM, Maurizio Lombardi wrote:
>> Hello Bodo, Mike,
>>
>> Some of our customers reported that the tcm_loop module is unable
>> to handle aborted TMRs, resulting in kernel hangs.
>>
>> I noticed that Bodo submitted a patch some time ago (our customers
>> confirm it works),
>> Mike instead proposed to revert
>> commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".
>>
>> The discussion unfortunately died out without reaching a conclusion.
>>
>> Personally, I think that if the handling of aborted TMRs was working
>> before the "Use system workqueues" commit then this should be
>> considered as a regression and the commit should be reverted.
>>
> 
> I'm fine with reverting it because multiple drivers are affected. I had
> talked to Bart offlist back then and he was also ok since we couldn't
> fix up the drivers.
> 
> I think Bodo and I tried to convert qla, but it was difficult without
> marvell's help (we both pinged them but didn't hear back) because from
> what I could tell we needed to send some hw/fw commands to perform cleanup
> to fully handle that case.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-12 13:59   ` Bodo Stroesser
@ 2022-11-12 21:46     ` michael.christie
  2022-11-19 11:42       ` Bodo Stroesser
  2022-11-21 13:35     ` Dmitry Bogdanov
  1 sibling, 1 reply; 15+ messages in thread
From: michael.christie @ 2022-11-12 21:46 UTC (permalink / raw)
  To: Bodo Stroesser, Maurizio Lombardi; +Cc: target-devel

On 11/12/22 7:59 AM, Bodo Stroesser wrote:
> Hello Mike, Maurizio,
> 
> Even if we couldn't yet find a method to fix handling of aborted
> TMRs in the core or in all fabric drivers, I still think that keeping
> the parallel handling of TMRs would be fine.
> 
> Tcmu offers a TMR notification mechanism to make userspace aware
> of ABORT or RESET_LUN. So userspace can try to break cmd handling
> and thus speed up TMR response. If we serialize TMR handling, then
> the notifications are also serialized and thus lose some of their
> power.
> 
> But maybe I have a new (?) idea of how to fix handling of aborted
> TMRs in fabric drivers:
> 1) Modify core to not call target_put_sess_cmd, no matter whether
>    SCF_ACK_REF is set.
> 2) Modify fabric drivers to handle an aborted TMR just like a
>    normal TMR response. This means, e.g. qla2xxx would send a
>    normal response for the Abort. This exactly is what happens
>    when serializing TMRs, because in that case despite of the
>    RESET_LUN the core always calls queue_tm_rsp callback instead
>    of aborted_task callback.
> 
> So to initiators we would show the 'old' behavior, while internally
> keeping the parallel processing of TMRs.
> 
> If fabric driver maintainers don't like that approach, they can
> change their drivers to correctly kill aborted TMRs.
> 
> What do you think?
> 

I'm fine with doing it in parallel. However, the issue is we have real
users hitting it now and we have to fix all the drivers because it's a
regression. So if your idea is going take a while then we should revert 
now and then do your idea whenever it's ready.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-12 21:46     ` michael.christie
@ 2022-11-19 11:42       ` Bodo Stroesser
  0 siblings, 0 replies; 15+ messages in thread
From: Bodo Stroesser @ 2022-11-19 11:42 UTC (permalink / raw)
  To: michael.christie, Maurizio Lombardi; +Cc: target-devel

On 12.11.22 22:46, michael.christie@oracle.com wrote:
> On 11/12/22 7:59 AM, Bodo Stroesser wrote:
>> Hello Mike, Maurizio,
>>
>> Even if we couldn't yet find a method to fix handling of aborted
>> TMRs in the core or in all fabric drivers, I still think that keeping
>> the parallel handling of TMRs would be fine.
>>
>> Tcmu offers a TMR notification mechanism to make userspace aware
>> of ABORT or RESET_LUN. So userspace can try to break cmd handling
>> and thus speed up TMR response. If we serialize TMR handling, then
>> the notifications are also serialized and thus lose some of their
>> power.
>>
>> But maybe I have a new (?) idea of how to fix handling of aborted
>> TMRs in fabric drivers:
>> 1) Modify core to not call target_put_sess_cmd, no matter whether
>>     SCF_ACK_REF is set.
>> 2) Modify fabric drivers to handle an aborted TMR just like a
>>     normal TMR response. This means, e.g. qla2xxx would send a
>>     normal response for the Abort. This exactly is what happens
>>     when serializing TMRs, because in that case despite of the
>>     RESET_LUN the core always calls queue_tm_rsp callback instead
>>     of aborted_task callback.
>>
>> So to initiators we would show the 'old' behavior, while internally
>> keeping the parallel processing of TMRs.
>>
>> If fabric driver maintainers don't like that approach, they can
>> change their drivers to correctly kill aborted TMRs.
>>
>> What do you think?
>>
> 
> I'm fine with doing it in parallel. However, the issue is we have real
> users hitting it now and we have to fix all the drivers because it's a
> regression. So if your idea is going take a while then we should revert
> now and then do your idea whenever it's ready.

I agree.

Even if my old patch fixes the issue for tcm_loop users, it does not
make sense to apply it, since the new idea would lead to reverting
parts of it. And with the patch we would still take the risk of users
running into trouble with fabrics other than tcm_loop.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-12 13:59   ` Bodo Stroesser
  2022-11-12 21:46     ` michael.christie
@ 2022-11-21 13:35     ` Dmitry Bogdanov
  2022-11-21 19:25       ` Mike Christie
  1 sibling, 1 reply; 15+ messages in thread
From: Dmitry Bogdanov @ 2022-11-21 13:35 UTC (permalink / raw)
  To: Bodo Stroesser; +Cc: Mike Christie, Maurizio Lombardi, target-devel


Hi Bodo,

On Sat, Nov 12, 2022 at 02:59:48PM +0100, Bodo Stroesser wrote:
> 
> Hello Mike, Maurizio,
> 
> Even if we couldn't yet find a method to fix handling of aborted
> TMRs in the core or in all fabric drivers, I still think that keeping
> the parallel handling of TMRs would be fine.
> 
> Tcmu offers a TMR notification mechanism to make userspace aware
> of ABORT or RESET_LUN. So userspace can try to break cmd handling
> and thus speed up TMR response. If we serialize TMR handling, then
> the notifications are also serialized and thus lose some of their
> power.
> 
> But maybe I have a new (?) idea of how to fix handling of aborted
> TMRs in fabric drivers:
> 1) Modify core to not call target_put_sess_cmd, no matter whether
>    SCF_ACK_REF is set.
> 2) Modify fabric drivers to handle an aborted TMR just like a
>    normal TMR response. This means, e.g. qla2xxx would send a
>    normal response for the Abort. This exactly is what happens
>    when serializing TMRs, because in that case despite of the
>    RESET_LUN the core always calls queue_tm_rsp callback instead
>    of aborted_task callback.
I am not sure for all initiators, but usually it sends LUN_RESET only
after ABORT has been timed out. There is no reason to send a response on
the ABORT that was actually aborted already internally on the initiator
side.
 
> So to initiators we would show the 'old' behavior, while internally
> keeping the parallel processing of TMRs.
> 
> If fabric driver maintainers don't like that approach, they can
> change their drivers to correctly kill aborted TMRs.
> 
> What do you think?
> 
I will vote to your old patch. qla2xxx was fixed long time ago. Other
fabric drivers have not that issue too. Only tcm_loop and xen still do
not adapted for parallel TMRs.

I think it's a not good idea to revert 2 years old patch. It was a
reason for some other patches (that fixes issues thanks to parallel TMR
handling).

> 
> On 11.11.22 22:18, Mike Christie wrote:
> > On 11/11/22 10:20 AM, Maurizio Lombardi wrote:
> > > Hello Bodo, Mike,
> > > 
> > > Some of our customers reported that the tcm_loop module is unable
> > > to handle aborted TMRs, resulting in kernel hangs.
> > > 
> > > I noticed that Bodo submitted a patch some time ago (our customers
> > > confirm it works),
> > > Mike instead proposed to revert
> > > commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".
> > > 
> > > The discussion unfortunately died out without reaching a conclusion.
> > > 
> > > Personally, I think that if the handling of aborted TMRs was working
> > > before the "Use system workqueues" commit then this should be
> > > considered as a regression and the commit should be reverted.
> > > 
> > 
> > I'm fine with reverting it because multiple drivers are affected. I had
> > talked to Bart offlist back then and he was also ok since we couldn't
> > fix up the drivers.
> > 
> > I think Bodo and I tried to convert qla, but it was difficult without
> > marvell's help (we both pinged them but didn't hear back) because from
> > what I could tell we needed to send some hw/fw commands to perform cleanup
> > to fully handle that case.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-21 13:35     ` Dmitry Bogdanov
@ 2022-11-21 19:25       ` Mike Christie
  2022-11-25  9:34         ` Dmitry Bogdanov
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Christie @ 2022-11-21 19:25 UTC (permalink / raw)
  To: Dmitry Bogdanov, Bodo Stroesser; +Cc: Maurizio Lombardi, target-devel

On 11/21/22 7:35 AM, Dmitry Bogdanov wrote:
>>
> I will vote to your old patch. qla2xxx was fixed long time ago.

What is the qla fix? I think we still leak. In commit

commit 605e74025f953b995a3a241ead43bde71c1c99b5
Author: Mike Christie <michael.christie@oracle.com>
Date:   Sun Nov 1 12:59:31 2020 -0600

    scsi: qla2xxx: Move sess cmd list/lock to driver

when I changed the locking I had added the check:

static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd)
{
        struct qla_tgt_cmd *cmd;
        unsigned long flags;

        if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB)
                return;

because tmrs are not on the sess_cmd_list that's accessed the
next line down. We don't crash as a result, but I think we need
to add code to send the cleanup command to the FW. Bodo and I
were working on that part, but someone with more qla experience
needed to work on it so it could be properly tested. We didn't
hear back from the qla engineers so progress had stalled.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-21 19:25       ` Mike Christie
@ 2022-11-25  9:34         ` Dmitry Bogdanov
  2022-11-27 18:59           ` Mike Christie
  0 siblings, 1 reply; 15+ messages in thread
From: Dmitry Bogdanov @ 2022-11-25  9:34 UTC (permalink / raw)
  To: Mike Christie; +Cc: Bodo Stroesser, Maurizio Lombardi, target-devel

On Mon, Nov 21, 2022 at 01:25:55PM -0600, Mike Christie wrote:
> 
> On 11/21/22 7:35 AM, Dmitry Bogdanov wrote:
> >>
> > I will vote to your old patch. qla2xxx was fixed long time ago.
> 
> What is the qla fix? I think we still leak. In commit
> 
> commit 605e74025f953b995a3a241ead43bde71c1c99b5
> Author: Mike Christie <michael.christie@oracle.com>
> Date:   Sun Nov 1 12:59:31 2020 -0600
> 
>     scsi: qla2xxx: Move sess cmd list/lock to driver
> 
> when I changed the locking I had added the check:
> 
> static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd)
> {
>         struct qla_tgt_cmd *cmd;
>         unsigned long flags;
> 
>         if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB)
>                 return;
> 
Yes, I was thinking about that commit.
> because tmrs are not on the sess_cmd_list that's accessed the
> next line down. We don't crash as a result, but I think we need
> to add code to send the cleanup command to the FW. Bodo and I
> were working on that part, but someone with more qla experience
> needed to work on it so it could be properly tested. We didn't
> hear back from the qla engineers so progress had stalled.
> 
Yes, you are right, FW expects some response on every ABORT IOCB
to clear its resources.

I can prepare the patch for qla2xxx.

But still, I do not see a sense of new Bodo's solution.
Calling target_put_sess_cmd due to SCF_ACK_REF forbids any async long
term work in fabric drivers in aborted_task callback. It is
intentionally. qla2xxx does not require to wait for the completion of
all requests to FW. All terminate exchange requests are such.

SAM-5 states that an aborted(due to LUN_RESET) TMR should not be
responded to initiator:

7.11 Task management function lifetime
 The task management function shall exist until:
  a) the task manager sends a service response for the task management function;
  b) an I_T nexus loss (see 6.3.4);
* c) a logical unit reset (see 6.3.3);
  d) a hard reset (see 6.3.2);
  e) power loss expected (see 6.3.5); or
  f) a power on condition (see 6.3.1)


BR,
 Dmitry


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-25  9:34         ` Dmitry Bogdanov
@ 2022-11-27 18:59           ` Mike Christie
  2022-12-01 14:15             ` Bodo Stroesser
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Christie @ 2022-11-27 18:59 UTC (permalink / raw)
  To: Dmitry Bogdanov; +Cc: Bodo Stroesser, Maurizio Lombardi, target-devel

On 11/25/22 3:34 AM, Dmitry Bogdanov wrote:
> On Mon, Nov 21, 2022 at 01:25:55PM -0600, Mike Christie wrote:
>>
>> On 11/21/22 7:35 AM, Dmitry Bogdanov wrote:
>>>>
>>> I will vote to your old patch. qla2xxx was fixed long time ago.
>>
>> What is the qla fix? I think we still leak. In commit
>>
>> commit 605e74025f953b995a3a241ead43bde71c1c99b5
>> Author: Mike Christie <michael.christie@oracle.com>
>> Date:   Sun Nov 1 12:59:31 2020 -0600
>>
>>     scsi: qla2xxx: Move sess cmd list/lock to driver
>>
>> when I changed the locking I had added the check:
>>
>> static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd)
>> {
>>         struct qla_tgt_cmd *cmd;
>>         unsigned long flags;
>>
>>         if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB)
>>                 return;
>>
> Yes, I was thinking about that commit.
>> because tmrs are not on the sess_cmd_list that's accessed the
>> next line down. We don't crash as a result, but I think we need
>> to add code to send the cleanup command to the FW. Bodo and I
>> were working on that part, but someone with more qla experience
>> needed to work on it so it could be properly tested. We didn't
>> hear back from the qla engineers so progress had stalled.
>>
> Yes, you are right, FW expects some response on every ABORT IOCB
> to clear its resources.
> 
> I can prepare the patch for qla2xxx.

That would be awesome.

> 
> But still, I do not see a sense of new Bodo's solution.

Drivers are crashing in the aborted_task callout. His idea was a
workaround/hack so we would call the queue_tm_rsp callout instead of
aborted_ask like we did before which would avoid the crashes but
allow us to keep the async behavior.

To the initiator's it would work like before where it looks like a
race where the abort response is received right after it has sent
the lun reset. His assumption is that initiators handled that before
so it would continue to work.

If you fix qla, then we can easily fix loop and xen and we can do
the correct behavior and keep the async feature.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-11-27 18:59           ` Mike Christie
@ 2022-12-01 14:15             ` Bodo Stroesser
  2022-12-08  3:45               ` Mike Christie
  0 siblings, 1 reply; 15+ messages in thread
From: Bodo Stroesser @ 2022-12-01 14:15 UTC (permalink / raw)
  To: Mike Christie, Dmitry Bogdanov; +Cc: Maurizio Lombardi, target-devel

On 27.11.22 19:59, Mike Christie wrote:
> On 11/25/22 3:34 AM, Dmitry Bogdanov wrote:
>> On Mon, Nov 21, 2022 at 01:25:55PM -0600, Mike Christie wrote:
>>>
>>> On 11/21/22 7:35 AM, Dmitry Bogdanov wrote:
>>>>>
>>>> I will vote to your old patch. qla2xxx was fixed long time ago.
>>>
>>> What is the qla fix? I think we still leak. In commit
>>>
>>> commit 605e74025f953b995a3a241ead43bde71c1c99b5
>>> Author: Mike Christie <michael.christie@oracle.com>
>>> Date:   Sun Nov 1 12:59:31 2020 -0600
>>>
>>>      scsi: qla2xxx: Move sess cmd list/lock to driver
>>>
>>> when I changed the locking I had added the check:
>>>
>>> static void tcm_qla2xxx_aborted_task(struct se_cmd *se_cmd)
>>> {
>>>          struct qla_tgt_cmd *cmd;
>>>          unsigned long flags;
>>>
>>>          if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB)
>>>                  return;
>>>
>> Yes, I was thinking about that commit.
>>> because tmrs are not on the sess_cmd_list that's accessed the
>>> next line down. We don't crash as a result, but I think we need
>>> to add code to send the cleanup command to the FW. Bodo and I
>>> were working on that part, but someone with more qla experience
>>> needed to work on it so it could be properly tested. We didn't
>>> hear back from the qla engineers so progress had stalled.
>>>
>> Yes, you are right, FW expects some response on every ABORT IOCB
>> to clear its resources.
>>
>> I can prepare the patch for qla2xxx.
> 
> That would be awesome.
> 
>>
>> But still, I do not see a sense of new Bodo's solution.
> 
> Drivers are crashing in the aborted_task callout. His idea was a
> workaround/hack so we would call the queue_tm_rsp callout instead of
> aborted_ask like we did before which would avoid the crashes but
> allow us to keep the async behavior.
> 

Not exactly. In case we are not able to fix a driver, I wanted to
change drivers's aborted_task callout to use its internal queue_tm_rsp
routines in case of aborted TMRs.
But that would work only if we changed the core to not call
target_put_sess_cmd, which again would lead to changes in currently
working drivers. Really a hack only.

> To the initiator's it would work like before where it looks like a
> race where the abort response is received right after it has sent
> the lun reset. His assumption is that initiators handled that before
> so it would continue to work.
> 
> If you fix qla, then we can easily fix loop and xen and we can do
> the correct behavior and keep the async feature.
> 

Are we sure qla, loop and xen are the only drivers that handle aborted
TMRs incorrectly?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-12-01 14:15             ` Bodo Stroesser
@ 2022-12-08  3:45               ` Mike Christie
  2024-07-24 13:42                 ` Gao Xiang
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Christie @ 2022-12-08  3:45 UTC (permalink / raw)
  To: Bodo Stroesser, Dmitry Bogdanov; +Cc: Maurizio Lombardi, target-devel

On 12/1/22 8:15 AM, Bodo Stroesser wrote:
> Are we sure qla, loop and xen are the only drivers that handle aborted
> TMRs incorrectly?

I'm not sure now. When we looked at this before I was only checking
for crashes, but didn't check if there could be issues like where
the driver needed to do some cleanup in their aborted_task callout
but hasn't been doing it.

For example ibmvscsi's aborted_task callout won't crash because the
fields it references are ok for a IO or tmr se_cmds. It doesn't do
vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task
callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok
there. However, I didn't look at the driver to see if maybe it did need
to do some cleanup in the aborted_task callout and we just haven't
been doing it.

Same for the other drivers. I only checked if aborted_task would crash.
Also we have a new driver efct, so we need to review that as well.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2022-12-08  3:45               ` Mike Christie
@ 2024-07-24 13:42                 ` Gao Xiang
  2024-07-30 21:40                   ` michael.christie
  0 siblings, 1 reply; 15+ messages in thread
From: Gao Xiang @ 2024-07-24 13:42 UTC (permalink / raw)
  To: Mike Christie, Bodo Stroesser, Dmitry Bogdanov
  Cc: Maurizio Lombardi, target-devel

Hi all,

On 2022/12/8 11:45, Mike Christie wrote:
> On 12/1/22 8:15 AM, Bodo Stroesser wrote:
>> Are we sure qla, loop and xen are the only drivers that handle aborted
>> TMRs incorrectly?
> 
> I'm not sure now. When we looked at this before I was only checking
> for crashes, but didn't check if there could be issues like where
> the driver needed to do some cleanup in their aborted_task callout
> but hasn't been doing it.
> 
> For example ibmvscsi's aborted_task callout won't crash because the
> fields it references are ok for a IO or tmr se_cmds. It doesn't do
> vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task
> callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok
> there. However, I didn't look at the driver to see if maybe it did need
> to do some cleanup in the aborted_task callout and we just haven't
> been doing it.
> 
> Same for the other drivers. I only checked if aborted_task would crash.
> Also we have a new driver efct, so we need to review that as well.


Sorry I have very little knowledge of TCMU, but currently we have
some call traces stuck as below:

[811824.868078] task:kworker/u256:1  state:D stack:    0 pid:213661 ppid:     2 flags:0x00004000
[811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler
[811824.868085] Call Trace:
[811824.868091]  __schedule+0x1ac/0x480
[811824.868092]  schedule+0x46/0xb0
[811824.868095]  schedule_timeout+0xe5/0x130
[811824.868110]  ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod]
[811824.868112]  ? __prepare_to_swait+0x4f/0x70
[811824.868114]  wait_for_completion+0x71/0xc0
[811824.868118]  tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop]
[811824.868120]  tcm_loop_abort_task+0x3d/0x50 [tcm_loop]
[811824.868121]  scmd_eh_abort_handler+0x7b/0x210
[811824.868124]  process_one_work+0x1a8/0x340
[811824.868125]  worker_thread+0x49/0x2f0
[811824.868126]  ? rescuer_thread+0x350/0x350
[811824.868127]  kthread+0x118/0x140
[811824.868129]  ? __kthread_bind_mask+0x60/0x60
[811824.868131]  ret_from_fork+0x1f/0x30
[811824.868166] task:kworker/121:2   state:D stack:    0 pid:242954 ppid:     2 flags:0x00004000
[811824.868172] Workqueue: events target_tmr_work [target_core_mod]
[811824.868172] Call Trace:
[811824.868174]  __schedule+0x1ac/0x480
[811824.868175]  schedule+0x46/0xb0
[811824.868176]  schedule_timeout+0xe5/0x130
[811824.868177]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
[811824.868178]  ? __prepare_to_swait+0x4f/0x70
[811824.868179]  wait_for_completion+0x71/0xc0
[811824.868184]  target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod]
[811824.868192]  core_tmr_abort_task.cold+0x187/0x21a [target_core_mod]
[811824.868198]  target_tmr_work+0xa3/0xf0 [target_core_mod]
[811824.868200]  process_one_work+0x1a8/0x340
[811824.868201]  worker_thread+0x49/0x2f0
[811824.868202]  ? rescuer_thread+0x350/0x350
[811824.868202]  kthread+0x118/0x140
[811824.868203]  ? __kthread_bind_mask+0x60/0x60
[811824.868204]  ret_from_fork+0x1f/0x30

I'm not sure how to recover from this state.  Is it resolved upstream?

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2024-07-24 13:42                 ` Gao Xiang
@ 2024-07-30 21:40                   ` michael.christie
  2024-07-31 13:49                     ` Gao Xiang
  0 siblings, 1 reply; 15+ messages in thread
From: michael.christie @ 2024-07-30 21:40 UTC (permalink / raw)
  To: Gao Xiang, Bodo Stroesser, Dmitry Bogdanov
  Cc: Maurizio Lombardi, target-devel

On 7/24/24 8:42 AM, Gao Xiang wrote:
> Hi all,
> 
> On 2022/12/8 11:45, Mike Christie wrote:
>> On 12/1/22 8:15 AM, Bodo Stroesser wrote:
>>> Are we sure qla, loop and xen are the only drivers that handle aborted
>>> TMRs incorrectly?
>>
>> I'm not sure now. When we looked at this before I was only checking
>> for crashes, but didn't check if there could be issues like where
>> the driver needed to do some cleanup in their aborted_task callout
>> but hasn't been doing it.
>>
>> For example ibmvscsi's aborted_task callout won't crash because the
>> fields it references are ok for a IO or tmr se_cmds. It doesn't do
>> vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task
>> callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok
>> there. However, I didn't look at the driver to see if maybe it did need
>> to do some cleanup in the aborted_task callout and we just haven't
>> been doing it.
>>
>> Same for the other drivers. I only checked if aborted_task would crash.
>> Also we have a new driver efct, so we need to review that as well.
> 
> 
> Sorry I have very little knowledge of TCMU, but currently we have
> some call traces stuck as below:
> 
> [811824.868078] task:kworker/u256:1  state:D stack:    0 pid:213661
> ppid:     2 flags:0x00004000
> [811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler
> [811824.868085] Call Trace:
> [811824.868091]  __schedule+0x1ac/0x480
> [811824.868092]  schedule+0x46/0xb0
> [811824.868095]  schedule_timeout+0xe5/0x130
> [811824.868110]  ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod]
> [811824.868112]  ? __prepare_to_swait+0x4f/0x70
> [811824.868114]  wait_for_completion+0x71/0xc0
> [811824.868118]  tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop]
> [811824.868120]  tcm_loop_abort_task+0x3d/0x50 [tcm_loop]
> [811824.868121]  scmd_eh_abort_handler+0x7b/0x210
> [811824.868124]  process_one_work+0x1a8/0x340
> [811824.868125]  worker_thread+0x49/0x2f0
> [811824.868126]  ? rescuer_thread+0x350/0x350
> [811824.868127]  kthread+0x118/0x140
> [811824.868129]  ? __kthread_bind_mask+0x60/0x60
> [811824.868131]  ret_from_fork+0x1f/0x30
> [811824.868166] task:kworker/121:2   state:D stack:    0 pid:242954
> ppid:     2 flags:0x00004000
> [811824.868172] Workqueue: events target_tmr_work [target_core_mod]
> [811824.868172] Call Trace:
> [811824.868174]  __schedule+0x1ac/0x480
> [811824.868175]  schedule+0x46/0xb0
> [811824.868176]  schedule_timeout+0xe5/0x130
> [811824.868177]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
> [811824.868178]  ? __prepare_to_swait+0x4f/0x70
> [811824.868179]  wait_for_completion+0x71/0xc0
> [811824.868184]  target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod]
> [811824.868192]  core_tmr_abort_task.cold+0x187/0x21a [target_core_mod]
> [811824.868198]  target_tmr_work+0xa3/0xf0 [target_core_mod]
> [811824.868200]  process_one_work+0x1a8/0x340
> [811824.868201]  worker_thread+0x49/0x2f0
> [811824.868202]  ? rescuer_thread+0x350/0x350
> [811824.868202]  kthread+0x118/0x140
> [811824.868203]  ? __kthread_bind_mask+0x60/0x60
> [811824.868204]  ret_from_fork+0x1f/0x30
> 
> I'm not sure how to recover from this state.  Is it resolved upstream?
> 

It's not.

I think the safest thing is to just revert the patch which caused this

commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".

because we don't currently have the resources to fix qla.

Do you want to send the patch? If not, I can send it.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: tcm_loop and aborted TMRs
  2024-07-30 21:40                   ` michael.christie
@ 2024-07-31 13:49                     ` Gao Xiang
  0 siblings, 0 replies; 15+ messages in thread
From: Gao Xiang @ 2024-07-31 13:49 UTC (permalink / raw)
  To: michael.christie
  Cc: Maurizio Lombardi, target-devel, Bodo Stroesser, Dmitry Bogdanov

Hi Michael,

On 2024/7/31 05:40, michael.christie@oracle.com wrote:
> On 7/24/24 8:42 AM, Gao Xiang wrote:
>> Hi all,
>>
>> On 2022/12/8 11:45, Mike Christie wrote:
>>> On 12/1/22 8:15 AM, Bodo Stroesser wrote:
>>>> Are we sure qla, loop and xen are the only drivers that handle aborted
>>>> TMRs incorrectly?
>>>
>>> I'm not sure now. When we looked at this before I was only checking
>>> for crashes, but didn't check if there could be issues like where
>>> the driver needed to do some cleanup in their aborted_task callout
>>> but hasn't been doing it.
>>>
>>> For example ibmvscsi's aborted_task callout won't crash because the
>>> fields it references are ok for a IO or tmr se_cmds. It doesn't do
>>> vio_iu(iue)->srp.tsk_mgmt or vio_iu(iue)->srp.cmd in the aborted_task
>>> callout and just accesses the se_cmd and ibmvscsis_cmd. So we are ok
>>> there. However, I didn't look at the driver to see if maybe it did need
>>> to do some cleanup in the aborted_task callout and we just haven't
>>> been doing it.
>>>
>>> Same for the other drivers. I only checked if aborted_task would crash.
>>> Also we have a new driver efct, so we need to review that as well.
>>
>>
>> Sorry I have very little knowledge of TCMU, but currently we have
>> some call traces stuck as below:
>>
>> [811824.868078] task:kworker/u256:1  state:D stack:    0 pid:213661
>> ppid:     2 flags:0x00004000
>> [811824.868084] Workqueue: scsi_tmf_24 scmd_eh_abort_handler
>> [811824.868085] Call Trace:
>> [811824.868091]  __schedule+0x1ac/0x480
>> [811824.868092]  schedule+0x46/0xb0
>> [811824.868095]  schedule_timeout+0xe5/0x130
>> [811824.868110]  ? transport_generic_handle_tmr+0xb9/0xd0 [target_core_mod]
>> [811824.868112]  ? __prepare_to_swait+0x4f/0x70
>> [811824.868114]  wait_for_completion+0x71/0xc0
>> [811824.868118]  tcm_loop_issue_tmr+0xbb/0x100 [tcm_loop]
>> [811824.868120]  tcm_loop_abort_task+0x3d/0x50 [tcm_loop]
>> [811824.868121]  scmd_eh_abort_handler+0x7b/0x210
>> [811824.868124]  process_one_work+0x1a8/0x340
>> [811824.868125]  worker_thread+0x49/0x2f0
>> [811824.868126]  ? rescuer_thread+0x350/0x350
>> [811824.868127]  kthread+0x118/0x140
>> [811824.868129]  ? __kthread_bind_mask+0x60/0x60
>> [811824.868131]  ret_from_fork+0x1f/0x30
>> [811824.868166] task:kworker/121:2   state:D stack:    0 pid:242954
>> ppid:     2 flags:0x00004000
>> [811824.868172] Workqueue: events target_tmr_work [target_core_mod]
>> [811824.868172] Call Trace:
>> [811824.868174]  __schedule+0x1ac/0x480
>> [811824.868175]  schedule+0x46/0xb0
>> [811824.868176]  schedule_timeout+0xe5/0x130
>> [811824.868177]  ? asm_sysvec_apic_timer_interrupt+0x12/0x20
>> [811824.868178]  ? __prepare_to_swait+0x4f/0x70
>> [811824.868179]  wait_for_completion+0x71/0xc0
>> [811824.868184]  target_put_cmd_and_wait+0x5d/0xb0 [target_core_mod]
>> [811824.868192]  core_tmr_abort_task.cold+0x187/0x21a [target_core_mod]
>> [811824.868198]  target_tmr_work+0xa3/0xf0 [target_core_mod]
>> [811824.868200]  process_one_work+0x1a8/0x340
>> [811824.868201]  worker_thread+0x49/0x2f0
>> [811824.868202]  ? rescuer_thread+0x350/0x350
>> [811824.868202]  kthread+0x118/0x140
>> [811824.868203]  ? __kthread_bind_mask+0x60/0x60
>> [811824.868204]  ret_from_fork+0x1f/0x30
>>
>> I'm not sure how to recover from this state.  Is it resolved upstream?
>>
> 
> It's not.
> 
> I think the safest thing is to just revert the patch which caused this
> 
> commit db5b21a24e01d354  "scsi: target/core: Use system workqueues for TMF".
> 
> because we don't currently have the resources to fix qla.
> 
> Do you want to send the patch? If not, I can send it.

Thanks for the reply.

I have very little knowledge and no time on this, just confirm its
current status.  But yeah, it'd be better to resolve, anyway.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-07-31 13:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-11-11 16:20 tcm_loop and aborted TMRs Maurizio Lombardi
2022-11-11 16:21 ` Maurizio Lombardi
2022-11-11 21:18 ` Mike Christie
2022-11-12 13:59   ` Bodo Stroesser
2022-11-12 21:46     ` michael.christie
2022-11-19 11:42       ` Bodo Stroesser
2022-11-21 13:35     ` Dmitry Bogdanov
2022-11-21 19:25       ` Mike Christie
2022-11-25  9:34         ` Dmitry Bogdanov
2022-11-27 18:59           ` Mike Christie
2022-12-01 14:15             ` Bodo Stroesser
2022-12-08  3:45               ` Mike Christie
2024-07-24 13:42                 ` Gao Xiang
2024-07-30 21:40                   ` michael.christie
2024-07-31 13:49                     ` Gao Xiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.