All of lore.kernel.org
 help / color / mirror / Atom feed
* SPC-3 persistent reservations race condition
@ 2018-11-13 15:17 Martin Svec
  2018-11-20 18:31 ` David Disseldorp
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Martin Svec @ 2018-11-13 15:17 UTC (permalink / raw)
  To: target-devel

Hello,

during Hyper-V Cluster Validation tests, we hit a bug in SPC-3 persistent reservations
implementation where LIO iSCSI target started to loop indefinitely while freeing
a PR registration.

According to our investigation, there's the following race between pr_reg->pr_res_holders
reference counter and dev->dev_reservation_lock:

(1) Thread A enters core_scsi3_pro_preempt() and locks dev->dev_reservation_lock.

(2) Thread B enters core_scsi3_pro_preempt() and increments pr_reg->pr_res_holders
    by core_scsi3_locate_pr_reg().

(3) Thread B waits for dev->dev_reservation_lock which is held by thread A.

(4) Thread A goes through the registration list and calls __core_scsi3_free_registration()
    to remove all registrations.

(5) When __core_scsi3_free_registration() is called for pr_reg referenced by thread B
    in step 2, it ends up waiting for pr_reg->pr_res_holders = 0 indefinitely.

That is, thread A waits for thread B to decrease pr_res_holders and thread B waits
for thread A to release dev->dev_reservation_lock. 

We're able to reproduce this deadlock in LTS kernels 4.14.78 and 4.9.x but the bug
seems to be there from the beginning. I also guess that the same deadlock can occur in
core_scsi3_emulate_pro_clear() too.

Any ideas how to fix the locking?

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SPC-3 persistent reservations race condition
  2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
@ 2018-11-20 18:31 ` David Disseldorp
  2018-11-21 13:16 ` Martin Svec
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Disseldorp @ 2018-11-20 18:31 UTC (permalink / raw)
  To: target-devel

Hi Martin,

On Tue, 13 Nov 2018 16:17:36 +0100, Martin Svec wrote:

> Hello,
> 
> during Hyper-V Cluster Validation tests, we hit a bug in SPC-3 persistent reservations
> implementation where LIO iSCSI target started to loop indefinitely while freeing
> a PR registration.

Any chance you could capture and share a network trace for the specific
test that triggers the deadlock. Having something to turn into a
libiscsi regression test would be helpful here.

Cheers, David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SPC-3 persistent reservations race condition
  2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
  2018-11-20 18:31 ` David Disseldorp
@ 2018-11-21 13:16 ` Martin Svec
  2018-11-21 13:33 ` David Disseldorp
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Martin Svec @ 2018-11-21 13:16 UTC (permalink / raw)
  To: target-devel

Hi David,

Dne 20.11.2018 v 19:31 David Disseldorp napsal(a):
> Hi Martin,
>
> On Tue, 13 Nov 2018 16:17:36 +0100, Martin Svec wrote:
>
>> Hello,
>>
>> during Hyper-V Cluster Validation tests, we hit a bug in SPC-3 persistent reservations
>> implementation where LIO iSCSI target started to loop indefinitely while freeing
>> a PR registration.
> Any chance you could capture and share a network trace for the specific
> test that triggers the deadlock. Having something to turn into a
> libiscsi regression test would be helpful here.

Thanks for your response, I'll try to capture it tomorrow. Also, the target has two multipathed NPs
now, so I can try to utilize only one to see if it makes a difference.

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SPC-3 persistent reservations race condition
  2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
  2018-11-20 18:31 ` David Disseldorp
  2018-11-21 13:16 ` Martin Svec
@ 2018-11-21 13:33 ` David Disseldorp
  2018-11-22 18:47 ` Martin Svec
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: David Disseldorp @ 2018-11-21 13:33 UTC (permalink / raw)
  To: target-devel

On Wed, 21 Nov 2018 14:16:05 +0100, Martin Svec wrote:

> >> during Hyper-V Cluster Validation tests, we hit a bug in SPC-3 persistent reservations
> >> implementation where LIO iSCSI target started to loop indefinitely while freeing
> >> a PR registration.  
> > Any chance you could capture and share a network trace for the specific
> > test that triggers the deadlock. Having something to turn into a
> > libiscsi regression test would be helpful here.  
> 
> Thanks for your response, I'll try to capture it tomorrow. Also, the target has two multipathed NPs
> now, so I can try to utilize only one to see if it makes a difference.

Thanks.
FWIW, given your preliminary analysis it looks as though concurrent MPIO
PR requests will be needed to trigger the race.

Cheers, David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SPC-3 persistent reservations race condition
  2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
                   ` (2 preceding siblings ...)
  2018-11-21 13:33 ` David Disseldorp
@ 2018-11-22 18:47 ` Martin Svec
  2018-11-25 23:10 ` David Disseldorp
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Martin Svec @ 2018-11-22 18:47 UTC (permalink / raw)
  To: target-devel

Hi David,

Dne 20.11.2018 v 19:31 David Disseldorp napsal(a):
> Hi Martin,
>
> On Tue, 13 Nov 2018 16:17:36 +0100, Martin Svec wrote:
>
>> Hello,
>>
>> during Hyper-V Cluster Validation tests, we hit a bug in SPC-3 persistent reservations
>> implementation where LIO iSCSI target started to loop indefinitely while freeing
>> a PR registration.
> Any chance you could capture and share a network trace for the specific
> test that triggers the deadlock. Having something to turn into a
> libiscsi regression test would be helpful here.

Please download the trace from https://www.maatts.eu/lio-pr.pcap. It's a merge of captures of two
target SAN interfaces (10.22.1.208, 10.22.2.208). The deadlock occured in 18:57:17 CET. I guess it
was caused by packets 5979 and 5980:

5979    62.915577    10.22.102.66    10.22.1.208    iSCSI    126    SCSI: Persistent Reserve Out
LUN: 0x01 SCSI: Data Out LUN: 0x01 (Persistent Reserve Out Request Data)
5980    62.915722    10.22.102.52    10.22.2.208    iSCSI    126    SCSI: Persistent Reserve Out
LUN: 0x01 SCSI: Data Out LUN: 0x01 (Persistent Reserve Out Request Data)

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SPC-3 persistent reservations race condition
  2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
                   ` (3 preceding siblings ...)
  2018-11-22 18:47 ` Martin Svec
@ 2018-11-25 23:10 ` David Disseldorp
  2018-11-29 13:40 ` Martin Svec
  2018-11-29 13:57 ` David Disseldorp
  6 siblings, 0 replies; 8+ messages in thread
From: David Disseldorp @ 2018-11-25 23:10 UTC (permalink / raw)
  To: target-devel

Hi Martin,

On Thu, 22 Nov 2018 19:47:19 +0100, Martin Svec wrote:
...
> Please download the trace from https://www.maatts.eu/lio-pr.pcap. It's a merge of captures of two
> target SAN interfaces (10.22.1.208, 10.22.2.208). The deadlock occured in 18:57:17 CET. I guess it
> was caused by packets 5979 and 5980:
> 
> 5979    62.915577    10.22.102.66    10.22.1.208    iSCSI    126    SCSI: Persistent Reserve Out
> LUN: 0x01 SCSI: Data Out LUN: 0x01 (Persistent Reserve Out Request Data)
> 5980    62.915722    10.22.102.52    10.22.2.208    iSCSI    126    SCSI: Persistent Reserve Out
> LUN: 0x01 SCSI: Data Out LUN: 0x01 (Persistent Reserve Out Request Data)

I've attempted to trigger this via a new libiscsi test published at:
https://github.com/ddiss/libiscsi/
branch: test_mpio_async_prout_preempt

I haven't had any luck so far against mainline kernel, but am interested
to hear whether you're able to trigger the deadlock against your target.
It can be run via:

# make
./test-tool/iscsi-test-cu -V --dataloss --test=SCSI.MultipathIO.ProutPreemptAsync \
	iscsi://$TARGET_IP/$IQN/$LUN \
	iscsi://$TARGET_IP/$IQN/$LUN

You can use different or matching portal IPs for the two iSCSI target
URIs above.

Cheers, David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SPC-3 persistent reservations race condition
  2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
                   ` (4 preceding siblings ...)
  2018-11-25 23:10 ` David Disseldorp
@ 2018-11-29 13:40 ` Martin Svec
  2018-11-29 13:57 ` David Disseldorp
  6 siblings, 0 replies; 8+ messages in thread
From: Martin Svec @ 2018-11-29 13:40 UTC (permalink / raw)
  To: target-devel

Hi David,

Dne 26.11.2018 v 0:10 David Disseldorp napsal(a):
> Hi Martin,
>
> On Thu, 22 Nov 2018 19:47:19 +0100, Martin Svec wrote:
> ...
>> Please download the trace from https://www.maatts.eu/lio-pr.pcap. It's a merge of captures of two
>> target SAN interfaces (10.22.1.208, 10.22.2.208). The deadlock occured in 18:57:17 CET. I guess it
>> was caused by packets 5979 and 5980:
>>
>> 5979    62.915577    10.22.102.66    10.22.1.208    iSCSI    126    SCSI: Persistent Reserve Out
>> LUN: 0x01 SCSI: Data Out LUN: 0x01 (Persistent Reserve Out Request Data)
>> 5980    62.915722    10.22.102.52    10.22.2.208    iSCSI    126    SCSI: Persistent Reserve Out
>> LUN: 0x01 SCSI: Data Out LUN: 0x01 (Persistent Reserve Out Request Data)
> I've attempted to trigger this via a new libiscsi test published at:
> https://github.com/ddiss/libiscsi/
> branch: test_mpio_async_prout_preempt
>
> I haven't had any luck so far against mainline kernel, but am interested
> to hear whether you're able to trigger the deadlock against your target.
> It can be run via:
>
> # make
> ./test-tool/iscsi-test-cu -V --dataloss --test=SCSI.MultipathIO.ProutPreemptAsync \
> 	iscsi://$TARGET_IP/$IQN/$LUN \
> 	iscsi://$TARGET_IP/$IQN/$LUN
>
> You can use different or matching portal IPs for the two iSCSI target
> URIs above.
>
> Cheers, David

Unfortunately I'm not able to reproduce the bug with the libiscsi test too. I tried multiple
concurrent tests ran in an infinite loop with no luck. However, the description of the deadlock
should be clear enaugh to understand the root cause.

Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SPC-3 persistent reservations race condition
  2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
                   ` (5 preceding siblings ...)
  2018-11-29 13:40 ` Martin Svec
@ 2018-11-29 13:57 ` David Disseldorp
  6 siblings, 0 replies; 8+ messages in thread
From: David Disseldorp @ 2018-11-29 13:57 UTC (permalink / raw)
  To: target-devel

On Thu, 29 Nov 2018 14:40:34 +0100, Martin Svec wrote:

> Unfortunately I'm not able to reproduce the bug with the libiscsi test too. I tried multiple
> concurrent tests ran in an infinite loop with no luck.

Thanks for trying.

> However, the description of the deadlock
> should be clear enaugh to understand the root cause.

Indeed, it'll just take a bit longer.

Cheers, David

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-11-29 13:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-13 15:17 SPC-3 persistent reservations race condition Martin Svec
2018-11-20 18:31 ` David Disseldorp
2018-11-21 13:16 ` Martin Svec
2018-11-21 13:33 ` David Disseldorp
2018-11-22 18:47 ` Martin Svec
2018-11-25 23:10 ` David Disseldorp
2018-11-29 13:40 ` Martin Svec
2018-11-29 13:57 ` David Disseldorp

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.