public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* Re: CPU lock-ups with 4.12.0+ kernels related to usb_storage
       [not found] <2b6dc4b0-71c8-28cb-383d-c01a9e896c7a@internode.on.net>
@ 2017-07-12 16:10 ` Alan Stern
  2017-07-12 16:52   ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Alan Stern @ 2017-07-12 16:10 UTC (permalink / raw)
  To: Arthur Marsh
  Cc: Christoph Hellwig, Matthew Dharm, USB list, SCSI development list

On Thu, 13 Jul 2017, Arthur Marsh wrote:

> Thanks for the advice, I've enabled CONFIG_FRAME_POINTER and managed to 
> catch 1 more log triggered by issuing a
> 
> blkid
> 
> command after inserting a USB stick.
> 
> The problem is that I can't be certain of triggering the problem when 
> testing kernels using git-bisect, so my git-bisect efforts are not 
> guaranteed to get a result.

> Jul 12 18:51:00 localhost kernel: [ 1206.802306] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> Jul 12 18:51:00 localhost kernel: [ 1206.802308] Modules linked in: cmac arc4 ecb md4 nls_utf8 cifs ccm dns_resolver fscache bnep bluetooth hmac drbg ansi_cprng ecdh_generic nfc rfkill cpufreq_userspace cpufreq_conservative snd_hrtimer cpufreq_powersave binfmt_misc fuse snd_emu10k1_synth snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_seq_midi_event snd_seq max6650 ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp parport ir_lirc_codec lirc_dev rtl2832_sdr videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media fc0012 rtl2832 i2c_mux dvb_usb_rtl28xxu dvb_usb_v2 amdkfd dvb_core rc_core edac_mce_amd radeon kvm_amd kvm irqbypass wmi_bmof snd_hda_codec_hdmi snd_emu10k1 snd_hda_intel ttm sg drm_km
 s_helper snd_hda_codec snd_util_mem snd_hda_core
> Jul 12 18:51:00 localhost kernel: [ 1206.802336]  snd_ac97_codec snd_rawmidi snd_seq_device snd_hwdep snd_pcm_oss drm snd_mixer_oss ac97_bus snd_pcm pcspkr evdev k10temp i2c_algo_bit serio_raw snd_timer emu10k1_gp snd gameport sp5100_tco soundcore button acpi_cpufreq asus_atk0110 wmi ext4 mbcache crc16 jbd2 crc32c_generic btrfs raid6_pq xor uas usb_storage sr_mod cdrom sd_mod hid_generic usbhid hid ata_generic ohci_pci ahci firewire_ohci libahci pata_atiixp firewire_core crc_itu_t i2c_piix4 ehci_pci ohci_hcd libata ehci_hcd usbcore scsi_mod r8169 mii
> Jul 12 18:51:00 localhost kernel: [ 1206.802359] CPU: 1 PID: 142 Comm: usb-storage Not tainted 4.12.0+ #2751
> Jul 12 18:51:00 localhost kernel: [ 1206.802359] Hardware name: System manufacturer System Product Name/M3A78 PRO, BIOS 1701    01/27/2011
> Jul 12 18:51:00 localhost kernel: [ 1206.802361] task: ffff927c610eef80 task.stack: ffffba9501b78000
> Jul 12 18:51:00 localhost kernel: [ 1206.802366] RIP: 0010:native_queued_spin_lock_slowpath+0x159/0x1c0
> Jul 12 18:51:00 localhost kernel: [ 1206.802366] RSP: 0018:ffffba9501b7bdc8 EFLAGS: 00000002
> Jul 12 18:51:00 localhost kernel: [ 1206.802368] RAX: 0000000000000101 RBX: ffff927c662f3030 RCX: 0000000000000001
> Jul 12 18:51:00 localhost kernel: [ 1206.802368] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffff927c662f3030
> Jul 12 18:51:00 localhost kernel: [ 1206.802369] RBP: ffffba9501b7bde0 R08: 0000000000000101 R09: 0000000000000100
> Jul 12 18:51:00 localhost kernel: [ 1206.802370] R10: 0000000000000000 R11: 7fffffffffffffff R12: 0000000000000082
> Jul 12 18:51:00 localhost kernel: [ 1206.802370] R13: 0000000000002003 R14: 0000000000001d4c R15: ffff927c610eef80
> Jul 12 18:51:00 localhost kernel: [ 1206.802371] FS:  0000000000000000(0000) GS:ffff927c6fc40000(0000) knlGS:0000000000000000
> Jul 12 18:51:00 localhost kernel: [ 1206.802372] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul 12 18:51:00 localhost kernel: [ 1206.802373] CR2: 00000030007995d8 CR3: 000000017b144000 CR4: 00000000000006e0
> Jul 12 18:51:00 localhost kernel: [ 1206.802374] Call Trace:
> Jul 12 18:51:00 localhost kernel: [ 1206.802378]  ? _raw_spin_lock_irqsave+0x59/0x68
> Jul 12 18:51:00 localhost kernel: [ 1206.802389]  scsi_eh_scmd_add+0x3f/0x150 [scsi_mod]
> Jul 12 18:51:00 localhost kernel: [ 1206.802393]  scsi_softirq_done+0xb5/0x150 [scsi_mod]
> Jul 12 18:51:00 localhost kernel: [ 1206.802396]  __blk_mq_complete_request+0xd4/0x160
> Jul 12 18:51:00 localhost kernel: [ 1206.802397]  blk_mq_complete_request+0x18/0x20
> Jul 12 18:51:00 localhost kernel: [ 1206.802402]  scsi_mq_done+0x21/0x80 [scsi_mod]
> Jul 12 18:51:00 localhost kernel: [ 1206.802405]  usb_stor_control_thread+0xf4/0x250 [usb_storage]
> Jul 12 18:51:00 localhost kernel: [ 1206.802407]  kthread+0x125/0x140
> Jul 12 18:51:00 localhost kernel: [ 1206.802409]  ? fill_inquiry_response+0x20/0x20 [usb_storage]
> Jul 12 18:51:00 localhost kernel: [ 1206.802410]  ? kthread_create_on_node+0x70/0x70
> Jul 12 18:51:00 localhost kernel: [ 1206.802412]  ret_from_fork+0x25/0x30
> Jul 12 18:51:00 localhost kernel: [ 1206.802413] Code: c2 74 a2 89 c2 89 d0 66 31 c0 39 c6 74 ea 4d 85 c9 c6 07 01 74 21 41 c7 41 08 01 00 00 00 eb 85 41 ff c9 75 04 f3 c3 f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 c3 f3 90 4c 8b 09 4d 85 c9 74 f6 eb 

This is pretty conclusive.  The problem comes about because
usb_stor_control_thread() calls scsi_mq_done() while holding
shost->host_lock, and then scsi_eh_scmd_add() tries to acquire that
same lock.

I don't know why this didn't show up in earlier kernels.  I guess some
element of the call chain listed above must be new in 4.12.

Christoph, what's the best way to fix this?  Should usb-storage release
the host lock before issuing the ->scsi_done callback?  If so, does
that change need to be applied to any kernels before 4.12?

Alan Stern

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CPU lock-ups with 4.12.0+ kernels related to usb_storage
  2017-07-12 16:10 ` CPU lock-ups with 4.12.0+ kernels related to usb_storage Alan Stern
@ 2017-07-12 16:52   ` Christoph Hellwig
  2017-07-13 17:00     ` Alan Stern
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2017-07-12 16:52 UTC (permalink / raw)
  To: Alan Stern
  Cc: Arthur Marsh, Christoph Hellwig, Matthew Dharm, USB list,
	SCSI development list, Jens Axboe

On Wed, Jul 12, 2017 at 12:10:02PM -0400, Alan Stern wrote:
> This is pretty conclusive.  The problem comes about because
> usb_stor_control_thread() calls scsi_mq_done() while holding
> shost->host_lock, and then scsi_eh_scmd_add() tries to acquire that
> same lock.
> 
> I don't know why this didn't show up in earlier kernels.  I guess some
> element of the call chain listed above must be new in 4.12.
> 
> Christoph, what's the best way to fix this?  Should usb-storage release
> the host lock before issuing the ->scsi_done callback?  If so, does
> that change need to be applied to any kernels before 4.12?

4.12 switched to blk-mq by default, and while the old code used
a softirq for completions, which is always a difference context
the blk-mq code might execute in the same context it's called in.

So yes, for that we'd need to drop host_lock.  But I wonder how
many more of these are lingering somewhere and if we can find
another workaround.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CPU lock-ups with 4.12.0+ kernels related to usb_storage
  2017-07-12 16:52   ` Christoph Hellwig
@ 2017-07-13 17:00     ` Alan Stern
       [not found]       ` <Pine.LNX.4.44L0.1707131258190.1571-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Alan Stern @ 2017-07-13 17:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Arthur Marsh, Matthew Dharm, USB list, SCSI development list,
	Jens Axboe

On Wed, 12 Jul 2017, Christoph Hellwig wrote:

> On Wed, Jul 12, 2017 at 12:10:02PM -0400, Alan Stern wrote:
> > This is pretty conclusive.  The problem comes about because
> > usb_stor_control_thread() calls scsi_mq_done() while holding
> > shost->host_lock, and then scsi_eh_scmd_add() tries to acquire that
> > same lock.
> > 
> > I don't know why this didn't show up in earlier kernels.  I guess some
> > element of the call chain listed above must be new in 4.12.
> > 
> > Christoph, what's the best way to fix this?  Should usb-storage release
> > the host lock before issuing the ->scsi_done callback?  If so, does
> > that change need to be applied to any kernels before 4.12?
> 
> 4.12 switched to blk-mq by default, and while the old code used
> a softirq for completions, which is always a difference context
> the blk-mq code might execute in the same context it's called in.
> 
> So yes, for that we'd need to drop host_lock.  But I wonder how
> many more of these are lingering somewhere and if we can find
> another workaround.

All right.  In the meantime, changing usb-storage won't hurt.
Arthur, can you test the patch below?

Alan Stern



Index: usb-4.x/drivers/usb/storage/usb.c
===================================================================
--- usb-4.x.orig/drivers/usb/storage/usb.c
+++ usb-4.x/drivers/usb/storage/usb.c
@@ -315,6 +315,7 @@ static int usb_stor_control_thread(void
 {
 	struct us_data *us = (struct us_data *)__us;
 	struct Scsi_Host *host = us_to_host(us);
+	struct scsi_cmnd *srb;
 
 	for (;;) {
 		usb_stor_dbg(us, "*** thread sleeping\n");
@@ -330,6 +331,7 @@ static int usb_stor_control_thread(void
 		scsi_lock(host);
 
 		/* When we are called with no command pending, we're done */
+		srb = us->srb;
 		if (us->srb == NULL) {
 			scsi_unlock(host);
 			mutex_unlock(&us->dev_mutex);
@@ -398,14 +400,11 @@ static int usb_stor_control_thread(void
 		/* lock access to the state */
 		scsi_lock(host);
 
-		/* indicate that the command is done */
-		if (us->srb->result != DID_ABORT << 16) {
-			usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
-				     us->srb->result);
-			us->srb->scsi_done(us->srb);
-		} else {
+		/* was the command aborted? */
+		if (us->srb->result == DID_ABORT << 16) {
 SkipForAbort:
 			usb_stor_dbg(us, "scsi command aborted\n");
+			srb = NULL;	/* Don't call srb->scsi_done() */
 		}
 
 		/*
@@ -429,6 +428,13 @@ SkipForAbort:
 
 		/* unlock the device pointers */
 		mutex_unlock(&us->dev_mutex);
+
+		/* now that the locks are released, notify the SCSI core */
+		if (srb) {
+			usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
+					srb->result);
+			srb->scsi_done(srb);
+		}
 	} /* for (;;) */
 
 	/* Wait until we are told to stop */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CPU lock-ups with 4.12.0+ kernels related to usb_storage
       [not found]       ` <Pine.LNX.4.44L0.1707131258190.1571-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
@ 2017-07-13 18:48         ` Arthur Marsh
       [not found]           ` <02c6c52e-d520-6d84-81e7-0b49dfae0ec9-CkBdp7X+a1oIQCUVoCVjmQ@public.gmane.org>
  2017-07-26 15:47         ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Arthur Marsh @ 2017-07-13 18:48 UTC (permalink / raw)
  To: Alan Stern, Christoph Hellwig
  Cc: Matthew Dharm, USB list, SCSI development list, Jens Axboe



Alan Stern wrote on 14/07/17 02:30:

> All right.  In the meantime, changing usb-storage won't hurt.
> Arthur, can you test the patch below?
> 
> Alan Stern
> 
> 
> 
> Index: usb-4.x/drivers/usb/storage/usb.c
> ===================================================================
> --- usb-4.x.orig/drivers/usb/storage/usb.c
> +++ usb-4.x/drivers/usb/storage/usb.c
> @@ -315,6 +315,7 @@ static int usb_stor_control_thread(void
>   {
>   	struct us_data *us = (struct us_data *)__us;
>   	struct Scsi_Host *host = us_to_host(us);
> +	struct scsi_cmnd *srb;
>   
>   	for (;;) {
>   		usb_stor_dbg(us, "*** thread sleeping\n");
> @@ -330,6 +331,7 @@ static int usb_stor_control_thread(void
>   		scsi_lock(host);
>   
>   		/* When we are called with no command pending, we're done */
> +		srb = us->srb;
>   		if (us->srb == NULL) {
>   			scsi_unlock(host);
>   			mutex_unlock(&us->dev_mutex);
> @@ -398,14 +400,11 @@ static int usb_stor_control_thread(void
>   		/* lock access to the state */
>   		scsi_lock(host);
>   
> -		/* indicate that the command is done */
> -		if (us->srb->result != DID_ABORT << 16) {
> -			usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
> -				     us->srb->result);
> -			us->srb->scsi_done(us->srb);
> -		} else {
> +		/* was the command aborted? */
> +		if (us->srb->result == DID_ABORT << 16) {
>   SkipForAbort:
>   			usb_stor_dbg(us, "scsi command aborted\n");
> +			srb = NULL;	/* Don't call srb->scsi_done() */
>   		}
>   
>   		/*
> @@ -429,6 +428,13 @@ SkipForAbort:
>   
>   		/* unlock the device pointers */
>   		mutex_unlock(&us->dev_mutex);
> +
> +		/* now that the locks are released, notify the SCSI core */
> +		if (srb) {
> +			usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
> +					srb->result);
> +			srb->scsi_done(srb);
> +		}
>   	} /* for (;;) */
>   
>   	/* Wait until we are told to stop */
> 
> 

Thanks for the patch!

I have applied it and am running the resulting kernel.

As I didn't have a reproducible way to trigger the problem on demand, 
I'll just have to see that there isn't a lock-up that looks related over 
the next several days.

Arthur.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CPU lock-ups with 4.12.0+ kernels related to usb_storage
       [not found]           ` <02c6c52e-d520-6d84-81e7-0b49dfae0ec9-CkBdp7X+a1oIQCUVoCVjmQ@public.gmane.org>
@ 2017-07-17 20:15             ` Arthur Marsh
  0 siblings, 0 replies; 7+ messages in thread
From: Arthur Marsh @ 2017-07-17 20:15 UTC (permalink / raw)
  To: Alan Stern, Christoph Hellwig
  Cc: Matthew Dharm, USB list, SCSI development list, Jens Axboe

Arthur Marsh wrote on 14/07/17 04:18:
> 
> 
> Alan Stern wrote on 14/07/17 02:30:
> 
>> All right.  In the meantime, changing usb-storage won't hurt.
>> Arthur, can you test the patch below?
>>
>> Alan Stern
>>
>>
>>
>> Index: usb-4.x/drivers/usb/storage/usb.c
>> ===================================================================
>> --- usb-4.x.orig/drivers/usb/storage/usb.c
>> +++ usb-4.x/drivers/usb/storage/usb.c
>> @@ -315,6 +315,7 @@ static int usb_stor_control_thread(void
>>   {
>>       struct us_data *us = (struct us_data *)__us;
>>       struct Scsi_Host *host = us_to_host(us);
>> +    struct scsi_cmnd *srb;
>>       for (;;) {
>>           usb_stor_dbg(us, "*** thread sleeping\n");
>> @@ -330,6 +331,7 @@ static int usb_stor_control_thread(void
>>           scsi_lock(host);
>>           /* When we are called with no command pending, we're done */
>> +        srb = us->srb;
>>           if (us->srb == NULL) {
>>               scsi_unlock(host);
>>               mutex_unlock(&us->dev_mutex);
>> @@ -398,14 +400,11 @@ static int usb_stor_control_thread(void
>>           /* lock access to the state */
>>           scsi_lock(host);
>> -        /* indicate that the command is done */
>> -        if (us->srb->result != DID_ABORT << 16) {
>> -            usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
>> -                     us->srb->result);
>> -            us->srb->scsi_done(us->srb);
>> -        } else {
>> +        /* was the command aborted? */
>> +        if (us->srb->result == DID_ABORT << 16) {
>>   SkipForAbort:
>>               usb_stor_dbg(us, "scsi command aborted\n");
>> +            srb = NULL;    /* Don't call srb->scsi_done() */
>>           }
>>           /*
>> @@ -429,6 +428,13 @@ SkipForAbort:
>>           /* unlock the device pointers */
>>           mutex_unlock(&us->dev_mutex);
>> +
>> +        /* now that the locks are released, notify the SCSI core */
>> +        if (srb) {
>> +            usb_stor_dbg(us, "scsi cmd done, result=0x%x\n",
>> +                    srb->result);
>> +            srb->scsi_done(srb);
>> +        }
>>       } /* for (;;) */
>>       /* Wait until we are told to stop */

Hi, just to confirm no further lock-ups occurred in the last 4 days with 
this patch applied.

Arthur.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CPU lock-ups with 4.12.0+ kernels related to usb_storage
       [not found]       ` <Pine.LNX.4.44L0.1707131258190.1571-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
  2017-07-13 18:48         ` Arthur Marsh
@ 2017-07-26 15:47         ` Christoph Hellwig
  2017-07-26 15:50           ` Alan Stern
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2017-07-26 15:47 UTC (permalink / raw)
  To: Alan Stern
  Cc: Christoph Hellwig, Arthur Marsh, Matthew Dharm, USB list,
	SCSI development list, Jens Axboe

On Thu, Jul 13, 2017 at 01:00:29PM -0400, Alan Stern wrote:
> All right.  In the meantime, changing usb-storage won't hurt.
> Arthur, can you test the patch below?

Do you plan to send this fix to Greg for 4.13-rc?
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CPU lock-ups with 4.12.0+ kernels related to usb_storage
  2017-07-26 15:47         ` Christoph Hellwig
@ 2017-07-26 15:50           ` Alan Stern
  0 siblings, 0 replies; 7+ messages in thread
From: Alan Stern @ 2017-07-26 15:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Arthur Marsh, Matthew Dharm, USB list, SCSI development list,
	Jens Axboe

On Wed, 26 Jul 2017, Christoph Hellwig wrote:

> On Thu, Jul 13, 2017 at 01:00:29PM -0400, Alan Stern wrote:
> > All right.  In the meantime, changing usb-storage won't hurt.
> > Arthur, can you test the patch below?
> 
> Do you plan to send this fix to Greg for 4.13-rc?

I just submitted it.

Alan Stern

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-07-26 15:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <2b6dc4b0-71c8-28cb-383d-c01a9e896c7a@internode.on.net>
2017-07-12 16:10 ` CPU lock-ups with 4.12.0+ kernels related to usb_storage Alan Stern
2017-07-12 16:52   ` Christoph Hellwig
2017-07-13 17:00     ` Alan Stern
     [not found]       ` <Pine.LNX.4.44L0.1707131258190.1571-100000-IYeN2dnnYyZXsRXLowluHWD2FQJk+8+b@public.gmane.org>
2017-07-13 18:48         ` Arthur Marsh
     [not found]           ` <02c6c52e-d520-6d84-81e7-0b49dfae0ec9-CkBdp7X+a1oIQCUVoCVjmQ@public.gmane.org>
2017-07-17 20:15             ` Arthur Marsh
2017-07-26 15:47         ` Christoph Hellwig
2017-07-26 15:50           ` Alan Stern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox