linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: dm + blk-mq soft lockup complaint
       [not found]                 ` <54B52B73.7040807@sandisk.com>
@ 2015-01-13 16:20                   ` Mike Snitzer
  2015-01-14  9:16                     ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Snitzer @ 2015-01-13 16:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On Tue, Jan 13 2015 at  9:28am -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/13/15 15:18, Mike Snitzer wrote:
> > On Tue, Jan 13 2015 at  7:29am -0500,
> > Bart Van Assche <bart.vanassche@sandisk.com> wrote:
> >> However, I hit another issue while running I/O on top of a multipath
> >> device (on a kernel with lockdep and SLUB memory poisoning enabled):
> >>
> >> NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [kdmwork-253:0:3116]
> >> CPU: 7 PID: 3116 Comm: kdmwork-253:0 Tainted: G        W      3.19.0-rc4-debug+ #1
> >> Call Trace:
> >>  [<ffffffff8118e4be>] kmem_cache_alloc+0x28e/0x2c0
> >>  [<ffffffff81346aca>] alloc_iova_mem+0x1a/0x20
> >>  [<ffffffff81342c8e>] alloc_iova+0x2e/0x250
> >>  [<ffffffff81344b65>] intel_alloc_iova+0x95/0xd0
> >>  [<ffffffff81348a15>] intel_map_sg+0xc5/0x260
> >>  [<ffffffffa07e0661>] srp_queuecommand+0xa11/0xc30 [ib_srp]
> >>  [<ffffffffa001698e>] scsi_dispatch_cmd+0xde/0x5a0 [scsi_mod]
> >>  [<ffffffffa0017480>] scsi_queue_rq+0x630/0x700 [scsi_mod]
> >>  [<ffffffff8125683d>] __blk_mq_run_hw_queue+0x1dd/0x370
> >>  [<ffffffff81256aae>] blk_mq_alloc_request+0xde/0x150
> >>  [<ffffffff8124bade>] blk_get_request+0x2e/0xe0
> >>  [<ffffffffa07ebd0f>] __multipath_map.isra.15+0x1cf/0x210 [dm_multipath]
> >>  [<ffffffffa07ebd6a>] multipath_clone_and_map+0x1a/0x20 [dm_multipath]
> >>  [<ffffffffa044abb5>] map_tio_request+0x1d5/0x3a0 [dm_mod]
> >>  [<ffffffff81075d16>] kthread_worker_fn+0x86/0x1b0
> >>  [<ffffffff81075c0f>] kthread+0xef/0x110
> >>  [<ffffffff814db42c>] ret_from_fork+0x7c/0xb0
> > 
> > Unfortunate.  Is this still with a 16MB backing device or is it real
> > hardware?  Can you share the workload so that myself and/or Keith could
> > try to reproduce?
>  
> Hello Mike,
> 
> This is still with a 16MB RAM disk as backing device. The fio job I
> used to trigger this was as follows:
> 
> dev=/dev/sdc
> fio --bs=4K --ioengine=libaio --rw=randread --buffered=0 --numjobs=12   \
>     --iodepth=128 --iodepth_batch=64 --iodepth_batch_complete=64        \
>     --thread --norandommap --loops=$((2**31)) --runtime=60              \
>     --group_reporting --gtod_reduce=1 --name=$dev --filename=$dev       \
>     --invalidate=1

OK, I assume you specified the mpath device for the test that failed.

This test works fine on my 100MB scsi_debug device with 4 paths exported
over virtio-blk to a guest that assembles the mpath device.

Could be a hang that is unique to scsi-mq.

Any chance you'd be willing to provide a HOWTO for setting up your
SRP/iscsi configuration?

Are you carrying any related changes that are not upstream?  (I can hunt
down the email in this thread where you describe your kernel tree...)

I'll try to reproduce but this info could be useful to others that are
more scsi-mq inclined who might need to chase this too.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-13 16:20                   ` dm + blk-mq soft lockup complaint Mike Snitzer
@ 2015-01-14  9:16                     ` Bart Van Assche
  2015-01-14 18:59                       ` Mike Snitzer
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2015-01-14  9:16 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 2555 bytes --]

On 01/13/15 17:21, Mike Snitzer wrote:
> OK, I assume you specified the mpath device for the test that failed.

Yes, of course ...

> This test works fine on my 100MB scsi_debug device with 4 paths exported
> over virtio-blk to a guest that assembles the mpath device.
> 
> Could be a hang that is unique to scsi-mq.
> 
> Any chance you'd be willing to provide a HOWTO for setting up your
> SRP/iscsi configuration?
> 
> Are you carrying any related changes that are not upstream?  (I can hunt
> down the email in this thread where you describe your kernel tree...)
> 
> I'll try to reproduce but this info could be useful to others that are
> more scsi-mq inclined who might need to chase this too.

The four patches I had used in my tests at the initiator side and that
are not yet in v3.19-rc4 have been attached to this e-mail (I have not
yet had the time to post all of these patches for review).

This is how my I had configured the initiator system:
* If the version of the srptools package supplied by your distro is
lower than 1.0.2, build and install the latest version from the source
code available at git://git.openfabrics.org/~bvanassche/srptools.git/.git.
* Install the latest version of lsscsi
(http://sg.danny.cz/scsi/lsscsi.html). This version has SRP transport
support but is not yet in any distro AFAIK.
* Build and install a kernel >= v3.19-rc4 that includes the dm patches
at the start of this e-mail thread.
* Check whether the IB links are up (should display "State: Active"):
ibstat | grep State:
* Spread completion interrupts statically over CPU cores, e.g. via the
attached script (spread-mlx4-ib-interrupts).
* Check whether the SRP target system is visible from the SRP initiator
system - the command below should print at least one line:
ibsrpdm -c
* Enable blk-mq:
echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
* Configure the SRP kernel module parameters as follows:
echo 'options ib_srp cmd_sg_entries=255 dev_loss_tmo=60 ch_count=6' >
/etc/modprobe.d/ib_srp.conf
* Unload and reload the SRP initiator kernel module to apply these
parameters:
rmmod ib_srp; modprobe ib_srp
* Start srpd and wait until SRP login has finished:
systemctl start srpd
while ! lsscsi -t | grep -q srp:; do sleep 1; done
* Start multipathd and check the table it has built:
systemctl start multipathd
dmsetup table /dev/dm-0
* Set the I/O scheduler to noop, disable add_random and set rq_affinity
to 2 for all SRP and dm block devices.
* Run the I/O load of your preference.

Please let me know if you need any further information.

Bart.

[-- Attachment #2: 0001-e1000-Avoid-that-e1000_netpoll-triggers-a-kernel-war.patch --]
[-- Type: text/x-patch, Size: 4609 bytes --]

>From 664b7adce6c09b9c939b4983f7f32b7539497ef4 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Fri, 2 Jan 2015 14:52:07 +0100
Subject: [PATCH 1/4] e1000: Avoid that e1000_netpoll() triggers a kernel
 warning

console_cont_flush(), which is called by console_unlock(), calls
call_console_drivers() and hence also the netconsole function
write_msg() with local interrupts disabled. This means that it is
not allowed to call disable_irq() from inside a netpoll callback
function. Hence eliminate the disable_irq() / enable_irq() pair
from the e1000 netpoll function. This patch avoids that the e1000
networking driver triggers the following complaint:

BUG: sleeping function called from invalid context at kernel/irq/manage.c:104

Call Trace:
 [<ffffffff814d1ec5>] dump_stack+0x4c/0x65
 [<ffffffff8107bcc5>] ___might_sleep+0x175/0x230
 [<ffffffff8107bdba>] __might_sleep+0x3a/0xa0
 [<ffffffff810a78c8>] synchronize_irq+0x38/0xa0
 [<ffffffff810a7a20>] disable_irq+0x20/0x30
 [<ffffffffa04b4442>] e1000_netpoll+0x102/0x130 [e1000e]
 [<ffffffff813ffff2>] netpoll_poll_dev+0x72/0x350
 [<ffffffff81400489>] netpoll_send_skb_on_dev+0x1b9/0x2b0
 [<ffffffff81400842>] netpoll_send_udp+0x2c2/0x430
 [<ffffffffa058187f>] write_msg+0xcf/0x120 [netconsole]
 [<ffffffff810a4682>] call_console_drivers.constprop.25+0xc2/0x250
 [<ffffffff810a5588>] console_unlock+0x328/0x4c0
 [<ffffffff810a59f0>] vprintk_emit+0x2d0/0x570
 [<ffffffff810a5def>] vprintk_default+0x1f/0x30
 [<ffffffff814cf680>] printk+0x46/0x48

See also "[RFC PATCH net-next 00/11] net: remove disable_irq() from
->ndo_poll_controller" (http://thread.gmane.org/gmane.linux.network/342096).

See also patch "sched/wait: Add might_sleep() checks" (kernel v3.19-rc1;
commit e22b886a8a43).

Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |  5 +++++
 drivers/net/ethernet/intel/e1000/e1000_main.c | 27 ++++++++++++++++++++++-----
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 6970710..d85d19f 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -323,6 +323,11 @@ struct e1000_adapter {
 	struct delayed_work watchdog_task;
 	struct delayed_work fifo_stall_task;
 	struct delayed_work phy_info_task;
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	/* Used to serialize e1000 interrupts and the e1000 netpoll callback. */
+	spinlock_t netpoll_lock;
+#endif
 };
 
 enum e1000_state_t {
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 83140cb..e5866f1 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -1313,6 +1313,9 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	e1000_irq_disable(adapter);
 
 	spin_lock_init(&adapter->stats_lock);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	spin_lock_init(&adapter->netpoll_lock);
+#endif
 
 	set_bit(__E1000_DOWN, &adapter->flags);
 
@@ -3747,10 +3750,8 @@ void e1000_update_stats(struct e1000_adapter *adapter)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int irq, void *data)
+static irqreturn_t __e1000_intr(int irq, struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
 	u32 icr = er32(ICR);
 
@@ -3792,6 +3793,24 @@ static irqreturn_t e1000_intr(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int irq, void *data)
+{
+	struct net_device *netdev = data;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	irqreturn_t ret;
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr(irq, adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+#else
+	ret = __e1000_intr(irq, adapter);
+#endif
+
+	return ret;
+}
+
 /**
  * e1000_clean - NAPI Rx polling callback
  * @adapter: board private structure
@@ -5216,9 +5235,7 @@ static void e1000_netpoll(struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 
-	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev);
-	enable_irq(adapter->pdev->irq);
 }
 #endif
 
-- 
2.1.2


[-- Attachment #3: 0002-e1000e-Avoid-that-e1000_netpoll-triggers-a-kernel-wa.patch --]
[-- Type: text/x-patch, Size: 5267 bytes --]

>From af2c97b3882a73f9b5a098e6aca322efb341ce6d Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Mon, 5 Jan 2015 11:40:23 +0100
Subject: [PATCH 2/4] e1000e: Avoid that e1000_netpoll() triggers a kernel
 warning

---
 drivers/net/ethernet/intel/e1000e/e1000.h  |  5 ++
 drivers/net/ethernet/intel/e1000e/netdev.c | 73 ++++++++++++++++++++++++------
 2 files changed, 65 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
index 7785240..e89b80f 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -344,6 +344,11 @@ struct e1000_adapter {
 	struct ptp_clock_info ptp_clock_info;
 
 	u16 eee_advert;
+
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	/* Used to serialize e1000 interrupts and the e1000 netpoll callback. */
+	spinlock_t netpoll_lock;
+#endif
 };
 
 struct e1000_info {
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index e14fd85..c6d0ffb 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1761,11 +1761,10 @@ static void e1000e_downshift_workaround(struct work_struct *work)
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr_msi(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
+	struct net_device *netdev = adapter->netdev;
 	u32 icr = er32(ICR);
 
 	/* read ICR disables interrupts using IAM */
@@ -1823,16 +1822,32 @@ static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr_msi(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr_msi(adapter);
+#endif
+}
+
 /**
  * e1000_intr - Interrupt Handler
  * @irq: interrupt number
  * @data: pointer to a network interface device structure
  **/
-static irqreturn_t e1000_intr(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
+	struct net_device *netdev = adapter->netdev;
 	u32 rctl, icr = er32(ICR);
 
 	if (!icr || test_bit(__E1000_DOWN, &adapter->state))
@@ -1903,6 +1918,23 @@ static irqreturn_t e1000_intr(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr(adapter);
+#endif
+}
+
 static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
 {
 	struct net_device *netdev = data;
@@ -4180,6 +4212,9 @@ static int e1000_sw_init(struct e1000_adapter *adapter)
 	adapter->rx_ring_count = E1000_DEFAULT_RXD;
 
 	spin_lock_init(&adapter->stats64_lock);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	spin_lock_init(&adapter->netpoll_lock);
+#endif
 
 	e1000e_set_interrupt_capability(adapter);
 
@@ -6437,10 +6472,9 @@ static void e1000_shutdown(struct pci_dev *pdev)
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
 
-static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
+static irqreturn_t __e1000_intr_msix(struct e1000_adapter *adapter)
 {
-	struct net_device *netdev = data;
-	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct net_device *netdev = adapter->netdev;
 
 	if (adapter->msix_entries) {
 		int vector, msix_irq;
@@ -6467,6 +6501,23 @@ static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t e1000_intr_msix(int __always_unused irq, void *data)
+{
+	struct e1000_adapter *adapter = netdev_priv(data);
+#ifdef CONFIG_NET_POLL_CONTROLLER
+	int ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&adapter->netpoll_lock, flags);
+	ret = __e1000_intr_msix(adapter);
+	spin_unlock_irqrestore(&adapter->netpoll_lock, flags);
+
+	return ret;
+#else
+	return __e1000_intr_msix(adapter);
+#endif
+}
+
 /**
  * e1000_netpoll
  * @netdev: network interface device structure
@@ -6484,14 +6535,10 @@ static void e1000_netpoll(struct net_device *netdev)
 		e1000_intr_msix(adapter->pdev->irq, netdev);
 		break;
 	case E1000E_INT_MODE_MSI:
-		disable_irq(adapter->pdev->irq);
 		e1000_intr_msi(adapter->pdev->irq, netdev);
-		enable_irq(adapter->pdev->irq);
 		break;
 	default:		/* E1000E_INT_MODE_LEGACY */
-		disable_irq(adapter->pdev->irq);
 		e1000_intr(adapter->pdev->irq, netdev);
-		enable_irq(adapter->pdev->irq);
 		break;
 	}
 }
-- 
2.1.2


[-- Attachment #4: 0003-Avoid-that-sd_shutdown-triggers-a-kernel-warning.patch --]
[-- Type: text/x-patch, Size: 10717 bytes --]

>From 54e10ead0a0e98bdf39a631c65c0a585211ffa22 Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Mon, 5 Jan 2015 10:51:13 +0100
Subject: [PATCH 3/4] Avoid that sd_shutdown() triggers a kernel warning

Since kernel v3.19-rc1 module_refcount() returns 1 instead of 0
when called from inside module_exit(). This breaks the
module_refcount() test in scsi_device_put() and hence causes the
following kernel warning to be reported when unloading the ib_srp
kernel module:

WARNING: CPU: 5 PID: 228 at kernel/module.c:954 module_put+0x207/0x220()

Call Trace:
 [<ffffffff814d1fcf>] dump_stack+0x4c/0x65
 [<ffffffff81053ada>] warn_slowpath_common+0x8a/0xc0
 [<ffffffff81053bca>] warn_slowpath_null+0x1a/0x20
 [<ffffffff810d0507>] module_put+0x207/0x220
 [<ffffffffa000bea8>] scsi_device_put+0x48/0x50 [scsi_mod]
 [<ffffffffa03676d2>] scsi_disk_put+0x32/0x50 [sd_mod]
 [<ffffffffa0368d4c>] sd_shutdown+0x8c/0x150 [sd_mod]
 [<ffffffffa0368e79>] sd_remove+0x69/0xc0 [sd_mod]
 [<ffffffff813457ef>] __device_release_driver+0x7f/0xf0
 [<ffffffff81345885>] device_release_driver+0x25/0x40
 [<ffffffff81345134>] bus_remove_device+0x124/0x1b0
 [<ffffffff8134189e>] device_del+0x13e/0x250
 [<ffffffffa001cdcd>] __scsi_remove_device+0xcd/0xe0 [scsi_mod]
 [<ffffffffa001b39f>] scsi_forget_host+0x6f/0x80 [scsi_mod]
 [<ffffffffa000d5f6>] scsi_remove_host+0x86/0x140 [scsi_mod]
 [<ffffffffa07d5c0b>] srp_remove_work+0x9b/0x210 [ib_srp]
 [<ffffffff8106fd28>] process_one_work+0x1d8/0x780
 [<ffffffff810703eb>] worker_thread+0x11b/0x4a0
 [<ffffffff81075a6f>] kthread+0xef/0x110
 [<ffffffff814dad6c>] ret_from_fork+0x7c/0xb0

See also patch "module: Remove stop_machine from module unloading"
(Masami Hiramatsu; commit e513cc1c07e2; kernel v3.19-rc1).

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/scsi.c        | 63 ++++++++++++++++++++++++++++++++--------------
 drivers/scsi/sd.c          | 44 +++++++++++++++++---------------
 include/scsi/scsi_device.h |  2 ++
 3 files changed, 70 insertions(+), 39 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index e028854..2cae46b 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -973,30 +973,63 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 EXPORT_SYMBOL(scsi_report_opcode);
 
 /**
- * scsi_device_get  -  get an additional reference to a scsi_device
+ * scsi_dev_get - get an additional reference to a scsi_device
  * @sdev:	device to get a reference to
+ * @get_lld:    whether or not to increase the LLD kernel module refcount
  *
- * Description: Gets a reference to the scsi_device and increments the use count
- * of the underlying LLDD module.  You must hold host_lock of the
- * parent Scsi_Host or already have a reference when calling this.
+ * Description: Gets a reference to the scsi_device and optionally increments
+ * the use count of the associated LLDD module. You must hold host_lock of
+ * the parent Scsi_Host or already have a reference when calling this.
  */
-int scsi_device_get(struct scsi_device *sdev)
+int scsi_dev_get(struct scsi_device *sdev, bool get_lld)
 {
 	if (sdev->sdev_state == SDEV_DEL)
 		return -ENXIO;
 	if (!get_device(&sdev->sdev_gendev))
 		return -ENXIO;
-	/* We can fail this if we're doing SCSI operations
-	 * from module exit (like cache flush) */
-	try_module_get(sdev->host->hostt->module);
+	/* Can fail if invoked during module exit (like cache flush) */
+	if (get_lld && !try_module_get(sdev->host->hostt->module)) {
+		put_device(&sdev->sdev_gendev);
+		return -ENXIO;
+	}
 
 	return 0;
 }
+EXPORT_SYMBOL(scsi_dev_get);
+
+/**
+ * scsi_dev_put - release a reference to a scsi_device
+ * @sdev:	device to release a reference on
+ * @put_lld:    whether or not to decrease the LLD kernel module refcount
+ *
+ * Description: Release a reference to the scsi_device. The device is freed
+ * once the last user vanishes.
+ */
+void scsi_dev_put(struct scsi_device *sdev, bool put_lld)
+{
+	if (put_lld)
+		module_put(sdev->host->hostt->module);
+	put_device(&sdev->sdev_gendev);
+}
+EXPORT_SYMBOL(scsi_dev_put);
+
+/**
+ * scsi_device_get - get an additional reference to a scsi_device
+ * @sdev:	device to get a reference to
+ *
+ * Description: Gets a reference to the scsi_device and increments the use count
+ * of the underlying LLDD module.  You must hold host_lock of the
+ * parent Scsi_Host or already have a reference when calling this.
+ */
+int scsi_device_get(struct scsi_device *sdev)
+{
+	return scsi_dev_get(sdev, true);
+}
 EXPORT_SYMBOL(scsi_device_get);
 
 /**
- * scsi_device_put  -  release a reference to a scsi_device
- * @sdev:	device to release a reference on.
+ * scsi_device_put - release a reference to a scsi_device
+ * @sdev:	device to release a reference on
  *
  * Description: Release a reference to the scsi_device and decrements the use
  * count of the underlying LLDD module.  The device is freed once the last
@@ -1004,15 +1037,7 @@ EXPORT_SYMBOL(scsi_device_get);
  */
 void scsi_device_put(struct scsi_device *sdev)
 {
-#ifdef CONFIG_MODULE_UNLOAD
-	struct module *module = sdev->host->hostt->module;
-
-	/* The module refcount will be zero if scsi_device_get()
-	 * was called from a module removal routine */
-	if (module && module_refcount(module) != 0)
-		module_put(module);
-#endif
-	put_device(&sdev->sdev_gendev);
+	scsi_dev_put(sdev, true);
 }
 EXPORT_SYMBOL(scsi_device_put);
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 3995169..bd641a8 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -564,13 +564,13 @@ static int sd_major(int major_idx)
 	}
 }
 
-static struct scsi_disk *__scsi_disk_get(struct gendisk *disk)
+static struct scsi_disk *__scsi_disk_get(struct gendisk *disk, bool get_lld)
 {
 	struct scsi_disk *sdkp = NULL;
 
 	if (disk->private_data) {
 		sdkp = scsi_disk(disk);
-		if (scsi_device_get(sdkp->device) == 0)
+		if (scsi_dev_get(sdkp->device, get_lld) == 0)
 			get_device(&sdkp->dev);
 		else
 			sdkp = NULL;
@@ -578,35 +578,36 @@ static struct scsi_disk *__scsi_disk_get(struct gendisk *disk)
 	return sdkp;
 }
 
-static struct scsi_disk *scsi_disk_get(struct gendisk *disk)
+static struct scsi_disk *scsi_disk_get(struct gendisk *disk, bool get_lld)
 {
 	struct scsi_disk *sdkp;
 
 	mutex_lock(&sd_ref_mutex);
-	sdkp = __scsi_disk_get(disk);
+	sdkp = __scsi_disk_get(disk, get_lld);
 	mutex_unlock(&sd_ref_mutex);
 	return sdkp;
 }
 
-static struct scsi_disk *scsi_disk_get_from_dev(struct device *dev)
+static struct scsi_disk *scsi_disk_get_from_dev(struct device *dev,
+						bool get_lld)
 {
 	struct scsi_disk *sdkp;
 
 	mutex_lock(&sd_ref_mutex);
 	sdkp = dev_get_drvdata(dev);
 	if (sdkp)
-		sdkp = __scsi_disk_get(sdkp->disk);
+		sdkp = __scsi_disk_get(sdkp->disk, get_lld);
 	mutex_unlock(&sd_ref_mutex);
 	return sdkp;
 }
 
-static void scsi_disk_put(struct scsi_disk *sdkp)
+static void scsi_disk_put(struct scsi_disk *sdkp, bool put_lld)
 {
 	struct scsi_device *sdev = sdkp->device;
 
 	mutex_lock(&sd_ref_mutex);
 	put_device(&sdkp->dev);
-	scsi_device_put(sdev);
+	scsi_dev_put(sdev, put_lld);
 	mutex_unlock(&sd_ref_mutex);
 }
 
@@ -1184,7 +1185,7 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
  **/
 static int sd_open(struct block_device *bdev, fmode_t mode)
 {
-	struct scsi_disk *sdkp = scsi_disk_get(bdev->bd_disk);
+	struct scsi_disk *sdkp = scsi_disk_get(bdev->bd_disk, true);
 	struct scsi_device *sdev;
 	int retval;
 
@@ -1239,7 +1240,7 @@ static int sd_open(struct block_device *bdev, fmode_t mode)
 	return 0;
 
 error_out:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return retval;	
 }
 
@@ -1273,7 +1274,7 @@ static void sd_release(struct gendisk *disk, fmode_t mode)
 	 * XXX is followed by a "rmmod sd_mod"?
 	 */
 
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 }
 
 static int sd_getgeo(struct block_device *bdev, struct hd_geometry *geo)
@@ -1525,11 +1526,11 @@ static int sd_sync_cache(struct scsi_disk *sdkp)
 
 static void sd_rescan(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 
 	if (sdkp) {
 		revalidate_disk(sdkp->disk);
-		scsi_disk_put(sdkp);
+		scsi_disk_put(sdkp, true);
 	}
 }
 
@@ -3143,11 +3144,14 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
 /*
  * Send a SYNCHRONIZE CACHE instruction down to the device through
  * the normal SCSI command structure.  Wait for the command to
- * complete.
+ * complete.  Since this function can be called during SCSI LLD kernel
+ * module unload and since try_module_get() fails after kernel module
+ * unload has started this function must not try to increase the SCSI
+ * LLD kernel module refcount.
  */
 static void sd_shutdown(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, false);
 
 	if (!sdkp)
 		return;         /* this can happen */
@@ -3166,12 +3170,12 @@ static void sd_shutdown(struct device *dev)
 	}
 
 exit:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, false);
 }
 
 static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 	int ret = 0;
 
 	if (!sdkp)
@@ -3197,7 +3201,7 @@ static int sd_suspend_common(struct device *dev, bool ignore_stop_errors)
 	}
 
 done:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return ret;
 }
 
@@ -3213,7 +3217,7 @@ static int sd_suspend_runtime(struct device *dev)
 
 static int sd_resume(struct device *dev)
 {
-	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
+	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev, true);
 	int ret = 0;
 
 	if (!sdkp->device->manage_start_stop)
@@ -3223,7 +3227,7 @@ static int sd_resume(struct device *dev)
 	ret = sd_start_stop_device(sdkp, 1);
 
 done:
-	scsi_disk_put(sdkp);
+	scsi_disk_put(sdkp, true);
 	return ret;
 }
 
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 3a4edd1..a4cb852 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -330,6 +330,8 @@ extern void scsi_remove_device(struct scsi_device *);
 extern int scsi_unregister_device_handler(struct scsi_device_handler *scsi_dh);
 void scsi_attach_vpd(struct scsi_device *sdev);
 
+extern int scsi_dev_get(struct scsi_device *, bool get_lld);
+extern void scsi_dev_put(struct scsi_device *, bool put_lld);
 extern int scsi_device_get(struct scsi_device *);
 extern void scsi_device_put(struct scsi_device *);
 extern struct scsi_device *scsi_device_lookup(struct Scsi_Host *,
-- 
2.1.2


[-- Attachment #5: 0004-IB-srp-Process-REQ_PREEMPT-requests-correctly.patch --]
[-- Type: text/x-patch, Size: 1166 bytes --]

>From 6f593a0e9fcfd9b6c99fd24ac981450ed6eb0a0f Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bvanassche@acm.org>
Date: Thu, 8 Jan 2015 09:42:45 +0100
Subject: [PATCH 4/4] IB/srp: Process REQ_PREEMPT requests correctly

Reported-by: Max Gurtuvoy <maxg@mellanox.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 0747c05..77a7a2f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -2003,8 +2003,13 @@ static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 	if (in_scsi_eh)
 		mutex_lock(&rport->mutex);
 
+	/*
+	 * The "blocked" state of SCSI devices is ignored by the SCSI core for
+	 * REQ_PREEMPT requests. Hence the explicit check below for the SCSI
+	 * device state.
+	 */
 	scmnd->result = srp_chkready(target->rport);
-	if (unlikely(scmnd->result))
+	if (unlikely(scmnd->result != 0 || scsi_device_blocked(scmnd->device)))
 		goto err;
 
 	WARN_ON_ONCE(scmnd->request->tag < 0);
-- 
2.1.2


[-- Attachment #6: spread-mlx4-ib-interrupts --]
[-- Type: text/plain, Size: 1671 bytes --]

#!/bin/awk -f

BEGIN {
  "ls -1d /sys/devices/system/node/node* 2>&1 | wc -l" | getline nodes
  if (nodes > 1) {
    for (i = 0; i < nodes; i++) {
      cpus_per_node = 0
      while (("cd /sys/devices/system/cpu && ls -d cpu*/node" i " | sed 's/^cpu//;s,/.*,,'|sort -n" | getline j) > 0) {
        #print "[" i ", " cpus_per_node "]: " j
        cpu[i, cpus_per_node++] = j
      }
    }
  } else {
      cpus_per_node = 0
      while (("cd /sys/devices/system/cpu && ls -d cpu[0-9]* | sed 's/^cpu//'|sort -n" | getline j) > 0) {
        #print "[0, " cpus_per_node "]: " j
        cpu[0, cpus_per_node++] = j
      }
  }
  for (i = 0; i < nodes; i++)
      nextcpu[i] = 0
  while (("sed -n 's/.*mlx4-ib-\\([0-9]*\\)-[0-9]*@\\(.*\\)$/\\1 \\2/p' /proc/interrupts | uniq" | getline) > 0) {
    port = $1
    bus = substr($0, length($1) + 2)
    #print "port = " port "; bus = " bus
    irqcount = 0
    while (("sed -n 's/^[[:blank:]]*\\([0-9]*\\):[0-9[:blank:]]*[^[:blank:]]*[[:blank:]]*\\(mlx4-ib-" port "-[0-9]*@" bus "\\)$/\\1 \\2/p' </proc/interrupts" | getline) > 0) {
      irq[irqcount] = $1
      irqname[irqcount] = substr($0, length($1) + 2)
      irqcount++
    }
    for (i = 0; i < nodes; i++) {
      ch_start = i * irqcount / nodes
      ch_end = (i + 1) * irqcount / nodes
      for (ch = ch_start; ch < ch_end; ch++) {
        c = cpu[i, nextcpu[i]++ % cpus_per_node]
        if (nodes > 1)
            nodetxt =  " (node " i ")"
        else
            nodetxt = ""
        print "IRQ " irq[ch] " (" irqname[ch] "): CPU " c nodetxt
        cmd="echo " c " >/proc/irq/" irq[ch] "/smp_affinity_list"
	#print cmd
	system(cmd)
      }
    }
  }
  exit 0
}

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-14  9:16                     ` Bart Van Assche
@ 2015-01-14 18:59                       ` Mike Snitzer
  2015-01-15  8:11                         ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Snitzer @ 2015-01-14 18:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On Wed, Jan 14 2015 at  4:16am -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/13/15 17:21, Mike Snitzer wrote:
> > OK, I assume you specified the mpath device for the test that failed.
> 
> Yes, of course ...
> 
> > This test works fine on my 100MB scsi_debug device with 4 paths exported
> > over virtio-blk to a guest that assembles the mpath device.
> > 
> > Could be a hang that is unique to scsi-mq.
> > 
> > Any chance you'd be willing to provide a HOWTO for setting up your
> > SRP/iscsi configuration?
> > 
> > Are you carrying any related changes that are not upstream?  (I can hunt
> > down the email in this thread where you describe your kernel tree...)
> > 
> > I'll try to reproduce but this info could be useful to others that are
> > more scsi-mq inclined who might need to chase this too.
> 
> The four patches I had used in my tests at the initiator side and that
> are not yet in v3.19-rc4 have been attached to this e-mail (I have not
> yet had the time to post all of these patches for review).
> 
> This is how my I had configured the initiator system:
> * If the version of the srptools package supplied by your distro is
> lower than 1.0.2, build and install the latest version from the source
> code available at git://git.openfabrics.org/~bvanassche/srptools.git/.git.
> * Install the latest version of lsscsi
> (http://sg.danny.cz/scsi/lsscsi.html). This version has SRP transport
> support but is not yet in any distro AFAIK.
> * Build and install a kernel >= v3.19-rc4 that includes the dm patches
> at the start of this e-mail thread.
> * Check whether the IB links are up (should display "State: Active"):
> ibstat | grep State:
> * Spread completion interrupts statically over CPU cores, e.g. via the
> attached script (spread-mlx4-ib-interrupts).
> * Check whether the SRP target system is visible from the SRP initiator
> system - the command below should print at least one line:
> ibsrpdm -c
> * Enable blk-mq:
> echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
> * Configure the SRP kernel module parameters as follows:
> echo 'options ib_srp cmd_sg_entries=255 dev_loss_tmo=60 ch_count=6' >
> /etc/modprobe.d/ib_srp.conf
> * Unload and reload the SRP initiator kernel module to apply these
> parameters:
> rmmod ib_srp; modprobe ib_srp
> * Start srpd and wait until SRP login has finished:
> systemctl start srpd
> while ! lsscsi -t | grep -q srp:; do sleep 1; done
> * Start multipathd and check the table it has built:
> systemctl start multipathd
> dmsetup table /dev/dm-0
> * Set the I/O scheduler to noop, disable add_random and set rq_affinity
> to 2 for all SRP and dm block devices.
> * Run the I/O load of your preference.

Thanks for all this info.  But I don't have an IB setup readily
available to test with.  We are setting up an IB testbed in the lab and
can hopefully work through your setup in the coming weeks.

IB aside, I haven't been following along close enough on scsi-mq
developments, but does a regular iscsi initiator have support for
scsi-mq?  I'd like to validate scsi-mq devices with the dm-mpath
changes.

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-14 18:59                       ` Mike Snitzer
@ 2015-01-15  8:11                         ` Bart Van Assche
  2015-01-15 15:43                           ` Mike Snitzer
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2015-01-15  8:11 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On 01/14/15 20:00, Mike Snitzer wrote:
> IB aside, I haven't been following along close enough on scsi-mq
> developments, but does a regular iscsi initiator have support for
> scsi-mq?  I'd like to validate scsi-mq devices with the dm-mpath
> changes.

Hello Mike,

scsi-mq support for the iSCSI initiator is being considered but has not
yet been implemented. For more information, see also Sagi Grimberg,
[LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion, linux-scsi mailing
list, January 7, 2015 (http://thread.gmane.org/gmane.linux.scsi/98199).

Bart.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-15  8:11                         ` Bart Van Assche
@ 2015-01-15 15:43                           ` Mike Snitzer
  2015-01-15 15:55                             ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Mike Snitzer @ 2015-01-15 15:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On Thu, Jan 15 2015 at  3:11am -0500,
Bart Van Assche <bart.vanassche@sandisk.com> wrote:

> On 01/14/15 20:00, Mike Snitzer wrote:
> > IB aside, I haven't been following along close enough on scsi-mq
> > developments, but does a regular iscsi initiator have support for
> > scsi-mq?  I'd like to validate scsi-mq devices with the dm-mpath
> > changes.
> 
> Hello Mike,
> 
> scsi-mq support for the iSCSI initiator is being considered but has not
> yet been implemented. For more information, see also Sagi Grimberg,
> [LSF/MM TOPIC] iSCSI MQ adoption via MCS discussion, linux-scsi mailing
> list, January 7, 2015 (http://thread.gmane.org/gmane.linux.scsi/98199).

OK, thanks.

FYI, I just used your fio test against an mpath device that is using 4
virtio-scsi devices with blk-mq enabled and all worked fine.

How easily could you reproduce your SRP hang?  Did it happen every run
or did it take multiple runs to see it?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: dm + blk-mq soft lockup complaint
  2015-01-15 15:43                           ` Mike Snitzer
@ 2015-01-15 15:55                             ` Bart Van Assche
  0 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2015-01-15 15:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig,
	device-mapper development, Jun'ichi Nomura, linux-scsi

On 01/15/15 16:44, Mike Snitzer wrote:
> How easily could you reproduce your SRP hang?  Did it happen every run
> or did it take multiple runs to see it?

Hello Mike,

This complaint was easy to trigger. It took less than one minute before
the soft lockup complaint appeared. The system did not hang after that
complaint appeared. Maybe this complaint means that a cond_resched()
call was missing from a loop ?

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-01-15 15:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <54B3FFAE.4070609@kernel.dk>
     [not found] ` <alpine.LNX.2.00.1501121738000.4026@localhost.lm.intel.com>
     [not found]   ` <54B40E8A.6010005@kernel.dk>
     [not found]     ` <alpine.LNX.2.00.1501121818510.4026@localhost.lm.intel.com>
     [not found]       ` <alpine.LNX.2.00.1501121830010.4026@localhost.lm.intel.com>
     [not found]         ` <20150112191138.GC21518@redhat.com>
     [not found]           ` <20150112202113.GA23234@redhat.com>
     [not found]             ` <54B50F9D.2040000@sandisk.com>
     [not found]               ` <20150113141746.GA30411@redhat.com>
     [not found]                 ` <54B52B73.7040807@sandisk.com>
2015-01-13 16:20                   ` dm + blk-mq soft lockup complaint Mike Snitzer
2015-01-14  9:16                     ` Bart Van Assche
2015-01-14 18:59                       ` Mike Snitzer
2015-01-15  8:11                         ` Bart Van Assche
2015-01-15 15:43                           ` Mike Snitzer
2015-01-15 15:55                             ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).