From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754167AbbJNOak (ORCPT ); Wed, 14 Oct 2015 10:30:40 -0400 Received: from bedivere.hansenpartnership.com ([66.63.167.143]:47850 "EHLO bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753617AbbJNOai (ORCPT ); Wed, 14 Oct 2015 10:30:38 -0400 Message-ID: <1444833036.2220.38.camel@HansenPartnership.com> Subject: Re: [PATCH 0/3] SCSI: Fix hard lockup in scsi_remove_target() From: James Bottomley To: Johannes Thumshirn Cc: Christoph Hellwig , Hannes Reinecke , linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org Date: Wed, 14 Oct 2015 07:30:36 -0700 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.11 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2015-10-14 at 15:50 +0200, Johannes Thumshirn wrote: > Removing a SCSI target via scsi_remove_target() suspected to be racy. When a > sibling get's removed from the list it can occassionly happen that one CPU is > stuck endlessly looping around this code block > > list_for_each_entry(starget, &shost->__targets, siblings) { > if (starget->state == STARGET_DEL) > continue; How long is the __targets list? It seems a bit unlikely that this is the exact cause, because for a short list all in STARGET_DEL that loop should exit very quickly. Where in the code does scsi_remove_target +0x68/0x240 actually point to? Is it not a bit more likely that we're following a removed list element? Since that points back to itself, the list_for_each_entry() would then circulate forever. If that's the case the simple fix would be to use the safe version of the list traversal macro. James > Resulting in the following hard lockup. > > Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 > [...] > Call Trace: > [] dump_trace+0x7d/0x2d0 > [] show_stack_log_lvl+0x94/0x170 > [] show_stack+0x21/0x50 > [] dump_stack+0x41/0x51 > [] panic+0xc8/0x1d7 > [] watchdog_overflow_callback+0xba/0xc0 > [] __perf_event_overflow+0x88/0x240 > [] intel_pmu_handle_irq+0x1fa/0x3e0 > [] perf_event_nmi_handler+0x26/0x40 > [] nmi_handle.isra.2+0x8d/0x180 > [] do_nmi+0x126/0x3c0 > [] end_repeat_nmi+0x1a/0x1e > [] scsi_remove_target+0x68/0x240 [scsi_mod] > [] process_one_work+0x172/0x420 > [] worker_thread+0x11a/0x3c0 > [] kthread+0xb4/0xc0 > [] ret_from_fork+0x58/0x90 > > This series attacks the issue by 1) decoupling the __targets and __devices > lists of struct Scsi_Host from the host_lock spinlock by introducing spinlocks > for each list and 2) de-coupling the list traversals needed for detecting > targets/devices to be removed from the actual removal by moving list entries to > be deleted to a second list and perform the deletion there. > > > The whole series survived a nearly 48h test loop of: > while [ $not_done ]; do > umount $mountpoint; > rmmod $module; > modprobe $module; > mount $mountpoint; > done > > This is a follow up of the patch proposed here: > http://marc.info/?l=linux-scsi&m=144377409311774&w=2 > incorporating Christoph's comment > > Johannes Thumshirn (3): > SCSI: Introduce device_lock and target_lock in Scsi_Host > SCSI: Rework list handling in scsi_target_remove > SCSI: Rework list handling in __scsi_target_remove > > drivers/scsi/53c700.c | 3 +++ > drivers/scsi/hosts.c | 2 ++ > drivers/scsi/scsi.c | 8 ++++---- > drivers/scsi/scsi_scan.c | 10 +++++----- > drivers/scsi/scsi_sysfs.c | 43 +++++++++++++++++++++---------------------- > include/scsi/scsi_host.h | 2 ++ > 6 files changed, 37 insertions(+), 31 deletions(-) >