From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joe Eykholt Subject: sd_ref_mutex and cpu_add_remove_lock deadlock Date: Wed, 24 Jun 2009 21:06:44 -0700 Message-ID: <4A42F7D4.8070102@cisco.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from sj-iport-5.cisco.com ([171.68.10.87]:31652 "EHLO sj-iport-5.cisco.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750779AbZFYEGm (ORCPT ); Thu, 25 Jun 2009 00:06:42 -0400 Received: from sj-core-2.cisco.com (sj-core-2.cisco.com [171.71.177.254]) by sj-dkim-1.cisco.com (8.12.11/8.12.11) with ESMTP id n5P46jsi026444 for ; Wed, 24 Jun 2009 21:06:45 -0700 Received: from airfoil.local ([10.200.1.69]) by sj-core-2.cisco.com (8.13.8/8.14.3) with ESMTP id n5P46i8m029865 for ; Thu, 25 Jun 2009 04:06:45 GMT Sender: linux-scsi-owner@vger.kernel.org List-Id: linux-scsi@vger.kernel.org To: Linux SCSI Mailing List Has anyone seen this? I'm getting a hang due to three threads in a deadly embrace involving two mutexes. A user process doing a close on /dev/sdx has the sd_ref_mutex and is trying to get cpu_add_remove_lock. Another process is doing a /sys write to destroy an fcoe instance. It is in destroy_workqueue() which holds the cpu_add_remove_lock() waiting for a work item to complete. The third thread is running the work item, and waiting on the sd_ref_mutex. To summarize: Worker thread wants sd_ref_mutex Close thread has sd_ref_mutex and wants cpu_add_remove_lock Destroy thread has cpu_add_remove_lock and waits for worker_thread to exit. The stacks are shown below. I'm not sure what the best solution would be or which locking rule is being broken here. Also, it seems to me there's a possible deadlock where sd_remove() has the sd_ref_mutex locked and is doing a put_device(). The release function for this device is scsi_disc_release(), which also takes the sd_ref_mutex(). Maybe it's known that this can't be the last put_device(). This is based on the open-fcoe.org fcoe-next.git tree, which is fairly up-to-date. This first process may not be involved, but # cat /proc/3727/stack [] scsi_disk_get_from_dev+0x1a/0x49 wants sd_ref_mutex [] sd_shutdown+0x12/0x117 [] sd_remove+0x51/0x8a [] __device_release_driver+0x80/0xc9 [] device_release_driver+0x1e/0x2b [] bus_remove_device+0xa8/0xc9 [] device_del+0x138/0x1a1 [] __scsi_remove_device+0x44/0x81 [] scsi_remove_device+0x26/0x33 [] __scsi_remove_target+0x93/0xd7 [] __remove_child+0x1e/0x25 [] device_for_each_child+0x38/0x6f [] scsi_remove_target+0x3b/0x48 [] fc_starget_delete+0x21/0x25 [scsi_transport_fc] [] fc_rport_final_delete+0xf6/0x188 [scsi_transport_fc] [] worker_thread+0x1fa/0x30a [] kthread+0x88/0x90 [] child_rip+0xa/0x20 [] 0xffffffffffffffff # cat /proc/4230/stack [] cpu_maps_update_begin+0x12/0x14 wants cpu_add_remove_lock [] destroy_workqueue+0x2b/0x9e [] scsi_host_dev_release+0x5a/0xbd [] device_release+0x49/0x75 [] kobject_release+0x51/0x67 [] kref_put+0x43/0x4f [] kobject_put+0x47/0x4b [] put_device+0x12/0x14 [] fc_rport_dev_release+0x18/0x24 [scsi_transport_fc] [] device_release+0x49/0x75 [] kobject_release+0x51/0x67 [] kref_put+0x43/0x4f [] kobject_put+0x47/0x4b [] put_device+0x12/0x14 [] scsi_target_dev_release+0x1d/0x21 [] device_release+0x49/0x75 [] kobject_release+0x51/0x67 [] kref_put+0x43/0x4f [] kobject_put+0x47/0x4b [] put_device+0x12/0x14 [] scsi_device_dev_release_usercontext+0x118/0x124 [] execute_in_process_context+0x2a/0x70 [] scsi_device_dev_release+0x17/0x19 [] device_release+0x49/0x75 [] kobject_release+0x51/0x67 [] kref_put+0x43/0x4f [] kobject_put+0x47/0x4b [] put_device+0x12/0x14 [] scsi_device_put+0x3d/0x42 [] scsi_disk_put+0x30/0x41 has sd_ref_mutex [] sd_release+0x4d/0x54 [] __blkdev_put+0xa7/0x16e [] blkdev_put+0xb/0xd [] blkdev_close+0x37/0x3c [] __fput+0xdf/0x186 [] fput+0x18/0x1a [] filp_close+0x59/0x63 [] sys_close+0xa5/0xe4 [] system_call_fastpath+0x16/0x1b [] 0xffffffffffffffff # cat /proc/4236/stack [] flush_cpu_workqueue+0x7b/0x87 [] cleanup_workqueue_thread+0x6a/0xb8 [] destroy_workqueue+0x63/0x9e has cpu_add_remove_lock [] fc_remove_host+0x148/0x171 [scsi_transport_fc] [] fcoe_if_destroy+0x183/0x1eb [fcoe] [] fcoe_destroy+0x35/0x76 [fcoe] [] param_attr_store+0x25/0x35 [] module_attr_store+0x21/0x25 [] sysfs_write_file+0xe4/0x119 [] vfs_write+0xab/0x105 [] sys_write+0x47/0x6e [] system_call_fastpath+0x16/0x1b [] 0xffffffffffffffff Thanks, Joe