From: Gabriel Krisman Bertazi <krisman@collabora.com>
To: Lee Duncan <lduncan@suse.com>
Cc: <cleech@redhat.com>, <martin.petersen@oracle.com>,
<open-iscsi@googlegroups.com>, <linux-scsi@vger.kernel.org>,
<kernel@collabora.com>, Khazhismel Kumykov <khazhy@google.com>
Subject: Re: [PATCH] iscsi: Fix deadlock on recovery path during GFP_IO reclaim
Date: Mon, 18 May 2020 14:53:09 -0400 [thread overview]
Message-ID: <85eerhf52i.fsf@collabora.com> (raw)
In-Reply-To: <eb7a3a10-bb04-f601-a01b-c50f482474fa@suse.com> (Lee Duncan's message of "Sun, 17 May 2020 10:30:40 -0700")
Lee Duncan <lduncan@suse.com> writes:
> On 5/7/20 10:59 PM, Gabriel Krisman Bertazi wrote:
>> iscsi suffers from a deadlock in case a management command submitted via
>> the netlink socket sleeps on an allocation while holding the
>> rx_queue_mutex, if that allocation causes a memory reclaim that
>> writebacks to a failed iscsi device. Then, the recovery procedure can
>> never make progress to recover the failed disk or abort outstanding IO
>> operations to complete the reclaim (since rx_queue_mutex is locked),
>> thus locking the system.
>>
>> Nevertheless, just marking all allocations under rx_queue_mutex as
>> GFP_NOIO (or locking the userspace process with something like
>> PF_MEMALLOC_NOIO) is not enough, since the iscsi command code relies on
>> other subsystems that try to grab locked mutexes, whose threads are
>> GFP_IO, leading to the same deadlock. One instance where this situation
>> can be observed is in the backtraces below, stitched from multiple bugs
>> reports, involving the kobj uevent sent when a session is created.
>>
>> The root of the problem is not the fact that iscsi does GFP_IO
>> allocations, that is acceptable. The actual problem is that
>> rx_queue_mutex has a very large granularity, covering every unrelated
>> netlink command execution at the same time as the error recovery path.
>>
>> The proposed fix leverages the recently added mechanism to stop failed
>> connections from the kernel, by enabling it to execute even though a
>> management command from the netlink socket is being run (rx_queue_mutex
>> is held), provided that the command is known to be safe. It splits the
>> rx_queue_mutex in two mutexes, one protecting from concurrent command
>> execution from the netlink socket, and one protecting stop_conn from
>> racing with other connection management operations that might conflict
>> with it.
>>
>> It is not very pretty, but it is the simplest way to resolve the
>> deadlock. I considered making it a lock per connection, but some
>> external mutex would still be needed to deal with iscsi_if_destroy_conn.
>>
>> The patch was tested by forcing a memory shrinker (unrelated, but used
>> bufio/dm-verity) to reclaim ISCSI pages every time
>> ISCSI_UEVENT_CREATE_SESSION happens, which is reasonable to simulate
>> reclaims that might happen with GFP_KERNEL on that path. Then, a faulty
>> hung target causes a connection to fail during intensive IO, at the same
>> time a new session is added by iscsid.
>>
>> The following stacktraces are stiches from several bug reports, showing
>> a case where the deadlock can happen.
>>
>> iSCSI-write
>> holding: rx_queue_mutex
>> waiting: uevent_sock_mutex
>>
>> kobject_uevent_env+0x1bd/0x419
>> kobject_uevent+0xb/0xd
>> device_add+0x48a/0x678
>> scsi_add_host_with_dma+0xc5/0x22d
>> iscsi_host_add+0x53/0x55
>> iscsi_sw_tcp_session_create+0xa6/0x129
>> iscsi_if_rx+0x100/0x1247
>> netlink_unicast+0x213/0x4f0
>> netlink_sendmsg+0x230/0x3c0
>>
>> iscsi_fail iscsi_conn_failure
>> waiting: rx_queue_mutex
>>
>> schedule_preempt_disabled+0x325/0x734
>> __mutex_lock_slowpath+0x18b/0x230
>> mutex_lock+0x22/0x40
>> iscsi_conn_failure+0x42/0x149
>> worker_thread+0x24a/0xbc0
>>
>> EventManager_
>> holding: uevent_sock_mutex
>> waiting: dm_bufio_client->lock
>>
>> dm_bufio_lock+0xe/0x10
>> shrink+0x34/0xf7
>> shrink_slab+0x177/0x5d0
>> do_try_to_free_pages+0x129/0x470
>> try_to_free_mem_cgroup_pages+0x14f/0x210
>> memcg_kmem_newpage_charge+0xa6d/0x13b0
>> __alloc_pages_nodemask+0x4a3/0x1a70
>> fallback_alloc+0x1b2/0x36c
>> __kmalloc_node_track_caller+0xb9/0x10d0
>> __alloc_skb+0x83/0x2f0
>> kobject_uevent_env+0x26b/0x419
>> dm_kobject_uevent+0x70/0x79
>> dev_suspend+0x1a9/0x1e7
>> ctl_ioctl+0x3e9/0x411
>> dm_ctl_ioctl+0x13/0x17
>> do_vfs_ioctl+0xb3/0x460
>> SyS_ioctl+0x5e/0x90
>>
>> MemcgReclaimerD"
>> holding: dm_bufio_client->lock
>> waiting: stuck io to finish (needs iscsi_fail thread to progress)
>>
>> schedule at ffffffffbd603618
>> io_schedule at ffffffffbd603ba4
>> do_io_schedule at ffffffffbdaf0d94
>> __wait_on_bit at ffffffffbd6008a6
>> out_of_line_wait_on_bit at ffffffffbd600960
>> wait_on_bit.constprop.10 at ffffffffbdaf0f17
>> __make_buffer_clean at ffffffffbdaf18ba
>> __cleanup_old_buffer at ffffffffbdaf192f
>> shrink at ffffffffbdaf19fd
>> do_shrink_slab at ffffffffbd6ec000
>> shrink_slab at ffffffffbd6ec24a
>> do_try_to_free_pages at ffffffffbd6eda09
>> try_to_free_mem_cgroup_pages at ffffffffbd6ede7e
>> mem_cgroup_resize_limit at ffffffffbd7024c0
>> mem_cgroup_write at ffffffffbd703149
>> cgroup_file_write at ffffffffbd6d9c6e
>> sys_write at ffffffffbd6662ea
>> system_call_fastpath at ffffffffbdbc34a2
>>
>> Reported-by: Khazhismel Kumykov <khazhy@google.com>
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
>> ---
>> drivers/scsi/scsi_transport_iscsi.c | 67 +++++++++++++++++++++--------
>> 1 file changed, 49 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
>> index 17a45716a0fe..d99c17306dff 100644
>> --- a/drivers/scsi/scsi_transport_iscsi.c
>> +++ b/drivers/scsi/scsi_transport_iscsi.c
>> @@ -1616,6 +1616,12 @@ static DECLARE_TRANSPORT_CLASS(iscsi_connection_class,
>> static struct sock *nls;
>> static DEFINE_MUTEX(rx_queue_mutex);
>>
>> +/*
>> + * conn_mutex protects the {start,bind,stop,destroy}_conn from racing
>> + * against the kernel stop_connection recovery mechanism
>> + */
>> +static DEFINE_MUTEX(conn_mutex);
>> +
>> static LIST_HEAD(sesslist);
>> static LIST_HEAD(sessdestroylist);
>> static DEFINE_SPINLOCK(sesslock);
>> @@ -2442,6 +2448,32 @@ int iscsi_offload_mesg(struct Scsi_Host *shost,
>> }
>> EXPORT_SYMBOL_GPL(iscsi_offload_mesg);
>>
>> +/*
>> + * This can be called without the rx_queue_mutex, if invoked by the kernel
>> + * stop work. But, in that case, it is guaranteed not to race with
>> + * iscsi_destroy by conn_mutex.
>> + */
>> +static void iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag)
>> +{
>> + /*
>> + * It is important that this path doesn't rely on
>> + * rx_queue_mutex, otherwise, a thread doing allocation on a
>> + * start_session/start_connection could sleep waiting on a
>> + * writeback to a failed iscsi device, that cannot be recovered
>> + * because the lock is held. If we don't hold it here, the
>> + * kernel stop_conn_work_fn has a chance to stop the broken
>> + * session and resolve the allocation.
>> + *
>> + * Still, the user invoked .stop_conn() needs to be serialized
>> + * with stop_conn_work_fn by a private mutex. Not pretty, but
>> + * it works.
>> + */
>> + mutex_lock(&conn_mutex);
>> + conn->transport->stop_conn(conn, flag);
>> + mutex_unlock(&conn_mutex);
>> +
>> +}
>> +
>> static void stop_conn_work_fn(struct work_struct *work)
>> {
>> struct iscsi_cls_conn *conn, *tmp;
>> @@ -2460,30 +2492,17 @@ static void stop_conn_work_fn(struct work_struct *work)
>> uint32_t sid = iscsi_conn_get_sid(conn);
>> struct iscsi_cls_session *session;
>>
>> - mutex_lock(&rx_queue_mutex);
>> -
>> session = iscsi_session_lookup(sid);
>> if (session) {
>> if (system_state != SYSTEM_RUNNING) {
>> session->recovery_tmo = 0;
>> - conn->transport->stop_conn(conn,
>> - STOP_CONN_TERM);
>> + iscsi_if_stop_conn(conn, STOP_CONN_TERM);
>> } else {
>> - conn->transport->stop_conn(conn,
>> - STOP_CONN_RECOVER);
>> + iscsi_if_stop_conn(conn, STOP_CONN_RECOVER);
>> }
>> }
>>
>> list_del_init(&conn->conn_list_err);
>> -
>> - mutex_unlock(&rx_queue_mutex);
>> -
>> - /* we don't want to hold rx_queue_mutex for too long,
>> - * for instance if many conns failed at the same time,
>> - * since this stall other iscsi maintenance operations.
>> - * Give other users a chance to proceed.
>> - */
>> - cond_resched();
>> }
>> }
>
> I'm curious about why you removed the cond_resched() here. Is it because
> it is no longer needed, with shorter (mutex) waiting time?
>
Hi Lee,
My thought was that the main reason for cond_resched here was to give a
chance for other iscsi maintenance operations to run. But, since
conn_mutex, differently from rx_queue_mutex, won't stop most of the
operations, it felt unnecessary. If you disagree, I can submit a v2
that doesn't change it.
>>
>> @@ -2843,8 +2862,11 @@ iscsi_if_destroy_conn(struct iscsi_transport *transport, struct iscsi_uevent *ev
>> spin_unlock_irqrestore(&connlock, flags);
>>
>> ISCSI_DBG_TRANS_CONN(conn, "Destroying transport conn\n");
>> +
>> + mutex_lock(&conn_mutex);
>> if (transport->destroy_conn)
>> transport->destroy_conn(conn);
>> + mutex_unlock(&conn_mutex);
>>
>> return 0;
>> }
>> @@ -3686,9 +3708,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>> break;
>> }
>>
>> + mutex_lock(&conn_mutex);
>> ev->r.retcode = transport->bind_conn(session, conn,
>> ev->u.b_conn.transport_eph,
>> ev->u.b_conn.is_leading);
>> + mutex_unlock(&conn_mutex);
>> +
>> if (ev->r.retcode || !transport->ep_connect)
>> break;
>>
>> @@ -3709,25 +3734,31 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
>> break;
>> case ISCSI_UEVENT_START_CONN:
>> conn = iscsi_conn_lookup(ev->u.start_conn.sid, ev->u.start_conn.cid);
>> - if (conn)
>> + if (conn) {
>> + mutex_lock(&conn_mutex);
>> ev->r.retcode = transport->start_conn(conn);
>> + mutex_unlock(&conn_mutex);
>> + }
>> else
>> err = -EINVAL;
>> break;
>> case ISCSI_UEVENT_STOP_CONN:
>> conn = iscsi_conn_lookup(ev->u.stop_conn.sid, ev->u.stop_conn.cid);
>> if (conn)
>> - transport->stop_conn(conn, ev->u.stop_conn.flag);
>> + iscsi_if_stop_conn(conn, ev->u.stop_conn.flag);
>> else
>> err = -EINVAL;
>> break;
>> case ISCSI_UEVENT_SEND_PDU:
>> conn = iscsi_conn_lookup(ev->u.send_pdu.sid, ev->u.send_pdu.cid);
>> - if (conn)
>> + if (conn) {
>> + mutex_lock(&conn_mutex);
>> ev->r.retcode = transport->send_pdu(conn,
>> (struct iscsi_hdr*)((char*)ev + sizeof(*ev)),
>> (char*)ev + sizeof(*ev) + ev->u.send_pdu.hdr_size,
>> ev->u.send_pdu.data_size);
>> + mutex_unlock(&conn_mutex);
>> + }
>> else
>> err = -EINVAL;
>> break;
>>
>
> My question above is for my own information, so I'll still say:
>
> Reviewed-by: Lee Duncan <lduncan@suse.com>
--
Gabriel Krisman Bertazi
next prev parent reply other threads:[~2020-05-18 18:53 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-05-08 5:59 [PATCH] iscsi: Fix deadlock on recovery path during GFP_IO reclaim Gabriel Krisman Bertazi
2020-05-15 0:11 ` Martin K. Petersen
2020-05-17 17:30 ` Lee Duncan
2020-05-18 18:53 ` Gabriel Krisman Bertazi [this message]
2020-05-18 19:05 ` Lee Duncan
2020-05-20 1:12 ` Martin K. Petersen
2020-05-20 1:37 ` Gabriel Krisman Bertazi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=85eerhf52i.fsf@collabora.com \
--to=krisman@collabora.com \
--cc=cleech@redhat.com \
--cc=kernel@collabora.com \
--cc=khazhy@google.com \
--cc=lduncan@suse.com \
--cc=linux-scsi@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=open-iscsi@googlegroups.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).