stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>,
	Jens Axboe <axboe@kernel.dk>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH 5.5 34/65] io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
Date: Thu, 19 Mar 2020 14:04:16 +0100	[thread overview]
Message-ID: <20200319123937.204774554@linuxfoundation.org> (raw)
In-Reply-To: <20200319123926.466988514@linuxfoundation.org>

From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

[ Upstream commit bdcd3eab2a9ae0ac93f27275b6895dd95e5bf360 ]

After making ext4 support iopoll method:
  let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
we found fio can easily hang in fio_ioring_getevents() with below fio
job:
    rm -f testfile; sync;
    sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
-rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
-bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.

There are two issues that results in this hang, one reason is that
when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
does not use io_uring_enter to get completed events, it relies on
kernel io_sq_thread to poll for completed events.

Another reason is that there is a race: when io_submit_sqes() in
io_sq_thread() submits a batch of sqes, variable 'inflight' will
record the number of submitted reqs, then io_sq_thread will poll for
reqs which have been added to poll_list. But note, if some previous
reqs have been punted to io worker, these reqs will won't be in
poll_list timely. io_sq_thread() will only poll for a part of previous
submitted reqs, and then find poll_list is empty, reset variable
'inflight' to be zero. If app just waits these deferred reqs and does
not wake up io_sq_thread again, then hang happens.

For app that entirely relies on io_sq_thread to poll completed requests,
let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
element to poll_list, and when io_sq_thread prepares to sleep, check
whether poll_list is empty again, if not empty, continue to poll.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/io_uring.c | 59 +++++++++++++++++++++++----------------------------
 1 file changed, 27 insertions(+), 32 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 60a4832089982..c8f8cc2463986 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1435,6 +1435,10 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
 		list_add(&req->list, &ctx->poll_list);
 	else
 		list_add_tail(&req->list, &ctx->poll_list);
+
+	if ((ctx->flags & IORING_SETUP_SQPOLL) &&
+	    wq_has_sleeper(&ctx->sqo_wait))
+		wake_up(&ctx->sqo_wait);
 }
 
 static void io_file_put(struct io_submit_state *state)
@@ -3857,9 +3861,8 @@ static int io_sq_thread(void *data)
 	const struct cred *old_cred;
 	mm_segment_t old_fs;
 	DEFINE_WAIT(wait);
-	unsigned inflight;
 	unsigned long timeout;
-	int ret;
+	int ret = 0;
 
 	complete(&ctx->completions[1]);
 
@@ -3867,39 +3870,19 @@ static int io_sq_thread(void *data)
 	set_fs(USER_DS);
 	old_cred = override_creds(ctx->creds);
 
-	ret = timeout = inflight = 0;
+	timeout = jiffies + ctx->sq_thread_idle;
 	while (!kthread_should_park()) {
 		unsigned int to_submit;
 
-		if (inflight) {
+		if (!list_empty(&ctx->poll_list)) {
 			unsigned nr_events = 0;
 
-			if (ctx->flags & IORING_SETUP_IOPOLL) {
-				/*
-				 * inflight is the count of the maximum possible
-				 * entries we submitted, but it can be smaller
-				 * if we dropped some of them. If we don't have
-				 * poll entries available, then we know that we
-				 * have nothing left to poll for. Reset the
-				 * inflight count to zero in that case.
-				 */
-				mutex_lock(&ctx->uring_lock);
-				if (!list_empty(&ctx->poll_list))
-					io_iopoll_getevents(ctx, &nr_events, 0);
-				else
-					inflight = 0;
-				mutex_unlock(&ctx->uring_lock);
-			} else {
-				/*
-				 * Normal IO, just pretend everything completed.
-				 * We don't have to poll completions for that.
-				 */
-				nr_events = inflight;
-			}
-
-			inflight -= nr_events;
-			if (!inflight)
+			mutex_lock(&ctx->uring_lock);
+			if (!list_empty(&ctx->poll_list))
+				io_iopoll_getevents(ctx, &nr_events, 0);
+			else
 				timeout = jiffies + ctx->sq_thread_idle;
+			mutex_unlock(&ctx->uring_lock);
 		}
 
 		to_submit = io_sqring_entries(ctx);
@@ -3928,7 +3911,7 @@ static int io_sq_thread(void *data)
 			 * more IO, we should wait for the application to
 			 * reap events and wake us up.
 			 */
-			if (inflight ||
+			if (!list_empty(&ctx->poll_list) ||
 			    (!time_after(jiffies, timeout) && ret != -EBUSY &&
 			    !percpu_ref_is_dying(&ctx->refs))) {
 				cond_resched();
@@ -3938,6 +3921,19 @@ static int io_sq_thread(void *data)
 			prepare_to_wait(&ctx->sqo_wait, &wait,
 						TASK_INTERRUPTIBLE);
 
+			/*
+			 * While doing polled IO, before going to sleep, we need
+			 * to check if there are new reqs added to poll_list, it
+			 * is because reqs may have been punted to io worker and
+			 * will be added to poll_list later, hence check the
+			 * poll_list again.
+			 */
+			if ((ctx->flags & IORING_SETUP_IOPOLL) &&
+			    !list_empty_careful(&ctx->poll_list)) {
+				finish_wait(&ctx->sqo_wait, &wait);
+				continue;
+			}
+
 			/* Tell userspace we may need a wakeup call */
 			ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP;
 			/* make sure to read SQ tail after writing flags */
@@ -3966,8 +3962,7 @@ static int io_sq_thread(void *data)
 		mutex_lock(&ctx->uring_lock);
 		ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true);
 		mutex_unlock(&ctx->uring_lock);
-		if (ret > 0)
-			inflight += ret;
+		timeout = jiffies + ctx->sq_thread_idle;
 	}
 
 	set_fs(old_fs);
-- 
2.20.1




  parent reply	other threads:[~2020-03-19 13:26 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-19 13:03 [PATCH 5.5 00/65] 5.5.11-rc1 review Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 01/65] gpiolib: Add support for the irqdomain which doesnt use irq_fwspec as arg Greg Kroah-Hartman
2020-03-19 13:33   ` Kevin Hao
2020-03-19 13:47     ` Greg Kroah-Hartman
2020-03-19 14:47       ` Kevin Hao
2020-03-19 14:58         ` Greg Kroah-Hartman
2020-03-19 22:27         ` Linus Walleij
2020-03-19 13:03 ` [PATCH 5.5 02/65] pinctrl: qcom: ssbi-gpio: Fix fwspec parsing bug Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 03/65] mmc: core: Default to generic_cmd6_time as timeout in __mmc_switch() Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 04/65] mmc: core: Allow host controllers to require R1B for CMD6 Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 05/65] mmc: sdhci-tegra: Fix busy detection by enabling MMC_CAP_NEED_RSP_BUSY Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 06/65] mmc: sdhci-omap: " Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 07/65] mmc: core: Respect MMC_CAP_NEED_RSP_BUSY for eMMC sleep command Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 08/65] mmc: core: Respect MMC_CAP_NEED_RSP_BUSY for erase/trim/discard Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 09/65] ACPI: watchdog: Allow disabling WDAT at boot Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 10/65] HID: apple: Add support for recent firmware on Magic Keyboards Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 11/65] ACPI: watchdog: Set default timeout in probe Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 12/65] HID: i2c-hid: add Trekstor Surfbook E11B to descriptor override Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 13/65] mips: vdso: fix jalr t9 crash in vdso code Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 14/65] MIPS: Disable VDSO time functionality on microMIPS Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 15/65] mips: vdso: add build time check that no jalr t9 calls left Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 16/65] HID: hid-bigbenff: fix general protection fault caused by double kfree Greg Kroah-Hartman
2020-03-19 13:03 ` [PATCH 5.5 17/65] HID: hid-bigbenff: call hid_hw_stop() in case of error Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 18/65] HID: hid-bigbenff: fix race condition for scheduled work during removal Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 19/65] riscv: set pmp configuration if kernel is running in M-mode Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 20/65] MIPS: vdso: Wrap -mexplicit-relocs in cc-option Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 21/65] kunit: run kunit_tool from any directory Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 22/65] selftests/rseq: Fix out-of-tree compilation Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 23/65] tracing: Fix number printing bug in print_synth_event() Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 24/65] cfg80211: check reg_rule for NULL in handle_channel_custom() Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 25/65] scsi: libfc: free response frame from GPN_ID Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 26/65] net: usb: qmi_wwan: restore mtu min/max values after raw_ip switch Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 27/65] net: ks8851-ml: Fix IRQ handling and locking Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 28/65] mac80211: rx: avoid RCU list traversal under mutex Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 29/65] net: ll_temac: Fix race condition causing TX hang Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 30/65] net: ll_temac: Add more error handling of dma_map_single() calls Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 31/65] net: ll_temac: Fix RX buffer descriptor handling on GFP_ATOMIC pressure Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 32/65] net: ll_temac: Handle DMA halt condition caused by buffer underrun Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 33/65] blk-mq: insert passthrough request into hctx->dispatch directly Greg Kroah-Hartman
2020-03-19 13:04 ` Greg Kroah-Hartman [this message]
2020-03-19 13:04 ` [PATCH 5.5 35/65] drm/amdgpu: fix memory leak during TDR test(v2) Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 36/65] io_uring: pick up link work on submit reference drop Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 37/65] kbuild: add dtbs_check to PHONY Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 38/65] kbuild: add dt_binding_check to PHONY in a correct place Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 39/65] signal: avoid double atomic counter increments for user accounting Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 40/65] net: bcmgenet: Clear ID_MODE_DIS in EXT_RGMII_OOB_CTRL when not needed Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 41/65] slip: not call free_netdev before rtnl_unlock in slip_open Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 42/65] net: phy: mscc: fix firmware paths Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 43/65] hinic: fix a irq affinity bug Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 44/65] hinic: fix a bug of setting hw_ioctxt Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 45/65] hinic: fix a bug of rss configuration Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 46/65] net: rmnet: fix NULL pointer dereference in rmnet_newlink() Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 47/65] net: rmnet: fix NULL pointer dereference in rmnet_changelink() Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 48/65] net: rmnet: fix suspicious RCU usage Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 49/65] net: rmnet: remove rcu_read_lock in rmnet_force_unassociate_device() Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 50/65] net: rmnet: do not allow to change mux id if mux id is duplicated Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 51/65] net: rmnet: use upper/lower device infrastructure Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 52/65] net: rmnet: fix bridge mode bugs Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 53/65] net: rmnet: fix packet forwarding in rmnet bridge mode Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 54/65] sfc: fix timestamp reconstruction at 16-bit rollover points Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 55/65] mlxsw: pci: Wait longer before accessing the device after reset Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 56/65] net: dsa: mv88e6xxx: Fix masking of egress port Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 57/65] jbd2: fix data races at struct journal_head Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 58/65] blk-mq: insert flush request to the front of dispatch queue Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 59/65] ARM: 8957/1: VDSO: Match ARMv8 timer in cntvct_functional() Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 60/65] ARM: 8958/1: rename missed uaccess .fixup section Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 61/65] mm: slub: add missing TID bump in kmem_cache_alloc_bulk() Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 62/65] HID: google: add moonball USB id Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 63/65] HID: add ALWAYS_POLL quirk to lenovo pixart mouse Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 64/65] ARM: 8961/2: Fix Kbuild issue caused by per-task stack protector GCC plugin Greg Kroah-Hartman
2020-03-19 13:04 ` [PATCH 5.5 65/65] ipv4: ensure rcu_read_lock() in cipso_v4_error() Greg Kroah-Hartman
2020-03-19 14:44 ` [PATCH 5.5 00/65] 5.5.11-rc1 review Guenter Roeck
2020-03-19 14:59   ` Greg Kroah-Hartman
2020-03-19 15:15     ` Guenter Roeck
2020-03-19 15:33       ` Greg Kroah-Hartman
2020-03-20 14:46       ` Sasha Levin
2020-03-22 19:51         ` Pavel Machek
2020-03-22 20:34           ` Sasha Levin
2020-03-23  6:48           ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200319123937.204774554@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@kernel.dk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=xiaoguang.wang@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).