[PATCH 0/2] fix double completion of timed out commands

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] fix double completion of timed out commands
@ 2019-10-21 19:56 Josef Bacik
  2019-10-21 19:56 ` [PATCH 1/2] nbd: protect cmd->status with cmd->lock Josef Bacik
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Josef Bacik @ 2019-10-21 19:56 UTC (permalink / raw)
  To: axboe, nbd, linux-block, kernel-team, mchristi

We noticed a problem where NBD sometimes double completes the same request when
things go wrong and we time out the request.  If the other side goes out to
lunch but happens to reply just as we're timing out the requests we can end up
with a double completion on the request.

We already keep track of the command status, we just need to make sure we
protect all cases where we set cmd->status with the cmd->lock, which is patch
#1.  Patch #2 is the fix for the problem, which catches the case where we race
with the timeout handler and the reply handler.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] nbd: protect cmd->status with cmd->lock
  2019-10-21 19:56 [PATCH 0/2] fix double completion of timed out commands Josef Bacik
@ 2019-10-21 19:56 ` Josef Bacik
  2019-10-21 19:56 ` [PATCH 2/2] nbd: handle racing with error'ed out commands Josef Bacik
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2019-10-21 19:56 UTC (permalink / raw)
  To: axboe, nbd, linux-block, kernel-team, mchristi

We already do this for the most part, except in timeout and clear_req.
For the timeout case we take the lock after we grab a ref on the config,
but that isn't really necessary because we're safe to touch the cmd at
this point, so just move the order around.

For the clear_req cause this is initiated by the user, so again is safe.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 drivers/block/nbd.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index a8e3815295fe..8fb8913074b8 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -368,17 +368,16 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 	struct nbd_device *nbd = cmd->nbd;
 	struct nbd_config *config;
 
+	if (!mutex_trylock(&cmd->lock))
+		return BLK_EH_RESET_TIMER;
+
 	if (!refcount_inc_not_zero(&nbd->config_refs)) {
 		cmd->status = BLK_STS_TIMEOUT;
+		mutex_unlock(&cmd->lock);
 		goto done;
 	}
 	config = nbd->config;
 
-	if (!mutex_trylock(&cmd->lock)) {
-		nbd_config_put(nbd);
-		return BLK_EH_RESET_TIMER;
-	}
-
 	if (config->num_connections > 1) {
 		dev_err_ratelimited(nbd_to_dev(nbd),
 				    "Connection timed out, retrying (%d/%d alive)\n",
@@ -775,7 +774,10 @@ static bool nbd_clear_req(struct request *req, void *data, bool reserved)
 {
 	struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 
+	mutex_lock(&cmd->lock);
 	cmd->status = BLK_STS_IOERR;
+	mutex_unlock(&cmd->lock);
+
 	blk_mq_complete_request(req);
 	return true;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] nbd: handle racing with error'ed out commands
  2019-10-21 19:56 [PATCH 0/2] fix double completion of timed out commands Josef Bacik
  2019-10-21 19:56 ` [PATCH 1/2] nbd: protect cmd->status with cmd->lock Josef Bacik
@ 2019-10-21 19:56 ` Josef Bacik
  2019-10-21 21:43 ` [PATCH 0/2] fix double completion of timed " Mike Christie
  2019-10-25 20:20 ` Jens Axboe
  3 siblings, 0 replies; 5+ messages in thread
From: Josef Bacik @ 2019-10-21 19:56 UTC (permalink / raw)
  To: axboe, nbd, linux-block, kernel-team, mchristi

We hit the following warning in production

print_req_error: I/O error, dev nbd0, sector 7213934408 flags 80700
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 25 PID: 32407 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
Workqueue: knbd-recv recv_work [nbd]
RIP: 0010:refcount_sub_and_test_checked+0x53/0x60
Call Trace:
 blk_mq_free_request+0xb7/0xf0
 blk_mq_complete_request+0x62/0xf0
 recv_work+0x29/0xa1 [nbd]
 process_one_work+0x1f5/0x3f0
 worker_thread+0x2d/0x3d0
 ? rescuer_thread+0x340/0x340
 kthread+0x111/0x130
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x1f/0x30
---[ end trace b079c3c67f98bb7c ]---

This was preceded by us timing out everything and shutting down the
sockets for the device.  The problem is we had a request in the queue at
the same time, so we completed the request twice.  This can actually
happen in a lot of cases, we fail to get a ref on our config, we only
have one connection and just error out the command, etc.

Fix this by checking cmd->status in nbd_read_stat.  We only change this
under the cmd->lock, so we are safe to check this here and see if we've
already error'ed this command out, which would indicate that we've
completed it as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 drivers/block/nbd.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 8fb8913074b8..e9f5d4e476e7 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -693,6 +693,12 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
 		ret = -ENOENT;
 		goto out;
 	}
+	if (cmd->status != BLK_STS_OK) {
+		dev_err(disk_to_dev(nbd->disk), "Command already handled %p\n",
+			req);
+		ret = -ENOENT;
+		goto out;
+	}
 	if (test_bit(NBD_CMD_REQUEUED, &cmd->flags)) {
 		dev_err(disk_to_dev(nbd->disk), "Raced with timeout on req %p\n",
 			req);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] fix double completion of timed out commands
  2019-10-21 19:56 [PATCH 0/2] fix double completion of timed out commands Josef Bacik
  2019-10-21 19:56 ` [PATCH 1/2] nbd: protect cmd->status with cmd->lock Josef Bacik
  2019-10-21 19:56 ` [PATCH 2/2] nbd: handle racing with error'ed out commands Josef Bacik
@ 2019-10-21 21:43 ` Mike Christie
  2019-10-25 20:20 ` Jens Axboe
  3 siblings, 0 replies; 5+ messages in thread
From: Mike Christie @ 2019-10-21 21:43 UTC (permalink / raw)
  To: Josef Bacik, axboe, nbd, linux-block, kernel-team

On 10/21/2019 02:56 PM, Josef Bacik wrote:
> We noticed a problem where NBD sometimes double completes the same request when
> things go wrong and we time out the request.  If the other side goes out to
> lunch but happens to reply just as we're timing out the requests we can end up
> with a double completion on the request.
> 
> We already keep track of the command status, we just need to make sure we
> protect all cases where we set cmd->status with the cmd->lock, which is patch
> #1.  Patch #2 is the fix for the problem, which catches the case where we race
> with the timeout handler and the reply handler.  Thanks,
> 

Patches look ok and tested ok for me.

Reviewed-by: Mike Christie <mchristi@redhat.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] fix double completion of timed out commands
  2019-10-21 19:56 [PATCH 0/2] fix double completion of timed out commands Josef Bacik
                   ` (2 preceding siblings ...)
  2019-10-21 21:43 ` [PATCH 0/2] fix double completion of timed " Mike Christie
@ 2019-10-25 20:20 ` Jens Axboe
  3 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2019-10-25 20:20 UTC (permalink / raw)
  To: Josef Bacik, nbd, linux-block, kernel-team, mchristi

On 10/21/19 1:56 PM, Josef Bacik wrote:
> We noticed a problem where NBD sometimes double completes the same request when
> things go wrong and we time out the request.  If the other side goes out to
> lunch but happens to reply just as we're timing out the requests we can end up
> with a double completion on the request.
> 
> We already keep track of the command status, we just need to make sure we
> protect all cases where we set cmd->status with the cmd->lock, which is patch
> #1.  Patch #2 is the fix for the problem, which catches the case where we race
> with the timeout handler and the reply handler.  Thanks,

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-10-25 20:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-10-21 19:56 [PATCH 0/2] fix double completion of timed out commands Josef Bacik
2019-10-21 19:56 ` [PATCH 1/2] nbd: protect cmd->status with cmd->lock Josef Bacik
2019-10-21 19:56 ` [PATCH 2/2] nbd: handle racing with error'ed out commands Josef Bacik
2019-10-21 21:43 ` [PATCH 0/2] fix double completion of timed " Mike Christie
2019-10-25 20:20 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).