Re: [PATCH 3/4] nvme: tcp: fix race between timeout and normal completion

From: Ming Lei <ming.lei@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	Christoph Hellwig <hch@lst.de>, Keith Busch <kbusch@kernel.org>,
	Chao Leng <lengchao@huawei.com>, Yi Zhang <yi.zhang@redhat.com>
Subject: Re: [PATCH 3/4] nvme: tcp: fix race between timeout and normal completion
Date: Tue, 20 Oct 2020 17:44:20 +0800	[thread overview]
Message-ID: <20201020094420.GD1429635@T590> (raw)
In-Reply-To: <e9d2e28e-fb55-358c-3e8c-6f3e9dd91c25@grimberg.me>

On Tue, Oct 20, 2020 at 01:11:11AM -0700, Sagi Grimberg wrote:
> 
> > NVMe TCP timeout handler allows to abort request directly when the
> > controller isn't in LIVE state. nvme_tcp_error_recovery() updates
> > controller state as RESETTING, and schedule reset work function. If
> > new timeout comes before the work function is called, the new timedout
> > request will be aborted directly, however at that time, the controller
> > isn't shut down yet, then timeout abort vs. normal completion race
> > will be triggered.
> 
> This assertion is incorrect, the before completing the request from
> the timeout handler, we call nvme_tcp_stop_queue, which guarantees upon
> return that no more completions will be seen from this queue.

OK, then looks the issue can be fixed by patch 1 & 2 only.

Yi, can you test again and see if the issue can be fixed by patch 1 & 2?

Thanks,
Ming