From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Gabriel Krisman Bertazi To: Jens Axboe Subject: Re: Oops when completing request on the wrong queue References: <87a8gltgks.fsf@linux.vnet.ibm.com> <871t1kq455.fsf@linux.vnet.ibm.com> <8fc9ae38-9488-ef52-f620-08499edebffa@kernel.dk> <87shu0hfye.fsf@linux.vnet.ibm.com> <87a8g39pg4.fsf@linux.vnet.ibm.com> <43693064-dd37-92ce-7753-2a8edb43eab5@kernel.dk> <164a4c63-065b-b766-36f3-bcef4aa46a38@kernel.dk> <49a954e6-2f96-8a63-ce15-2c82c1a1d36d@kernel.dk> <87d1krzbz6.fsf@linux.vnet.ibm.com> <5902c166-7aec-b2ae-72d7-07e8efeb5aa9@kernel.dk> Date: Mon, 05 Sep 2016 09:02:56 -0300 In-Reply-To: <5902c166-7aec-b2ae-72d7-07e8efeb5aa9@kernel.dk> (Jens Axboe's message of "Mon, 29 Aug 2016 12:40:32 -0600") MIME-Version: 1.0 Message-Id: <87mvjmzh8v.fsf@linux.vnet.ibm.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org, Keith Busch , Brian King , Christoph Hellwig Content-Type: text/plain; charset="us-ascii" Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+axboe=kernel.dk@lists.infradead.org List-ID: Jens Axboe writes: > On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote: >> Jens Axboe writes: >>>> Can you try this patch? It's not perfect, but I'll be interested if it >>>> makes a difference for you. >>> >> >> Hi Jens, >> >> Sorry for the delay. I just got back to this and have been running your >> patch on top of 4.8 without a crash for over 1 hour. I wanna give it >> more time to make sure it's running properly, though. >> >> Let me get back to you after a few more rounds of test. > > Thanks, sounds good. The patches have landed in mainline too. Hi Jens, Our test teams ran stress tests on several machines over the last week on a test kernel with your patches applied, and were no longer able to reproduce the issue. Thanks a lot for helping out on this one. >>> This one should handle the WARN_ON() for running the hw queue on the >>> wrong CPU as well. >> >> On the workaround you added to prevent WARN_ON, we surely need to >> prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first >> place, right.. How do you feel about the following RFC? I know it's >> not a complete fix, but it feels like a good improvement to me. >> >> http://www.spinics.net/lists/linux-scsi/msg98608.html > > But we can't completely prevent it, and I don't think we have to. I just > don't want to trigger a warning for something that's a valid condition. > I want the warning to trigger if this happens without the CPU going > offline, since then it's indicative of a real bug in the mapping. Your > patch isn't going to prevent it either - it'll shrink the window, at the > expense of making blk_mq_hctx_next_cpu() more expensive. So I don't > think it's worthwhile. Right, I got your point. Your patch definitely prevents the WARN_ON from occurring on CPU hotplug events too. So thanks a lot for help on that too :) -- Gabriel Krisman Bertazi _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme From mboxrd@z Thu Jan 1 00:00:00 1970 From: krisman@linux.vnet.ibm.com (Gabriel Krisman Bertazi) Date: Mon, 05 Sep 2016 09:02:56 -0300 Subject: Oops when completing request on the wrong queue In-Reply-To: <5902c166-7aec-b2ae-72d7-07e8efeb5aa9@kernel.dk> (Jens Axboe's message of "Mon, 29 Aug 2016 12:40:32 -0600") References: <87a8gltgks.fsf@linux.vnet.ibm.com> <871t1kq455.fsf@linux.vnet.ibm.com> <8fc9ae38-9488-ef52-f620-08499edebffa@kernel.dk> <87shu0hfye.fsf@linux.vnet.ibm.com> <87a8g39pg4.fsf@linux.vnet.ibm.com> <43693064-dd37-92ce-7753-2a8edb43eab5@kernel.dk> <164a4c63-065b-b766-36f3-bcef4aa46a38@kernel.dk> <49a954e6-2f96-8a63-ce15-2c82c1a1d36d@kernel.dk> <87d1krzbz6.fsf@linux.vnet.ibm.com> <5902c166-7aec-b2ae-72d7-07e8efeb5aa9@kernel.dk> Message-ID: <87mvjmzh8v.fsf@linux.vnet.ibm.com> Jens Axboe writes: > On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote: >> Jens Axboe writes: >>>> Can you try this patch? It's not perfect, but I'll be interested if it >>>> makes a difference for you. >>> >> >> Hi Jens, >> >> Sorry for the delay. I just got back to this and have been running your >> patch on top of 4.8 without a crash for over 1 hour. I wanna give it >> more time to make sure it's running properly, though. >> >> Let me get back to you after a few more rounds of test. > > Thanks, sounds good. The patches have landed in mainline too. Hi Jens, Our test teams ran stress tests on several machines over the last week on a test kernel with your patches applied, and were no longer able to reproduce the issue. Thanks a lot for helping out on this one. >>> This one should handle the WARN_ON() for running the hw queue on the >>> wrong CPU as well. >> >> On the workaround you added to prevent WARN_ON, we surely need to >> prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first >> place, right.. How do you feel about the following RFC? I know it's >> not a complete fix, but it feels like a good improvement to me. >> >> http://www.spinics.net/lists/linux-scsi/msg98608.html > > But we can't completely prevent it, and I don't think we have to. I just > don't want to trigger a warning for something that's a valid condition. > I want the warning to trigger if this happens without the CPU going > offline, since then it's indicative of a real bug in the mapping. Your > patch isn't going to prevent it either - it'll shrink the window, at the > expense of making blk_mq_hctx_next_cpu() more expensive. So I don't > think it's worthwhile. Right, I got your point. Your patch definitely prevents the WARN_ON from occurring on CPU hotplug events too. So thanks a lot for help on that too :) -- Gabriel Krisman Bertazi