From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mout.gmx.net ([212.227.15.19]:56543 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752760AbeBPFjU (ORCPT ); Fri, 16 Feb 2018 00:39:20 -0500 Message-ID: <1518759556.17014.63.camel@gmx.de> Subject: Re: [PATCH BUGFIX V3] block, bfq: add requeue-request hook From: Mike Galbraith To: Paolo Valente , Jens Axboe Cc: Oleksandr Natalenko , stable Date: Fri, 16 Feb 2018 06:39:16 +0100 In-Reply-To: References: <20180207211920.6343-1-paolo.valente@linaro.org> <1518197379.26824.31.camel@gmx.de> <6394471.U0O273vb9H@natalenko.name> <9E24F648-C93A-4CEA-A1B6-B041540CEAAE@linaro.org> <1518434553.13087.7.camel@gmx.de> <805e9d5af9aed3fbc2697fdb0ce51e88@natalenko.name> <1518498175.6944.48.camel@gmx.de> <816E0B1B-B2D9-4604-A8CB-1E32AFBF6C22@linaro.org> <1518504174.6944.71.camel@gmx.de> <1518504640.6944.73.camel@gmx.de> <1518532236.15792.25.camel@gmx.de> <1518587910.5647.14.camel@gmx.de> <1518591888.6752.12.camel@gmx.de> <1518592512.6752.14.camel@gmx.de> <07DA441B-C2E8-4F65-B674-11C87D2F084B@linaro.org> <0cdbafe7-fe13-51b4-4e86-e7453026508e@kernel.dk> <6A22B75D-B033-4EB0-8CDB-91A2E6755664@linaro.org> <06d0f749-511b-885b-d55f-922d99dcc24e@kernel.dk> Content-Type: text/plain; charset="ISO-8859-15" Mime-Version: 1.0 Content-Transfer-Encoding: 8BIT Sender: stable-owner@vger.kernel.org List-ID: On Thu, 2018-02-15 at 19:13 +0100, Paolo Valente wrote: > > > Il giorno 14 feb 2018, alle ore 16:44, Jens Axboe ha scritto: > > > > On 2/14/18 8:39 AM, Paolo Valente wrote: > >> > >> > >>> Il giorno 14 feb 2018, alle ore 16:19, Jens Axboe ha scritto: > >>> > >>> On 2/14/18 1:56 AM, Paolo Valente wrote: > >>>> > >>>> > >>>>> Il giorno 14 feb 2018, alle ore 08:15, Mike Galbraith ha scritto: > >>>>> > >>>>> On Wed, 2018-02-14 at 08:04 +0100, Mike Galbraith wrote: > >>>>>> > >>>>>> And _of course_, roughly two minutes later, IO stalled. > >>>>> > >>>>> P.S. > >>>>> > >>>>> crash> bt 19117 > >>>>> PID: 19117 TASK: ffff8803d2dcd280 CPU: 7 COMMAND: "kworker/7:2" > >>>>> #0 [ffff8803f7207bb8] __schedule at ffffffff81595e18 > >>>>> #1 [ffff8803f7207c40] schedule at ffffffff81596422 > >>>>> #2 [ffff8803f7207c50] io_schedule at ffffffff8108a832 > >>>>> #3 [ffff8803f7207c60] blk_mq_get_tag at ffffffff8129cd1e > >>>>> #4 [ffff8803f7207cc0] blk_mq_get_request at ffffffff812987cc > >>>>> #5 [ffff8803f7207d00] blk_mq_alloc_request at ffffffff81298a9a > >>>>> #6 [ffff8803f7207d38] blk_get_request_flags at ffffffff8128e674 > >>>>> #7 [ffff8803f7207d60] scsi_execute at ffffffffa0025b58 [scsi_mod] > >>>>> #8 [ffff8803f7207d98] scsi_test_unit_ready at ffffffffa002611c [scsi_mod] > >>>>> #9 [ffff8803f7207df8] sd_check_events at ffffffffa0212747 [sd_mod] > >>>>> #10 [ffff8803f7207e20] disk_check_events at ffffffff812a0f85 > >>>>> #11 [ffff8803f7207e78] process_one_work at ffffffff81079867 > >>>>> #12 [ffff8803f7207eb8] worker_thread at ffffffff8107a127 > >>>>> #13 [ffff8803f7207f10] kthread at ffffffff8107ef48 > >>>>> #14 [ffff8803f7207f50] ret_from_fork at ffffffff816001a5 > >>>>> crash> > >>>> > >>>> This has evidently to do with tag pressure. I've looked for a way to > >>>> easily reduce the number of tags online, so as to put your system in > >>>> the bad spot deterministically. But at no avail. Does anyone know a > >>>> way to do it? > >>> > >>> The key here might be that it's not a regular file system request, > >>> which I'm sure bfq probably handles differently. So it's possible > >>> that you are slowly leaking those tags, and we end up in this > >>> miserable situation after a while. > >>> > >> > >> Could you elaborate more on this? My mental model of bfq hooks in > >> this respect is that they do only side operations, which AFAIK cannot > >> block the putting of a tag. IOW, tag getting and putting is done > >> outside bfq, regardless of what bfq does with I/O requests. Is there > >> a flaw in this? > >> > >> In any case, is there any flag in or the like, in requests passed to > >> bfq, that I could make bfq check, to raise some warning? > > > > I'm completely guessing, and I don't know if this trace is always what > > Mike sees when things hang. It just seems suspect that we end up with a > > "special" request here, since I'm sure the regular file system requests > > outnumber them greatly. That raises my suspicion that the type is > > related. > > > > But no, there should be no special handling on the freeing side, my > > guess was that BFQ ends them a bit differently. > > > > Hi Jens, > whatever the exact cause of leakage is, a leakage in its turn does > sound like a reasonable cause for these hangs. But also if leakage is > the cause, it seems to me that reducing tags to just 1 might help > trigger the problem quickly and reliably on Mike's machine. If you > agree, Jens, which would be the quickest/easiest way to reduce tags? Whatever the cause, seems this wants some instrumentation that can be left in place for a while. �I turned on�CONFIG_BLK_DEBUG_FS for Jens, but the little bugger didn't raise it's ugly head all day long. What you need most is more reproducers. �My box swears there's something amiss, and that something is BFQ.. but it's alone. -Mike