From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 274F2C10F0E for ; Fri, 12 Apr 2019 08:10:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id EC9F920818 for ; Fri, 12 Apr 2019 08:10:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727021AbfDLIKS (ORCPT ); Fri, 12 Apr 2019 04:10:18 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38580 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725973AbfDLIKS (ORCPT ); Fri, 12 Apr 2019 04:10:18 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9E6033082EF1; Fri, 12 Apr 2019 08:10:17 +0000 (UTC) Received: from gondolin (dhcp-192-187.str.redhat.com [10.33.192.187]) by smtp.corp.redhat.com (Postfix) with ESMTP id 660795D6A9; Fri, 12 Apr 2019 08:10:16 +0000 (UTC) Date: Fri, 12 Apr 2019 10:10:13 +0200 From: Cornelia Huck To: Farhan Ali Cc: kvm@vger.kernel.org, linux-s390@vger.kernel.org, farman@linux.ibm.com, pasic@linux.ibm.com, pmorel@linux.ibm.com Subject: Re: [RFC v2 2/3] vfio-ccw: Prevent quiesce function going into an infinite loop Message-ID: <20190412101013.2bf4a5df.cohuck@redhat.com> In-Reply-To: References: <2c17cf29fbce8fc1cfbf60cfd04559d00c8eeac0.1554756534.git.alifm@linux.ibm.com> <20190411182434.07d5f685.cohuck@redhat.com> Organization: Red Hat GmbH MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.46]); Fri, 12 Apr 2019 08:10:17 +0000 (UTC) Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Thu, 11 Apr 2019 16:30:44 -0400 Farhan Ali wrote: > On 04/11/2019 12:24 PM, Cornelia Huck wrote: > > On Mon, 8 Apr 2019 17:05:32 -0400 > > Farhan Ali wrote: > > > >> The quiesce function calls cio_cancel_halt_clear() and if we > >> get an -EBUSY we go into a loop where we: > >> - wait for any interrupts > >> - flush all I/O in the workqueue > >> - retry cio_cancel_halt_clear > >> > >> During the period where we are waiting for interrupts or > >> flushing all I/O, the channel subsystem could have completed > >> a halt/clear action and turned off the corresponding activity > >> control bits in the subchannel status word. This means the next > >> time we call cio_cancel_halt_clear(), we will again start by > >> calling cancel subchannel and so we can be stuck between calling > >> cancel and halt forever. > >> > >> Rather than calling cio_cancel_halt_clear() immediately after > >> waiting, let's try to disable the subchannel. If we succeed in > >> disabling the subchannel then we know nothing else can happen > >> with the device. > >> > >> Suggested-by: Eric Farman > >> Signed-off-by: Farhan Ali > >> --- > >> drivers/s390/cio/vfio_ccw_drv.c | 27 ++++++++++++--------------- > >> 1 file changed, 12 insertions(+), 15 deletions(-) > >> > >> diff --git a/drivers/s390/cio/vfio_ccw_drv.c b/drivers/s390/cio/vfio_ccw_drv.c > >> index 5aca475..4405f2a 100644 > >> --- a/drivers/s390/cio/vfio_ccw_drv.c > >> +++ b/drivers/s390/cio/vfio_ccw_drv.c > >> @@ -43,26 +43,23 @@ int vfio_ccw_sch_quiesce(struct subchannel *sch) > >> if (ret != -EBUSY) > >> goto out_unlock; > >> > >> + iretry = 255; > >> do { > >> - iretry = 255; > >> > >> ret = cio_cancel_halt_clear(sch, &iretry); > >> - while (ret == -EBUSY) { > >> - /* > >> - * Flush all I/O and wait for > >> - * cancel/halt/clear completion. > >> - */ > >> - private->completion = &completion; > >> - spin_unlock_irq(sch->lock); > >> - > >> + /* > >> + * Flush all I/O and wait for > >> + * cancel/halt/clear completion. > >> + */ > >> + private->completion = &completion; > >> + spin_unlock_irq(sch->lock); > >> + > >> + if (ret == -EBUSY) > > > > I don't think you need to do the unlock/lock and change > > private->completion if you don't actually wait, no? > > If we don't end up waiting, then changing private->completion would not > be needed. But we would still need to release the spinlock due to [1]. > > > > > Looking at the possible return codes: > > * -ENODEV -> device is not operational anyway, in theory you should even > > not need to bother with disabling the subchannel > > * -EIO -> we've run out of retries, and the subchannel still is not > > idle; I'm not sure if we could do anything here, as disable is > > unlikely to work, either > > We could break out of the loop early for these cases. My thinking was I > wanted to depend on the result of trying to disable, because ultimately > that's what we want. > > I can add the cases to break out of the loop early. The -ENODEV case does not really hurt, as it will get us out of the loop anyway. But for the -EIO case, I think we'll get -EBUSY from the disable and stay within the loop endlessly? > > > > * -EBUSY -> we expect an interrupt (or a timeout), the loop looks fine > > for that > > * 0 -> the one thing that might happen is that we get an unsolicited > > interrupt between the successful cancel_halt_clear and the disable; > > not even giving up the lock here might even be better here? > > I didn't think of this case, but if cancel_halt_clear succeeds with 0 > then we should wait, no? For 0 I don't expect a solicited interrupt (documentation for the functions says that the subchannel is idle in that case); it's just the unsolicited interrupt that might get into the way. > > > > > I think this loop will probably work as it is after this patch, but > > giving up the lock when not really needed makes me a bit queasy... what > > do others think? > > > >> wait_for_completion_timeout(&completion, 3*HZ); > >> > >> - private->completion = NULL; > >> - flush_workqueue(vfio_ccw_work_q); > >> - spin_lock_irq(sch->lock); > >> - ret = cio_cancel_halt_clear(sch, &iretry); > >> - }; > >> - > >> + private->completion = NULL; > > [1] flush_workqueue can go to sleep so we would still need to release > spinlock and reacquire it again to try disabling the subchannel. Grr, I thought we could skip the flush in the !-EBUSY case, but I think we can't due to the possibility of an unsolicited interrupt... what simply adding handling for -EIO (although I'm not sure what we can sensibly do in that case) and leave the other cases as they are now? > > >> + flush_workqueue(vfio_ccw_work_q); > >> + spin_lock_irq(sch->lock); > >> ret = cio_disable_subchannel(sch); > >> } while (ret == -EBUSY); > >> out_unlock: > > > > >