From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65F20C433E0 for ; Mon, 1 Feb 2021 14:36:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2DDBB60C41 for ; Mon, 1 Feb 2021 14:36:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230525AbhBAOf6 (ORCPT ); Mon, 1 Feb 2021 09:35:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58218 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231137AbhBAOfq (ORCPT ); Mon, 1 Feb 2021 09:35:46 -0500 Received: from mail-io1-xd36.google.com (mail-io1-xd36.google.com [IPv6:2607:f8b0:4864:20::d36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 49E41C06174A for ; Mon, 1 Feb 2021 06:35:05 -0800 (PST) Received: by mail-io1-xd36.google.com with SMTP id n14so3720171iog.3 for ; Mon, 01 Feb 2021 06:35:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=tpFKdqtau4ahR4UB5yONo7WPW5JC6DPpzSRX9ZY8yyc=; b=ZSzDNpp7HahNsuesQxdYvqONbkURXzbxiZs9QtLeEoM6iT0FQrrq7Q/+juqbOJai2g t+g1DEC8SnkBLt22FU5EY0WYK7RANuFdf4WKzJOc0PXwQ8+Y2vfIgnW/XiI161xwb1Gi GrFfuz8th0262M9Sg21GvnxDWUUCajZ1vfEtoGx0E00VqVyKxRM3euGg/aqww2Xktipg /ITc6k0VQSlieoyoxdL/AJRvRyirSy7/W7mPgju0wXxbmdlBDkDooFc9LjZVAbSMSeLQ 0dUtX6sqcOboJDuCmn1EIcclQjX+UsPLWNqH3mt126ENMoXGfYTEXw0sEYj3ySa7ZvfN Im+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=tpFKdqtau4ahR4UB5yONo7WPW5JC6DPpzSRX9ZY8yyc=; b=n1u0NZNUpzkVP3rqG8jWW2AdIDgFbATyAT7PPqE+l3aEYBfFlFd2o1NecvA4aurDq/ 3XXMklm7g2fqB4czUItdRqoKoI3u3+SNnaaWBx2De0BpNuduIAT1TkpC5xHCCVLxtGbu X9OpqKSPhja+0aW4k1BcrdiJG0O3dypgvMFw261Gr0wh1T/3FvhivXMW9+NUYzhW5bbY RphAs7GC/GzNBTuJxlUbLvE+Bmwd6AaJS0YapzcBd2nN2b/H+M4cJpZhEGuAN0UAU+L2 nFSikZxs2asfsS9XAp6qLPfIxI1tHfIegV5zVZBD9fk+H28rgnMX3eVrCuPBZeIHvDpb jG5A== X-Gm-Message-State: AOAM532TcBUc8x+lxZl2cT+sComxhX/C6xU30v9YoS4oTfyLMR9Bnd9c RRukXQd/+txaAGU63xDmgkWCDA== X-Google-Smtp-Source: ABdhPJyGPN+kannRgCzKXIBe7BwBhRNMuqsZb2Iq1lYaAezIWWFHTHWMOwHfB3NKoZp2EYBGH0/sgA== X-Received: by 2002:a02:5dc9:: with SMTP id w192mr5712063jaa.44.1612190104645; Mon, 01 Feb 2021 06:35:04 -0800 (PST) Received: from [172.22.22.4] (c-73-185-129-58.hsd1.mn.comcast.net. [73.185.129.58]) by smtp.googlemail.com with ESMTPSA id v14sm9505407ilm.18.2021.02.01.06.35.03 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 01 Feb 2021 06:35:03 -0800 (PST) Subject: Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend To: Willem de Bruijn Cc: David Miller , Jakub Kicinski , elder@kernel.org, evgreen@chromium.org, bjorn.andersson@linaro.org, cpratapa@codeaurora.org, Subash Abhinov Kasiviswanathan , Network Development , LKML References: <20210129202019.2099259-1-elder@linaro.org> <20210129202019.2099259-10-elder@linaro.org> <67f4aa5a-4a60-41e6-a049-0ff93fb87b66@linaro.org> From: Alex Elder Message-ID: Date: Mon, 1 Feb 2021 08:35:03 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 1/31/21 7:36 PM, Willem de Bruijn wrote: > On Sun, Jan 31, 2021 at 10:32 AM Alex Elder wrote: >> >> On 1/31/21 8:52 AM, Willem de Bruijn wrote: >>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder wrote: >>>> >>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote: >>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder wrote: >>>>>> >>>>>> The channel stop and suspend paths both call __gsi_channel_stop(), >>>>>> which quiesces channel activity, disables NAPI, and (on other than >>>>>> SDM845) stops the channel. Similarly, the start and resume paths >>>>>> share __gsi_channel_start(), which starts the channel and re-enables >>>>>> NAPI again. >>>>>> >>>>>> Disabling NAPI should be done when stopping a channel, but this >>>>>> should *not* be done when suspending. It's not necessary in the >>>>>> suspend path anyway, because the stopped channel (or suspended >>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI, >>>>>> and gsi_channel_trans_quiesce() won't return until there are no >>>>>> more transactions to process in the NAPI polling loop. >>>>> >>>>> But why is it incorrect to do so? >>>> >>>> Maybe it's not; I also thought it was fine before, but... . . . >> The "hang" occurs on an RX endpoint, and in particular it >> occurs on an endpoint that we *know* will be receiving a >> packet as part of the suspend process (when clearing the >> hardware pipeline). I can go into that further but won't' >> unless asked. >> >>>> A stopped channel won't interrupt, >>>> so we don't bother disabling the completion interrupt, >>>> with no interrupts, NAPI won't be scheduled, so there's >>>> no need to disable NAPI either. >>> >>> That sounds plausible. But it doesn't explain why napi_disable "should >>> *not* be done when suspending" as the commit states. >>> >>> Arguably, leaving that won't have much effect either way, and is in >>> line with other drivers. >> >> Understood and agreed. In fact, if the hang occurrs in >> napi_disable() when waiting for NAPI_STATE_SCHED to clear, >> it would occur in napi_synchronize() as well. > > Agreed. > > So you have an environment to test a patch in, it might be worthwhile > to test essentially the same logic reordering as in this patch set, > but while still disabling napi. What is the purpose of this test? Just to guarantee that the NAPI hang goes away? Because you agree that the napi_schedule() call would *also* hang if that problem exists, right? Anyway, what you're suggesting is to simply test with this last patch removed. I can do that but I really don't expect it to change anything. I will start that test later today when I'm turning my attention to something else for a while. > The disappearing race may be due to another change rather than > napi_disable vs napi_synchronize. A smaller, more targeted patch could > also be a net (instead of net-next) candidate. I am certain it is. I can tell you that we have seen a hang (after I think 2500+ suspend/resume cycles) with the IPA code that is currently upstream. But with this latest series of 9, there is no hang after 10,000+ cycles. That gives me a bisect window, but I really don't want to go through a full bisect of even those 9, because it's 4 tests, each of which takes days to complete. Looking at the 9 patches, I think this one is the most likely culprit: net: ipa: disable IEOB interrupt after channel stop I think the race involves the I/O completion handler interacting with NAPI in an unwanted way, but I have not come up with the exact sequence that would lead to getting stuck in napi_disable(). Here are some possible events that could occur on an RX channel in *some* order, prior to that patch. And in the order I show there's at least a problem of a receive not being processed immediately. . . . (suspend initiated) replenish_stop() quiesce() IRQ fires (receive ready) napi_disable() napi_schedule() (ignored) irq_disable() IRQ condition; pending channel_stop() . . . (resume triggered) channel_start() irq_enable() pending IRQ fires napi_schedule() (ignored) napi_enable() . . . (suspend initiated) >> At this point >> it's more about the whole set of rework here, and keeping >> NAPI enabled during suspend seems a little cleaner. > > I'm not sure. I haven't looked if there is a common behavior across > devices. That might be informative. igb, for one, releases all > resources. I tried to do a survey of that too and did not see a consistent pattern. I didn't look *that* hard because doing so would be more involved than I wanted to get. So in summary: - I'm putting together version 2 of this series now - Testing this past week seems to show that this series makes the hang in napi_disable() (or synchronize) go away. - I think the most likely patch in this series that fixes the problem is the IRQ ordering one I mention above, but right now I can't cite a specific sequence of events that would prove it. - I will begin some long testing later today without this last patch applied --> But I think testing without the IRQ ordering patch would be more promising, and I'd like to hear your opinion on that Thanks again for your input and help on this. -Alex . . .