From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3D43C432BE for ; Thu, 12 Aug 2021 15:40:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B6D2460FBF for ; Thu, 12 Aug 2021 15:40:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237922AbhHLPk2 (ORCPT ); Thu, 12 Aug 2021 11:40:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36662 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232854AbhHLPk1 (ORCPT ); Thu, 12 Aug 2021 11:40:27 -0400 Received: from fieldses.org (fieldses.org [IPv6:2600:3c00:e000:2f7::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A7862C061756 for ; Thu, 12 Aug 2021 08:40:02 -0700 (PDT) Received: by fieldses.org (Postfix, from userid 2815) id 7DA6C7C76; Thu, 12 Aug 2021 11:40:01 -0400 (EDT) DKIM-Filter: OpenDKIM Filter v2.11.0 fieldses.org 7DA6C7C76 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fieldses.org; s=default; t=1628782801; bh=FIzx6+0WQLG4+uWXMyGLp76L9xEc6wRLsTXvyCKdfuA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=BgoGYvkMaiEu0qLuQFTNNql0bmyWlhw5Nf3i3YjvK9fILrdHX5PbMq/XRc083q7qn bjom1z+3o+5X4bFrLkG0fIMIJtivaG6OpSMARzpgBU8+VyjVnjHZyZHOTmaN3kAsut u70kdNk0+iUHzVPektLBJ2ceFhdpi+tiAmGzPCLk= Date: Thu, 12 Aug 2021 11:40:01 -0400 From: "J. Bruce Fields" To: Olga Kornievskaia Cc: Chuck Lever III , Bruce Fields , Timo Rothenpieler , Linux NFS Mailing List , Dai Ngo Subject: Re: Spurious instability with NFSoRDMA under moderate load Message-ID: <20210812154001.GB9536@fieldses.org> References: <5DD80ADC-0A4B-4D95-8CF7-29096439DE9D@oracle.com> <0444ca5c-e8b6-1d80-d8a5-8469daa74970@rothenpieler.org> <3AF4F6CA-8B17-4AE9-82E2-21A2B9AA0774@oracle.com> <95DB2B47-F370-4787-96D9-07CE2F551AFD@oracle.com> <20210811201435.GA31574@fieldses.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Wed, Aug 11, 2021 at 04:40:04PM -0400, Olga Kornievskaia wrote: > On Wed, Aug 11, 2021 at 4:14 PM J. Bruce Fields wrote: > > > > On Wed, Aug 11, 2021 at 08:01:30PM +0000, Chuck Lever III wrote: > > > Probably not just CB_RECALL, but agreed, there doesn't seem to > > > be any mechanism that can re-drive callback operations when the > > > backchannel is replaced. > > > > The nfsd4_queue_cb() in nfsd4_cb_release() should queue a work item > > to run nfsd4_run_cb_work, which should set up another callback client if > > necessary. But I think the result is it'll look to see if there's another connection available for callbacks, and give up immediately if not. There's no logic to wait for the client to fix the problem. > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > index 7325592b456e..ed0e76f7185c 100644 > --- a/fs/nfsd/nfs4callback.c > +++ b/fs/nfsd/nfs4callback.c > @@ -1191,6 +1191,7 @@ static void nfsd4_cb_done(struct rpc_task *task, > void *calldata) > case -ETIMEDOUT: > case -EACCES: > nfsd4_mark_cb_down(clp, task->tk_status); > + cb->cb_need_restart = true; > } > break; > default: > > Something like this should requeue and retry the callback? I think we'd need more than just that. --b.