From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 019B81E888 for ; Mon, 1 Apr 2024 12:49:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711975792; cv=none; b=W3KheZztHcqTtUx+PykuiDNk+QoyT18YW8GJ6O+6yMyiZPffLY91Z33G0xQC0SkXORaBZmApsMHSzi+qcLc7WQKhf4G0LS1LHxyTD+8yP9O6wVMw8mLTD99wI6Ufkw9sGrquttHROg5ZN7lNiv+6q0UG6bRNg1oDb4YwMMps8x4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711975792; c=relaxed/simple; bh=lgz5YepQvaONK8JCHBTP8InjXGMT4xe42+ZcpcQ2oRA=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=N41g6SXzqIWCHUrNja2ySEEIB2uYOGYq7F4uLe/sR7dF9IFsowtCoqMVV3mEG0JqofSK1N51WYoyM707wA6nL/7TyzeLgfH9hgZ6y9+9Y8EH1DQzXu6wpnPKrgeUnCQGbqtTlvVD1hi4NPcmdtB5bV3QamrmXMFNZKcN1cOiygY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=SzTF6HoR; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="SzTF6HoR" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 15C20C433F1; Mon, 1 Apr 2024 12:49:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711975791; bh=lgz5YepQvaONK8JCHBTP8InjXGMT4xe42+ZcpcQ2oRA=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=SzTF6HoRh8DNJYPtifxiTkoPUVnAfwhp4m/7Sx+nMJ2PZlbilj0Zl9ZYnRlKMAjgX IbRA4po4NntWv0j/aKIbs1n/FaGZzgS363h5wEt6UZ85pti/Chb28bZg8aOILsct21 SdusGwaD6lDKvWWMGr5d2Cm2fVRog1hTHeGREEO9MVgCz6wVpfdL912EAfV3AKrk3Z W9kgxscDt23gAfmd3+URhQAS02FOkw0jNByEl0mVWEuFvbRRXML/h2etGrqoO0G2Ez Cwx+hisSflKbskfhWI6DiRNYLViWmwmCXNvcF8pC7vFywbICegbZPWufXntFdbC912 z+dp1dV27L5XQ== Message-ID: <039c7e38b70287541849ab03d92818740fdf5d43.camel@kernel.org> Subject: Re: [PATCH 1/1] NFSD: cancel CB_RECALL_ANY call when nfs4_client is about to be destroyed From: Jeff Layton To: Dai Ngo , Chuck Lever Cc: linux-nfs@vger.kernel.org Date: Mon, 01 Apr 2024 08:49:49 -0400 In-Reply-To: <84d6311e-a0a3-4fc6-a87e-e09857c765b3@oracle.com> References: <1711476809-26248-1-git-send-email-dai.ngo@oracle.com> <69914825-e9d5-4859-a5a8-60d17e8e8bf6@oracle.com> <88fac8af-c194-452b-94eb-7658b9056246@oracle.com> <97387ec5-bcb4-4c5e-81af-a0038f7fc311@oracle.com> <09da882f-4bad-4398-a2d0-ef8daefce374@oracle.com> <84d6311e-a0a3-4fc6-a87e-e09857c765b3@oracle.com> Autocrypt: addr=jlayton@kernel.org; prefer-encrypt=mutual; keydata=mQINBE6V0TwBEADXhJg7s8wFDwBMEvn0qyhAnzFLTOCHooMZyx7XO7dAiIhDSi7G1NPxwn8jdFUQMCR/GlpozMFlSFiZXiObE7sef9rTtM68ukUyZM4pJ9l0KjQNgDJ6Fr342Htkjxu/kFV1WvegyjnSsFt7EGoDjdKqr1TS9syJYFjagYtvWk/UfHlW09X+jOh4vYtfX7iYSx/NfqV3W1D7EDi0PqVT2h6v8i8YqsATFPwO4nuiTmL6I40ZofxVd+9wdRI4Db8yUNA4ZSP2nqLcLtFjClYRBoJvRWvsv4lm0OX6MYPtv76hka8lW4mnRmZqqx3UtfHX/hF/zH24Gj7A6sYKYLCU3YrI2Ogiu7/ksKcl7goQjpvtVYrOOI5VGLHge0awt7bhMCTM9KAfPc+xL/ZxAMVWd3NCk5SamL2cE99UWgtvNOIYU8m6EjTLhsj8snVluJH0/RcxEeFbnSaswVChNSGa7mXJrTR22lRL6ZPjdMgS2Km90haWPRc8Wolcz07Y2se0xpGVLEQcDEsvv5IMmeMe1/qLZ6NaVkNuL3WOXvxaVT9USW1+/SGipO2IpKJjeDZfehlB/kpfF24+RrK+seQfCBYyUE8QJpvTZyfUHNYldXlrjO6n5MdOempLqWpfOmcGkwnyNRBR46g/jf8KnPRwXs509yAqDB6sELZH+yWr9LQZEwARAQABtCVKZWZmIExheXRvbiA8amxheXRvbkBwb29jaGllcmVkcy5uZXQ+iQI7BBMBAgAlAhsDBgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAUCTpXWPAIZAQAKCRAADmhBGVaCFc65D/4gBLNMHopQYgG/9RIM3kgFCCQV0pLv0hcg1cjr+bPI5f1PzJoOVi9s0wBDHwp8+vtHgYhM54yt43uI7Htij0RHFL5eFqoVT4TSfAg2qlvNemJEOY0e4daljjmZM7UtmpGs9NN0r9r50W82eb5Kw5bc/ r0kmR/arUS2st+ecRsCnwAOj6HiURwIgfDMHGPtSkoPpu3DDp/cjcYUg3HaOJuTjtGHFH963B+f+hyQ2BrQZBBE76ErgTDJ2Db9Ey0kw7VEZ4I2nnVUY9B5dE2pJFVO5HJBMp30fUGKvwaKqYCU2iAKxdmJXRIONb7dSde8LqZahuunPDMZyMA5+mkQl7kpIpR6kVDIiqmxzRuPeiMP7O2FCUlS2DnJnRVrHmCljLkZWf7ZUA22wJpepBligemtSRSbqCyZ3B48zJ8g5B8xLEntPo/NknSJaYRvfEQqGxgk5kkNWMIMDkfQOlDSXZvoxqU9wFH/9jTv1/6p8dHeGM0BsbBLMqQaqnWiVt5mG92E1zkOW69LnoozE6Le+12DsNW7RjiR5K+27MObjXEYIW7FIvNN/TQ6U1EOsdxwB8o//Yfc3p2QqPr5uS93SDDan5ehH59BnHpguTc27XiQQZ9EGiieCUx6Zh2ze3X2UW9YNzE15uKwkkuEIj60NvQRmEDfweYfOfPVOueC+iFifbQgSmVmZiBMYXl0b24gPGpsYXl0b25AcmVkaGF0LmNvbT6JAjgEEwECACIFAk6V0q0CGwMGCwkIBwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEAAOaEEZVoIViKUQALpvsacTMWWOd7SlPFzIYy2/fjvKlfB/Xs4YdNcf9qLqF+lk2RBUHdR/dGwZpvw/OLmnZ8TryDo2zXVJNWEEUFNc7wQpl3i78r6UU/GUY/RQmOgPhs3epQC3PMJj4xFx+VuVcf/MXgDDdBUHaCTT793hyBeDbQuciARDJAW24Q1RCmjcwWIV/pgrlFa4lAXsmhoac8UPc82Ijrs6ivlTweFf16VBc4nSLX5FB3ls7S5noRhm5/Zsd4PGPgIHgCZcPgkAnU1S/A/rSqf3FLpU+CbVBDvlVAnOq9gfNF+QiTlOHdZVIe4gEYAU3CUjbleywQqV02BKxPVM0C5/oVjMVx 3bri75n1TkBYGmqAXy9usCkHIsG5CBHmphv9MHmqMZQVsxvCzfnI5IO1+7MoloeeW/lxuyd0pU88dZsV/riHw87i2GJUJtVlMl5IGBNFpqoNUoqmvRfEMeXhy/kUX4Xc03I1coZIgmwLmCSXwx9MaCPFzV/dOOrju2xjO+2sYyB5BNtxRqUEyXglpujFZqJxxau7E0eXoYgoY9gtFGsspzFkVNntamVXEWVVgzJJr/EWW0y+jNd54MfPRqH+eCGuqlnNLktSAVz1MvVRY1dxUltSlDZT7P2bUoMorIPu8p7ZCg9dyX1+9T6Muc5dHxf/BBP/ir+3e8JTFQBFOiLNdFtB9KZWZmIExheXRvbiA8amxheXRvbkBzYW1iYS5vcmc+iQI4BBMBAgAiBQJOldK9AhsDBgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAKCRAADmhBGVaCFWgWD/0ZRi4hN9FK2BdQs9RwNnFZUr7JidAWfCrs37XrA/56olQl3ojn0fQtrP4DbTmCuh0SfMijB24psy1GnkPepnaQ6VRf7Dxg/Y8muZELSOtsv2CKt3/02J1BBitrkkqmHyni5fLLYYg6fub0T/8Kwo1qGPdu1hx2BQRERYtQ/S5d/T0cACdlzi6w8rs5f09hU9Tu4qV1JLKmBTgUWKN969HPRkxiojLQziHVyM/weR5Reu6FZVNuVBGqBD+sfk/c98VJHjsQhYJijcsmgMb1NohAzwrBKcSGKOWJToGEO/1RkIN8tqGnYNp2G+aR685D0chgTl1WzPRM6mFG1+n2b2RR95DxumKVpwBwdLPoCkI24JkeDJ7lXSe3uFWISstFGt0HL8EewP8RuGC8s5h7Ct91HMNQTbjgA+Vi1foWUVXpEintAKgoywaIDlJfTZIl6Ew8ETN/7DLy8bXYgq0XzhaKg3CnOUuGQV5/nl4OAX/3jocT5Cz/OtAiNYj5mLPeL5z2ZszjoCAH6caqsF2oLyA nLqRgDgR+wTQT6gMhr2IRsl+cp8gPHBwQ4uZMb+X00c/Amm9VfviT+BI7B66cnC7Zv6Gvmtu2rEjWDGWPqUgccB7hdMKnKDthkA227/82tYoFiFMb/NwtgGrn5n2vwJyKN6SEoygGrNt0SI84y6hEVbQlSmVmZiBMYXl0b24gPGpsYXl0b25AcHJpbWFyeWRhdGEuY29tPokCOQQTAQIAIwUCU4xmKQIbAwcLCQgHAwIBBhUIAgkKCwQWAgMBAh4BAheAAAoJEAAOaEEZVoIV1H0P/j4OUTwFd7BBbpoSp695qb6HqCzWMuExsp8nZjruymMaeZbGr3OWMNEXRI1FWNHMtcMHWLP/RaDqCJil28proO+PQ/yPhsr2QqJcW4nr91tBrv/MqItuAXLYlsgXqp4BxLP67bzRJ1Bd2x0bWXurpEXY//VBOLnODqThGEcL7jouwjmnRh9FTKZfBDpFRaEfDFOXIfAkMKBa/c9TQwRpx2DPsl3eFWVCNuNGKeGsirLqCxUg5kWTxEorROppz9oU4HPicL6rRH22Ce6nOAON2vHvhkUuO3GbffhrcsPD4DaYup4ic+DxWm+DaSSRJ+e1yJvwi6NmQ9P9UAuLG93S2MdNNbosZ9P8k2mTOVKMc+GooI9Ve/vH8unwitwo7ORMVXhJeU6Q0X7zf3SjwDq2lBhn1DSuTsn2DbsNTiDvqrAaCvbsTsw+SZRwF85eG67eAwouYk+dnKmp1q57LDKMyzysij2oDKbcBlwB/TeX16p8+LxECv51asjS9TInnipssssUDrHIvoTTXWcz7Y5wIngxDFwT8rPY3EggzLGfK5Zx2Q5S/N0FfmADmKknG/D8qGIcJE574D956tiUDKN4I+/g125ORR1v7bP+OIaayAvq17RP+qcAqkxc0x8iCYVCYDouDyNvWPGRhbLUO7mlBpjW9jK9e2fvZY9iw3QzIPGKtClKZWZmIExheXRvbiA8amVmZi5sYXl0 b25AcHJpbWFyeWRhdGEuY29tPokCOQQTAQIAIwUCU4xmUAIbAwcLCQgHAwIBBhUIAgkKCwQWAgMBAh4BAheAAAoJEAAOaEEZVoIVzJoQALFCS6n/FHQS+hIzHIb56JbokhK0AFqoLVzLKzrnaeXhE5isWcVg0eoV2oTScIwUSUapy94if69tnUo4Q7YNt8/6yFM6hwZAxFjOXR0ciGE3Q+Z1zi49Ox51yjGMQGxlakV9ep4sV/d5a50M+LFTmYSAFp6HY23JN9PkjVJC4PUv5DYRbOZ6Y1+TfXKBAewMVqtwT1Y+LPlfmI8dbbbuUX/kKZ5ddhV2736fgyfpslvJKYl0YifUOVy4D1G/oSycyHkJG78OvX4JKcf2kKzVvg7/Rnv+AueCfFQ6nGwPn0P91I7TEOC4XfZ6a1K3uTp4fPPs1Wn75X7K8lzJP/p8lme40uqwAyBjk+IA5VGd+CVRiyJTpGZwA0jwSYLyXboX+Dqm9pSYzmC9+/AE7lIgpWj+3iNisp1SWtHc4pdtQ5EU2SEz8yKvDbD0lNDbv4ljI7eflPsvN6vOrxz24mCliEco5DwhpaaSnzWnbAPXhQDWb/lUgs/JNk8dtwmvWnqCwRqElMLVisAbJmC0BhZ/Ab4sph3EaiZfdXKhiQqSGdK4La3OTJOJYZphPdGgnkvDV9Pl1QZ0ijXQrVIy3zd6VCNaKYq7BAKidn5g/2Q8oio9Tf4XfdZ9dtwcB+bwDJFgvvDYaZ5bI3ln4V3EyW5i2NfXazz/GA/I/ZtbsigCFc8ftCBKZWZmIExheXRvbiA8amxheXRvbkBrZXJuZWwub3JnPokCOAQTAQIAIgUCWe8u6AIbAwYLCQgHAwIGFQgCCQoLBBYCAwECHgECF4AACgkQAA5oQRlWghUuCg/+Lb/xGxZD2Q1oJVAE37uW308UpVSD2tAMJUvFTdDbfe3zKlPDTuVsyNsALBGclPLagJ5ZTP+Vp2irAN9uwBuac BOTtmOdz4ZN2tdvNgozzuxp4CHBDVzAslUi2idy+xpsp47DWPxYFIRP3M8QG/aNW052LaPc0cedYxp8+9eiVUNpxF4SiU4i9JDfX/sn9XcfoVZIxMpCRE750zvJvcCUz9HojsrMQ1NFc7MFT1z3MOW2/RlzPcog7xvR5ENPH19ojRDCHqumUHRry+RF0lH00clzX/W8OrQJZtoBPXv9ahka/Vp7kEulcBJr1cH5Wz/WprhsIM7U9pse1f1gYy9YbXtWctUz8uvDR7shsQxAhX3qO7DilMtuGo1v97I/Kx4gXQ52syh/w6EBny71CZrOgD6kJwPVVAaM1LRC28muq91WCFhs/nzHozpbzcheyGtMUI2Ao4K6mnY+3zIuXPygZMFr9KXE6fF7HzKxKuZMJOaEZCiDOq0anx6FmOzs5E6Jqdpo/mtI8beK+BE7Va6ni7YrQlnT0i3vaTVMTiCThbqsB20VrbMjlhpf8lfK1XVNbRq/R7GZ9zHESlsa35ha60yd/j3pu5hT2xyy8krV8vGhHvnJ1XRMJBAB/UYb6FyC7S+mQZIQXVeAA+smfTT0tDrisj1U5x6ZB9b3nBg65ke5Ag0ETpXRPAEQAJkVmzCmF+IEenf9a2nZRXMluJohnfl2wCMmw5qNzyk0f+mYuTwTCpw7BE2H0yXk4ZfAuA+xdj14K0A1Dj52j/fKRuDqoNAhQe0b6ipo85Sz98G+XnmQOMeFVp5G1Z7r/QP/nus3mXvtFsu9lLSjMA0cam2NLDt7vx3l9kUYlQBhyIE7/DkKg+3fdqRg7qJoMHNcODtQY+n3hMyaVpplJ/l0DdQDbRSZi5AzDM3DWZEShhuP6/E2LN4O3xWnZukEiz688d1ppl7vBZO9wBql6Ft9Og74diZrTN6lXGGjEWRvO55h6ijMsLCLNDRAVehPhZvSlPldtUuvhZLAjdWpwmzbRIwgoQcO51aWeKthpcpj8feDdKdlVjvJO9fgFD5kqZ QiErRVPpB7VzA/pYV5Mdy7GMbPjmO0IpoL0tVZ8JvUzUZXB3ErS/dJflvboAAQeLpLCkQjqZiQ/DCmgJCrBJst9Xc7YsKKS379Tc3GU33HNSpaOxs2NwfzoesyjKU+P35czvXWTtj7KVVSj3SgzzFk+gLx8y2Nvt9iESdZ1Ustv8tipDsGcvIZ43MQwqU9YbLg8k4V9ch+Mo8SE+C0jyZYDCE2ZGf3OztvtSYMsTnF6/luzVyej1AFVYjKHORzNoTwdHUeC+9/07GO0bMYTPXYvJ/vxBFm3oniXyhgb5FtABEBAAGJAh8EGAECAAkFAk6V0TwCGwwACgkQAA5oQRlWghXhZRAAyycZ2DDyXh2bMYvI8uHgCbeXfL3QCvcw2XoZTH2l2umPiTzrCsDJhgwZfG9BDyOHaYhPasd5qgrUBtjjUiNKjVM+Cx1DnieR0dZWafnqGv682avPblfi70XXr2juRE/fSZoZkyZhm+nsLuIcXTnzY4D572JGrpRMTpNpGmitBdh1l/9O7Fb64uLOtA5Qj5jcHHOjL0DZpjmFWYKlSAHmURHrE8M0qRryQXvlhoQxlJR4nvQrjOPMsqWD5F9mcRyowOzr8amasLv43w92rD2nHoBK6rbFE/qC7AAjABEsZq8+TQmueN0maIXUQu7TBzejsEbV0i29z+kkrjU2NmK5pcxgAtehVxpZJ14LqmN6E0suTtzjNT1eMoqOPrMSx+6vOCIuvJ/MVYnQgHhjtPPnU86mebTY5Loy9YfJAC2EVpxtcCbx2KiwErTndEyWL+GL53LuScUD7tW8vYbGIp4RlnUgPLbqpgssq2gwYO9m75FGuKuB2+2bCGajqalid5nzeq9v7cYLLRgArJfOIBWZrHy2m0C+pFu9DSuV6SNr2dvMQUv1V58h0FaSOxHVQnJdnoHn13g/CKKvyg2EMrMt/EfcXgvDwQbnG9we4xJiWOIOcsvrWcB6C6lWBDA+In7w7SXnnok kZWuOsJdJQdmwlWC5L5ln9xgfr/4mOY38B0U= Content-Type: text/plain; charset="ISO-8859-15" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.50.4 (3.50.4-1.fc39) Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Sat, 2024-03-30 at 16:30 -0700, Dai Ngo wrote: > On 3/30/24 11:28 AM, Chuck Lever wrote: > > On Sat, Mar 30, 2024 at 10:46:08AM -0700, Dai Ngo wrote: > > > On 3/29/24 4:42 PM, Chuck Lever wrote: > > > > On Fri, Mar 29, 2024 at 10:57:22AM -0700, Dai Ngo wrote: > > > > > On 3/29/24 7:55 AM, Chuck Lever wrote: > > > > > > On Thu, Mar 28, 2024 at 05:31:02PM -0700, Dai Ngo wrote: > > > > > > > On 3/28/24 11:14 AM, Dai Ngo wrote: > > > > > > > > On 3/28/24 7:08 AM, Chuck Lever wrote: > > > > > > > > > On Wed, Mar 27, 2024 at 06:09:28PM -0700, Dai Ngo wrote: > > > > > > > > > > On 3/26/24 11:27 AM, Chuck Lever wrote: > > > > > > > > > > > On Tue, Mar 26, 2024 at 11:13:29AM -0700, Dai Ngo wro= te: > > > > > > > > > > > > Currently when a nfs4_client is destroyed we wait f= or > > > > > > > > > > > > the cb_recall_any > > > > > > > > > > > > callback to complete before proceed. This adds > > > > > > > > > > > > unnecessary delay to the > > > > > > > > > > > > __destroy_client call if there is problem communica= ting > > > > > > > > > > > > with the client. > > > > > > > > > > > By "unnecessary delay" do you mean only the seven-sec= ond RPC > > > > > > > > > > > retransmit timeout, or is there something else? > > > > > > > > > > when the client network interface is down, the RPC task= takes ~9s to > > > > > > > > > > send the callback, waits for the reply and gets ETIMEDO= UT. This process > > > > > > > > > > repeats in a loop with the same RPC task before being s= topped by > > > > > > > > > > rpc_shutdown_client after client lease expires. > > > > > > > > > I'll have to review this code again, but rpc_shutdown_cli= ent > > > > > > > > > should cause these RPCs to terminate immediately and safe= ly. Can't > > > > > > > > > we use that? > > > > > > > > rpc_shutdown_client works, it terminated the RPC call to st= op the loop. > > > > > > > >=20 > > > > > > > > > > It takes a total of about 1m20s before the CB_RECALL is= terminated. > > > > > > > > > > For CB_RECALL_ANY and CB_OFFLOAD, this process gets in = to a infinite > > > > > > > > > > loop since there is no delegation conflict and the clie= nt is allowed > > > > > > > > > > to stay in courtesy state. > > > > > > > > > >=20 > > > > > > > > > > The loop happens because in nfsd4_cb_sequence_done if c= b_seq_status > > > > > > > > > > is 1 (an RPC Reply was never received) it calls nfsd4_m= ark_cb_fault > > > > > > > > > > to set the NFSD4_CB_FAULT bit. It then sets cb_need_res= tart to true. > > > > > > > > > > When nfsd4_cb_release is called, it checks cb_need_rest= art bit and > > > > > > > > > > re-queues the work again. > > > > > > > > > Something in the sequence_done path should check if the s= erver is > > > > > > > > > tearing down this callback connection. If it doesn't, tha= t is a bug > > > > > > > > > IMO. > > > > > > > TCP terminated the connection after retrying for 16 minutes a= nd > > > > > > > notified the RPC layer which deleted the nfsd4_conn. > > > > > > The server should have closed this connection already. > > > > > Since there is no delegation conflict, the client remains in cour= tesy > > > > > state. > > > > >=20 > > > > > > Is it stuck > > > > > > waiting for the client to respond to a FIN or something? > > > > > No, in this case since the client network interface was down serv= er > > > > > TCP did not even receive ACK for the transmit so the server TCP > > > > > kept retrying. > > > > It sounds like the server attempts to maintain the client's > > > > transport while the client is in courtesy state? > > > Yes, TCP keeps retryingwhile the client is in courtesy state. > > So the client hasn't sent a lease-renewing operation recently, but > > the connection is still there. It then makes sense for the server to > > forcibly close the connection when a client transitions to the > > courtesy state, since that connection instance is suspect. >=20 > yes, this makes sense. This would make the RPC to stop within a > lease period. >=20 > I'll submit a patch to kill the back channel connection if there > is RPC pending and the client is about to enter courtesy state. >=20 > >=20 > > The server would then recognize immediately that it shouldn't post > > any new backchannel operations to that client until it gets a fresh > > connection attempt from it. >=20 > Currently deleg_reaper does not issue CB_RECALL_ANY for courtesy clients. >=20 > >=20 > >=20 > > > > I thought the > > > > purpose of courteous server was to more gracefully handle network > > > > partitions, in which case, the transport is not reliable. > > > >=20 > > > > If a courtesy client hasn't re-established a connection with a > > > > backchannel by the time a conflicting open/lock request arrives, > > > > the server has no choice but to expire that client's courtesy > > > > state immediately. Right? > > > Yes, that is the case for CB_RECALL but not for CB_RECALL_ANY. > > CB_RECALL_ANY is done by a shrinker, yes? >=20 > CB_RECALL_ANY is done by the memory shrinker or when > num_delegations >=3D max_delegations. >=20 > >=20 > > Instead of attempting to send a CB_RECALL_ANY to a courtesy client > > which is likely to be unresponsive, why not just expire the oldest > > courtesy client the server has? Or... if NFSD /already/ expires the > > oldest courtesy client as a result of memory pressure, then just > > don't ever send CB_RECALL_ANY to courtesy clients. >=20 > Currently deleg_reaper does not issue CB_RECALL_ANY for courtesy clients. >=20 > >=20 > > As soon as a courtesy client reconnects, it will send a lease- > > renewing operation, transition back to an active state, and at that > > point it re-qualifies for getting CB_RECALL_ANY. > >=20 > >=20 > > > > But maybe that's a side-bar. > > > >=20 > > > >=20 > > > > > > > But when nfsd4_run_cb_work ran again, it got into the infinit= e > > > > > > > loop caused by: > > > > > > > /* > > > > > > > * XXX: Ideally, we could wait for the client to > > > > > > > * reconnect, but I haven't figured out how > > > > > > > * to do that yet. > > > > > > > */ > > > > > > > nfsd4_queue_cb_delayed(cb, 25); > > > > > > >=20 > > > > > > > which was introduced by c1ccfcf1a9bf. Note that I'm using 6.9= -rc1. > > > > > > The whole paragraph is: > > > > > >=20 > > > > > > 1503 clnt =3D clp->cl_cb_client; > > > > > > 1504 if (!clnt) { > > > > > > 1505 if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->c= l_flags)) > > > > > > 1506 nfsd41_destroy_cb(cb); > > > > > > 1507 else { > > > > > > 1508 /* > > > > > > 1509 * XXX: Ideally, we could wait for= the client to > > > > > > 1510 * reconnect, but I haven't f= igured out how > > > > > > 1511 * to do that yet. > > > > > > 1512 */ > > > > > > 1513 nfsd4_queue_cb_delayed(cb, 25); > > > > > > 1514 } > > > > > > 1515 return; > > > > > > 1516 } > > > > > >=20 > > > > > > When there's no rpc_clnt and CB_KILL is set, the callback > > > > > > operation should be destroyed immediately. CB_KILL is set by > > > > > > nfsd4_shutdown_callback. It's only caller is > > > > > > __destroy_client(). > > > > > >=20 > > > > > > Why isn't NFSD4_CLIENT_CB_KILL set? > > > > > Since __destroy_client was not called, for the reason mentioned a= bove, > > > > > nfsd4_shutdown_callback was not called so NFSD4_CLIENT_CB_KILL wa= s not > > > > > set. > > > > Well, then I'm missing something. You said, above: > > > >=20 > > > > > Currently when a nfs4_client is destroyed we wait for > > > > And __destroy_client() invokes nfsd4_shutdown_callback(). Can you > > > > explain the usage scenario(s) to me again? > > > __destroy_client is not called in the case of CB_RECALL_ANY since > > > there is no open/lock conflict associated the the client. > > > > > Since the nfsd callback_wq was created with alloc_ordered_workque= ue, > > > > > once this loop happens the nfsd callback_wq is stuck since this w= orkqueue > > > > > executes at most one work item at any given time. > > > > >=20 > > > > > This means that if a nfs client is shutdown and the server is in = this > > > > > state then all other clients are effected; all callbacks to other > > > > > clients are stuck in the workqueue. And if a callback for a clien= t is > > > > > stuck in the workqueue then that client can not unmount its share= s. > > > > >=20 > > > > > This is the symptom that was reported recently on this list. I th= ink > > > > > the nfsd callback_wq should be created as a normal workqueue that= allows > > > > > multiple work items to be executed at the same time so a problem = with > > > > > one client does not effect other clients. > > > > My impression is that the callback_wq is single-threaded to avoid > > > > locking complexity. I'm not yet convinced it would be safe to simpl= y > > > > change that workqueue to permit multiple concurrent work items. > > > Do you have any specific concern about lock complexity related to > > > callback operation? > > If there needs to be a fix, I'd like something for v6.9-rc, so it > > needs to be a narrow change. Larger infrastructural changes have to > > go via a full merge window. > >=20 > >=20 > > > > It could be straightforward, however, to move the callback_wq into > > > > struct nfs4_client so that each client can have its own workqueue. > > > > Then we can take some time and design something less brittle and > > > > more scalable (and maybe come up with some test infrastructure so > > > > this stuff doesn't break as often). > > > IMHO I don't see why the callback workqueue has to be different > > > from the laundry_wq or nfsd_filecache_wq used by nfsd. > > You mean, you don't see why callback_wq has to be ordered, while > > the others are not so constrained? Quite possibly it does not have > > to be ordered. >=20 > Yes, I looked at the all the nfsd4_callback_ops on nfsd and they > seems to take into account of concurrency and use locks appropriately. > For each type of work I don't see why one work has to wait for > the previous work to complete before proceed. >=20 > >=20 > > But we might have lost the bit of history that explains why, so > > let's be cautious about making broad changes here until we have a > > good operational understanding of the code and some robust test > > cases to check any changes we make. >=20 > Understand, you make the call. >=20 commit 88382036674770173128417e4c09e9e549f82d54 Author: J. Bruce Fields Date: Mon Nov 14 11:13:43 2016 -0500 nfsd: update workqueue creation =20 No real change in functionality, but the old interface seems to be deprecated. =20 We don't actually care about ordering necessarily, but we do depend on running at most one work item at a time: nfsd4_process_cb_update() assumes that no other thread is running it, and that no new callbacks are starting while it's running. =20 Reviewed-by: Jeff Layton Signed-off-by: J. Bruce Fields ...so it may be as simple as just fixing up nfsd4_process_cb_update(). Allowing parallel recalls would certainly be a good thing. That said, a workqueue per client would be a great place to start. I don't see any reason to serialize callbacks to different clients. -- Jeff Layton