From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6883222079 for ; Sat, 30 Mar 2024 23:30:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=205.220.177.32 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711841417; cv=fail; b=VfGSKhP/50QVRXdout48JTg6fxVlvMVEc473gZzIDwRsFU0c9oOqejusiWCJ1iW7aNkMWhW9vAshFWj/oglpgr4P9xRAmvav+xziqOJA/y4y8TokCXeo6pIWqZ30oTlXKI3lWSGYeKw+UCg29UZD4nvEcGwSQh7SMv8ZnfF5vSg= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711841417; c=relaxed/simple; bh=ncZsIRhcNKipoP2U63Xs2QR4TTiOZRz5UCmfsZFfFQw=; h=Message-ID:Date:Subject:To:Cc:References:From:In-Reply-To: Content-Type:MIME-Version; b=HSpW973GeaooGcIDaqQromRsAtQWF4zrBlZrhuCKaQwuSlwu0ExmwnfRLZskIJ/i7FZXk/d+DXEHEB6NtgafWGLMIj21EsIjArmsiHY0+++zmkNqsKYvFn/7v0xTmf3IPWT+Ao9GxurUHCqMlZ+/uT3i426uNs/ul9ofJM34HSQ= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=QDi2wWJh; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b=0PFF5Xcn; arc=fail smtp.client-ip=205.220.177.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="QDi2wWJh"; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b="0PFF5Xcn" Received: from pps.filterd (m0246631.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 42UNUB2I019787; Sat, 30 Mar 2024 23:30:11 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=message-id : date : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding : mime-version; s=corp-2023-11-20; bh=ToK22UqOI5C609DJAEIyZYTnOKR+jvmKkDrEFoLFSD0=; b=QDi2wWJhwOU2RN1ro8tQApUUCH+Y8SIvporuuptAW7Fp2nTdYJxB0ERCiA+MMEPekvsP PNb9RybhYc2+VfWgY9QHfq0mXVSzmN2+vMB5+2SuXGNuznbTiE/oLvy+YQovP1MIjzUj XEnAQC8HFWDqw3MnvfeJoS9Y+AwfsVWwSG319TL7sNKzLPfcgjgtMPxLW3d4toqftWux +ll2MK9lC183ulPPu4tmdw3FC7YGb86zRG2zokZ3FULyum3EJdfuLv0sXxRVqcTBmlUb uHxsLORY5XT1wD/3TYkz99Q6kbN78b6SDrSuRQU7qntRAktfzSUyVIjp945tlGj8U6aJ 4w== Received: from phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta01.appoci.oracle.com [138.1.114.2]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3x69h4gn9f-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sat, 30 Mar 2024 23:30:10 +0000 Received: from pps.filterd (phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 42ULU7Hc027995; Sat, 30 Mar 2024 23:30:09 GMT Received: from nam12-bn8-obe.outbound.protection.outlook.com (mail-bn8nam12lp2169.outbound.protection.outlook.com [104.47.55.169]) by phxpaimrmta01.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3x69643c1b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sat, 30 Mar 2024 23:30:09 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=V7NtUDZ2h+z17GZkbiNYTXQlQBOzhUD/UIFevr9mnfyMNeQZYNlAqmNv8Tm+rEC8M6eNj6Ubc9afwHuCt4Ea6Cqjx3ZcZXpJaKiiXn6JSTBq8k+WTMRIwi727l7unWvNOJHawAwUAJ5APEjAIs958b0bTYpMsYYahNPgz5DgYZoTPX1fcIBnIp2K9imMi1ET8BUUNjaL/eADvZFtewffwkRuzBuKXGHgpfKd2sQ5MLNjYF6sa6H8vSZe8500/o+qZtgkv5cF3SFEPT8PDcdLbKN0Dkj9WpRb5J7MPugIZcz5i+2+amt38s05qvlHyABBZs7oE/GjXsDQcnaWd+Xjdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=ToK22UqOI5C609DJAEIyZYTnOKR+jvmKkDrEFoLFSD0=; b=Z2n7UF2UNp5rH98x3SENo7BRjD8K15Cxg6dK7xzlbRnKST+BDxT+arOh3c1e2UDBoFm0DWeJAJ8PUz5dXFquwtIfKj3ho+sqZ1KZwm1H1c2rLybpDU8HyuT8DxApN4jGAdSZftRT5CEvuyz0AvpCjJe0jaALhchSAAlX4vmvbeSUK4ExsXbPhEDPEdwr5yy2PKDlRa7cjsK5h3JFgYd2zEUy1olVrrGjdFU+e2ZmEDjKTb4vyt++aApX6txJrtllFEo5Iz6AtfWLa6JM9d0JiI91y9QuRDoKpd1PSNhmvn5q/4vv7YAn24F1B6htrtjCKSgtgCKq2JJDYcdR/zxLQw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ToK22UqOI5C609DJAEIyZYTnOKR+jvmKkDrEFoLFSD0=; b=0PFF5Xcn/OJXLJiL7hHgt68ymPQihncY3RpyRGBnTeL08dg1Oorc/6nBP3NUwn4epCFaeAFavttqAk0cuOyhjlP5BcCw4pN/Ii6JB0djnXnV+rZCUz+46e35jCZNS5SjxZSRkrkbgp3s99xw9Z9W0F7ko4NGhnhLM+ZBVZj/Mz8= Received: from BY5PR10MB4257.namprd10.prod.outlook.com (2603:10b6:a03:211::21) by IA0PR10MB7372.namprd10.prod.outlook.com (2603:10b6:208:40f::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7409.33; Sat, 30 Mar 2024 23:30:07 +0000 Received: from BY5PR10MB4257.namprd10.prod.outlook.com ([fe80::b7ab:75e7:2cf9:efc3]) by BY5PR10MB4257.namprd10.prod.outlook.com ([fe80::b7ab:75e7:2cf9:efc3%3]) with mapi id 15.20.7409.042; Sat, 30 Mar 2024 23:30:07 +0000 Message-ID: <84d6311e-a0a3-4fc6-a87e-e09857c765b3@oracle.com> Date: Sat, 30 Mar 2024 16:30:05 -0700 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 1/1] NFSD: cancel CB_RECALL_ANY call when nfs4_client is about to be destroyed Content-Language: en-US To: Chuck Lever Cc: jlayton@kernel.org, linux-nfs@vger.kernel.org References: <1711476809-26248-1-git-send-email-dai.ngo@oracle.com> <69914825-e9d5-4859-a5a8-60d17e8e8bf6@oracle.com> <88fac8af-c194-452b-94eb-7658b9056246@oracle.com> <97387ec5-bcb4-4c5e-81af-a0038f7fc311@oracle.com> <09da882f-4bad-4398-a2d0-ef8daefce374@oracle.com> From: Dai Ngo In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-ClientProxiedBy: BYAPR06CA0059.namprd06.prod.outlook.com (2603:10b6:a03:14b::36) To BY5PR10MB4257.namprd10.prod.outlook.com (2603:10b6:a03:211::21) Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BY5PR10MB4257:EE_|IA0PR10MB7372:EE_ X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: grcECs83a1B8N6FxHA6CF/AVbP63MygHAQEW4wentr2BswhtZFE2Ure2NwfWtPcpEVICnzdZFSJNw/zgu7AHXiInXPw7zFk4FScrdVhGAlTdzECpgd5hh3eFZsyxbvvmrVhI6gpKYY9vba6ZaaWg/MBSv19j/dXPsBZF0ujEzA0RgyghJXSJjA8VW4qrLzwBDX4V20igPBe6BMqMaytqB99jXb0vtGPC7pxh+NbObb5xERC2lMUNAiF/L3kHTVCq5JJoohFcV/RrHrLwSLNXP3qn+y/VR0FIuSyHyGRunmgiSpp8gsSHP50bDaqlYqRunfyxUxyDuo0pvgPE58XBMxuLlaD5O/J4O+6T/nFMN8zvFPKoEFtTir4PTc2f1mOTLK1W/ejIFUR5fkOC4+dGlYyWz8xeva5hSV1kATzxUAFucFmuQfIAM1n0E7925yIpIoomx00f9XcO+a1Setq3pM/N5fml48ipw62W5qrAPTzg+ko9cEZLKG+6f7tbCBgXFxOLztAh9zaqaST/F7ZQkSC4uEQo02LY1dNx1hDhQdBxBJNLdTTRCJQ9pwENlXEhDcCTNg87XzIV/2YhdQ4g2066fEJkMvt1TwaSRoRXc0AjZsCDyCZYUqdXPFnc2a0JOuovU9EuESZaKWL/JvWc69P0kweJhE+gKY7qIlprYpE= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BY5PR10MB4257.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230031)(1800799015)(376005)(366007);DIR:OUT;SFP:1102; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?cGdPV3dQdURhbGVFR0JBNTM4M1BuakNVbWZWbVVnRy9KaFVJRmJqaDFOZkJS?= =?utf-8?B?MXlHc2NCYURzMjMvWUtkWW9WWVl4Z210K29QNGMrb1ZUMmFaNlB4aktPWTBR?= =?utf-8?B?c1lrVkpraURqNUxXWVhuUEQ4S0pvZ3pHNTJjUkVGQmZ1ejlKTnVwTHFLa0pB?= =?utf-8?B?Q1RTR2tOUTBVSmlHZnNMR3ZzazZ6OVpBT2l5RmRMK3IzSDJ2MTBPTmFBTHJB?= =?utf-8?B?QnJyNmNJaFBLazV3RGVVNUJyaWVyN1FUMURUdG1jMlBFWFFLZmZXZTh3a2lD?= =?utf-8?B?UTNDMlBkSmd2YXF6UldUYkVrRkQzSGxtc0tmOGFhYlk3Mks2VnVhdmlES2Fv?= =?utf-8?B?MkJ0b2VxNEZFREJxY1dqdVhmcE1wUzVuY0grTTljRUpsNUVwVmhiaUJzSlVL?= =?utf-8?B?aS9HZzRwOHlUcGdGeUFIQmVST2MrdENLN0Vnb3pzNUxMUXVKOEVBOEl3Q1d5?= =?utf-8?B?TnhxcmJmOUxDbXRTbGRCMnlldE5PdDNqNzNDazFZYy9DZFJmQnJVaVNFaENp?= =?utf-8?B?OWh4bHV6UkpjU2ZPeElqK042NnNEZVBraGtacG1PcC9PdTJvSUo0ZlRQblVq?= =?utf-8?B?OXo2TWZ5MDFrUmxDWTVadHU1aUpiMWtFTFFpWW9objhOK2hiUk1YNGk5Vjd0?= =?utf-8?B?aDYvdG1JUWtIZyttRS9vd01RNGdXaXlXckQwUWJKc0dhL1J6bkp0a1VFU29k?= =?utf-8?B?alhnM1ZZa3ZQeXpyYzc4R0hOQlNvK3hRS3lNbjFnZGh2dmt6cW5pdFpVR3BF?= =?utf-8?B?dU1lMllCL2tGdmxVbVdyZnZ0QkFjVVFnbW9IR01qcUpiUnk3b00ydWNNd0RP?= =?utf-8?B?OHMxNCtRZ2I1c1ZsNXNHdzk0N1BaVTZRSzVhVi92RVJjdzhBRkRocjZRM05E?= =?utf-8?B?dGUzdGp1OENDTFFHa2JXUS96YzB1aTJFZXRrc3d6YzZLa2trRWVxb2l4TnVK?= =?utf-8?B?UWNSVEFGdGNYOUV6ZXpxMHQzYXh6K0U0U0k1WEc5MHVrYkxUK1BQVG9NNHhX?= =?utf-8?B?UlpaR3J1enNIOFA0cCsvSnVTdWNwMWVmalJ5T09SN2c5cEw1aFI0d3ErM3Az?= =?utf-8?B?Si9XZ2p3bzJvbEhlWnVTR01obzFEV1JnNW8xaUgwckpyQmRhT0JIVVd6VS9t?= =?utf-8?B?RC9JdEtpdFNoa0RMZzhIMFYvcHVHQUo3a2w4VmQ5alFNVjEzRC9kallxdk9H?= =?utf-8?B?QTVnelpwai9nbEJ5WVpISVpyVFpGa3ZSVjdYYVd2Nit2RXhvWit4bUhJMndx?= =?utf-8?B?VEp0aE1JQjZHbVZLVFF1NXo2SnVmL2plTzdndUJjdG5RdGZlSVVlK0hYdWNL?= =?utf-8?B?eUFRQVd2bU1zaWs4SUxHNUd0NTNGTmVPWFprVkQxckdBOHZSVkRLU00rajBH?= =?utf-8?B?dUlaQWErZGY4U0V3WUNtaWxRdkdvMVFlQm1CQlp4a0RZNU5NdmhXeXJPRkQ4?= =?utf-8?B?RE04d1QxSTB3bEQxMHdnTEFJaWIrQzNRWTVmaTJibUZKT1Z2WXpPMVMybUFT?= =?utf-8?B?N2pXa1ZpWFBHMlpJVjJaNVhoS21DbUVhMmJRMHY4SGx3S3IyUXovL3BxaFdV?= =?utf-8?B?UzdSZ29EcGtxc1VvRVJZM0UzRGhBdTAxTWJEditCUzBReFBVeW9WL3VlYjN6?= =?utf-8?B?TThkYTJzS3UyRDhyYno3cjdhbUF1Q0h4bVVYRXYzQzBkTWRQekF6Szh2Ri9I?= =?utf-8?B?eWdHR0ZUMmFiaHdtT01nSERRbEcxT3ZwcTVockQrWFRBSHFqcGhJZU5IMk43?= =?utf-8?B?a3JsdWJKcHd4UXBYbDRlTW9aR0lYcUpyVndCWThyNm9sMzNzUEUzRnJLbmNs?= =?utf-8?B?a01Gdmd1RDcwSGhmaXhxd21sMUw1R1dPNkZVbi93SGxOSmI4SGs0UXV6dVJ3?= =?utf-8?B?Y3dDOHdFUmpvQUdtUVFDSTh6N2tKTmI0Z0pUWWpLM0hHTjJJWUt5YUN2a0Mx?= =?utf-8?B?TTRYOHIrNG5wRlJwWWFTUjhUbE84ZFNRSS8vSG1TZnkxWFpyZk50cG50MVVJ?= =?utf-8?B?QlJtYTQwVmVEdDVOTjlMUjhaRVFiRmRseXN3TmNRMS9KcTg4Z3daUFFpdUdV?= =?utf-8?B?K2dqYStieFNvY0FhMnlpUSsrUVFRM3F6eUtXUkdXZmpzalVScEh3V0U2SVIz?= =?utf-8?Q?iIXk5xulk/YDILswpHutMCv5A?= X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: qhleWE0ptzS/tn92ZanEpVvQUhA+BSEck4x1supfs22ARnCXf1PCVzEiBewkxXHnQo1PsuRzKthcIMb5RrOjUPcJ5T5bDywLUxiUVTq6UP7DSpkyBMoogCv3lkq2/DcMOpz6pT5UXHGZ0uwFpf3MNMy++HmAr49gcJdMHJwudIZwvzy+yACvzYV+ufzgt4ju9vo9z9uvXIFoEYArNv8SFCet7Nrn/BvnBHsy1pNQLMt/V339jTvTezPZ9DF4b8aGErgwDmOreE3BLm/rzGQQf2fCqSi5MwKfWv2sZFhYDVT/UxUL91ZzlfIAw8F85W3BjIryCc7zfMeXD7c7TPhbbELMJtScNYBUhJK53v7hJnA5dW03oJ8s8mdKjke41ZvH1bcqB9Dr6obO9wJdgaoIUE08TQPd1M5n1rfTXdI/X+jEjo6lfZTVWH+pqBwaSDrfa95TgxYLKUgYMMEzG2ppbQMCbCPEwhaEkd1cC6kA7ZTGTsB6DMEctcSTTbm+z8XFttz5LN3WqlAp9RjRyWLLBQrCFEUXYYMIKrNlw++Fcg0uovoP/4CayqFBkL3X47xjRrDPSxyT+1jXOJ2aio827J7kHuNGsA45k1Kip5Cc8xQ= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 4f807e1f-d14e-4743-5b14-08dc5111556d X-MS-Exchange-CrossTenant-AuthSource: BY5PR10MB4257.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Mar 2024 23:30:07.0535 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: SwzMVYSo8PLIx7Y/TIuVzJNVtSwFAVoomT620EvMpErlmqQQGj57WMxWNyN5Qw5brqriKCY3mY5+/UcgOkF/Rg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA0PR10MB7372 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-03-30_17,2024-03-28_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999 mlxscore=0 adultscore=0 spamscore=0 phishscore=0 bulkscore=0 suspectscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2403210000 definitions=main-2403300192 X-Proofpoint-ORIG-GUID: NVxdbQlFEzdJNwrAXruTn-3wV2orz9dA X-Proofpoint-GUID: NVxdbQlFEzdJNwrAXruTn-3wV2orz9dA On 3/30/24 11:28 AM, Chuck Lever wrote: > On Sat, Mar 30, 2024 at 10:46:08AM -0700, Dai Ngo wrote: >> On 3/29/24 4:42 PM, Chuck Lever wrote: >>> On Fri, Mar 29, 2024 at 10:57:22AM -0700, Dai Ngo wrote: >>>> On 3/29/24 7:55 AM, Chuck Lever wrote: >>>>> On Thu, Mar 28, 2024 at 05:31:02PM -0700, Dai Ngo wrote: >>>>>> On 3/28/24 11:14 AM, Dai Ngo wrote: >>>>>>> On 3/28/24 7:08 AM, Chuck Lever wrote: >>>>>>>> On Wed, Mar 27, 2024 at 06:09:28PM -0700, Dai Ngo wrote: >>>>>>>>> On 3/26/24 11:27 AM, Chuck Lever wrote: >>>>>>>>>> On Tue, Mar 26, 2024 at 11:13:29AM -0700, Dai Ngo wrote: >>>>>>>>>>> Currently when a nfs4_client is destroyed we wait for >>>>>>>>>>> the cb_recall_any >>>>>>>>>>> callback to complete before proceed. This adds >>>>>>>>>>> unnecessary delay to the >>>>>>>>>>> __destroy_client call if there is problem communicating >>>>>>>>>>> with the client. >>>>>>>>>> By "unnecessary delay" do you mean only the seven-second RPC >>>>>>>>>> retransmit timeout, or is there something else? >>>>>>>>> when the client network interface is down, the RPC task takes ~9s to >>>>>>>>> send the callback, waits for the reply and gets ETIMEDOUT. This process >>>>>>>>> repeats in a loop with the same RPC task before being stopped by >>>>>>>>> rpc_shutdown_client after client lease expires. >>>>>>>> I'll have to review this code again, but rpc_shutdown_client >>>>>>>> should cause these RPCs to terminate immediately and safely. Can't >>>>>>>> we use that? >>>>>>> rpc_shutdown_client works, it terminated the RPC call to stop the loop. >>>>>>> >>>>>>>>> It takes a total of about 1m20s before the CB_RECALL is terminated. >>>>>>>>> For CB_RECALL_ANY and CB_OFFLOAD, this process gets in to a infinite >>>>>>>>> loop since there is no delegation conflict and the client is allowed >>>>>>>>> to stay in courtesy state. >>>>>>>>> >>>>>>>>> The loop happens because in nfsd4_cb_sequence_done if cb_seq_status >>>>>>>>> is 1 (an RPC Reply was never received) it calls nfsd4_mark_cb_fault >>>>>>>>> to set the NFSD4_CB_FAULT bit. It then sets cb_need_restart to true. >>>>>>>>> When nfsd4_cb_release is called, it checks cb_need_restart bit and >>>>>>>>> re-queues the work again. >>>>>>>> Something in the sequence_done path should check if the server is >>>>>>>> tearing down this callback connection. If it doesn't, that is a bug >>>>>>>> IMO. >>>>>> TCP terminated the connection after retrying for 16 minutes and >>>>>> notified the RPC layer which deleted the nfsd4_conn. >>>>> The server should have closed this connection already. >>>> Since there is no delegation conflict, the client remains in courtesy >>>> state. >>>> >>>>> Is it stuck >>>>> waiting for the client to respond to a FIN or something? >>>> No, in this case since the client network interface was down server >>>> TCP did not even receive ACK for the transmit so the server TCP >>>> kept retrying. >>> It sounds like the server attempts to maintain the client's >>> transport while the client is in courtesy state? >> Yes, TCP keeps retryingwhile the client is in courtesy state. > So the client hasn't sent a lease-renewing operation recently, but > the connection is still there. It then makes sense for the server to > forcibly close the connection when a client transitions to the > courtesy state, since that connection instance is suspect. yes, this makes sense. This would make the RPC to stop within a lease period. I'll submit a patch to kill the back channel connection if there is RPC pending and the client is about to enter courtesy state. > > The server would then recognize immediately that it shouldn't post > any new backchannel operations to that client until it gets a fresh > connection attempt from it. Currently deleg_reaper does not issue CB_RECALL_ANY for courtesy clients. > > >>> I thought the >>> purpose of courteous server was to more gracefully handle network >>> partitions, in which case, the transport is not reliable. >>> >>> If a courtesy client hasn't re-established a connection with a >>> backchannel by the time a conflicting open/lock request arrives, >>> the server has no choice but to expire that client's courtesy >>> state immediately. Right? >> Yes, that is the case for CB_RECALL but not for CB_RECALL_ANY. > CB_RECALL_ANY is done by a shrinker, yes? CB_RECALL_ANY is done by the memory shrinker or when num_delegations >= max_delegations. > > Instead of attempting to send a CB_RECALL_ANY to a courtesy client > which is likely to be unresponsive, why not just expire the oldest > courtesy client the server has? Or... if NFSD /already/ expires the > oldest courtesy client as a result of memory pressure, then just > don't ever send CB_RECALL_ANY to courtesy clients. Currently deleg_reaper does not issue CB_RECALL_ANY for courtesy clients. > > As soon as a courtesy client reconnects, it will send a lease- > renewing operation, transition back to an active state, and at that > point it re-qualifies for getting CB_RECALL_ANY. > > >>> But maybe that's a side-bar. >>> >>> >>>>>> But when nfsd4_run_cb_work ran again, it got into the infinite >>>>>> loop caused by: >>>>>> /* >>>>>> * XXX: Ideally, we could wait for the client to >>>>>> * reconnect, but I haven't figured out how >>>>>> * to do that yet. >>>>>> */ >>>>>> nfsd4_queue_cb_delayed(cb, 25); >>>>>> >>>>>> which was introduced by c1ccfcf1a9bf. Note that I'm using 6.9-rc1. >>>>> The whole paragraph is: >>>>> >>>>> 1503 clnt = clp->cl_cb_client; >>>>> 1504 if (!clnt) { >>>>> 1505 if (test_bit(NFSD4_CLIENT_CB_KILL, &clp->cl_flags)) >>>>> 1506 nfsd41_destroy_cb(cb); >>>>> 1507 else { >>>>> 1508 /* >>>>> 1509 * XXX: Ideally, we could wait for the client to >>>>> 1510 * reconnect, but I haven't figured out how >>>>> 1511 * to do that yet. >>>>> 1512 */ >>>>> 1513 nfsd4_queue_cb_delayed(cb, 25); >>>>> 1514 } >>>>> 1515 return; >>>>> 1516 } >>>>> >>>>> When there's no rpc_clnt and CB_KILL is set, the callback >>>>> operation should be destroyed immediately. CB_KILL is set by >>>>> nfsd4_shutdown_callback. It's only caller is >>>>> __destroy_client(). >>>>> >>>>> Why isn't NFSD4_CLIENT_CB_KILL set? >>>> Since __destroy_client was not called, for the reason mentioned above, >>>> nfsd4_shutdown_callback was not called so NFSD4_CLIENT_CB_KILL was not >>>> set. >>> Well, then I'm missing something. You said, above: >>> >>>> Currently when a nfs4_client is destroyed we wait for >>> And __destroy_client() invokes nfsd4_shutdown_callback(). Can you >>> explain the usage scenario(s) to me again? >> __destroy_client is not called in the case of CB_RECALL_ANY since >> there is no open/lock conflict associated the the client. >>>> Since the nfsd callback_wq was created with alloc_ordered_workqueue, >>>> once this loop happens the nfsd callback_wq is stuck since this workqueue >>>> executes at most one work item at any given time. >>>> >>>> This means that if a nfs client is shutdown and the server is in this >>>> state then all other clients are effected; all callbacks to other >>>> clients are stuck in the workqueue. And if a callback for a client is >>>> stuck in the workqueue then that client can not unmount its shares. >>>> >>>> This is the symptom that was reported recently on this list. I think >>>> the nfsd callback_wq should be created as a normal workqueue that allows >>>> multiple work items to be executed at the same time so a problem with >>>> one client does not effect other clients. >>> My impression is that the callback_wq is single-threaded to avoid >>> locking complexity. I'm not yet convinced it would be safe to simply >>> change that workqueue to permit multiple concurrent work items. >> Do you have any specific concern about lock complexity related to >> callback operation? > If there needs to be a fix, I'd like something for v6.9-rc, so it > needs to be a narrow change. Larger infrastructural changes have to > go via a full merge window. > > >>> It could be straightforward, however, to move the callback_wq into >>> struct nfs4_client so that each client can have its own workqueue. >>> Then we can take some time and design something less brittle and >>> more scalable (and maybe come up with some test infrastructure so >>> this stuff doesn't break as often). >> IMHO I don't see why the callback workqueue has to be different >> from the laundry_wq or nfsd_filecache_wq used by nfsd. > You mean, you don't see why callback_wq has to be ordered, while > the others are not so constrained? Quite possibly it does not have > to be ordered. Yes, I looked at the all the nfsd4_callback_ops on nfsd and they seems to take into account of concurrency and use locks appropriately. For each type of work I don't see why one work has to wait for the previous work to complete before proceed. > > But we might have lost the bit of history that explains why, so > let's be cautious about making broad changes here until we have a > good operational understanding of the code and some robust test > cases to check any changes we make. Understand, you make the call. -Dai > >