From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 917D31A9B35 for ; Fri, 17 Jan 2025 20:28:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=205.220.165.32 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737145708; cv=fail; b=GeViVdUe9+L0GGNBtuugNwxCCAh335dX+TlxuyXVMZooyTSSb8bJO3MaPkyDXAlsPoD3aVmhbLU0sX3vD8WSgyiFqh+MMXhaOw8WSmx1VzVWF5+mVXReF/u/jfC2OeigoZJujiVj9BaVZfztblCYLMj6CwGayL4H3tEy3yRYzhI= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737145708; c=relaxed/simple; bh=aWiUIUqPbtw4mck9IRxR5wnR+EehPZYjIFFUL4GtF/o=; h=Message-ID:Date:Subject:To:Cc:References:From:In-Reply-To: Content-Type:MIME-Version; b=esGWnFcsoMfOrM+5g6/EuTkjm1y3y1H1zu11zbrI9Aif/J2Jb9Njl0y3DDCKn/mVIeOIEz71VVs9EJB+fmFCRsz0ex6USbvzPQjjfldHX4Qm7v8D4OdU2L01Vsd0/t8DmQnoJF07TxofvcxH5XoBNegYGbbnjxaIxQEl15XtFMA= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=OK7cOVY1; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b=rrU2vdkO; arc=fail smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="OK7cOVY1"; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b="rrU2vdkO" Received: from pps.filterd (m0246627.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 50HKMtTr006173; Fri, 17 Jan 2025 20:27:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s= corp-2023-11-20; bh=rAQytzo6yVvp1G+tDEEzSxPqUNrkpGIfsGK1IKJeh70=; b= OK7cOVY1Hdm2Asz3cXgZHL5yzPA1XOiMyjhxDGgyNxTwFmiI2Q2YxWuyPNm6e8DG WuxWU3snxKDAn9rKQTQSEOWPLHPfEqLLVUH+5grWqJHT56sdvFQjsr/fBxAaOia4 4nq5PzSdxvnsGHLZXbDaI3ewzuL4OyUlY3dyisnoTIBMUkxcTsdcGiGG3bCExioM +JWI/5wkRR/iUXgGqAMrr8HQmHYzPOFeOiS9a+frZyryybUH1ea/qvkaZQc+vwHL HNZGKKBLdNWYfa4teyLgNwHDy/QukEtAt3MZRIiuZ0EIqTVAB/v4dSzM2ta4dMTH oCOQoBVEhMDy8npozLOeIQ== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 443f7yda2p-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 17 Jan 2025 20:27:53 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 50HJ5LOp035016; Fri, 17 Jan 2025 20:27:53 GMT Received: from nam11-co1-obe.outbound.protection.outlook.com (mail-co1nam11lp2169.outbound.protection.outlook.com [104.47.56.169]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 443f3ck1ur-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 17 Jan 2025 20:27:52 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=rilQANR8CpBb6iBpptTGbMY/TDn2hQQJQ1q1JROJovGNrsqDSAkIuxgKhAL2my1cDp5ETc/B4gS1dSfK/8mjUNh3H93FTUHAZf682HBF26CwusVCjjltllwFHjwkggPaioy+wEul7POedJpskLWK4beXsHIpnyGacaUZm4C5+k8BDzUWLRkVG108viSNivtpylN5y6Ig1HzHNnCChliKWOJ9ZjvHfYkOEC5m3asUMHJcLzEbW+1YllqBHdbDKLeTnNxIuvS8gobILtN/hp/HG2j6GR7KGnIy8J8F3IbA3fu54Is+lS1Rj0wIhFkflxhS/F7N2IKLDr+35t7IafIUwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=rAQytzo6yVvp1G+tDEEzSxPqUNrkpGIfsGK1IKJeh70=; b=cDU+igWOYbxBvfLcSNE7BactNw7BMsNbBgFYhphPWhxKzo1cvnfXrQRFBpSMPCSM30U7AfEDsv/niTaTAX1dTVFZgbT1zAFTw2e03oyq+BSob4WjG+mSLI431L23mBrZmAxa+BRWBpcYU1YJGFNIekypI7pc74TRRP3Qt+DrXYSXPWtOJCnnacLjsrmbqg8//Wa7rax6e2mQC5yIxABM6hPww8R9VNSuaS2/xPBzXvusn1+TvZAb2X+NDxxI6VRaaSg7hwHLbCii0GWSpxNQRtijDxPbxhjOPsB2PrUeCylTEdXPqgS1HeHMBqncY6CZyLS2wHr1tLOBigX31YG64w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=rAQytzo6yVvp1G+tDEEzSxPqUNrkpGIfsGK1IKJeh70=; b=rrU2vdkOs5o61HakGt5FCaPojrSNOs8g6VSpT5Mbeulhpg1+UGrXSIae7Z0SMohY9t+bfRh60WQ83xhoNJP5hUzpSG7br/wW9VLFEqUzplO78q38OtXGkltJa6svWrhxIjxy2oiJzCQf9nksKpaxW+LrwVSs0Evip9Ch+91AR2I= Received: from BN0PR10MB5128.namprd10.prod.outlook.com (2603:10b6:408:117::24) by BN0PR10MB5191.namprd10.prod.outlook.com (2603:10b6:408:116::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8356.17; Fri, 17 Jan 2025 20:27:50 +0000 Received: from BN0PR10MB5128.namprd10.prod.outlook.com ([fe80::743a:3154:40da:cf90]) by BN0PR10MB5128.namprd10.prod.outlook.com ([fe80::743a:3154:40da:cf90%3]) with mapi id 15.20.8356.014; Fri, 17 Jan 2025 20:27:50 +0000 Message-ID: <1bf3afff-2ddd-486e-a921-ae72b4dc4cc0@oracle.com> Date: Fri, 17 Jan 2025 15:27:48 -0500 User-Agent: Mozilla Thunderbird Subject: Re: nfsd blocks indefinitely in nfsd4_destroy_session To: Baptiste PELLEGRIN , Rik Theys , Christian Herzog Cc: Salvatore Bonaccorso , Benjamin Coddington , Trond Myklebust , Harald Dunkel , Linux NFS Mailing List , Martin Svec , Michael Gernoth , Jeff Layton References: <4c3080af-eec7-4af5-8b0d-c35ac98ec074@aixigo.com> <3cdcf2ee-46b3-463d-bc14-0f44062c0bd0@oracle.com> <7fb711b1-c557-48de-bf91-d522bdbcc575@oracle.com> <36f4892e1332e2322ab46e1343316eb187d78025.camel@kernel.org> <7428f722-3456-42e1-91ba-7cb52468f364@oracle.com> <7a3cfd98-b82d-4e96-971e-2f2658ff55d8@ac-grenoble.fr> Content-Language: en-US From: Chuck Lever In-Reply-To: <7a3cfd98-b82d-4e96-971e-2f2658ff55d8@ac-grenoble.fr> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-ClientProxiedBy: CH0PR03CA0235.namprd03.prod.outlook.com (2603:10b6:610:e7::30) To BN0PR10MB5128.namprd10.prod.outlook.com (2603:10b6:408:117::24) Precedence: bulk X-Mailing-List: linux-nfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN0PR10MB5128:EE_|BN0PR10MB5191:EE_ X-MS-Office365-Filtering-Correlation-Id: 74b2a29e-a914-4cfc-fb20-08dd373569ce X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|366016|1800799024|376014|7416014; X-Microsoft-Antispam-Message-Info: =?utf-8?B?aVhBVEQ3TU1PWkQ0NE9mUzl3U25CdURUdVQ4WVVKZVUvcGc1dkR3cG92VlhW?= =?utf-8?B?UmFoQk5mYTJHeTZwQXFsbzhFWmI0UlRvcHRXV3FyS0dOZ0s3bzJmZUJ6Mkpk?= =?utf-8?B?dFh1dzREMTcxbGlhRTBlRi9UQ1hCMHo0S1IyR0JEako1UEZ3R0l6RHhiSUNQ?= =?utf-8?B?eHFEUUI0OXlrNjBKY0tneVJJYUtsUGNhWGwvdm9MVWQrNVRXS1FBaG0xOUJY?= =?utf-8?B?amJ5T1NZOFBJZSt3Z01HU2hUWG5oVE1hbjRWMGQ5ZmNrYTJMWC9CT0hXMWtL?= =?utf-8?B?VnFnd2xxckdvWUZpdUNvKytXZ2JDY0xXYUtYQ3JlNGdaaFFCOTlRMWlUc1pD?= =?utf-8?B?a3IwaWQrQXprczVzWG9DVVZVQzNsZGwzdHVhNks0Q1ZRZWs5Rk9zSmpYaFox?= =?utf-8?B?WU0vWHVEODlDU0pQdW9oUGtaREoxQXAvOVN3amgzTTNHbVpOV3o0dTJFOS9t?= =?utf-8?B?bkRGOFZMdS9ra2xvMGRqKy8vdTdDRjNJbEJZL3cwcHdERTYrcmpTVHBCdEhQ?= =?utf-8?B?QVdaa0JEOUhCR1FWYUgzOWMwOEhEVW5GU1VxRCt1b1VpLzNNZXJMMzBlTUdX?= =?utf-8?B?WmFtY3ZHV2Rwc0N3dWJmWERzbGdBdlhJeG95cDB2NENjc2ZucmFRYW1UcUx3?= =?utf-8?B?aHBHSGllTGFyYVlqTkN0ZWYzRE9EcXh1TC92M3RrRS9RUTRwbEorSC94azZB?= =?utf-8?B?NklUMDlPZ2tDdGY4TVBCM3hnMG1vRkp6ZUdMUW5tVDBWZnIyQ3gyenY5K2I5?= =?utf-8?B?cXFCVUZlS2FicUhXLzYwd1R1WEZuOE5ML09WR3FyTjRWR0lxVFNCMXpTNjZN?= =?utf-8?B?MGc2ZUJuejZ5RTZtU0ltWFRKMFNxN05Yb2lkVExXM2RtdmhMdjQ3Mkk3dk05?= =?utf-8?B?TmVRREFNWVArZllTdHNZcGd4M3ZWR3lnSEJjcmNjbThESFd0VHdDMTh1dncz?= =?utf-8?B?RXh0V1VGTXRxK1FSRjJJRHh4WVBITXYzK2VWbDJXZGIwVm5BV2RZZXF4SGNj?= =?utf-8?B?cjh6UlY1OWpnMlI4eE1iK1duT3V5VFRlR0phby9zWWF6VHpBdEY5VklrZTl6?= =?utf-8?B?eEVwOGRDS1RJRW8weFdBemx4UjdCMlIrR2hmZFV6Y05tT0w5NmpZQitPUnB4?= =?utf-8?B?bTExUUhrSGNDSnRPUGFjcy8reXh4SVQ2Qml3aGVqR0JvRGREdkFNcG1RZDNo?= =?utf-8?B?WmFFdkRtOTZoTkxaOXRtMzZXTW9kQlU0dlpKZUtRYkM5L2Q4Q0QxQy9DZVJu?= =?utf-8?B?VGV6cFRWNzAvNmUvZG1wVm0xVGxoZm5HZFBJMjVwbEFLVi9rb3hWeFFJemQx?= =?utf-8?B?cVZoT0pDQXZJQk5QS213MW4zNjhJKzRDSDJRY1BmNU5JQnAxRU9xWFkvMXls?= =?utf-8?B?UWVNYzhHcWZEU1hYdFA4WWVxcXJGZ3hVZUxtTS92b0ZyUDFWckJVcEhkRzMw?= =?utf-8?B?Y2FTSVZtNWUwVnRwWjBYdXM0MkZsVmFSUDhWWElTZlNrZWFzcjNUMm00d3E4?= =?utf-8?B?aGNDWUM0MjhRV2VPblRNWWo4eUUrd2ptWDNReGtuQ1BzRmxjVXBWQUR4bWU2?= =?utf-8?B?MlpWcXBNZUtzWGxOUEhIVHh4OVdkaGNGY1JKN0FtNUIvY0o4Ync5UkVqSDZ6?= =?utf-8?B?Vm56a3NmMUEvQndPMlNSYXRwU1ZRZU1KZWlveDlCWnNMNGIzc09SSWJobjFT?= =?utf-8?B?Tjl5Q1o3WWxmTG85aFkxZ0g5ZWNZWUY3akI1MWNManRvT2hZM0MrV25jUlFE?= =?utf-8?B?NHBuZEJlK0IzM2dFNHd1Nk1pS2dqbGtrY1dDUEtoQ2N0VGlhZ0hDcVpUWHd1?= =?utf-8?B?TTdUMDhtMERIWjRpUWhjZz09?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BN0PR10MB5128.namprd10.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(366016)(1800799024)(376014)(7416014);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?ekNKT25QN1JTV2hmbW0ybi9uMEZ0YVpuc25wV1hMV1ZRSFNzVDZqbTRsZ1Jt?= =?utf-8?B?VlhTR0Q5WnpHTkFJM3lvWXVQZ2JOb3BvZHhVajN5RWljNkJLTVlkVkRybG8v?= =?utf-8?B?SkU5c1FRNWhSVW1CQVEzUEwzZjBoc0FLRFcyMmYwemdNVFBrc2RMcDdrUHBu?= =?utf-8?B?cWFmRWxza1pIcDJqczRmK0h2d1duMFFQNW9DaWtTdDRFVjhHcnliU2g5cDNP?= =?utf-8?B?V0NxOHoySVRkYmZ5Z2lCbnlhZ2k0TTd2RTFtREh5NUZacnFETUVDcDlZYkFT?= =?utf-8?B?Nyt5T2NSUlFyc2JwWEdCRDdRS0MxZzVQMHVnZW5DVHArTE5WdVpETENDTGNv?= =?utf-8?B?UUxjc1huRXdXbWY2TW5LVHd4WDVZRDJoa1REOVJodkJKS3luNU9ZQkxCODJo?= =?utf-8?B?bTAwdzJTTjJUYll6YURDUU5DQUpKKzhWZ3pBQlRVeUdCMjk3RWtRejZTZDdZ?= =?utf-8?B?eFlQQk81TjFFZEx5SkpidlR0WW05ZUdhdFNrOG9DZVB2UFowakFaOWFPeUwz?= =?utf-8?B?N2ZiTjltaDRUVmp4RHhvc0swMkFEWEszVEE4RXBCMnRwcm85RThDUFc1N0FL?= =?utf-8?B?eU43bHVsLzA4L1lrNmZYbFQ4ekkrNGRyTDAwQllwMU1hVUlGZ3RneWM3OWRz?= =?utf-8?B?a0s4dXVXeGFKVEl1blQ2K25tYTVEaXowQVJ5ME1GNkpDUVZ0ek1xenlRZUQv?= =?utf-8?B?dTJGdFltd0dSdDZxY2ZxY1VRRGxnckxuL1BabnFYOWphZkZ5Q3NQMHgyZUcz?= =?utf-8?B?emV6Yk9oWDg1dWRjMUJDRDNnNkU4bHA5ZVNrTnF6STdDQ3U4TEVRTGU3UDNJ?= =?utf-8?B?WjU2ODBNZjd5ZDdIbFJEdlhDbm5qYUpPOUVRT015cS9YeTFIUDR0VTBqaEZh?= =?utf-8?B?bENHUEc0OUFxQ1JDWDh3SGtrSUVaQmJSakJHVTNFU1JFV1hCM0JZOVplSWx4?= =?utf-8?B?QjdndDhKazR6UjZMNDB6RE5LNEtMQiswaGFiRkRNQ3IycDlnWm1JeFFDZmxQ?= =?utf-8?B?RDFPOFd0cjhtcFhzUE9Uck9FQTZHQTBVYTZpZ21HbkFsWHpEMWFoQkN2TmlI?= =?utf-8?B?U1dkWWYxUjZEZitSMC90Si9pT2p6YVNqbUlRSnNWRGd2RmRJWTBUZ1NoQ0VU?= =?utf-8?B?cTRiOTRMTzFRRCszc1RFaXJIMGNxeTcxbHNFbUgzS2hnMnlCK0lLQ3U3WmFN?= =?utf-8?B?ak5aTEpTMUVhelhrM01MOTFJcnkyRi9WOHYya3pDZyt1ZXNEYnowdEk1dVEx?= =?utf-8?B?Mm1STlozbURYeDZ2R2RNNUVyb0xaWVJJdStKZitIZGZ6KzhaYkJFZERFdmZI?= =?utf-8?B?Z3p3c2ZIdWIvZStSMndTN29Va0w1Z0dZM1dvelhSM3NVMy9RRFBySzkzb2J4?= =?utf-8?B?Wmk5eGcyYzcyUThaMFExVmFBWlppdjMvMWFCZXMrcUMyYjJFaXozSG52emRQ?= =?utf-8?B?eW5GQXZadisrZkZkdGhSZHFvelpESzAxSVRCRkp2Sk43YktPSTdVSkNXaU9u?= =?utf-8?B?WVRUS3djNi9LdzZtRjRVVEhCNWVkSk1DNHY1R2JVRHFndFVVWVNFMmZEbmg1?= =?utf-8?B?M2o4akdLQWlxbzY1eVVQRXVrOEZ2MlNZdzk0MmFGUnBIMnJzcno2c3hMczkv?= =?utf-8?B?dUVSTFI3WnFNNW1uQnlmZisxS25seE12Zk1oQTNKK2IvMzVRS0hjcTJ3VldG?= =?utf-8?B?MjJCWWdNdWtMcUU2NC9BRVc3QjArWjNLQ3FXMDBqTkMzQXp4a0huNFNMcjZP?= =?utf-8?B?VC9DMGxPMFhVL3kxekZ0UkdwaTVxZEdRbHVzS29rckhjQUZKSkdIbU5Tcmxv?= =?utf-8?B?VjlSOGtWMkI1SXczMDRmckZSd29GRU83b3BLdDlKZVIya2laaVV0M2VXRUp4?= =?utf-8?B?Zk1Sa2UyL3JkZ1NjU0RWQU9zMVdMankrbkFqZXNKVTVDSWlpRGJIOWVvTVZx?= =?utf-8?B?NWg0Vk9lbWRkTFFiMlJvSm1vWnlwemVkWmNwTFVGVWR3NS9ybWdZYjk4MjRl?= =?utf-8?B?djUvR2l6QldzOWRKY2hlZStKZGVRQndxV1NjK2RNRDAwVlBYN25UblNwR3dx?= =?utf-8?B?UFhaVURHNXhaQ2RXdmZKdkdJcE5ydnV0NWc5c0VuLzZIOHhURnJ0WXhsbDQw?= =?utf-8?B?cjMxazkvUis5bjZsSytyU1lFbStkcDBLeERXTmkvT0RleHpHV1ZZdFFTNUlZ?= =?utf-8?B?T3c9PQ==?= X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: KGn8hecllhy7Xpfj8GcHkavZC80cnFriKn1PONRNIE4wbYxjDdaxUnrXJHtz4pUqIu89uQ1tOam8sL9oBdkFr4OysdGlOUPee0K2kQu8czaxPDCyUzT9OzXmm81f5Zc7AoSoY7BrSTnGgDhO3eO/cpirmlfRpF2GkLi+GlrqsdGink30ynpn5MutQqhR6hnRbKFhje1CqUs0giVW3NmNIBBubGPEdOIxD1gvB8nCasRtfnhD0y5auxQp9wioaHS0VL3UF1Po295Oh2o0Gh5ParxMyCzqKEGUNHCxthCwYX7EhaYejVmWuQYOaV8jtegjkNoo7OmlMTcvs9jq280kVvJUzPSPq56BeryDWto6DrWf6cLzCliyuSzfyparFRJgh9KyjPPzJfSyMY3ztUBxvvXcxQ4kAKABSPslsVBv2SL3Lu8YpmGBnITwIUvCrLAOltE0/K6g129TcoMGkOREJygkxJbXLf/hJT6kGJYHMWjQy3pMuB58fHN7BH6Z85nfDNQavMjDBbrDv07rlLynZ7pXhvZWy9MiPpHOjP02HcczRHN5OjZX1V+NbfI771LFVFLB0y97iF3jqfEbtfDIHyGtiP4q2TLNxwgVYz0l6bY= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 74b2a29e-a914-4cfc-fb20-08dd373569ce X-MS-Exchange-CrossTenant-AuthSource: BN0PR10MB5128.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2025 20:27:50.5825 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: DPufGYS5zgHN8zYcKIfXlLeGfRkpgDrtlg2M6D0VzrPeUX6nuz+jYym5Ot4pO2UuzKbj055ApAfnhn5bx3CsEg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN0PR10MB5191 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1057,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-01-17_07,2025-01-16_01,2024-11-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 bulkscore=0 adultscore=0 suspectscore=0 phishscore=0 malwarescore=0 mlxlogscore=999 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2411120000 definitions=main-2501170158 X-Proofpoint-ORIG-GUID: bFlfk_azuUKzeo8U1JVgQg4XZs_Kybs4 X-Proofpoint-GUID: bFlfk_azuUKzeo8U1JVgQg4XZs_Kybs4 On 1/17/25 2:43 PM, Baptiste PELLEGRIN wrote: > > > On 1/16/25 9:07 PM, Chuck Lever wrote: >> On 1/11/25 3:14 AM, Pellegrin Baptiste wrote: >>> On 09/01/25 17:32, *Chuck Lever * wrote: >>>> On 1/9/25 7:42 AM, Jeff Layton wrote: >>>> >On Thu, 2025-01-09 at 12:56 +0100, Christian Herzog wrote: >>>> >>Dear Chuck, >>>> >> >>>> >>On Wed, Jan 08, 2025 at 10:07:49AM -0500, Chuck Lever wrote: >>>> >>>On 1/8/25 9:54 AM, Christian Herzog wrote: >>>> >>>>Dear Chuck, >>>> >>>> >>>> >>>>On Wed, Jan 08, 2025 at 08:33:23AM -0500, Chuck Lever wrote: >>>> >>>>>On 1/7/25 4:17 PM, Salvatore Bonaccorso wrote: >>>> >>>>>>Hi Chuck, >>>> >>>>>> >>>> >>>>>>Thanks for your time on this, much appreciated. >>>> >>>>>> >>>> >>>>>>On Wed, Jan 01, 2025 at 02:24:50PM -0500, Chuck Lever wrote: >>>> >>>>>>>On 12/25/24 4:15 AM, Salvatore Bonaccorso wrote: >>>> >>>>>>>>Hi Chuck, hi all, >>>> >>>>>>>> >>>> >>>>>>>>[it was not ideal to pick one of the message for this >>>> followup, let me >>>> >>>>>>>>know if you want a complete new thread, adding as well >>>> Benjamin and >>>> >>>>>>>>Trond as they are involved in one mentioned patch] >>>> >>>>>>>> >>>> >>>>>>>>On Mon, Jun 17, 2024 at 02:31:54PM +0000, Chuck Lever III >>>> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>>>On Jun 17, 2024, at 2:55 AM, Harald Dunkel >>>> wrote: >>>> >>>>>>>>>> >>>> >>>>>>>>>>Hi folks, >>>> >>>>>>>>>> >>>> >>>>>>>>>>what would be the reason for nfsd getting stuck somehow >>>> and becoming >>>> >>>>>>>>>>an unkillable process? See >>>> >>>>>>>>>> >>>> >>>>>>>>>>- https://bugs.debian.org/cgi-bin/bugreport.cgi? >>>> bug=1071562 >>>> >>>>>>>>>>- https://bugs.launchpad.net/ubuntu/+source/nfs-utils/ >>>> +bug/2062568 >>> +bug/2062568> >>>> >>>>>>>>>> >>>> >>>>>>>>>>Doesn't this mean that something inside the kernel gets >>>> stuck as >>>> >>>>>>>>>>well? Seems odd to me. >>>> >>>>>>>>> >>>> >>>>>>>>>I'm not familiar with the Debian or Ubuntu kernel packages. >>>> Can >>>> >>>>>>>>>the kernel release numbers be translated to LTS kernel >>>> releases >>>> >>>>>>>>>please? Need both "last known working" and "first broken" >>>> releases. >>>> >>>>>>>>> >>>> >>>>>>>>>This: >>>> >>>>>>>>> >>>> >>>>>>>>>[ 6596.911785] RPC: Could not send backchannel reply error: >>>> -110 >>>> >>>>>>>>>[ 6596.972490] RPC: Could not send backchannel reply error: >>>> -110 >>>> >>>>>>>>>[ 6837.281307] RPC: Could not send backchannel reply error: >>>> -110 >>>> >>>>>>>>> >>>> >>>>>>>>>is a known set of client backchannel bugs. Knowing the LTS >>>> kernel >>>> >>>>>>>>>releases (see above) will help us figure out what needs to be >>>> >>>>>>>>>backported to the LTS kernels kernels in question. >>>> >>>>>>>>> >>>> >>>>>>>>>This: >>>> >>>>>>>>> >>>> >>>>>>>>>[11183.290619] wait_for_completion+0x88/0x150 >>>> >>>>>>>>>[11183.290623] __flush_workqueue+0x140/0x3e0 >>>> >>>>>>>>>[11183.290629] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd] >>>> >>>>>>>>>[11183.290689] nfsd4_destroy_session+0x186/0x260 [nfsd] >>>> >>>>>>>>> >>>> >>>>>>>>>is probably related to the backchannel errors on the >>>> client, but >>>> >>>>>>>>>client bugs shouldn't cause the server to hang like this. We >>>> >>>>>>>>>might be able to say more if you can provide the kernel >>>> release >>>> >>>>>>>>>translations (see above). >>>> >>>>>>>> >>>> >>>>>>>>In Debian we hstill have the bug #1071562 open and one >>>> person notified >>>> >>>>>>>>mye offlist that it appears that the issue get more frequent >>>> since >>>> >>>>>>>>they updated on NFS client side from Ubuntu 20.04 to Debian >>>> bookworm >>>> >>>>>>>>with a 6.1.y based kernel). >>>> >>>>>>>> >>>> >>>>>>>>Some people around those issues, seem to claim that the change >>>> >>>>>>>>mentioned in >>>> >>>>>>>>https://lists.proxmox.com/pipermail/pve-devel/2024- >>>> July/064614.html >>> devel/2024- July/064614.html> >>>> >>>>>>>>would fix the issue, which is as well backchannel related. >>>> >>>>>>>> >>>> >>>>>>>>This is upstream: 6ddc9deacc13 ("SUNRPC: Fix backchannel reply, >>>> >>>>>>>>again"). While this commit fixes 57331a59ac0d ("NFSv4.1: Use >>>> the >>>> >>>>>>>>nfs_client's rpc timeouts for backchannel") this is not >>>> something >>>> >>>>>>>>which goes back to 6.1.y, could it be possible that hte >>>> backchannel >>>> >>>>>>>>refactoring and this final fix indeeds fixes the issue? >>>> >>>>>>>> >>>> >>>>>>>>As people report it is not easily reproducible, so this >>>> makes it >>>> >>>>>>>>harder to identify fixes correctly. >>>> >>>>>>>> >>>> >>>>>>>>I gave a (short) stance on trying to backport commits up to >>>> >>>>>>>>6ddc9deacc13 ("SUNRPC: Fix backchannel reply, again") but >>>> this quickly >>>> >>>>>>>>seems to indicate it is probably still not the right thing for >>>> >>>>>>>>backporting to the older stable series. >>>> >>>>>>>> >>>> >>>>>>>>As at least pre-requisites: >>>> >>>>>>>> >>>> >>>>>>>>2009e32997ed568a305cf9bc7bf27d22e0f6ccda >>>> >>>>>>>>4119bd0306652776cb0b7caa3aea5b2a93aecb89 >>>> >>>>>>>>163cdfca341b76c958567ae0966bd3575c5c6192 >>>> >>>>>>>>f4afc8fead386c81fda2593ad6162271d26667f8 >>>> >>>>>>>>6ed8cdf967f7e9fc96cd1c129719ef99db2f9afc >>>> >>>>>>>>57331a59ac0d680f606403eb24edd3c35aecba31 >>>> >>>>>>>> >>>> >>>>>>>>and still there would be conflicting codepaths (and does not >>>> seem >>>> >>>>>>>>right). >>>> >>>>>>>> >>>> >>>>>>>>Chuck, Benjamin, Trond, is there anything we can provive on >>>> reporters >>>> >>>>>>>>side that we can try to tackle this issue better? >>>> >>>>>>> >>>> >>>>>>>As I've indicated before, NFSD should not block no matter >>>> what the >>>> >>>>>>>client may or may not be doing. I'd like to focus on the >>>> server first. >>>> >>>>>>> >>>> >>>>>>>What is the result of: >>>> >>>>>>> >>>> >>>>>>>$ cd >>>> >>>>>>>$ unset KBUILD_OUTPUT >>>> >>>>>>>$ make -j `nproc` >>>> >>>>>>>$ scripts/faddr2line \ >>>> >>>>>>> fs/nfsd/nfs4state.o \ >>>> >>>>>>> nfsd4_destroy_session+0x16d >>>> >>>>>>> >>>> >>>>>>>Since this issue appeared after v6.1.1, is it possible to bisect >>>> >>>>>>>between v6.1.1 and v6.1.90 ? >>>> >>>>>> >>>> >>>>>>First please note, at least speaking of triggering the issue in >>>> >>>>>>Debian, Debian has moved to 6.1.119 based kernel already (and >>>> soon in >>>> >>>>>>the weekends point release move to 6.1.123). >>>> >>>>>> >>>> >>>>>>That said, one of the users which regularly seems to be hit by >>>> the >>>> >>>>>>issue was able to provide the above requested information, >>>> based for >>>> >>>>>>6.1.119: >>>> >>>>>> >>>> >>>>>>~/kernel/linux-source-6.1# make kernelversion >>>> >>>>>>6.1.119 >>>> >>>>>>~/kernel/linux-source-6.1# scripts/faddr2line fs/nfsd/ >>>> nfs4state.o nfsd4_destroy_session+0x16d >>>> >>>>>>nfsd4_destroy_session+0x16d/0x250: >>>> >>>>>>__list_del_entry at /root/kernel/linux-source-6.1/./include/ >>>> linux/list.h:134 >>>> >>>>>>(inlined by) list_del at /root/kernel/linux-source-6.1/./ >>>> include/linux/list.h:148 >>>> >>>>>>(inlined by) unhash_session at /root/kernel/linux-source-6.1/ >>>> fs/ nfsd/nfs4state.c:2062 >>>> >>>>>>(inlined by) nfsd4_destroy_session at /root/kernel/linux- >>>> source-6.1/fs/nfsd/nfs4state.c:3856 >>>> >>>>>> >>>> >>>>>>They could provide as well a decode_stacktrace output for the >>>> recent >>>> >>>>>>hit (if that is helpful for you): >>>> >>>>>> >>>> >>>>>>[Mon Jan 6 13:25:28 2025] INFO: task nfsd:55306 blocked for >>>> more than 6883 seconds. >>>> >>>>>>[Mon Jan 6 13:25:28 2025]       Not tainted 6.1.0-28-amd64 #1 >>>> Debian 6.1.119-1 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] "echo 0 > /proc/sys/kernel/ >>>> hung_task_timeout_secs" disables this message. >>>> >>>>>>[Mon Jan 6 13:25:28 2025] task:nfsd            state:D >>>> stack:0     pid:55306 ppid:2      flags:0x00004000 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] Call Trace: >>>> >>>>>>[Mon Jan 6 13:25:28 2025]  >>>> >>>>>>[Mon Jan 6 13:25:28 2025] __schedule+0x34d/0x9e0 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] schedule+0x5a/0xd0 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] schedule_timeout+0x118/0x150 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] wait_for_completion+0x86/0x160 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] __flush_workqueue+0x152/0x420 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] nfsd4_destroy_session (debian/build/ >>>> build_amd64_none_amd64/include/linux/spinlock.h:351 debian/build/ >>>> build_amd64_none_amd64/fs/nfsd/nfs4state.c:3861) nfsd >>>> >>>>>>[Mon Jan 6 13:25:28 2025] nfsd4_proc_compound (debian/build/ >>>> build_amd64_none_amd64/fs/nfsd/nfs4proc.c:2680) nfsd >>>> >>>>>>[Mon Jan 6 13:25:28 2025] nfsd_dispatch (debian/build/ >>>> build_amd64_none_amd64/fs/nfsd/nfssvc.c:1022) nfsd >>>> >>>>>>[Mon Jan 6 13:25:28 2025] svc_process_common (debian/build/ >>>> build_amd64_none_amd64/net/sunrpc/svc.c:1344) sunrpc >>>> >>>>>>[Mon Jan 6 13:25:28 2025] ? svc_recv (debian/build/ >>>> build_amd64_none_amd64/net/sunrpc/svc_xprt.c:897) sunrpc >>>> >>>>>>[Mon Jan 6 13:25:28 2025] ? nfsd_svc (debian/build/ >>>> build_amd64_none_amd64/fs/nfsd/nfssvc.c:983) nfsd >>>> >>>>>>[Mon Jan 6 13:25:28 2025] ? nfsd_inet6addr_event (debian/ >>>> build/ build_amd64_none_amd64/fs/nfsd/nfssvc.c:922) nfsd >>>> >>>>>>[Mon Jan 6 13:25:28 2025] svc_process (debian/build/ >>>> build_amd64_none_amd64/net/sunrpc/svc.c:1474) sunrpc >>>> >>>>>>[Mon Jan 6 13:25:28 2025] nfsd (debian/build/ >>>> build_amd64_none_amd64/fs/nfsd/nfssvc.c:960) nfsd >>>> >>>>>>[Mon Jan 6 13:25:28 2025] kthread+0xd7/0x100 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] ? kthread_complete_and_exit+0x20/0x20 >>>> >>>>>>[Mon Jan 6 13:25:28 2025] ret_from_fork+0x1f/0x30 >>>> >>>>>>[Mon Jan  6 13:25:28 2025]  >>>> >>>>>> >>>> >>>>>>The question about bisection is actually harder, those are >>>> production >>>> >>>>>>systems and I understand it's not possible to do bisect there, >>>> while >>>> >>>>>>unfortunately not reprodcing the issue on "lab conditions". >>>> >>>>>> >>>> >>>>>>Does the above give us still a clue on what you were looking for? >>>> >>>>> >>>> >>>>>Thanks for the update. >>>> >>>>> >>>> >>>>>It's possible that 38f080f3cd19 ("NFSD: Move callback_wq into >>>> struct >>>> >>>>>nfs4_client"), while not a specific fix for this issue, might >>>> offer some >>>> >>>>>relief by preventing the DESTROY_SESSION hang from affecting >>>> all other >>>> >>>>>clients of the degraded server. >>>> >>>>> >>>> >>>>>Not having a reproducer or the ability to bisect puts a damper on >>>> >>>>>things. The next step, then, is to enable tracing on servers >>>> where this >>>> >>>>>issue can come up, and wait for the hang to occur. The >>>> following command >>>> >>>>>captures information only about callback operation, so it >>>> should not >>>> >>>>>generate much data or impact server performance. >>>> >>>>> >>>> >>>>>    # trace-cmd record -e nfsd:nfsd_cb\* >>>> >>>>> >>>> >>>>>Let that run until the problem occurs, then ^C, compress the >>>> resulting >>>> >>>>>trace.dat file, and either attach it to 1071562 or email it to me >>>> >>>>>privately. >>>> >>>>thanks for the follow-up. >>>> >>>> >>>> >>>>I am the "customer" with two affected file servers. We have >>>> since moved those >>>> >>>>servers to the backports kernel (6.11.10) in the hope of forward >>>> fixing the >>>> >>>>issue. If this kernel is stable, I'm afraid I can't go back to >>>> the 'bad' >>>> >>>>kernel (700+ researchers affected..), and we're also not able to >>>> trigger the >>>> >>>>issue on our test environment. >>>> >>> >>>> >>>Hello Dr. Herzog - >>>> >>> >>>> >>>If the problem recurs on 6.11, the trace-cmd I suggest above works >>>> >>>there as well. >>>> >>the bad news is: it just happened again with the bpo kernel. >>>> >> >>>> >>We then immediately started trace-cmd since usually there are >>>> several call >>>> >>traces in a row and we did get a trace.dat: >>>> >>http://people.phys.ethz.ch/~daduke/trace.dat >>> people.phys.ethz.ch/~daduke/trace.dat> >>>> >> >>>> >>we did notice however that the stack trace looked a bit different >>>> this time: >>>> >> >>>> >>     INFO: task nfsd:2566 blocked for more than 5799 seconds. >>>> >>     Tainted: G        W          6.11.10+bpo-amd64 #1 Debia> >>>> >>     "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables t> >>>> >>     task:nfsd            state:D stack:0     pid:2566  tgid:2566 > >>>> >>     Call Trace: >>>> >>     >>>> >>     __schedule+0x400/0xad0 >>>> >>     schedule+0x27/0xf0 >>>> >>     nfsd4_shutdown_callback+0xfe/0x140 [nfsd] >>>> >>     ? __pfx_var_wake_function+0x10/0x10 >>>> >>     __destroy_client+0x1f0/0x290 [nfsd] >>>> >>     nfsd4_destroy_clientid+0xf1/0x1e0 [nfsd] >>>> >>     ? svcauth_unix_set_client+0x586/0x5f0 [sunrpc] >>>> >>     nfsd4_proc_compound+0x34d/0x670 [nfsd] >>>> >>     nfsd_dispatch+0xfd/0x220 [nfsd] >>>> >>     svc_process_common+0x2f7/0x700 [sunrpc] >>>> >>     ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd] >>>> >>     svc_process+0x131/0x180 [sunrpc] >>>> >>     svc_recv+0x830/0xa10 [sunrpc] >>>> >>     ? __pfx_nfsd+0x10/0x10 [nfsd] >>>> >>     nfsd+0x87/0xf0 [nfsd] >>>> >>     kthread+0xcf/0x100 >>>> >>     ? __pfx_kthread+0x10/0x10 >>>> >>     ret_from_fork+0x31/0x50 >>>> >>     ? __pfx_kthread+0x10/0x10 >>>> >>     ret_from_fork_asm+0x1a/0x30 >>>> >>     >>>> >> >>>> >>and also the state of the offending client in `/proc/fs/nfsd/ >>>> clients/*/info` >>>> >>used to be callback state: UNKNOWN while now it is DOWN or FAULT. >>>> No idea >>>> >>what it means, but I thought it was worth mentioning. >>>> >> >>>> > >>>> >Looks like this is hung in nfsd41_cb_inflight_wait_complete() ? That >>>> >probably means that the cl_cb_inflight counter is stuck at >0. I'm >>>> >guessing that means that there is some callback that it's expecting to >>>> >complete that isn't. From nfsd4_shutdown_callback(): >>>> > >>>> >         /* >>>> >          * Note this won't actually result in a null callback; >>>> >          * instead, nfsd4_run_cb_null() will detect the killed >>>> >          * client, destroy the rpc client, and stop: >>>> >          */ >>>> >         nfsd4_run_cb(&clp->cl_cb_null); >>>> >         flush_workqueue(clp->cl_callback_wq); >>>> >         nfsd41_cb_inflight_wait_complete(clp); >>>> > >>>> >...it sounds like that isn't happening properly though. >>>> > >>>> >It might be interesting to see if you can track down the callback >>>> >client in /sys/kernel/debug/sunrpc and see what it's doing. >>>> >>>> Here's a possible scenario: >>>> >>>> The trace.dat shows the server is sending a lot of CB_RECALL_ANY >>>> operations. >>>> >>>> Normally a callback occurs only due to some particular client request. >>>> Such requests would be completed before a client unmounts. >>>> >>>> CB_RECALL_ANY is an operation which does not occur due to a particular >>>> client operation and can occur at any time. I believe this is the >>>> first operation of this type we've added to NFSD. >>>> >>>> My guess is that the server's laundromat has sent some CB_RECALL_ANY >>>> operations while the client is umounting -- DESTROY_SESSION is >>>> racing with those callback operations. >>>> >>>> deleg_reaper() was backported to 6.1.y in 6.1.81, which lines up >>>> more or less with when the issues started to show up. >>>> >>>> A quick way to test this theory would be to make deleg_reaper() a >>>> no-op. If this helps, then we know that shutting down the server's >>>> callback client needs to be more careful about cleaning up outstanding >>>> callbacks. >>>> >>>> >>>> -- >>>> Chuck Lever >>> >>> Hello. >>> >>> Thanks a lot for all of this work ! >>> >>> I have started recording traces using : >>> >>> # trace-cmd record -e nfsd:nfsd_cb\* >>> >>> and currently waiting for the bug to trigger. This generally happen >>> in one/two weeks for me. >>> >>> If you want I adjust some recording parameters let me know. >> >> After looking at your data and Rik's data, I decided to try reproducing >> the problem in my lab. No luck so far. >> >> The signatures of these issues are slightly different from each other: >> >> v6.1: >> - nfsd4_destroy_session hangs on the flush_workqueue in >>    nfsd4_shutdown_callback() >> - There might be outstanding callback RPCs that are not exiting >> - Lots of CB_RECALL_ANY traffic. Only place that's generated in >>    v6.1 is the NFSD shrinker when memory is short >> - There appear to be some unreachable clients -- evidence of client >>    crashes or network outages, perhaps >> >> v6.12: >> - destroy_client() hangs on the wait_var_event in >>    nfsd4_shtudown_callback() >> - There are no callback RPCs left after the hang >> - There are some stuck clients listed under /proc/fs/nfsd/ >> >> Similar, but not the same. >> >> I'm still open to clues, so please continue the regimen of starting >> the trace-cmd when the server boots, and doing an "echo t > >> /proc/sysrq-trigger" when the hang appears. >> >>   # trace-cmd record -e nfsd:nfsd_cb\* -e sunrpc:svc_xprt_detach \ >>     -p function -l nfsd4_destroy_session >> >> > > Hello Chuck. > > Sorry for my late reply but I need to find something > interesting for you. > > Did you see my call trace analyze here ? > > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071562 > > On v6.1 did you see that nfsd4_shutdown_callback is reached from > both nfsd4 laundromat_main and nfsd4_state_shrinker_worker ? > For me the hang in V6.1 come also from wait_var_event. > > Here the state of my RAM on my server with some load. > >                total        used        free      shared  buff/cache > available > Mem:           32083        4982         651           5       26919 > 27101 > Swap:          16629         842       15787 > > My desktop clients use heavily suspend. The computer are almost > never rebooted. When a user logout, suspending is automatically > launched. Do you think that some delegation can be accumulated, > locked by many suspended host, and can cause the > high memory/cache usage ? v6.1 I think has courteous server support, meaning the server will attempt to preserve all client state when the client stops sending state-renewing operations. Suspended clients probably don't clean up their network connections. The server could see that as a network partition. Still, the server is supposed to clear out an idle connection after 6 minutes. So these hangs should clear up if this were the only issue. > Here my crash timeline between the desktop client salleprof13 > (the one you think is unreachable) and the server fichdc02. > The failure come from suspend. > > > --- > Client salleprof13 wake from sleep at 09:55:37 > and was attached by the server. > The callback channel is initialized. > --- > > janv. 14 07:20:32 salleprofs13 kernel: PM: suspend entry (deep) > > janv. 14 09:55:37 salleprofs13 kernel: PM: suspend exit > janv. 14 09:55:37 salleprofs13 ModemManager[621]: [sleep-monitor- > systemd] system is resuming > janv. 14 09:55:37 salleprofs13 NetworkManager[576]: > [1736844937.2382] manager: sleep: wake requested (sleeping: yes enabled: > yes) > janv. 14 09:55:37 salleprofs13 systemd[1]: Stopping gssd-bug.service - > Gssd credential cache hook... > janv. 14 09:55:37 salleprofs13 systemd[1]: Starting winbind.service - > Samba Winbind Daemon... > janv. 14 09:55:37 salleprofs13 systemd[1]: Started winbind.service - > Samba Winbind Daemon. > janv. 14 09:55:40 salleprofs13 NetworkManager[576]: > [1736844940.6304] device (enp2s0): carrier: link connected > janv. 14 09:55:40 salleprofs13 kernel: r8169 0000:02:00.0 enp2s0: Link > is Up - 1Gbps/Full - flow control off > > nfsd-1695  [047] ..... 190512.862590: nfsd_cb_probe: > addr=172.16.118.121:0 client 67833e73:4dd32903 state=UNKNOWN > kworker/u96:8-225529 [047] ..... 190512.862782: nfsd_cb_setup: > addr=172.16.118.121:0 client 67833e73:4dd32903 proto=tcp flavor=sys > kworker/u96:8-225529 [047] ..... 190512.862784: nfsd_cb_state: > addr=172.16.118.121:0 client 67833e73:4dd32903 state=UP > 2025-01-14T09:55:48.290546+01:00 fichdc02 rpc.mountd[1363]: v4.2 client > attached: 0x4dd3290367833e73 from "172.16.118.121:825" > > > --- > Someone logon on the station. > Immediately the server start sending nfsd_cb_recall_any. > Strange no ? What can be recalled on that host that just > wake up... The server attempts to invoke CB_RECALL_ANY on all clients it knows about. If it believes the client has state (even when in courtesy) then the server will send this request to that client. > --- > > janv. 14 09:55:53 salleprofs13 nfsrahead[121742]: setting /dnfs/wine > readahead to 128 > janv. 14 09:56:13 salleprofs13 systemd-logind[569]: New session 36 of > user xxxxxxxx. > > kworker/u96:3-227781 [047] ..... 190541.974364: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:2-220501 [013] ..... 190541.977732: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:2-220501 [000] ..... 190550.387742: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:0-228547 [003] ..... 190550.389602: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:2-220501 [021] ..... 190565.367759: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:1-230409 [000] ..... 190565.371391: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:2-220501 [013] ..... 190579.748430: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:8-225529 [026] ..... 190579.749946: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:8-225529 [013] ..... 190592.749738: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:2-220501 [001] ..... 190592.751192: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:0-228547 [000] ..... 190599.849333: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:0-228547 [000] ..... 190599.850872: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:2-220501 [045] ..... 190686.760138: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:1-230409 [014] ..... 190686.761181: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:1-230409 [013] ..... 190694.103480: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:8-225529 [022] ..... 190694.106120: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:8-225529 [000] ..... 190749.678378: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:2-220501 [007] ..... 190749.679870: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > kworker/u96:4-230705 [014] ..... 190782.830144: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:1-230409 [010] ..... 190782.831220: nfsd_cb_recall_any_done: > client 67833e73:4dd32903 status=0 > > > --- > Next the user suspend the host maybe by pressing > the power button without login out so the last > nfsd_cb_recall_any fail > Maybe the "unrecognized reply" related ? > --- > > janv. 14 10:01:18 salleprofs13 systemd-logind[569]: The system will > suspend now! > janv. 14 10:01:18 salleprofs13 NetworkManager[576]: > [1736845278.5631] manager: sleep: sleep requested (sleeping: no enabled: > yes) > janv. 14 10:01:18 salleprofs13 ModemManager[621]: [sleep-monitor- > systemd] system is about to suspend > janv. 14 10:01:20 salleprofs13 kernel: PM: suspend entry (deep) > > kworker/u96:2-220501 [047] ..... 190851.843698: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > kworker/u96:2-220501 [047] ..... 190861.062828: nfsd_cb_state: > addr=172.16.118.121:0 client 67833e73:4dd32903 state=FAULT > > 2025-01-14T10:06:06.586140+01:00 fichdc02 kernel: [191131.106081] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000ad57929d xid 52751431 > > > --- > Got 5 unrecognized reply. > Seems not related . > --- > > 2025-01-14T10:06:56.762137+01:00 fichdc02 kernel: [191181.279788] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000002dcd99c8 xid 5d819d7d > 2025-01-14T10:06:56.762197+01:00 fichdc02 kernel: [191181.280005] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000002dcd99c8 xid 61819d7d > 2025-01-14T10:06:56.762201+01:00 fichdc02 kernel: [191181.280054] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000002dcd99c8 xid 60819d7d > 2025-01-14T10:06:56.762206+01:00 fichdc02 kernel: [191181.280223] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000002dcd99c8 xid 5f819d7d > 2025-01-14T10:06:56.762209+01:00 fichdc02 kernel: [191181.280364] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000002dcd99c8 xid 5e819d7d > > > --- > Just 10 minutes after suspend the host wake up (with xxxxxxxx still > logged on). > The gssd-bug.service is a custom script that kill all the user processes > if the kerberos cache file has expired. This is due to a rpc.gssd bug > that hammer the server when the cache has expired during suspend. > It run "killall -u xxxxxxxx". But here, I don't know if the killall > was called or not. > > Just after the wake. The server send again an nfsd_cb_recall_any. But > this time > there is no nfsd_cb_state with state=FAULT. From this time there no more > kworker > cb calls. I think this where the hang start. I'm thinking the hang started a little earlier, but I agree that this is the first symptom of it. I was attempting to reproduce by periodically dropping the connection, but it appears there needs to be one or two more steps to make it behave like a suspended client. I will see what can be done. > And we got again an unrecognized reply. > --- > > janv. 14 10:10:24 salleprofs13 kernel: PM: suspend exit > janv. 14 10:10:24 salleprofs13 systemd-sleep[124910]: System returned > from sleep state. > janv. 14 10:10:24 salleprofs13 systemd[1]: Stopping gssd-bug.service - > Gssd credential cache hook... > janv. 14 10:10:24 salleprofs13 NetworkManager[576]: > [1736845824.4117] manager: sleep: wake requested (sleeping: yes enabled: > yes) > janv. 14 10:10:24 salleprofs13 check-gssd-cache[124962]: id: > « xxxxxxxx » : the user don't exist > janv. 14 10:10:24 salleprofs13 systemd[1]: Started winbind.service - > Samba Winbind Daemon. > janv. 14 10:10:27 salleprofs13 kernel: r8169 0000:02:00.0 enp2s0: Link > is Up - 1Gbps/Full - flow control off > janv. 14 10:10:27 salleprofs13 NetworkManager[576]: > [1736845827.6412] device (enp2s0): carrier: link connected > janv. 14 10:10:35 salleprofs13 systemd[1]: Stopped gssd-bug.service - > Gssd credential cache hook. > > kworker/u96:2-220501 [012] ..... 191417.544355: nfsd_cb_recall_any: > addr=172.16.118.121:825 client 67833e73:4dd32903 keep=0 bmval0=RDATA_DLG > > 2025-01-14T10:10:58.426113+01:00 fichdc02 kernel: [191422.934976] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 000000007205cdb2 xid ee91a9d8 > > > --- > nfsd probe a failed State. Normal. > But as kworker hang, it can't reinit the callback channel > after the nfsd_cb_state "state=UNKNOWN". > > I don't know why but nfsd started also > to send nfsd_cb_recall... > Note that is less than 120 second before the "BLOCKED" message. > so the thread may be already locked. > --- > > nfsd-1690  [014] ..... 191424.483249: nfsd_cb_probe: > addr=172.16.118.121:0 client 67833e73:4dd32903 state=FAULT > nfsd-1690  [014] ..... 191424.483252: nfsd_cb_state: > addr=172.16.118.121:0 client 67833e73:4dd32903 state=UNKNOWN > nfsd-1695  [013] ...1. 191441.265047: nfsd_cb_recall: > addr=172.16.118.121:0 client 67833e73:4dd32903 stateid 00001c6e:00000001 > nfsd-1695  [014] ...1. 191441.646104: nfsd_cb_recall: > addr=172.16.118.121:0 client 67833e73:4dd32903 stateid 000015fc:00000001 > > > --- > Got some strange logs on the client > And 7 unrecognized reply on the server that seems not related. > --- > > janv. 14 10:11:00 salleprofs13 PackageKit[122249]: daemon quit > janv. 14 10:11:00 salleprofs13 systemd[1]: packagekit.service: > Deactivated successfully. > janv. 14 10:11:00 salleprofs13 systemd[1]: packagekit.service: Consumed > 28.223s CPU time. > janv. 14 10:11:02 salleprofs13 gnome-shell[122127]: JS ERROR: > Gio.IOErrorEnum: Erreur lors de l’appel de StartServiceByName pour > org.vmware.viewagent.Credentials : Le délai d’> > > _injectToMethod/klass[method]@resource:///org/gnome/gjs/modules/core/ > overrides/Gio.js:287:25 > > VmwareCredentials@resource:///org/gnome/shell/gdm/vmware.js:34:10 > > VmwareCredentialsManager@resource:///org/gnome/shell/gdm/vmware.js:41:29 > > getVmwareCredentialsManager@resource:///org/gnome/shell/gdm/vmware.js:51:37 > > ShellUserVerifier@resource:///org/gnome/shell/gdm/util.js:183:64 > > _init@resource:///org/gnome/shell/gdm/authPrompt.js:74:30 > > AuthPrompt@resource:///org/gnome/shell/gdm/authPrompt.js:51:4 > > _ensureAuthPrompt@resource:///org/gnome/shell/ui/unlockDialog.js:686:32 > > _showPrompt@resource:///org/gnome/shell/ui/unlockDialog.js:725:14 > > vfunc_key_press_event@resource:///org/gnome/shell/ui/unlockDialog.js:620:14 > janv. 14 10:11:05 salleprofs13 gsd-power[122329]: Error setting property > 'PowerSaveMode' on interface org.gnome.Mutter.DisplayConfig: Le délai > d’attente est dépassé (g-io-erro> > janv. 14 10:11:27 salleprofs13 gnome-shell[122127]: JS ERROR: > Gio.IOErrorEnum: Erreur lors de l’appel de StartServiceByName pour > org.vmware.viewagent.Credentials : Le délai d’> > > _injectToMethod/klass[method]@resource:///org/gnome/gjs/modules/core/ > overrides/Gio.js:287:25 > > VmwareCredentials@resource:///org/gnome/shell/gdm/vmware.js:34:10 > > VmwareCredentialsManager@resource:///org/gnome/shell/gdm/vmware.js:41:29 > > getVmwareCredentialsManager@resource:///org/gnome/shell/gdm/vmware.js:51:37 > > ShellUserVerifier@resource:///org/gnome/shell/gdm/util.js:183:64 > > _init@resource:///org/gnome/shell/gdm/authPrompt.js:74:30 > > AuthPrompt@resource:///org/gnome/shell/gdm/authPrompt.js:51:4 > > _ensureAuthPrompt@resource:///org/gnome/shell/ui/unlockDialog.js:686:32 > > _showPrompt@resource:///org/gnome/shell/ui/unlockDialog.js:725:14 > > vfunc_key_press_event@resource:///org/gnome/shell/ui/unlockDialog.js:620:14 > > > > 2025-01-14T10:12:36.730114+01:00 fichdc02 kernel: [191521.235822] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000191d4459 xid d591b1ee > 2025-01-14T10:12:36.730165+01:00 fichdc02 kernel: [191521.236547] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000191d4459 xid d691b1ee > 2025-01-14T10:12:36.730169+01:00 fichdc02 kernel: [191521.236666] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000191d4459 xid da91b1ee > 2025-01-14T10:12:36.730171+01:00 fichdc02 kernel: [191521.236743] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000191d4459 xid db91b1ee > 2025-01-14T10:12:36.730172+01:00 fichdc02 kernel: [191521.236769] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000191d4459 xid d991b1ee > 2025-01-14T10:12:36.730174+01:00 fichdc02 kernel: [191521.236824] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000191d4459 xid d891b1ee > 2025-01-14T10:12:36.730177+01:00 fichdc02 kernel: [191521.236840] > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt > 00000000191d4459 xid d791b1ee > > > --- > And finally the hang syslog message > --- > > 2025-01-14T10:14:29.370164+01:00 fichdc02 kernel: [191633.871204] INFO: > task nfsd:1690 blocked for more than 120 seconds. > 2025-01-14T10:14:29.374846+01:00 fichdc02 kernel: [191633.875410] INFO: > task nfsd:1691 blocked for more than 120 seconds. > 2025-01-14T10:14:29.374966+01:00 fichdc02 kernel: [191633.878106] INFO: > task nfsd:1693 blocked for more than 120 seconds. > 2025-01-14T10:14:29.378303+01:00 fichdc02 kernel: [191633.880748] INFO: > task nfsd:1695 blocked for more than 120 seconds. > > > Hope this help. > > I'm still recording. > > Regards, > > Baptiste. > > -- Chuck Lever