From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DCD3830ACF6; Wed, 6 May 2026 18:05:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=192.198.163.11 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778090733; cv=fail; b=FEOIGw/cvI6Tyh4HZfuStL1lWnyjjmfWAcfWnlTwmfinwGQs+WAR10IXkR/uNV9Xdp1WajhvrxiBq+OI7QrBfG9yRaRPgKqz7XPusPRLMDkNoVXcHy9W2xTZ2y0o78afYUiF9VuzYQnqTbj6hjjYHtS1uuSUrmlmYiiWhaR7ijk= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778090733; c=relaxed/simple; bh=8I25gPBxEh7xcYQ1y9mXqzoQ+M7T54SF/lf/mmPTJDQ=; h=Date:From:To:CC:Subject:Message-ID:References:Content-Type: Content-Disposition:In-Reply-To:MIME-Version; b=RaSq+RTCYDF5Gab86LXfaLtye4cX8tgZ9Q0+UCEUB/VSK3sGWlv6/mzvSwNB4iLDtEHGx32zQHPmkROe4citEv2HkqIiTjWgxlrBIKVuBdF6cs4S2erjQqLJarZtckKfd8D4Z7I7xfBLPTzejOgo489qNg7mHkDOo9Wdtbbh4u0= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=RLMuX/cx; arc=fail smtp.client-ip=192.198.163.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="RLMuX/cx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778090728; x=1809626728; h=date:from:to:cc:subject:message-id:references: content-transfer-encoding:in-reply-to:mime-version; bh=8I25gPBxEh7xcYQ1y9mXqzoQ+M7T54SF/lf/mmPTJDQ=; b=RLMuX/cxcRuUyiRHS5Yj+vdBnSVookhvc3wmfESGx7NSh6qBnT5n03D8 iQaE4oxpNg3zXermJD01kLSYHQHZl2SYS4qHDC9HyAI0OhmrpZ13amoVe ZLnIOWjS1BNuFNpZOigtL5SmRCMkB5T7sjU0GWuCHSziwpMjM4BcMf7oW MrJjc45S2uLdrzIlb3sC7G0ojFMwXzkh+hjgWPryrmojqiDQkZXLTpV0Z WOgmJRux3Wv6WSNTC5HKL3EVIShp12zemftYd5T6Rz5jFy05f6Kcza8ae ujmpp+OFe8MA8W+0VTXZEBKuz2pkY4+m9mfsCpDcBEVLMjDhhfXU6fFqf w==; X-CSE-ConnectionGUID: PzFaypiZRuuvpzfQhN8mqg== X-CSE-MsgGUID: BLR88nVrS9+/O/p5pLZ28w== X-IronPort-AV: E=McAfee;i="6800,10657,11778"; a="89619368" X-IronPort-AV: E=Sophos;i="6.23,220,1770624000"; d="scan'208";a="89619368" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by fmvoesa105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2026 11:05:26 -0700 X-CSE-ConnectionGUID: sSNMdAAKQBum5oRaXFhe4Q== X-CSE-MsgGUID: zM/VEvHuTgmO3wJoKz9I6A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,220,1770624000"; d="scan'208";a="235206621" Received: from orsmsx902.amr.corp.intel.com ([10.22.229.24]) by orviesa006.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 May 2026 11:05:26 -0700 Received: from ORSMSX902.amr.corp.intel.com (10.22.229.24) by ORSMSX902.amr.corp.intel.com (10.22.229.24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.37; Wed, 6 May 2026 11:05:26 -0700 Received: from ORSEDG903.ED.cps.intel.com (10.7.248.13) by ORSMSX902.amr.corp.intel.com (10.22.229.24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.37 via Frontend Transport; Wed, 6 May 2026 11:05:26 -0700 Received: from SN4PR2101CU001.outbound.protection.outlook.com (40.93.195.44) by edgegateway.intel.com (134.134.137.113) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.37; Wed, 6 May 2026 11:05:26 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=Sb7ExXJ97yipi6xMA/KcOzMz/AqlCLC2zsdIVNJeIy3nPSH3KsOMjvZelq2K+PiUNLxr/1dfFt28zv0YtdIEN7E/fL1CXZgvUJz1nhEOBtLEnaR1kFGP/eV/r/Pfb3iqP203vX8kpZ9UrkdSGv/hhLIaVIjNAdLOqESYEdjvY5oA52KOs3n9muScfC9KG8wpi998N7+Otm+/P4HK4w4xiJpm5F8S/0LN5i8Bj3GQFuFm+Vemj79u3NeK0w6OEp6rSDI5zCDKZS8+kK0lYNdP15cgXJUAze7NTEqWzhTzK6tIacO4zQ0P2Iklyu64G4vflQD7lKDbOKWEcetWG25neA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=aqjpMiWiL7yUfz+LVKQejnx3tlq/kuM76jhFi+R0QyQ=; b=LOg2vlhby2gxWm4h46FYp88ikSo4rN5ISYdfenG6wtxoc0pnb4l7AhWkCfkxNjfMXf5woHyd7Qe4QQYc+gjaYjPDbIwd6wMGmSq4ZoJb3eTVztMArHwbcAPNZPEKjWw3+//BN3DuqpBGoXW8qKNsChHLG9ma73FrEEH0U2j9lfSP1BZwPCggQ8nf0a4Ga9Syyja3GIksgDPm2q7nrt9tsNz++PCo9BlN03neR12F/WB3HVWpNgGpZ8L+sQFBo2Z6xCnmZPA3u1G0bEz/9cSW9kl4AlsKoQV8PLOaUp1Ve7oRCZKn/yqKBbnTxXLb9FOfrSSd7zKZmAU+jx7TVy13Tg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; Received: from PH7PR11MB6522.namprd11.prod.outlook.com (2603:10b6:510:212::12) by DS7PR11MB6104.namprd11.prod.outlook.com (2603:10b6:8:9f::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.15; Wed, 6 May 2026 18:05:22 +0000 Received: from PH7PR11MB6522.namprd11.prod.outlook.com ([fe80::e0c5:6cd8:6e67:dc0c]) by PH7PR11MB6522.namprd11.prod.outlook.com ([fe80::e0c5:6cd8:6e67:dc0c%7]) with mapi id 15.20.9891.008; Wed, 6 May 2026 18:05:22 +0000 Date: Wed, 6 May 2026 11:05:19 -0700 From: Matthew Brost To: Thomas =?iso-8859-1?Q?Hellstr=F6m?= CC: , , Christian Koenig , Huang Rui , Matthew Auld , Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , David Airlie , Simona Vetter , , Subject: Re: [PATCH v5 2/2] drm/ttm/pool: back up at native page order Message-ID: References: <20260505200443.3300962-1-matthew.brost@intel.com> <20260505200443.3300962-3-matthew.brost@intel.com> <47256c5547c75296af32ca87161188588cacf727.camel@linux.intel.com> <906de072af1f6744aed1eb914ff196f0f5e00016.camel@linux.intel.com> Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <906de072af1f6744aed1eb914ff196f0f5e00016.camel@linux.intel.com> X-ClientProxiedBy: MW4PR04CA0105.namprd04.prod.outlook.com (2603:10b6:303:83::20) To PH7PR11MB6522.namprd11.prod.outlook.com (2603:10b6:510:212::12) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: PH7PR11MB6522:EE_|DS7PR11MB6104:EE_ X-MS-Office365-Filtering-Correlation-Id: 509bf3d8-9344-4840-e0f2-08deab9a0a8d X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|7416014|376014|366016|56012099003|22082099003|18002099003|20046099003; X-Microsoft-Antispam-Message-Info: 4X7AWHq8npuZCMtUDku0c7R9TuRPz9nUjRDxse4bR+vWA6P5+Ey8iF2kK8IAzY/0VvoMy/KZ2dQUM2pYfVROhrvBTurg/AdO93PKZgYnd6SO9asSJsczst7N7tBl2r8ddRmYxqCoJrPf/oyfS5/JPBnrppjlMY9sbsN455QVHsSgYXBQ8n6Gjf5St1U61quCXunr4FBmZA7LgO883bNuPaKY+Qj4p3gDaErDz/+l939ETRz8Jw0r8FulCqxhWFuudh4wavOeJLnVBK5gg0E+1v6mxYgll2ZdJy3kcjZyuYVrAR8hde/LxUDsXFLJDti+h/9dOOWsr0+v5WYOyJ4+afHJY1xPXCOsAl438n6dv7Ko/g0EaXSNEG5/yQMUoE6p8fVS+ZGpt0OKkORAfwgss4QGoYBNz8sfq+57Uwb6WPCQY/hVDLoOaKAA4BFWAN1A4lspTyN+5GS8BWnZU7Tf125hZGcWebmSNUeBX3+9T95NIphqyw/geKFB3KnjK80wnJP+huQglOif3HsJz6uorE7C3U+XvznFtzyEuOtuTM8wWZH2EL5aqjl3ML3BAflmZHhx5zwDbMiEUyQuo0Wp0Ttymksv9P4S+1zLIj4CWm2l44i2ONof7Dhv0fgsIdAb9HKLM7vGEJiINy6DkN9/XA== X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:PH7PR11MB6522.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(1800799024)(7416014)(376014)(366016)(56012099003)(22082099003)(18002099003)(20046099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?L0RGQ2JJL2t0VXZ1VzRNZHF0K0FnTnhMbmU3STVjMmw3eHVxVEhmSFFqZVha?= =?utf-8?B?QUc4Zm92NFhiNVA1UXFYRjE3T3luWUJ0bDZxWHVkNzVudHdGTjJheFhtemVw?= =?utf-8?B?MVcvaGhJUkYxRmpFSjdPQk1ud2h4elZ1dnppS280VUFhNjlzdStyaHZZTVFx?= =?utf-8?B?ZmhUb1lySTVOakV5WXJyUnd4WGNaM2d0ZzdiV1UwTzAyVGRKWHh1L0ppanl4?= =?utf-8?B?ZTdHSzlONmZmcHRCZWFUemd6RUF2RElZWnZGb1QrZlB1dHRUcC9nU252VVdM?= =?utf-8?B?Z2tpblBoYkRsWVFOZTBROEVEbjcrc0dSay84a2wvV3BMZDBFSHQzR1lWTDJC?= =?utf-8?B?Ulo5bTJsNElnenMyMWhZd0Fpa2FCT1RGM0lmdEFLVkN0S0EyV0ZYUzFMSFdX?= =?utf-8?B?blp4WlVvOVovNXpsYVVacnpjcWZNMHpaUUhWd29KSW8rOG12aWMzNE81Y1N3?= =?utf-8?B?Z3kxQS9ZZlNZd0tYaTFiczBnNE1iRDVoRkJlR2JKMW9sdzJsUUl5WDM4Y20w?= =?utf-8?B?aEliWjBqZG96ZTRRMklBWlpNTjZMd1ZkVVUzdzB2ZjYxK0pJSUZMNU41dTZC?= =?utf-8?B?blc1QjRXVWUwblc5aHVtcTc3SnljQ2t1Zy96cm9IN01EbVlQL1ptRm00Wmd5?= =?utf-8?B?dW9Id0VZT2FyVVVyakRnYTJ5ZFo2NGZUM0dmb2x6bVpWbUhkMEZvVndOWmFL?= =?utf-8?B?UVovSWdxNW85MXdyTnJWUCtCNGhSeXRLM3Mvcit1bGEyc05MdW9mUDhYMEhW?= =?utf-8?B?cXVJTE1WemZDOExoRHRlS1dYK1MrcEhVbHhicUNKQTYzMkVzZ0ZXVnZoaTNw?= =?utf-8?B?SFV5R0Fmbm96VzRzU2U4a25PTTVqTUFnWGtsc2hIY2FVWlUxeUtJOFUwdkpm?= =?utf-8?B?Yk9iWVZ1VXNFeWMxMlB6YzY2ODk4RDh3Tm1XbGFkZXhOL01WMzlGcUZCakhW?= =?utf-8?B?a2VSVGRKUmhUS284WTB3Y3ZZZmJiSmE0dUNla3ZCQ2xQaUtsUTd0cDQ1NkFZ?= =?utf-8?B?ck9vM2RrbFkwSDJQZzhySzhIcFc0SURuNVFDSlVOc2ZNZXFJQW9mcjZtbGFL?= =?utf-8?B?VWFSeWFQbldDVUx2aWRLRzdudThhZFA0ajdpTWpmUjZkdWswWGRMc3hCVXcr?= =?utf-8?B?RW5yZHh1Lyt6UnJYNDNtMW55YUtmSFdaOFAzdnc3WkZ2SzFKUzJpc0dlVE4r?= =?utf-8?B?ckgxQ3VBNmFLcS9LeDRKTzV1RHVtcVBZMkF3QXRLTytmUkZaditzVlhUQng4?= =?utf-8?B?QXM3aEVsWHhrZjZwQ1pQSGFCT2VHcjJRbDBma2I2VUJGSHcrOXVhR3ZzTm5p?= =?utf-8?B?Q3g5UjFUdytXMEVSSHFibVZ2emNkK0QyODkwaE0zcGpLU003NElCYmt0RUR2?= =?utf-8?B?djU2YVZSUTZ1VFZ6TDU2NW9oSFpxSDlBQjBlMm5ZR1ZKYXpSdmZoYTNpY0k1?= =?utf-8?B?eURMck5WWTU5TER1eEI0cmdMN21aT054cEY5dlprbTVjTWlmaURnQy9SMUYz?= =?utf-8?B?OW1TS3VVVS9ERWkrWHZ3VG0xREpHRm9EZzMwUW10czRYZ0NkUGhIdkxxdXVC?= =?utf-8?B?U1gvTDJNcDJqelZjai9pVjR4OG82MEEyVjcveU9Fek1QTzhud3loZ09sLzJl?= =?utf-8?B?ZkE2OUVxZFRZS0VXR1VIZnlIdzkrWlAyWFZIS1pVY3JBa3hEN1BlY2k5MlBL?= =?utf-8?B?SitMNThhZFdxSUZVc0RneVZHY3MrK0VpTUJ4SjBDYWRtZWtXSlh2WnBPMWV5?= =?utf-8?B?bGVKclZUWnhORlBCMkUvTU9vRXFwMjdLOHhhT1RpS3dqZVRjNDdsRGdmUTRm?= =?utf-8?B?NGwreCtzMDFKMldRTG5zcUhtaXIxTmMyY3ZLSjNzVjZ3QkZxWm9rdnlKbEpW?= =?utf-8?B?Nm8zVHQySmZwdlZpVWxvc2t1OTNjYWE2YVRFempqcmlLa1JFdjlRRzdpRWF1?= =?utf-8?B?TkVnNEtWblRvbnFBREsrV1ZTVWdVQnFuOUZJdzZBYWZJNG1zbmtwZjhNU0xU?= =?utf-8?B?QXZxbkZUblYzY3V2N28wNE9pOXBEdFJFZnhKRHdXSmhrZnd3OElHYWdjcngw?= =?utf-8?B?MDNSdUpWYlJueTNORksxNUdjYk00SHp4dmxMUnM5QWJSd2VKQldxVWdQcXZW?= =?utf-8?B?a0dXRklRUFBCa2NVQUo2YUkvaG9oZ3FpUDJpTXB6VnEvVXg1K1F5SDExZFVx?= =?utf-8?B?UHpiSzhjSkVQaGhDZU9MQlBaNDZSR2JGc1piNVE0eVMwdnprUloxNUlEVEtT?= =?utf-8?B?YVFvTlNtMVYva3doWWNQb2I5UTFSQVNReWp3V05VeU9WcHVZUE1KM2NZanJp?= =?utf-8?B?OVhRNmRLMklyVjQzRktzTzliUWxHTFJaKzlOb3lHQ09WdXBUVWx4SEhMeGF5?= =?utf-8?Q?SoWaS1jfeO8jEliU=3D?= X-Exchange-RoutingPolicyChecked: cmmX7Qfv/YmwqZv0Kal6h5iAkCsasdNEwcuRMro69NwXuJUmROzpdpPjEWxN5bbB4b2m8dzTaV6gpCTjE3zGHDSL3a0xK3XXlsL0M5qI5Zqns3dhocKMisLNST7RwmvS+K4Eb1fB96aOut9YEsLPedw6BgOUxXs8WXnR8X7GVypCgUHdfPC+bqsnQj7BvmRJdTYZmVznVU/gY7JmCEMWVSIsALQIHGh55hP2oK7Vbpeffm3IAf+2HKlq7+WgbmotEBA78Qk37+AUTvkx/9brK3Ab4mj3JPU2NguVyFG//t7YtSXAOgf3NWdv5KWGqWGIdr62brUMIoLpFEvh2lnOnw== X-MS-Exchange-CrossTenant-Network-Message-Id: 509bf3d8-9344-4840-e0f2-08deab9a0a8d X-MS-Exchange-CrossTenant-AuthSource: PH7PR11MB6522.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 06 May 2026 18:05:22.5735 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: K11CmNFS9ydNvGniBMmZQS8w+s/lf8O+5f8jGYGXj2gJ1BTwgbuP9d8DM3FVjw5VkkYvX4tG3Pv+fzmVXZtfRA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS7PR11MB6104 X-OriginatorOrg: intel.com On Wed, May 06, 2026 at 06:26:43PM +0200, Thomas Hellström wrote: > On Wed, 2026-05-06 at 09:14 -0700, Matthew Brost wrote: > > On Wed, May 06, 2026 at 04:23:29PM +0200, Thomas Hellström wrote: > > > Hi, Matt > > > > > > On Tue, 2026-05-05 at 13:04 -0700, Matthew Brost wrote: > > > > ttm_pool_split_for_swap() splits high-order pool pages into > > > > order-0 > > > > pages during backup so each 4K page can be released to the system > > > > as > > > > soon as it has been written to shmem. While this minimizes the > > > > allocator's working set during reclaim, it actively fragments > > > > memory: > > > > every TTM-backed compound page that the shrinker touches is > > > > shattered > > > > into order-0 pages, even when the rest of the system would prefer > > > > that > > > > the high-order block stay intact. Under sustained kswapd pressure > > > > this > > > > is enough to drive other parts of MM into recovery loops from > > > > which > > > > they cannot easily escape, because the memory TTM just freed is > > > > no > > > > longer contiguous. > > > > > > > > Stop unconditionally splitting on the backup path and back up > > > > each > > > > compound at its native order in ttm_pool_backup(): > > > > > > > >   - For each non-handle slot, read the order from the head page > > > > and > > > >     back up all 1< > > >     writing the resulting handles into tt->pages[] as we go. > > > >   - On success, the compound is freed once at its native order. > > > > No > > > >     split_page(), no per-4K refcount juggling, no fragmentation > > > >     introduced from this path. > > > >   - Slots that already hold a backup handle from a previous > > > > partial > > > >     attempt are skipped. A compound that would extend past a > > > >     fault-injection-truncated num_pages is skipped rather than > > > > split. > > > > > > > > A per-subpage backup failure cannot be made fully atomic: backing > > > > up > > > > a > > > > subpage allocates a shmem folio before the source page can be > > > > released, > > > > so under true OOM any subpage in a compound (not just the first) > > > > may > > > > fail to be backed up with the rest of the source compound still > > > > live > > > > and contiguous. To make forward progress in that case, fall back > > > > to > > > > splitting the source compound and backing up its remaining > > > > subpages > > > > individually: > > > > > > > >   - On the first per-subpage failure for a compound (and only if > > > >     order > 0), call ttm_pool_split_for_swap() to split the > > > > source > > > >     compound, release the subpages whose contents already live in > > > >     shmem (their handles in tt->pages stay valid), and retry the > > > >     failing subpage at order 0. > > > >   - Subsequent successful subpage backups in the now-split > > > > compound > > > >     free their source page individually as soon as the handle is > > > >     written. > > > >   - A second failure after splitting terminates the loop with > > > > partial > > > >     progress; the remaining order-0 subpages stay in tt->pages as > > > >     plain page pointers and are cleaned up by the normal > > > >     ttm_pool_drop_backed_up() / ttm_pool_free_range() paths. > > > > > > > > This restores the original split-on-OOM fallback behavior while > > > > keeping the common, non-OOM case fragmentation-free. It also > > > > preserves the "partial backup is allowed" contract: shrunken is > > > > incremented per backed-up subpage so the caller still sees > > > > forward > > > > progress when a compound only partially succeeds. > > > > > > > > The restore-side leftover-page branch in > > > > ttm_pool_restore_commit() is > > > > left as-is for now: that path can still split a previously- > > > > retained > > > > compound, but in practice it is unreachable under realistic > > > > workloads > > > > (per profiling we have not been able to trigger it), so it is not > > > > worth complicating the restore state machine to avoid the split > > > > there. > > > > If it ever becomes a problem in practice it can be addressed > > > > independently. > > > > > > > > ttm_pool_split_for_swap() itself is retained both for the OOM > > > > fallback above and for the restore path's remaining caller. The > > > > DMA-mapped pre-backup unmap loop, the purge path, > > > > ttm_pool_free_*, > > > > and ttm_pool_unmap_and_free() already operate at native order and > > > > are unchanged. > > > > > > > > Cc: Christian Koenig > > > > Cc: Huang Rui > > > > Cc: Matthew Auld > > > > Cc: Maarten Lankhorst > > > > Cc: Maxime Ripard > > > > Cc: Thomas Zimmermann > > > > Cc: David Airlie > > > > Cc: Simona Vetter > > > > Cc: dri-devel@lists.freedesktop.org > > > > Cc: linux-kernel@vger.kernel.org > > > > Cc: stable@vger.kernel.org > > > > Fixes: b63d715b8090 ("drm/ttm/pool, drm/ttm/tt: Provide a helper > > > > to > > > > shrink pages") > > > > Suggested-by: Thomas Hellström > > > > Assisted-by: Claude:claude-opus-4.6 > > > > Signed-off-by: Matthew Brost > > > > > > > > --- > > > > > > > > A follow-up should attempt writeback to shmem at folio order as > > > > well, > > > > but the API for doing so is unclear and may be incomplete. > > > > > > > > This patch is related to the pending series [1] and significantly > > > > reduces the likelihood of Xe entering a kswapd loop under > > > > fragmentation. > > > > The kswapd → shrinker → Xe shrinker → TTM backup path is still > > > > exercised; however, with this change the backup path no longer > > > > worsens > > > > fragmentation, which previously amplified reclaim pressure and > > > > reinforced the kswapd loop. > > > > > > > > Nonetheless, the pathological case that [1] aims to address still > > > > exists > > > > and requires a proper solution. Even with this patch, a kswapd > > > > loop > > > > due > > > > to severe fragmentation can still be triggered, although it is > > > > now > > > > substantially harder to reproduce. > > > > > > > > v2: > > > >  - Split pages and free immediately if backup fails are higher > > > > order > > > >    (Thomas) > > > > v3: > > > >  - Skip handles in purge path (sashiko) > > > > v5: > > > >  - Refactor into ttm_pool_backup_folio (Thomas) > > > > > > > > [1] https://patchwork.freedesktop.org/series/165330/ > > > > --- > > > >  drivers/gpu/drm/ttm/ttm_pool.c | 110 > > > > ++++++++++++++++++++++++++++--- > > > > -- > > > >  1 file changed, 94 insertions(+), 16 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c > > > > b/drivers/gpu/drm/ttm/ttm_pool.c > > > > index d380a3c7fe40..78efc8524133 100644 > > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c > > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c > > > > @@ -1019,6 +1019,70 @@ void ttm_pool_drop_backed_up(struct ttm_tt > > > > *tt) > > > >   ttm_pool_free_range(NULL, tt, ttm_cached, start_page, > > > > tt- > > > > > num_pages); > > > >  } > > > >   > > > > +static int ttm_pool_backup_folio(struct ttm_pool *pool, struct > > > > ttm_tt *tt, > > > > + struct file *backup, struct > > > > folio > > > > *folio, > > > > + unsigned int order, bool > > > > writeback, > > > > + pgoff_t idx, gfp_t page_gfp, > > > > gfp_t > > > > alloc_gfp) > > > > > > I don't really understand why we can't end up with a > > > ttm_backup_backup_folio(), which I believe is the proper layering, > > > already at this point? Please see a suggestion at > > > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/ttm_swapout?ref_type=heads > > > > > > Here the splitting logic is kept in the ttm_pool, but ttm_backup > > > supports handing large folios to it. > > > > > > Although the cumulative diffstat becomes larger, the end code > > > becomes > > > smaller and IMO easier to read, and we don't need to introduce code > > > that we immediately have to refactor. > > > > That version looks fine too. If that is preference no issue. > > Cool. Note that there is a bug in that we don't pass the folio order > into ttm_backup_backup_folio(). I'm force-pushing a fix for that. > > > > > > My goal with this series is get something than can reasonably be > > backported to LTS kernels so the desktop doesn't frequently enter > > kswapd > > because of fragmentation. We now have at least 3 reports of this > > being > > an issue. > > > > This is larger fix [1] which works in tandem but seemly unlikely to > > backportable given it add new concepts to the core MM [1]. > > > > [1] https://patchwork.freedesktop.org/series/165329/ > > > > > > > > But I'm starting to question the general approach: Even if the > > > *shrinker* can recover from a total kernel memory reserve > > > depletion, it > > > can't really be considered a reasonable practice, since if we > > > frequently deplete the reserves, *other* important allocations in > > > the > > > system like GFP_ATOMIC, PF_MEMALLOC may spuriously start to fail > > > and > > > people will have a hard time finding out why. > > > > > > > Wouldn’t GFP_ATOMIC enter direct reclaim, hit our shrinker, and > > eventually make progress—i.e., take the split path if needed? I’m not > > 100% sure, but my initial reaction is that this concern may not be > > valid; however, MM is hard to reason about. > > No, GFP_ATOMIC just uses what's available without any reclaim at all. > It's more aggressive than GFP_NOWAIT in that it allows dipping into the > kernel reserves. > Right - wrote this before I had my coffee. > > > > Again, FWIW, I’ve tried a lot of things to trigger OOM—for example, > > running WebGL tabs and then kicking off various very memory-intensive > > workloads from the CLI—and I still haven’t hit OOM or seen memory > > allocation failures or warnings. > > > > > So I actually don't think we can be avoiding the splitting without > > > direct insertion. FWIW, up until recently when shmem started > > > supporting > > > > I agree direct insertion is better solution. Do you think this > > something > > we could reasonably get working and backport? I haven't done any > > research on direct insertion yet, thus why I'm asking. > > Yes I think so. The problem would be to get it accepted. Looking into > that now, but hitting various kinds of subtle issues. > Ok, I'm pretty unlikely to get the shrinker work to the finish line before I go - fine with whatever lands in either part: - Shrinking THP should not make fragmentation worse (this patch), a version of this should get Xe reasonably stable, hopefully this fix can be backported. - Avoid evicting working sets under fragmentation ([1] above) Matt > Thanks, > Thomas > > > > > > > huge page swapping, other GPU drivers basically also split pages at > > > swapout. > > > > I wonder if other drivers have the same issue? The deadly combo is > > allow > > GPUs to subscribe all of system memory, allocate THP pages (or higher > > order pages), and split them in the shrinker. Xe might be the only > > driver with right combo to hit this but not 100% sure without a deep > > dive. > > > > > > > > Another idea for improving on the compaction loop, perhaps worth > > > trying > > > is this change, shamelessly stolen from i915: > > > > > > https://gitlab.freedesktop.org/thomash/xe-vibe/-/commits/shrinker_batch?ref_type=heads > > > > > > > I'd have to give this a try - I'm quickly running out of time before > > I > > leave for month though. > > > > Matt > > > > > /Thomas > > > > > > > > > > +{ > > > > + struct page *page = folio_page(folio, 0); > > > > + int shrunken = 0, npages = 1UL << order, ret = 0, i; > > > > + bool folio_has_been_split = false; > > > > + > > > > + for (i = 0; i < npages; ++i) { > > > > + s64 shandle; > > > > + > > > > +try_again_after_split: > > > > + if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > > +     should_fail(&backup_fault_inject, 1)) > > > > + shandle = -ENOMEM; > > > > + else > > > > + shandle = ttm_backup_backup_page(backup, > > > > page + i, > > > > + > > > > writeback, > > > > idx + i, > > > > + > > > > page_gfp, > > > > alloc_gfp); > > > > + > > > > + if (shandle < 0 && !folio_has_been_split && > > > > order) { > > > > + pgoff_t j; > > > > + > > > > + /* > > > > + * True OOM: could not allocate a shmem > > > > folio > > > > + * for the next subpage. Fall back to > > > > splitting > > > > + * the source compound and backing up > > > > subpages > > > > + * individually. Release the already- > > > > backed- > > > > up > > > > + * subpages whose contents now live in > > > > shmem; > > > > + * any further failure terminates the > > > > loop > > > > with > > > > + * partial progress (handled by the > > > > caller). > > > > + */ > > > > + folio_has_been_split = true; > > > > + ttm_pool_split_for_swap(pool, page); > > > > + > > > > + for (j = 0; j < i; ++j) { > > > > + __free_pages_gpu_account(page + > > > > j, > > > > 0, false); > > > > + shrunken++; > > > > + } > > > > + > > > > + goto try_again_after_split; > > > > + } else if (shandle < 0) { > > > > + ret = shandle; > > > > + goto out; > > > > + } else if (folio_has_been_split) { > > > > + __free_pages_gpu_account(page + i, 0, > > > > false); > > > > + shrunken++; > > > > + } > > > > + > > > > + tt->pages[idx + i] = > > > > ttm_backup_handle_to_page_ptr(shandle); > > > > + } > > > > + > > > > + if (!folio_has_been_split) { > > > > + /* Compound fully backed up; free at native > > > > order. > > > > */ > > > > + page->private = 0; > > > > + __free_pages_gpu_account(page, order, false); > > > > + shrunken += npages; > > > > + } > > > > + > > > > +out: > > > > + return shrunken ? shrunken : ret; > > > > +} > > > > + > > > >  /** > > > >   * ttm_pool_backup() - Back up or purge a struct ttm_tt > > > >   * @pool: The pool used when allocating the struct ttm_tt. > > > > @@ -1045,12 +1109,11 @@ long ttm_pool_backup(struct ttm_pool > > > > *pool, > > > > struct ttm_tt *tt, > > > >  { > > > >   struct file *backup = tt->backup; > > > >   struct page *page; > > > > - unsigned long handle; > > > >   gfp_t alloc_gfp; > > > >   gfp_t gfp; > > > >   int ret = 0; > > > >   pgoff_t shrunken = 0; > > > > - pgoff_t i, num_pages; > > > > + pgoff_t i, num_pages, npages; > > > >   > > > >   if (WARN_ON(ttm_tt_is_backed_up(tt))) > > > >   return -EINVAL; > > > > @@ -1070,7 +1133,8 @@ long ttm_pool_backup(struct ttm_pool *pool, > > > > struct ttm_tt *tt, > > > >   unsigned int order; > > > >   > > > >   page = tt->pages[i]; > > > > - if (unlikely(!page)) { > > > > + if (unlikely(!page || > > > > +      > > > > ttm_backup_page_ptr_is_handle(page))) { > > > >   num_pages = 1; > > > >   continue; > > > >   } > > > > @@ -1106,26 +1170,40 @@ long ttm_pool_backup(struct ttm_pool > > > > *pool, > > > > struct ttm_tt *tt, > > > >   if (IS_ENABLED(CONFIG_FAULT_INJECTION) && > > > > should_fail(&backup_fault_inject, 1)) > > > >   num_pages = DIV_ROUND_UP(num_pages, 2); > > > >   > > > > - for (i = 0; i < num_pages; ++i) { > > > > - s64 shandle; > > > > + for (i = 0; i < num_pages; i += npages) { > > > > + unsigned int order; > > > >   > > > > + npages = 1; > > > >   page = tt->pages[i]; > > > >   if (unlikely(!page)) > > > >   continue; > > > >   > > > > - ttm_pool_split_for_swap(pool, page); > > > > + /* Already-handled entry from a previous > > > > attempt. */ > > > > + if > > > > (unlikely(ttm_backup_page_ptr_is_handle(page))) > > > > + continue; > > > >   > > > > - shandle = ttm_backup_backup_page(backup, page, > > > > flags->writeback, i, > > > > - gfp, > > > > alloc_gfp); > > > > - if (shandle < 0) { > > > > - /* We allow partially shrunken tts */ > > > > - ret = shandle; > > > > + order = ttm_pool_page_order(pool, page); > > > > + npages = 1UL << order; > > > > + > > > > + /* > > > > + * Back up the compound atomically at its native > > > > order. If > > > > + * fault injection truncated num_pages mid- > > > > compound, > > > > skip > > > > + * the partial tail rather than splitting. > > > > + */ > > > > + if (unlikely(i + npages > num_pages)) > > > > + break; > > > > + > > > > + ret = ttm_pool_backup_folio(pool, tt, backup, > > > > page_folio(page), > > > > +     order, flags- > > > > >writeback, > > > > i, gfp, > > > > +     alloc_gfp); > > > > + if (unlikely(ret < 0)) > > > > + break; > > > > + > > > > + shrunken += ret; > > > > + > > > > + /* partial backup */ > > > > + if (unlikely(ret != npages)) > > > >   break; > > > > - } > > > > - handle = shandle; > > > > - tt->pages[i] = > > > > ttm_backup_handle_to_page_ptr(handle); > > > > - __free_pages_gpu_account(page, 0, false); > > > > - shrunken++; > > > >   } > > > >   > > > >   return shrunken ? shrunken : ret;