From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83266C25B74 for ; Fri, 31 May 2024 03:12:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E5B196B0088; Thu, 30 May 2024 23:12:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E0A756B0089; Thu, 30 May 2024 23:12:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CD21F6B008A; Thu, 30 May 2024 23:12:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id AF0AB6B0088 for ; Thu, 30 May 2024 23:12:48 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 61AE140E90 for ; Fri, 31 May 2024 03:12:48 +0000 (UTC) X-FDA: 82177218816.09.508A34F Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf30.hostedemail.com (Postfix) with ESMTP id 2C9DD80002 for ; Fri, 31 May 2024 03:12:45 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=ZQJ2WQnC; spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717125166; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=M8Jg0ym9RaS3BlW/5JxOpfBM4iwO7Ulq9QjtiLuXHFE=; b=VFvkpKxn8/3F2KE6tmYMXx/EYlwm3NIWGjiASBoR4tognBP/vkSBcbxWg6C1RG5mF+ACNe OKnSdWDvGIyU5N2+UuUhkF7aysY1KHLEut+t2jdLEM/7zDSL/OpWqXFNgAklRg0VmYFmM4 X8f0AbuChSbLKAF4Gps32JYNeAx/xHk= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=ZQJ2WQnC; spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717125166; a=rsa-sha256; cv=none; b=wOOrLzpiIIVAd68B/6ntMSaVba9Ez5EvyxQuuwzWZx+cciarKjx33I8hyo7TuCtMuov9wU rE8Ti9PVeiDHPjfKfVOSPcN85U17Q0al+fvQT7Da5Q3pSdwmk08FMZVZmv+rgtqWwhOAfw EBNRZ0ApehTlimYRdyb4YpKuwDRjYHw= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=M8Jg0ym9RaS3BlW/5JxOpfBM4iwO7Ulq9QjtiLuXHFE=; b=ZQJ2WQnC4OuoDj4id7+sOsLLfD NX2CguAUC1dvU+hObeDpTYCNGscLeb4MhP8AFbG78i0M/QLTqlGTu98z6J3xtptsgnp4HfKvMEvi2 PsQpJ99miDBj+G789bJo5/2kbHsUxC3Mr0mvVNqpm+IPvTbwqd7BjG4mRxFzTX6BmWca2nAuu0HhA 6eO97NprRyGAOPUrikMbgR5J+y7TdiTk09fBYLfU29gl9YlPT6KC3fUASaj0tYO6y79eqPZMflXRl +a5JtBcPsTGQgOL6fQWLBombuTbKJjwjpCH1dd/QzfGowdiUUYsaGThxmcxpJLgqBR6GGLeHq0S1Y uHbFSpQw==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1sCsh3-0000000BNH1-1Uu7; Fri, 31 May 2024 03:12:37 +0000 Date: Fri, 31 May 2024 04:12:37 +0100 From: Matthew Wilcox To: Chris Li Cc: Karim Manaouil , Jan Kara , Chuanhua Han , linux-mm , lsf-pc@lists.linux-foundation.org, ryan.roberts@arm.com, 21cnbao@gmail.com, david@redhat.com Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" Message-ID: References: <039190fb-81da-c9b3-3f33-70069cdb27b0@oppo.com> <20240307140344.4wlumk6zxustylh6@quack3> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 2C9DD80002 X-Stat-Signature: 9aggokneuuijn7s88n9ri3uok13spp7w X-HE-Tag: 1717125165-512903 X-HE-Meta: U2FsdGVkX1/6j5xswgxXfUymuprfIh2IXdLiTh2kxPthZGyKXiaZ0c3qmxPYn4xGu7sswPejSS0pRWaSlpXQxJwFX2GLUfjBYi0bjZOGFHMwnvsn54s8yHmwgY5nPgY+rfSMMsYIYke5yA+8HmOsJ5Q/f3CvofN+JuOhQf2+vET4/N4zQNx/QGpQ4yFk7u40vtZNjtuaolmYFVa4nhwkyb/TfoT8l3RPNDEJ0jvqcaGAK2m6uol826TbQktO6V2E4PTC+tAiwpD3whMeG66E7BG008dHOH5FOK1BI6/p4rj3jtaFeV0uU3DSqvzenEFVwmZq+f6cej68C6omv9WlMIHr9wb/ujnziBFaQc9lhkjsKAebOUYl9FPMeeM147YqirXGIpZKSegChGefJrEfd6OzXtOWlovaMWp5Tn8Rm+Ehyvaf7ClUzAFLvsXyGLfI8k+6QOgzkUIV1xpqH+w4qHKMUl+KrBrvTc17+veqK9JOPFUGcyHOFYWUj2LjopK9pM3pIf6Iv5mCpbzM0ZcCXDg1U2FcMDya0pI9qfD0sbdkdSN6pVwj+YapfxbKXB5GFhbFOonoFlmswHviGImX1PsToxVBJagym/M4QUooASF+W+XiMyv1DlUbxSlr4Of8nJpdDGXJGDPLVa+Y0ngLeDw99XMUMKXN13J4/wM+dlzMS7FXilvl9yg8QGjo+wyGSwBzJ2mTiPTluijotPBhazsp3H+e9jPc7qFDBD0gRcMQrpy0eEz6OCvPkNUPZr+kG4Si/MmwChKEQa0DRXDI6CO2d7hXe6vEA0g/eeot7R2/SKrR/uMDzYFw4zBSZ8UVcm2REQ89+Qftmhx/2K35yOJ+zonPFXGRAbNNHOepIq/Aio4ltpTswgMPqwsW+pW9JqeLQaEExD4A0Hl9AIXa8E6hh2eP7htC974pJREN5AYzpx9L9MuQeMGRdUCOSuC1AIoR/o9OIJPzFtHGEdT 7JJfTHM/ oThejBc714GtWO7rVbmR9fW+PhSeXyZIUPrE6oCL0O/U0fp0LS/Ape1OD5ywAcrvwXdbFRbhdXRO74A+YPMQtZ/fBLR0eKvnCkeeUM5msaRPsnFvejo+xI0oydCOPkKY34HHgxD2nNIz8tmh2k8x3IOhoUfUeyyh2bzejti3fNp1rppJygq3WchvijTcbS9ETxEQ+3l7OzbPmO3R9iwE6gMLSSXU/7nWqgdzLuTvMlbV/WRrJgk/y27yB9c5LrSnyx0e7qtSYXlUYlGNGMICennS3fLOOmt42Mqv4IK0UbMDMrfU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 30, 2024 at 03:53:49PM -0700, Chris Li wrote: > On Wed, May 29, 2024 at 5:33 AM Matthew Wilcox wrote: > > > Where the anonymous memory case, the dirty page does not have to write > > > to swap. It is optional, so which page you choose to swap out is > > > critical, you want to swap out the coldest page, the page that is > > > least likely to get swapin. Therefore, the LRU makes sense. > > > > Disagree. There are two things you want and the LRU serves neither > > particularly well. One is that when you want to reclaim memory, you > > want to find some memory that is likely to not be accessed in the next > > few seconds/minutes/hours. It doesn't need to be the coldest, just in > > (say) the coldest 10% or so of memory. And it needs to already be clean, > > otherwise you have to wait for it to writeback, and you can't afford that. > > Do you disagree that LRU is necessary or the way we use the LRU? I think we should switch to a scheme where we just don't use an LRU at all. > In order to get the coldest 10% or so pages, assume you still need to > maintain an LRU, no? I don't think that's true. If you reframe the problem as "we need to find some of the coldest pages in the system", then you can use a different scheme. > > The second thing you need to be able to do is find pages which are > > already dirty, and not likely to be written to soon, and write those > > back so they join the pool of clean pages which are eligible for reclaim. > > Again, the LRU isn't really the best tool for the job. > > It seems you need to LRU to find which pages qualify for write back. > It should be both dirty and cold. > > The question is, can you do the reclaim write back without LRU for > anonymous pages? > If LRU is unavoidable, then it is necessarily evil. The point I was trying to make is that a simple physical scan is 40x faster. So if you just scan N pages, starting from wherever you left off the scan last time, and even 1/10 of them are eligible for reclaiming (not referenced since last time the clock hand swept past it, perhaps), you're still reclaiming 4x as many pages as doing an LRU scan. > > > In VMA swap out, the question is, which VMA you choose from first? To > > > make things more complicated, the same page can map into different > > > processes in more than one VMA as well. > > > > This is why we have the anon_vma, to handle the same pages mapped from > > multiple VMAs. > > Can you clarify when you use anon_vma to organize the swap out and > swap in, do you want to write a range of pages rather than just one > page at a time? Will write back a sub list of the LRU work for you? > Ideally we shouldn't write back pages that are hot. anon_vma alone > does not give us that information. So filesystems do write back all pages in an inode that are dirty, regardless of whether they're hot. But, as noted, we do like to get the pagecache written back periodically even if the pages are going to be redirtied soon. And this is somewhere that I think there's a difference between anon & file pages. So maybe the algorithm looks something like this: A: write page fault causes page to be created B: scan unmaps page, marks it dirty, does not start writeout C: scan finds dirty, unmapped anon page, starts writeout D: scan finds clean unmapped anon page, frees it so it will actually take three trips around the whole of memory for the physical scan to evict an anon page. That should be adequate time for a workload to fault back in a page that's actually hot. (if a page fault finds a page in state B, it transitions back to state A and gets three more trips around the clock).