From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 73F507404E for ; Tue, 21 Jan 2025 01:44:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=166.125.252.92 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737423859; cv=none; b=dXJ9KG9Y4fVL4ss/ec3j5O19RtKEwjEjxuamFIoPXXthmJL0nloxBwX4wc0Kr/ycPG+HE92GBXZOFUQJSEOOHyQK7HNY2Voo55VS5axCw30qL0G/Kk/t1RC8itu/roIicVhFON1wIME3wW2bhuXOt35V5KN84Ve5Ho+n/7zryIU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737423859; c=relaxed/simple; bh=9zEkg93FDZkQ191omWg8pRMzSj954H6GwNXjhKuvrP0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=IHLa80YNAp2s4LODAeXR5lyDIFrU1laItsaRoCcKxaFoPwyqUM+g2pmeGzs5nAMYbsEM6Wrhx5ubgg91NXgPJkyprcU2wVHN3mW82AyOnbU/uEmu/+Cze1CAqMKMdWCcOTkOVXcNFJVU7ticYVzhMdGbHaLmJbvGGhbNzsMU2nU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com; spf=pass smtp.mailfrom=sk.com; arc=none smtp.client-ip=166.125.252.92 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=sk.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=sk.com X-AuditID: a67dfc5b-3c9ff7000001d7ae-3c-678efbe48c75 Date: Tue, 21 Jan 2025 10:43:59 +0900 From: Byungchul Park To: Vinay Banakar Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, willy@infradead.org, mgorman@suse.de, Wei Xu , Greg Thelen , kernel_team@skhynix.com Subject: Re: [PATCH] mm: Optimize TLB flushes during page reclaim Message-ID: <20250121014359.GA60549@system.software.com> References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFrrDLMWRmVeSWpSXmKPExsXC9ZZnoe6T333pBs2P5C3mrF/DZrFu1UQW i8u75rBZ3Fvzn9Vi8rtnjBa7bq9is3h/7SO7xe8fc9gcODwWbCr12LxCy2PTp0nsHidm/Gbx 2Hy62uPzJrkAtigum5TUnMyy1CJ9uwSujMv/17IW7DWs2PniPFMD4y31LkZODgkBE4kHG38x wthHp7SzdTFycLAIqEpcWJsFEmYTUJe4ceMnM4gtIqAkMWHacqASLg5mgcuMErMPTWACSQgL OEos2zAJzOYVsJA4u+ExG4gtJBAg0X//AzNEXFDi5MwnLCA2s4CWxI1/L5lAdjELSEss/8cB YnIKBEo8mFYDUiEqoCxxYNtxJpBVEgJ72CR2TbrADHGmpMTBFTdYJjAKzEIydRaSqbMQpi5g ZF7FKJSZV5abmJljopdRmZdZoZecn7uJERjmy2r/RO9g/HQh+BCjAAejEg/vAau+dCHWxLLi ytxDjBIczEoivKIfetKFeFMSK6tSi/Lji0pzUosPMUpzsCiJ8xp9K08REkhPLEnNTk0tSC2C yTJxcEo1MC4TKn446dnFq9+zX/cp7F4mEGkzx3vTszrnKKEmU4a1HdIiUWsYclje+P5Otu5/ lPz56P4VFd9vXlcVS3HXuLkqOKrwyee7DgoeKfLHZ7zdvaJsw0bNjYGCmR6+72d079CZnqnP lK6QtNol7Oymvr64ZSUqlRYuzKcqFk3zcT3eqqexSsNHSImlOCPRUIu5qDgRALARWS5vAgAA X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrMLMWRmVeSWpSXmKPExsXC5WfdrPvkd1+6wdGz4hZz1q9hs1i3aiKL xeG5J1ktLu+aw2Zxb81/VovJ754xWuy6vYrN4v21j+wWv3/MYXPg9FiwqdRj8wotj02fJrF7 nJjxm8Vj8YsPTB6bT1d7fN4kF8AexWWTkpqTWZZapG+XwJVx+f9a1oK9hhU7X5xnamC8pd7F yMkhIWAicXRKO1sXIwcHi4CqxIW1WSBhNgF1iRs3fjKD2CICShITpi0HKuHiYBa4zCgx+9AE JpCEsICjxLINk8BsXgELibMbHrOB2EICARL99z8wQ8QFJU7OfMICYjMLaEnc+PeSCWQXs4C0 xPJ/HCAmp0CgxINpNSAVogLKEge2HWeawMg7C0nzLCTNsxCaFzAyr2IUycwry03MzDHVK87O qMzLrNBLzs/dxAgM2mW1fybuYPxy2f0QowAHoxIP7wGrvnQh1sSy4srcQ4wSHMxKIryiH3rS hXhTEiurUovy44tKc1KLDzFKc7AoifN6hacmCAmkJ5akZqemFqQWwWSZODilGhjTz/2P+rG5 4l1ukWXros93NmdFWy2qmSe8VjXNdWb3ucDHge4bDy3lTb0YbdfbqKStr6goUmXjf+FAATfr xhtexS0b5zoG35/B4nbII+jgUfErra65IX+jm7Pl0+ZdMVNrcLdcXBOReqej+dNVm60tGg7B rLfn2J5Pa34tcrDdu+TIg5lXdiuxFGckGmoxFxUnAgDaKsJFVgIAAA== X-CFilter-Loop: Reflected On Mon, Jan 20, 2025 at 04:47:29PM -0600, Vinay Banakar wrote: > The current implementation in shrink_folio_list() performs full TLB > flushes and issues IPIs for each individual page being reclaimed. This > causes unnecessary overhead during memory reclaim, whether triggered > by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where > applications are actively moving cold pages to swap while maintaining > high performance requirements for hot pages. > > The current code: > 1. Clears PTE and unmaps each page individually > 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or > issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage > 3. Submits each page individually to BIO > > This approach results in: > - Excessive full TLB flushes across all cores > - Unnecessary IPI storms when processing multiple pages > - Suboptimal I/O submission patterns > > I initially tried using selective TLB shootdowns (invlpg) instead of > full TLB flushes per each page to avoid interference with other > threads. However, this approach still required sending IPIs to all > cores for each page, which did not significantly improve application > throughput. > > This patch instead optimizes the process by batching operations, > issuing one IPI per PMD instead of per page. This reduces interrupts > by a factor of 512 and enables batching page submissions to BIO. The > new approach: > 1. Collect dirty pages that need to be written back > 2. Issue a single TLB flush for all dirty pages in the batch > 3. Process the collected pages for writebacks (submit to BIO) The *interesting* IPIs will be reduced by 1/512 at most. Can we see the improvement number? Byungchul > Testing shows significant reduction in application throughput impact > during page-out operations. Applications maintain better performance > during memory reclaim, when triggered by explicit > madvise(MADV_PAGEOUT) calls. > > I'd appreciate your feedback on this approach, especially on the > correctness of batched BIO submissions. Looking forward to your > comments. > > Signed-off-by: Vinay Banakar > --- > mm/vmscan.c | 107 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- > 1 file changed, 74 insertions(+), 33 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bd489c1af..1bd510622 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > struct folio_batch free_folios; > LIST_HEAD(ret_folios); > LIST_HEAD(demote_folios); > + LIST_HEAD(pageout_list); > unsigned int nr_reclaimed = 0; > unsigned int pgactivate = 0; > bool do_demote_pass; > @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > if (!sc->may_writepage) > goto keep_locked; > > - /* > - * Folio is dirty. Flush the TLB if a writable entry > - * potentially exists to avoid CPU writes after I/O > - * starts and then write it out here. > - */ > - try_to_unmap_flush_dirty(); > - switch (pageout(folio, mapping, &plug)) { > - case PAGE_KEEP: > - goto keep_locked; > - case PAGE_ACTIVATE: > - goto activate_locked; > - case PAGE_SUCCESS: > - stat->nr_pageout += nr_pages; > - > - if (folio_test_writeback(folio)) > - goto keep; > - if (folio_test_dirty(folio)) > - goto keep; > - > - /* > - * A synchronous write - probably a ramdisk. Go > - * ahead and try to reclaim the folio. > - */ > - if (!folio_trylock(folio)) > - goto keep; > - if (folio_test_dirty(folio) || > - folio_test_writeback(folio)) > - goto keep_locked; > - mapping = folio_mapping(folio); > - fallthrough; > - case PAGE_CLEAN: > - ; /* try to free the folio below */ > - } > + /* Add to pageout list for defered bio submissions */ > + list_add(&folio->lru, &pageout_list); > + continue; > } > > /* > @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct > list_head *folio_list, > } > /* 'folio_list' is always empty here */ > > + if (!list_empty(&pageout_list)) { > + /* > + * Batch TLB flushes by flushing once before processing all dirty pages. > + * Since we operate on one PMD at a time, this batches TLB flushes at > + * PMD granularity rather than per-page, reducing IPIs. > + */ > + struct address_space *mapping; > + try_to_unmap_flush_dirty(); > + > + while (!list_empty(&pageout_list)) { > + struct folio *folio = lru_to_folio(&pageout_list); > + list_del(&folio->lru); > + > + /* Recheck if page got reactivated */ > + if (folio_test_active(folio) || > + (folio_mapped(folio) && folio_test_young(folio))) > + goto skip_pageout_locked; > + > + mapping = folio_mapping(folio); > + pageout_t pageout_res = pageout(folio, mapping, &plug); > + switch (pageout_res) { > + case PAGE_KEEP: > + goto skip_pageout_locked; > + case PAGE_ACTIVATE: > + goto skip_pageout_locked; > + case PAGE_SUCCESS: > + stat->nr_pageout += folio_nr_pages(folio); > + > + if (folio_test_writeback(folio) || > + folio_test_dirty(folio)) > + goto skip_pageout; > + > + /* > + * A synchronous write - probably a ramdisk. Go > + * ahead and try to reclaim the folio. > + */ > + if (!folio_trylock(folio)) > + goto skip_pageout; > + if (folio_test_dirty(folio) || > + folio_test_writeback(folio)) > + goto skip_pageout_locked; > + > + // Try to free the page > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + > + case PAGE_CLEAN: > + if (!mapping || > + !__remove_mapping(mapping, folio, true, > + sc->target_mem_cgroup)) > + goto skip_pageout_locked; > + > + nr_reclaimed += folio_nr_pages(folio); > + folio_unlock(folio); > + continue; > + } > + > +skip_pageout_locked: > + folio_unlock(folio); > +skip_pageout: > + list_add(&folio->lru, &ret_folios); > + } > + } > + > /* Migrate folios selected for demotion */ > nr_reclaimed += demote_folio_list(&demote_folios, pgdat); > /* Folios that could not be demoted are still in @demote_folios */