From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F1ABC5475B for ; Mon, 11 Mar 2024 03:41:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5AFB76B006E; Sun, 10 Mar 2024 23:41:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 55F0C6B0072; Sun, 10 Mar 2024 23:41:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44E416B0074; Sun, 10 Mar 2024 23:41:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 36ED16B006E for ; Sun, 10 Mar 2024 23:41:41 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9C315140871 for ; Mon, 11 Mar 2024 03:41:40 +0000 (UTC) X-FDA: 81883358760.10.CD86F58 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf01.hostedemail.com (Postfix) with ESMTP id 862E740005 for ; Mon, 11 Mar 2024 03:41:38 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=MwlOgmeM; dmarc=none; spf=none (imf01.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710128499; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0IhkeUfnkckGUpoB8ltx//YeJvd8V+1fi/yffsV3/DE=; b=s2741KXtQB0G+6412O2EfHi/DUq8qEBhmQc6tIdZeAYa5R1SiznzsnqLiTff0HelRyOvxA v0wZUVxPAsIFByXgRqAyNJeS3vwCO1RF1dZYnBUKX9ujpiH7gvl3TmTCxa7ImMJFs5n+Fj fMp/Et0kit9bE43n5bIANDl+m+i6YY0= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=MwlOgmeM; dmarc=none; spf=none (imf01.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710128499; a=rsa-sha256; cv=none; b=riv7CjW7/Dm6as7GF4XsSBZkybNvbSFJHfhoMFg2iz1cmUbFG4mgQrgg3IVt5EC0UeOgsD AMQH/XQ1FNkn4JDPaFLxoLvJNzw2rzB5/X7Y1WQqIi5JNligUXnz1wMog7rqpcvQHy2yUZ fEkic8ToAAmWx+vI/LcBGmVg7Dr1XZU= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=0IhkeUfnkckGUpoB8ltx//YeJvd8V+1fi/yffsV3/DE=; b=MwlOgmeMRShI7BYlx0kHstVhwm hgt4enB2uKdIIl4DUlbLzZAgYM+QIr0KyS8j52ssOJiKK7PECYrI8ur2Yg5lhXRWltKMEAENneSrE GHF2O/16j7srHcEfH1vx8OVCLBeawcyjEPMfWR5ILjY3IwDhtYunJROk/J/tl8BnjxKVlEZWsxgjS /Dcp6AB0D3HSZ35TjxPeb3I+2/djM5Cb2ToUPaPhPiBtf+sa1MAWGDDGrh9ASCl4fblf2yqAt41AG oKTvkpanedvcrvAYsgMMKbeDl8FCEstEYqcbWMyZohgNKPC1eaDyPE2cqGu+fFt4kOPYA90L6LhUm Ta8tWEPg==; Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux)) id 1rjWXf-0000000HKj2-0ofC; Mon, 11 Mar 2024 03:41:35 +0000 Date: Mon, 11 Mar 2024 03:41:35 +0000 From: Matthew Wilcox To: Dave Chinner Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: On the optimum size of a batch Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Stat-Signature: tya7fr6dr967c5n5fimca8rfkbyncxaa X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 862E740005 X-HE-Tag: 1710128498-996350 X-HE-Meta: U2FsdGVkX1+gI0wbmOt0ipQBHypTBnAc5aEJI17fJ2LRil0vsO9PVQ5SlJp247lD8YWIkHFOtzAkE+0WyPPRB0oW9tx+dKCwpr/N2bRPVEOwpfbSHhOQbbZ8HdwlQ45ElOSFNsDmyMjR+mG7CjVmANRvs+R7pcRU6dW2QUPTWyfzUqou01jX37coSnlhs+k4mIu7MjFbAd4h3aplJtKSVR5txB7FtHtz4JGNoCucV14Gv/ryA25gDiWXNZ8Mk8dbTT9VFTwIJJYPoBV1apS52Loe8dKCKNUPVDtu+q2C/xooNTLRgHajNdkJpSzM+G5Sw0n8hmWatba+9uMavE0iW6+bMp3eQWczmGiKnrBIk+8xgRGUEPw9gBY25vcszj7E+8MiDRDJRVGGC9KNmn7H09i1QQTvC/HVdorfTT4cd58VjBdEcfq6hbyrsRkSV/3J3vbBEUSAuF7B4Z1SpQ4dohDdTeHqhfiYloQE1lMyU27LpDe04Lb9mMkMJCxQirF2Z7pSYzIEG2ixkru2DVGLEMJ7ZTxyIiD/LsXqbFiSklSQxqnK8FHjMi1jDVw5X1OBTqyp7JOUCJB0TPJGa++BJPBeD3OE8NemyKvW2Ay+cbwNIEMHb3ZQj7k2COJaEcLAkjhkVPPK37pCxrJ02Ue9+uBdWPQEmv+yt2ZDjgMgxBr+UMmar7mayMP/t+XYedyGbnsKdsxGGyfQTwGwP1uEXz+4Wc1he+dG4eUK3aUS52bEK2+7V8CWUItsXOrW1/9J2A+MCYTMqIb6NxKgBcHLEHJODSv5jPuoI/EATvPpURNH09nh9HwBeaR501GApKY9bOBYz133W3OARsY0wl7mgA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Mar 11, 2024 at 01:12:45PM +1100, Dave Chinner wrote: > > Batch size Cost of allocating 100 thousand million > > 1 500 (5 * 100) 5000 5M > > 2 300 (6 * 50) 3000 3M > > 4 200 (8 * 25) 2000 2M > > 8 156 (12 * 13) 1500 1.5M > > 16 140 (20 * 7) 1260 1.25M > > 32 144 (36 * 4) 1152 1.13M > > 64 136 (68 * 2) 1088 1.06M > > 128 132 (132 * 1) 1056 1.03M > > Isn't this just repeating the fundamental observation that SLUB is > based on? i.e. it can use high-order pages so that it can > pre-allocate optimally sized batches of objects regardless of their > size? i.e. it tries to size the backing page order to allocate in > chunks of 30-40 objects at a time? What SLUB is currently doing is inefficient. One of the conversations I had (off-list) about appropriate batch size is in relation to SLUB-ng where one of the participants claimed that the batch size of 32 was obviously too small (it wasn't; the performance problem was due to a bug). What you're thinking about is the cost of allocating from the page allocator (which is now much cheaper than it used to be, but should be cheaper than it currently is). But there is another inefficiency to consider, which is that the slab allocator has a per-slab lock, and while you can efficiently remove and add a number of objects to a single slab, you might only have one or two free objects per slab. To work around this some of the more performance sensitive parts of the kernel have implemented their own allocator in front of slab. This is clearly a bad thing for all of us, and hence Vlastimil has been working on a better approach. https://lore.kernel.org/linux-mm/20231129-slub-percpu-caches-v3-0-6bcf536772bc@suse.cz/ > Except for SLUB we're actually allocating in the hundreds of > millions to billions of objects on machines with TBs of RAM. IOWs we > really want to be much further down the curve than 8 - batches of at > least 32-64 have significantly lower cost and that matters when > scaling to (and beyond) hundreds of millions of objects.... But that doesn't necessarily mean that you want a larger batch size. Because you're not just allocating, you're also freeing and over a large enough timescale the number of objects allocated and freed is approximately equal. In the SLUB case, your batch size needs to be large enough to absorb most of the allcation-vs-free bias jitter; that is if you know they always alternate AFAFAFAFAF a batch size of 2 would be fine. If you know you get four allocations followed by four frees, having a batch size of 5 woud be fine. We'd never go to the parent allocator if we got a AFAAFFAAAFFFAAAAFFFFAAFFAFAAFAAFFF pattern. > > This is a simple model for only one situation. If we have a locking > > contention breakdown, the overhead cost might be much higher than 4 units, > > and that would lead us to a larger batch size. > > > > Another consideration is how much of each object we have to touch. > > put_pages_list() is frequently called with batches of 500 pages. In order > > to free a folio, we have to manipulate its contents, so touching at least > > one cacheline per object. > > Right, that's simply the cost of the batch cache footprint issue > rather than a "fixed cost mitigation" described for allocation. No, it's not, it's an illustration that too large a batch size can actively harm you. > So I'm not sure what you're trying to say here? We've known about > these batch optimisation considerations for a long, long time and > that batch size optimisation is always algorithm and access pattern > dependent, so.... ??? People forget these "things we've always known". I went looking and couldn't find a good writeup of this, so did my own. In addition to the percpu slub batch size, various people have opined that the folio_batch size (15 objects) is too small for doing things like writeback and readahead. They're going to have to bring data to convince me. > > And we make multiple passes over the batch, > > first decrementing the refcount, removing it from the lru list; second > > uncharging the folios from the memcg (writes to folio->memcg_data); > > third calling free_pages_prepare which, eg, sets ->mapping to NULL; > > fourth putting the folio on the pcp list (writing to the list_head). > > Sounds like "batch cache footprint" would be reduced by inverting > that algorithm and doing all the work on a single object in a single > pass, rahter than doing it in multiple passes. That way the cache > footprint of the batch is determined entirely by the size of the > data structures accessed to process each object in the batch. Well, now you're just opining without having studied the problem, and I have, so I can say confidently that you're wrong. You could read the code if you like.