From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2F1ABC5475B
	for <linux-mm@archiver.kernel.org>; Mon, 11 Mar 2024 03:41:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5AFB76B006E; Sun, 10 Mar 2024 23:41:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 55F0C6B0072; Sun, 10 Mar 2024 23:41:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 44E416B0074; Sun, 10 Mar 2024 23:41:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 36ED16B006E
	for <linux-mm@kvack.org>; Sun, 10 Mar 2024 23:41:41 -0400 (EDT)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 9C315140871
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 03:41:40 +0000 (UTC)
X-FDA: 81883358760.10.CD86F58
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf01.hostedemail.com (Postfix) with ESMTP id 862E740005
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 03:41:38 +0000 (UTC)
Authentication-Results: imf01.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=MwlOgmeM;
	dmarc=none;
	spf=none (imf01.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710128499;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=0IhkeUfnkckGUpoB8ltx//YeJvd8V+1fi/yffsV3/DE=;
	b=s2741KXtQB0G+6412O2EfHi/DUq8qEBhmQc6tIdZeAYa5R1SiznzsnqLiTff0HelRyOvxA
	v0wZUVxPAsIFByXgRqAyNJeS3vwCO1RF1dZYnBUKX9ujpiH7gvl3TmTCxa7ImMJFs5n+Fj
	fMp/Et0kit9bE43n5bIANDl+m+i6YY0=
ARC-Authentication-Results: i=1;
	imf01.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=MwlOgmeM;
	dmarc=none;
	spf=none (imf01.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710128499; a=rsa-sha256;
	cv=none;
	b=riv7CjW7/Dm6as7GF4XsSBZkybNvbSFJHfhoMFg2iz1cmUbFG4mgQrgg3IVt5EC0UeOgsD
	AMQH/XQ1FNkn4JDPaFLxoLvJNzw2rzB5/X7Y1WQqIi5JNligUXnz1wMog7rqpcvQHy2yUZ
	fEkic8ToAAmWx+vI/LcBGmVg7Dr1XZU=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=0IhkeUfnkckGUpoB8ltx//YeJvd8V+1fi/yffsV3/DE=; b=MwlOgmeMRShI7BYlx0kHstVhwm
	hgt4enB2uKdIIl4DUlbLzZAgYM+QIr0KyS8j52ssOJiKK7PECYrI8ur2Yg5lhXRWltKMEAENneSrE
	GHF2O/16j7srHcEfH1vx8OVCLBeawcyjEPMfWR5ILjY3IwDhtYunJROk/J/tl8BnjxKVlEZWsxgjS
	/Dcp6AB0D3HSZ35TjxPeb3I+2/djM5Cb2ToUPaPhPiBtf+sa1MAWGDDGrh9ASCl4fblf2yqAt41AG
	oKTvkpanedvcrvAYsgMMKbeDl8FCEstEYqcbWMyZohgNKPC1eaDyPE2cqGu+fFt4kOPYA90L6LhUm
	Ta8tWEPg==;
Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux))
	id 1rjWXf-0000000HKj2-0ofC;
	Mon, 11 Mar 2024 03:41:35 +0000
Date: Mon, 11 Mar 2024 03:41:35 +0000
From: Matthew Wilcox <willy@infradead.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: On the optimum size of a batch
Message-ID: <Ze59byUR80z42m8R@casper.infradead.org>
References: <Zeoble0xJQYEAriE@casper.infradead.org>
 <Ze5onaXsI+LT1+Be@dread.disaster.area>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Ze5onaXsI+LT1+Be@dread.disaster.area>
X-Rspam-User: 
X-Stat-Signature: tya7fr6dr967c5n5fimca8rfkbyncxaa
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 862E740005
X-HE-Tag: 1710128498-996350
X-HE-Meta: U2FsdGVkX1+gI0wbmOt0ipQBHypTBnAc5aEJI17fJ2LRil0vsO9PVQ5SlJp247lD8YWIkHFOtzAkE+0WyPPRB0oW9tx+dKCwpr/N2bRPVEOwpfbSHhOQbbZ8HdwlQ45ElOSFNsDmyMjR+mG7CjVmANRvs+R7pcRU6dW2QUPTWyfzUqou01jX37coSnlhs+k4mIu7MjFbAd4h3aplJtKSVR5txB7FtHtz4JGNoCucV14Gv/ryA25gDiWXNZ8Mk8dbTT9VFTwIJJYPoBV1apS52Loe8dKCKNUPVDtu+q2C/xooNTLRgHajNdkJpSzM+G5Sw0n8hmWatba+9uMavE0iW6+bMp3eQWczmGiKnrBIk+8xgRGUEPw9gBY25vcszj7E+8MiDRDJRVGGC9KNmn7H09i1QQTvC/HVdorfTT4cd58VjBdEcfq6hbyrsRkSV/3J3vbBEUSAuF7B4Z1SpQ4dohDdTeHqhfiYloQE1lMyU27LpDe04Lb9mMkMJCxQirF2Z7pSYzIEG2ixkru2DVGLEMJ7ZTxyIiD/LsXqbFiSklSQxqnK8FHjMi1jDVw5X1OBTqyp7JOUCJB0TPJGa++BJPBeD3OE8NemyKvW2Ay+cbwNIEMHb3ZQj7k2COJaEcLAkjhkVPPK37pCxrJ02Ue9+uBdWPQEmv+yt2ZDjgMgxBr+UMmar7mayMP/t+XYedyGbnsKdsxGGyfQTwGwP1uEXz+4Wc1he+dG4eUK3aUS52bEK2+7V8CWUItsXOrW1/9J2A+MCYTMqIb6NxKgBcHLEHJODSv5jPuoI/EATvPpURNH09nh9HwBeaR501GApKY9bOBYz133W3OARsY0wl7mgA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Mar 11, 2024 at 01:12:45PM +1100, Dave Chinner wrote:
> > Batch size      Cost of allocating 100          thousand        million
> > 1               500 (5 * 100)                   5000            5M
> > 2               300 (6 * 50)                    3000            3M
> > 4               200 (8 * 25)                    2000            2M
> > 8               156 (12 * 13)                   1500            1.5M
> > 16              140 (20 * 7)                    1260            1.25M
> > 32              144 (36 * 4)                    1152            1.13M
> > 64              136 (68 * 2)                    1088            1.06M
> > 128             132 (132 * 1)                   1056            1.03M
> 
> Isn't this just repeating the fundamental observation that SLUB is
> based on?  i.e. it can use high-order pages so that it can
> pre-allocate optimally sized batches of objects regardless of their
> size? i.e.  it tries to size the backing page order to allocate in
> chunks of 30-40 objects at a time?

What SLUB is currently doing is inefficient.  One of the conversations
I had (off-list) about appropriate batch size is in relation to SLUB-ng
where one of the participants claimed that the batch size of 32 was
obviously too small (it wasn't; the performance problem was due to a
bug).

What you're thinking about is the cost of allocating from the page
allocator (which is now much cheaper than it used to be, but should
be cheaper than it currently is).  But there is another inefficiency
to consider, which is that the slab allocator has a per-slab lock,
and while you can efficiently remove and add a number of objects to
a single slab, you might only have one or two free objects per slab.
To work around this some of the more performance sensitive parts of the
kernel have implemented their own allocator in front of slab.  This is
clearly a bad thing for all of us, and hence Vlastimil has been working
on a better approach.

https://lore.kernel.org/linux-mm/20231129-slub-percpu-caches-v3-0-6bcf536772bc@suse.cz/

> Except for SLUB we're actually allocating in the hundreds of
> millions to billions of objects on machines with TBs of RAM. IOWs we
> really want to be much further down the curve than 8 - batches of at
> least 32-64 have significantly lower cost and that matters when
> scaling to (and beyond) hundreds of millions of objects....

But that doesn't necessarily mean that you want a larger batch size.
Because you're not just allocating, you're also freeing and over a
large enough timescale the number of objects allocated and freed is
approximately equal.  In the SLUB case, your batch size needs to be
large enough to absorb most of the allcation-vs-free bias jitter; that
is if you know they always alternate AFAFAFAFAF a batch size of 2 would
be fine.  If you know you get four allocations followed by four frees,
having a batch size of 5 woud be fine.  We'd never go to the parent
allocator if we got a AFAAFFAAAFFFAAAAFFFFAAFFAFAAFAAFFF pattern.

> > This is a simple model for only one situation.  If we have a locking
> > contention breakdown, the overhead cost might be much higher than 4 units,
> > and that would lead us to a larger batch size.
> > 
> > Another consideration is how much of each object we have to touch.
> > put_pages_list() is frequently called with batches of 500 pages.  In order
> > to free a folio, we have to manipulate its contents, so touching at least
> > one cacheline per object.
> 
> Right, that's simply the cost of the batch cache footprint issue
> rather than a "fixed cost mitigation" described for allocation.

No, it's not, it's an illustration that too large a batch size can
actively harm you.

> So I'm not sure what you're trying to say here? We've known about
> these batch optimisation considerations for a long, long time and
> that batch size optimisation is always algorithm and access pattern
> dependent, so.... ???

People forget these "things we've always known".  I went looking and
couldn't find a good writeup of this, so did my own.  In addition to the
percpu slub batch size, various people have opined that the folio_batch
size (15 objects) is too small for doing things like writeback and
readahead.  They're going to have to bring data to convince me.

> > And we make multiple passes over the batch,
> > first decrementing the refcount, removing it from the lru list; second
> > uncharging the folios from the memcg (writes to folio->memcg_data);
> > third calling free_pages_prepare which, eg, sets ->mapping to NULL;
> > fourth putting the folio on the pcp list (writing to the list_head).
> 
> Sounds like "batch cache footprint" would be reduced by inverting
> that algorithm and doing all the work on a single object in a single
> pass, rahter than doing it in multiple passes.  That way the cache
> footprint of the batch is determined entirely by the size of the
> data structures accessed to process each object in the batch.

Well, now you're just opining without having studied the problem, and
I have, so I can say confidently that you're wrong.  You could read
the code if you like.