From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C5D5DC54E4A
	for <linux-mm@archiver.kernel.org>; Thu,  7 Mar 2024 19:55:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3E65D6B0287; Thu,  7 Mar 2024 14:55:06 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 36E4F6B0288; Thu,  7 Mar 2024 14:55:06 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 20E286B0289; Thu,  7 Mar 2024 14:55:06 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 0C9F86B0287
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 14:55:06 -0500 (EST)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id BA8241C0BF4
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 19:55:05 +0000 (UTC)
X-FDA: 81871296570.29.D01F7C4
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf04.hostedemail.com (Postfix) with ESMTP id 8A1AF40027
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 19:55:03 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b="FAceBK/T";
	dmarc=none;
	spf=none (imf04.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709841304;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:in-reply-to:
	 references:dkim-signature; bh=KdUReECK6yv0klISswefryC4t0TYivcyEoL0v9VbHEU=;
	b=EP1BG793t5ItfYsIG0GaJ3dXAat6JXjc/4WZgkouGQ0Hba5CW1wdWRl+5s39FoRo6s61r/
	zeYK7PwgIXDwnZ6MhvoDfOvMIToxqvWsMm/0AnJ3udBIzoEuIu/jYcPVsgPdW55+E4IFlP
	fZcsEyL9RRxCPU9+IQxBEcFfZF35RPI=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b="FAceBK/T";
	dmarc=none;
	spf=none (imf04.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709841304; a=rsa-sha256;
	cv=none;
	b=c8du4z5fBiO6M/FtD13vLH0KD+45Ime5e234A8OcOThjqTG+GXhAyVVBiL6943ENfSTTVg
	2cyeBSSNhYPuKs4hHhrSMhnrp4DtLmkWO/xb3zY0/NBrsxfCpMnMQ+6FZHffzLvcppdkAU
	4S4qBp3mt8+FTqe5rHekvOHvnrFhbxI=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=Content-Type:MIME-Version:Message-ID:
	Subject:Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID:
	Content-Description:In-Reply-To:References;
	bh=KdUReECK6yv0klISswefryC4t0TYivcyEoL0v9VbHEU=; b=FAceBK/T3lPG7e+iEGAKxd7y1n
	f2QMqtkU+L88JedyHGlknWz1uqCutT25vzgUABf8TfbN38V9fnlh50CXWsUDETOSxPuXEd1UxEGUO
	imV7okghXJkze5eToVih5lELudzQzoeubS5FW9VS+a3NPkPM1xx17KB04OPT5mRGCQS4DMgwvB54h
	rTgVkkymtvqbZANHMhLJ7YsRIWlzArh0pVZVRHxpW/HT3SjHOhcfDRh7vj2zu8cCtlTgOBca2JDWs
	NG9xmCY/C394KeENak12lGdv3L1eeuQ131O9D0veP/HIWg034Hlq7lFTjnahm+yMkJjXL7n/aVR5p
	E3VtKD/A==;
Received: from willy by casper.infradead.org with local (Exim 4.97.1 #2 (Red Hat Linux))
	id 1riJpV-00000009r4y-3s3b;
	Thu, 07 Mar 2024 19:55:01 +0000
Date: Thu, 7 Mar 2024 19:55:01 +0000
From: Matthew Wilcox <willy@infradead.org>
To: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: On the optimum size of a batch
Message-ID: <Zeoble0xJQYEAriE@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Rspam-User: 
X-Stat-Signature: rkt6qbju45gi5fb38dkbptoongopw5g5
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 8A1AF40027
X-HE-Tag: 1709841303-348882
X-HE-Meta: U2FsdGVkX19iYhhjfKXsC+nsz1rHOjImI5ICVk60RC7vLrW8CQM+IiSFqYghQBZOUO7cyyWQ6piYVNWxXc2TfnVG6YXzOm4NaDa8yLtAdtdh2DPbN3eWKJHx/Andm9OzChGG9ni24XUeciJuzQK38Vcnk2iCw/YEfe2/njcrX78MlRUYxP7xiYsn/74sbWfXuzHn+H0R1/lv4009yd7ah52UDZ0xAtMLi3I9j2XjDF5XxK0UyGYltPOeG8UN07BsHshXh3pA9iBkT/lQjNqTIQDVNIZ3+SCO5Iabe40zR7nx5qx+Ifj/1Sq//k92GncCNTwau140e1qjfyHGpfRJYHM57zhiiB/chu44OEOIb+LCdFAvuweWojGfaSsogbq+WpdsVDCGQC8VxPzMq+czPfT6z3WgjcNAOobi6f55iPI/b47LJ68h2WjH2OJKD7L4qY/QFjjyqxS0PddhGjXIi+QpuIxbps0ll1W976VPl9+04+THEus42Q1XIB19BbiHZF2dR4Wu/wMT4cj2wNbP5qJ3rGs4lqFE7d4Z0+fyiq1yYEpc2A4iztNgN9ocSC+5/zKLQzvfTyyR1ToUNIcJ+PNjAt2DjiXyu2YCc3H05c5NWsKozoMvAIJ84UyUKC/chwxOqokQSlh4kKXQwWY0EhKuie0+YDahAjXxfdYcM3GSxfiUTiMycHMrPyKPPLdOG38mZlEZzxUy1Ewt9Q6uycoNEljB3Q7J3Op8tYJwStIul8RUXDhdAxvdBB//CMH6iHxC0qZlXWdodRgQjSahO9AVPwpFM9eGDefYu9mUGo+vrddZcMvmX/24peNgiU0rjF9Jva3xYZzBMnNGciPxYiLZI0iJbsE62yNscDxvzVjLOrgbB4Y2SeBVI9pLFodSQ6PjFmIlLZwRyDMTpo2TX5u8jIfIgmVEJR8L/ihEUaRm02KjEaddSbLtd4lH9DC96Dh0qzovZbrKue2kC1n
 ewjnzwJn
 hdRQnBTXGUSFY4maCE/OrtJR6Lia8hDwxBD5ptEyvr0yMxIBZi5Xp7TL+ZW2JQrbKsbjGuUlZjU5odt0gGbMm0h9Q4iK0eHMKQc8UJsOBaukgwu/MAo/3OfsR9Aj3fLPwi1kdiG1gpKVz92yzS6s2rZzIi+uNvav4ie6VkwgZxabDLzU=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

I've had a few conversations recently about how many objects should be
in a batch in some disparate contextx, so I thought I'd write down my
opinion so I can refer to it in future.  TLDR: Start your batch size
around 10, adjust the batch size when measurements tell you to change it.

In this model, let's look at the cost of allocating N objects from an
allocator.  Assume there's a fixed cost, say 4 (units are not relevant
here) for going into the allocator and then there's a 1 unit cost per
object (eg we're taking a spinlock, pulling N objects out of the data
structure and releasing the spinlock again).

Allocating 100 * 1 objects would cost 500 units.  Our best case is that
we could save 396 units by allocating a batch of 100.  But we probably
don't know how many objects we're going to need to allocate, so we pull
objects from the allocator in smaller batches.  Here's a table showing
the costs for different batch sizes:

Batch size      Cost of allocating 100          thousand        million
1               500 (5 * 100)                   5000            5M
2               300 (6 * 50)                    3000            3M
4               200 (8 * 25)                    2000            2M
8               156 (12 * 13)                   1500            1.5M
16              140 (20 * 7)                    1260            1.25M
32              144 (36 * 4)                    1152            1.13M
64              136 (68 * 2)                    1088            1.06M
128             132 (132 * 1)                   1056            1.03M

You can see the knee of this curve is around 8.  It fluctuates a bit after
that depending on how many "left over" objects we have after allocating
the 100 it turned out that we needed.  Even if we think we're going to
be dealing with a _lot_ of objects (the thousand and million column),
we've got most of the advantage by the time we get to 8 (eg a reduction
of 3.5M from a total possible reduction of 4M), and while I wouldn't
sneeze at getting a few more percentage points of overhead reduction,
we're scrabbling at the margins now, not getting big wins.

This is a simple model for only one situation.  If we have a locking
contention breakdown, the overhead cost might be much higher than 4 units,
and that would lead us to a larger batch size.

Another consideration is how much of each object we have to touch.
put_pages_list() is frequently called with batches of 500 pages.  In order
to free a folio, we have to manipulate its contents, so touching at least
one cacheline per object.  And we make multiple passes over the batch,
first decrementing the refcount, removing it from the lru list; second
uncharging the folios from the memcg (writes to folio->memcg_data);
third calling free_pages_prepare which, eg, sets ->mapping to NULL;
fourth putting the folio on the pcp list (writing to the list_head).

With 500 folios on the list, that uses 500 * 64 bytes of cache which
just barely fits into the L1 cache of a modern laptop CPU (never mind
whatever else we might want to have in the L1).  Capping the batch size
at 15 (as my recent patches do) uses only 1kB of L1, which is a much
more reasonable amount of cache to take up.  We can be pretty sure the
first one is still in it when the last one has finished being processed.