From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34D84C001DE for ; Thu, 10 Aug 2023 10:46:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B1E4E6B0071; Thu, 10 Aug 2023 06:46:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ACDFD6B0074; Thu, 10 Aug 2023 06:46:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 96F616B0075; Thu, 10 Aug 2023 06:46:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8776C6B0071 for ; Thu, 10 Aug 2023 06:46:57 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 4581A40FBE for ; Thu, 10 Aug 2023 10:46:57 +0000 (UTC) X-FDA: 81107867274.05.2344618 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by imf27.hostedemail.com (Postfix) with ESMTP id 3696840013 for ; Thu, 10 Aug 2023 10:46:54 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=T3ltPvVd; spf=pass (imf27.hostedemail.com: domain of jaypatel@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=jaypatel@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691664414; a=rsa-sha256; cv=none; b=VG1KOhsXH+1u6xlYN2wWx5M+jJBbyK0SoC1sv6u7oCXUKZC7n7lo31Y7Qqb14D8nDvNnp7 Lqp0TaFp3pnLUk1W34nzThFtO5+HZf34kKfusEOcJu8LLlcxnSUUSDH23srH7CRDRytwKn rADQmluq5ohGfx1vFUU2wDiS1YrLN4A= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=ibm.com header.s=pp1 header.b=T3ltPvVd; spf=pass (imf27.hostedemail.com: domain of jaypatel@linux.ibm.com designates 148.163.156.1 as permitted sender) smtp.mailfrom=jaypatel@linux.ibm.com; dmarc=pass (policy=none) header.from=ibm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691664414; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=R/GCXu5Hbk7e4jsZC7FKiJxpXwgHWDml+sfbCw8khYk=; b=Qc9JlxUKN5BwjlpnehX8NqpToRZ1+kW4hXA9ti5+aqwQlQO6SJVaICFEr1KS8sh6R19Yh3 tTXbAcGvy2ObnjlkFQ50FUQTvn6OF0Z8UUCTl2h6OQCEeQ4qB1bQp1zKnu+PmFVTuWmI0r 2uNNJREykV8Bo9nRVoZARAAzto9E9cs= Received: from pps.filterd (m0353727.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 37AAj8l4019352; Thu, 10 Aug 2023 10:46:21 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : subject : from : reply-to : to : cc : date : in-reply-to : references : content-type : content-transfer-encoding : mime-version; s=pp1; bh=R/GCXu5Hbk7e4jsZC7FKiJxpXwgHWDml+sfbCw8khYk=; b=T3ltPvVdJqtMLiU+eqIwLOjyOxcm9KSLXk4JSSH/wPmhZa4jVqiwGhOK8A2fMghWceIo /GJqKRWyuIzhreaF4lm53cxK5i7ccg49MqbTZqEQ8f6n9A5HW/qzGYvvp9PBYQGnx2Qc E8i9xX5Vdsk0nSM87IQBlUsRS0Ln9w7ZcXBvH2mMi3nWalL0SnP29cQMQNAmqar4K9g9 iWuO3NDDyJtRyhsmKjYUYOECANmSVzV8Gbb1IEzIM0o1u4th5zjZlxyWoCgRLn2b6rvW R8FS6cQ52zuVLJewvU1zxpgFYtdJ8SPjuSZ/Q6Q0cht29H9AZqAvOX6uzSxHE7p4YaKT lQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3scxdv00b7-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 10 Aug 2023 10:46:21 +0000 Received: from m0353727.ppops.net (m0353727.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 37AAVje4012491; Thu, 10 Aug 2023 10:39:13 GMT Received: from ppma12.dal12v.mail.ibm.com (dc.9e.1632.ip4.static.sl-reverse.com [50.22.158.220]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3scx60rnpd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 10 Aug 2023 10:39:13 +0000 Received: from pps.filterd (ppma12.dal12v.mail.ibm.com [127.0.0.1]) by ppma12.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 37A8MDkS006719; Thu, 10 Aug 2023 10:39:12 GMT Received: from smtprelay06.dal12v.mail.ibm.com ([172.16.1.8]) by ppma12.dal12v.mail.ibm.com (PPS) with ESMTPS id 3sa0rth4vu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 10 Aug 2023 10:39:12 +0000 Received: from smtpav02.wdc07v.mail.ibm.com (smtpav02.wdc07v.mail.ibm.com [10.39.53.229]) by smtprelay06.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 37AAdBru66191756 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 10 Aug 2023 10:39:11 GMT Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 14E5A58070; Thu, 10 Aug 2023 10:39:11 +0000 (GMT) Received: from smtpav02.wdc07v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4956058061; Thu, 10 Aug 2023 10:39:05 +0000 (GMT) Received: from patel (unknown [9.61.51.89]) by smtpav02.wdc07v.mail.ibm.com (Postfix) with ESMTP; Thu, 10 Aug 2023 10:39:04 +0000 (GMT) Message-ID: <5b07232a4bdbf99cdd117c595eb897bb4eeb02ce.camel@linux.ibm.com> Subject: Re: [PATCH] [RFC PATCH v2]mm/slub: Optimize slub memory usage From: Jay Patel Reply-To: jaypatel@linux.ibm.com To: Vlastimil Babka , Hyeonggon Yoo <42.hyeyoo@gmail.com>, Feng Tang Cc: "Sang, Oliver" , "oe-lkp@lists.linux.dev" , lkp , "linux-mm@kvack.org" , "Huang, Ying" , "Yin, Fengwei" , "cl@linux.com" , "penberg@kernel.org" , "rientjes@google.com" , "iamjoonsoo.kim@lge.com" , "akpm@linux-foundation.org" , "aneesh.kumar@linux.ibm.com" , "tsahu@linux.ibm.com" , "piyushs@linux.ibm.com" Date: Thu, 10 Aug 2023 16:08:56 +0530 In-Reply-To: <91bd907e-adc0-d7c7-7eaa-da199689c99c@suse.cz> References: <20230628095740.589893-1-jaypatel@linux.ibm.com> <202307172140.3b34825a-oliver.sang@intel.com> <91bd907e-adc0-d7c7-7eaa-da199689c99c@suse.cz> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.28.5 (3.28.5-22.el8) X-TM-AS-GCONF: 00 X-Proofpoint-GUID: 9EYMojfjeyBMv0mZUm2JiflwQ7odWORq X-Proofpoint-ORIG-GUID: LUFLDJhQgTWCiy7-3cFNMu7N8m8wa8WQ Content-Transfer-Encoding: 8bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.267,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26 definitions=2023-08-10_10,2023-08-10_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 mlxlogscore=999 mlxscore=0 spamscore=0 phishscore=0 bulkscore=0 suspectscore=0 clxscore=1011 priorityscore=1501 malwarescore=0 adultscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2306200000 definitions=main-2308100089 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 3696840013 X-Stat-Signature: 7ktidbzbfu757gaabiojpz8msd65ixqw X-Rspam-User: X-HE-Tag: 1691664414-827146 X-HE-Meta: U2FsdGVkX1+bh6quoXTcIhToJMQMND/rhGpq2HvZljbHcKsDRQ8ds20hPtueIOuOl3HAFitMdb3hI5VQDUPFmDiP52aKYDSO9HwvGRRqdtFwIHCFvB0GA8ZAqjt7yNvFojQwEUBD9XebJBwa8WD5VI73D+Y9OSNJsulWhXAKZaP1chtEwu0gBpTbJG7hLagzHa1BeBzk/qg4OZAgH1FmoyCQBPY4ZUSgJy2b47LTDzNNIVTQLCZXQugSPy69L3jFzJpaULudx0rARMfyOf5JKUMxfkHOJrXjHfITYp+1QqWynYTsCldyMghKaIBKkogMw7C4MlWKhLGxc8H6Kl3wNVbcI1e1c4h2NJqWoJqakQ5TNQKe9EwdLWQhVrBUVKb2PPTRoFsyf4hjWS+t8x58HSWzsu9ZB39A/EJ4/AIFEeylAzt2x8lqsJOvFBWK0mxfFSY/yN642MXONG4kTazrPSh39/Xg+dsCIaapX6FEDlgMDFMg75maGUS6t4vzfvA09rd3fq/fMpdlSm++c2+hjDCjcQes2N7EUFrgozawQtARX8J8A5kwZbEITnYcGEwSko0MQJLFESNnbdZvjUOFjF/kTGv9Q5D0MIZm2I+KS05qrQc4JAQ+4KunFEVL9Qn/ZPV2xQIfjfeX+UJtNrOIrZsnBfVjVJzEoTQw7kKoAByixXzdlqJxO1Q7LKsHEaG1z3MTlTsm47v4RWI9CATskn2KtzES2QEN6eXCt2mtHIkZ3pCPVat5RYrEkuFaFRJJV2ljzihWRJ1sqkPp+gEstNZfxO/7lN6qTiK/yg/yYlUIm/9QMezvoDpwlMgMNsGE/rH06WnqJ77CT/3msmzil60ogCFSdX4/C55PMbOiWcbSerujLlYos81135rPv4lpFr5kkBaBXEx149egZsAVpIkIIB8NvcpbkqeE0rp8PCCc6U4SUYUYZtdLod1lyP+TIrqGcEEM9VDBWmns690 +wDrJ/95 3k5L714ApoVBXNR3SZunb5MFvp94O9K43K85sJNDWxxYR7Dl9jgga0SMcmzMCmlYWcXdS6sNBxNvK6lGifDiv+N5EUg/Tv2Or3D/N4P8/rKdhTHfjkkZw/UKBC/LjKLDEhGq5QiDkDTTyYEBfFM71TLSF4hQFr6IC8vO59SY6wZihXLnHk5zHIQNaWyVUsIA9lU1Ui1vcYX7/6l1wnwqcMuv0Ad5NnGO1hIXQTPjrti+jf5WWqvATmIfzjL2cg/z/xVkkxIaD2kYw9DGrIFmWJwSpz4ZnOsdJYXqsk8Yrs9pDcadWu3qUbCSBX5N1K62SR+mU8YyAhOABp37RfziLqA1iuks7qZTIAaZEZ68de0AL683LpGI9BYU8J1Uybng2hnXVS+nE+/zfXdaw30Ac4BsXD/JJTJdVSJc7OJww6O6BflSruaT8n8kJbuF6WqpJdkDsWK0cwaG+BU8JWVlr0yqi1Nwdh3Avd0uj/q8E0T3G6FukSV1DNMMrNl5QX32d36ZbqImNsqhzV5BNv0hYyEVWAJjuNQFi9yNq+1XdTopXdI8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 2023-07-26 at 12:06 +0200, Vlastimil Babka wrote: > On 7/25/23 05:13, Hyeonggon Yoo wrote: > > On Mon, Jul 24, 2023 at 11:43 PM Feng Tang > > wrote: > > > On Thu, Jul 20, 2023 at 11:05:17PM +0800, Hyeonggon Yoo wrote: > > > > > > > let me introduce our test process. > > > > > > > > > > > > > > we make sure the tests upon commit and its parent have > > > > > > > exact same environment > > > > > > > except the kernel difference, and we also make sure the > > > > > > > config to build the > > > > > > > commit and its parent are identical. > > > > > > > > > > > > > > we run tests for one commit at least 6 times to make sure > > > > > > > the data is stable. > > > > > > > > > > > > > > such like for this case, we rebuild the commit and its > > > > > > > parent's kernel, the > > > > > > > config is attached FYI. > > > > > > > > > > > > Hello Oliver, > > > > > > > > > > > > Thank you for confirming the testing environment is totally > > > > > > fine. > > > > > > and I'm sorry. I didn't mean to offend that your tests were > > > > > > bad. > > > > > > > > > > > > It was more like "oh, the data totally doesn't make sense > > > > > > to me" > > > > > > and I blamed the tests rather than my poor understanding of > > > > > > the data ;) > > > > > > > > > > > > Anyway, > > > > > > as the data shows a repeatable regression, > > > > > > let's think more about the possible scenario: > > > > > > > > > > > > I can't stop thinking that the patch must've affected the > > > > > > system's > > > > > > reclamation behavior in some way. > > > > > > (I think more active anon pages with a similar number total > > > > > > of anon > > > > > > pages implies the kernel scanned more pages) > > > > > > > > > > > > It might be because kswapd was more frequently woken up > > > > > > (possible if > > > > > > skbs were allocated with GFP_ATOMIC) > > > > > > But the data provided is not enough to support this > > > > > > argument. > > > > > > > > > > > > > 2.43 ± 7% +4.5 6.90 ± 11% perf-profile.children.cycles- > > > > > > > pp.get_partial_node > > > > > > > 3.23 ± 5% +4.5 7.77 ± 9% perf- > > > > > > > profile.children.cycles-pp.___slab_alloc > > > > > > > 7.51 ± 2% +4.6 12.11 ± 5% perf- > > > > > > > profile.children.cycles-pp.kmalloc_reserve > > > > > > > 6.94 ± 2% +4.7 11.62 ± 6% perf- > > > > > > > profile.children.cycles-pp.__kmalloc_node_track_caller > > > > > > > 6.46 ± 2% +4.8 11.22 ± 6% perf- > > > > > > > profile.children.cycles-pp.__kmem_cache_alloc_node > > > > > > > 8.48 ± 4% +7.9 16.42 ± 8% perf- > > > > > > > profile.children.cycles-pp._raw_spin_lock_irqsave > > > > > > > 6.12 ± 6% +8.6 14.74 ± 9% perf- > > > > > > > profile.children.cycles- > > > > > > > pp.native_queued_spin_lock_slowpath > > > > > > > > > > > > And this increased cycles in the SLUB slowpath implies that > > > > > > the actual > > > > > > number of objects available in > > > > > > the per cpu partial list has been decreased, possibly > > > > > > because of > > > > > > inaccuracy in the heuristic? > > > > > > (cuz the assumption that slabs cached per are half-filled, > > > > > > and that > > > > > > slabs' order is s->oo) > > > > > > > > > > From the patch: > > > > > > > > > > static unsigned int slub_max_order = > > > > > - IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : > > > > > PAGE_ALLOC_COSTLY_ORDER; > > > > > + IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2; > > > > > > > > > > Could this be related? that it reduces the order for some > > > > > slab cache, > > > > > so each per-cpu slab will has less objects, which makes the > > > > > contention > > > > > for per-node spinlock 'list_lock' more severe when the slab > > > > > allocation > > > > > is under pressure from many concurrent threads. > > > > > > > > hackbench uses skbuff_head_cache intensively. So we need to > > > > check if > > > > skbuff_head_cache's > > > > order was increased or decreased. On my desktop > > > > skbuff_head_cache's > > > > order is 1 and I roughly > > > > guessed it was increased, (but it's still worth checking in the > > > > testing env) > > > > > > > > But decreased slab order does not necessarily mean decreased > > > > number > > > > of cached objects per CPU, because when oo_order(s->oo) is > > > > smaller, > > > > then it caches > > > > more slabs into the per cpu slab list. > > > > > > > > I think more problematic situation is when oo_order(s->oo) is > > > > higher, > > > > because the heuristic > > > > in SLUB assumes that each slab has order of oo_order(s->oo) and > > > > it's > > > > half-filled. if it allocates > > > > slabs with order lower than oo_order(s->oo), the number of > > > > cached > > > > objects per CPU > > > > decreases drastically due to the inaccurate assumption. > > > > > > > > So yeah, decreased number of cached objects per CPU could be > > > > the cause > > > > of the regression due to the heuristic. > > > > > > > > And I have another theory: it allocated high order slabs from > > > > remote node > > > > even if there are slabs with lower order in the local node. > > > > > > > > ofc we need further experiment, but I think both improving the > > > > accuracy of heuristic and > > > > avoiding allocating high order slabs from remote nodes would > > > > make SLUB > > > > more robust. > > > > > > I run the reproduce command in a local 2-socket box: > > > > > > "/usr/bin/hackbench" "-g" "128" "-f" "20" "--process" "-l" > > > "30000" "-s" "100" > > > > > > And found 2 kmem_cache has been boost: 'kmalloc-cg-512' and > > > 'skbuff_head_cache'. Only order of 'kmalloc-cg-512' was reduced > > > from 3 to 2 with the patch, while its 'cpu_partial_slabs' was > > > bumped > > > from 2 to 4. The setting of 'skbuff_head_cache' was kept > > > unchanged. > > > > > > And this compiled with the perf-profile info from 0Day's report, > > > that the > > > 'list_lock' contention is increased with the patch: > > > > > > 13.71% 13.70% [kernel.kallsyms] [k] > > > native_queued_spin_lock_slowpath - > > > - > > > 5.80% > > > native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;__unfreez > > > e_partials;skb_release_data;consume_skb;unix_stream_read_generic; > > > unix_stream_recvmsg;sock_recvmsg;sock_read_iter;vfs_read;ksys_rea > > > d;do_syscall_64;entry_SYSCALL_64_after_hwframe;__libc_read > > > 5.56% > > > native_queued_spin_lock_slowpath;_raw_spin_lock_irqsave;get_parti > > > al_node.part.0;___slab_alloc.constprop.0;__kmem_cache_alloc_node; > > > __kmalloc_node_track_caller;kmalloc_reserve;__alloc_skb;alloc_skb > > > _with_frags;sock_alloc_send_pskb;unix_stream_sendmsg;sock_write_i > > > ter;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwf > > > rame;__libc_write > > > > Oh... neither of the assumptions were not true. > > AFAICS it's a case of decreasing slab order increases lock > > contention, > > Oh good, that would be the least surprising result, at least :) Yeah > I've > pointed out in my reply to this v2 that this patch should not result > in > decreasing slab order, at least for 4k pages. > > The v3/v4 is indeed different in that it only affects 64k pages. But > the > inital goal from v1 to increase the order for 4k is also no longer > there. > Maybe that's fine as there's two things to consider here IMHO. 1) the > order > could be increased for 4k pages for some cache sizes to minimize > waste > (that's what v1 did, but also for 64k where it was not an > improvement) 2) > the orders we have might be too large for 64k pages. Now v4 addresses > 2) > AFAICS. We could return also to 1) separately if it shows benefits. > Yes, so with V4 currently targeting larger page size for slub memory wastage reduction, but will also work on point 1 later on as it shows some benefits :) > In any case it means the benchmark results on v2 are no longer > applicable, > so we could move the discussion to v4: > > https://lore.kernel.org/all/20230720102337.2069722-1-jaypatel@linux.ibm.com/ > So any reviews/feedbacks for V4. > Now I noticed in v4 there's only M: folks from the MAINTAINERS slab > section > on Cc: but not R: folks. Next time please Cc: also R: (Hyeonggon and > Roman). > Thanks! > Sure next time will also add R: floks :) Thanks Jay Patel > > The number of cached objects per CPU is mostly the same (not > > exactly same, > > because the cpu slab is not accounted for), but only increases the > > number of slabs > > to process while taking slabs (get_partial_node()), and flushing > > the current > > cpu partial list. (put_cpu_partial() -> __unfreeze_partials()) > > > > Can we do better in this situation? improve __unfreeze_partials()? > > > > > Also I tried to restore the slub_max_order to 3, and the > > > regression was > > > gone. > > > > > > static unsigned int slub_max_order = > > > - IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 2; > > > + IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 3; > > > static unsigned int slub_min_objects; > > > > > > Thanks, > > > Feng > > > > > > > > I don't have direct data to backup it, and I can try some > > > > > experiment. > > > > > > > > Thank you for taking time for experiment! > > > > > > > > Thanks, > > > > Hyeonggon > > > > > > > > > > > then retest on this test machine: > > > > > > > 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ > > > > > > > 2.00GHz (Ice Lake) with 256G memory