From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 896A1C433FE
	for <linux-mm@archiver.kernel.org>; Fri, 26 Nov 2021 16:26:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DE22C6B0075; Fri, 26 Nov 2021 11:26:38 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D90C36B0078; Fri, 26 Nov 2021 11:26:38 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C58EB6B007B; Fri, 26 Nov 2021 11:26:38 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0007.hostedemail.com [216.40.44.7])
	by kanga.kvack.org (Postfix) with ESMTP id B645D6B0075
	for <linux-mm@kvack.org>; Fri, 26 Nov 2021 11:26:38 -0500 (EST)
Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 79C5918549A6A
	for <linux-mm@kvack.org>; Fri, 26 Nov 2021 16:26:28 +0000 (UTC)
X-FDA: 78851609214.12.4A14300
Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177])
	by imf26.hostedemail.com (Postfix) with ESMTP id 593C120019EB
	for <linux-mm@kvack.org>; Fri, 26 Nov 2021 16:26:26 +0000 (UTC)
Received: by mail-pl1-f177.google.com with SMTP id y8so7027089plg.1
        for <linux-mm@kvack.org>; Fri, 26 Nov 2021 08:26:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to:user-agent;
        bh=q/HRGw2r5wG7sszT9Fb1vk0NIxwkVXLQR9pjE/nA1Tc=;
        b=hZXEtwgIgWBHlarUnLTfnMEddvX1ZvCgyKfbW91zU5qVw4BWPEk6h2BeN9eKocOl/5
         iG1H9wm6oQQbaTqDQpw1p+zMdg1suCwqLCY/vnG5DYCGq6ZzwyVFZvB2Bs4p61bvyAwE
         Hryp4nPZZV2AKt6xkJYPE0MnXv87asXXdQcdf9Kz/oQwrAgC/gMe4/YUVxiWY7Rd67qf
         g04fTUHIM77oZ4dZa0Z8QqUGqBo/wihmAFXnw0+Hz4Ub/GHawVCN/6vuyEqlZP8rKY9V
         pM9enbAwzLwDrbZyd8qpj7L3ye4okDSJ6kEgPAuyvqe1bDdJJmGxnE8Sv6J2ee4aHrOc
         rzIQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to:user-agent;
        bh=q/HRGw2r5wG7sszT9Fb1vk0NIxwkVXLQR9pjE/nA1Tc=;
        b=iiqNVLlFqSBBMDSZY2SMj7CGzB0xREQu3C/gQE8MvaMaZDeFzS1kQwTFJFs8ose0Bb
         5r6sgIybLQG2Z6ri60Ge2YGwJGqZwlqd9pSGJtXJRIyaRZrvwbS4g0/aqOKDB5fjkq/j
         c1Cr8NzKkr/2O0biTHhyNJMuETxCDjHXCzx6BkQnasjrB6alONpT6OmfmK5DnH5So0u4
         g9liUQLmvJHZJCbjiSBWeU3xdBzvP8y/50ucybsgTegc3piVbWnm1y0bQL/GKb+GDVOs
         BedxyH8d9ZhDRsdDSA8QQq9Uz41wzqsrK26aLgKrV4His1JgTX+JyQ+dECsJVDaZtHW3
         WG2w==
X-Gm-Message-State: AOAM531qKS4aXIjMPlMIx1TyX8JT4VNggH8S6XWr7ioxVDmp0HdOjgzk
	YrxC9UsEysS90kqhQQLXrvM=
X-Google-Smtp-Source: ABdhPJwKTFwC5iHnm/7QHrutv643MnseHPiTXxhG5z3/B6Sy5xf47ltm167TKmFsuvjEvKgR7sSn3g==
X-Received: by 2002:a17:90b:4a05:: with SMTP id kk5mr16792155pjb.232.1637943986231;
        Fri, 26 Nov 2021 08:26:26 -0800 (PST)
Received: from haolee.io ([2600:3c01::f03c:91ff:fe02:b162])
        by smtp.gmail.com with ESMTPSA id kk7sm12001805pjb.19.2021.11.26.08.26.25
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 26 Nov 2021 08:26:25 -0800 (PST)
Date: Fri, 26 Nov 2021 16:26:23 +0000
From: Hao Lee <haolee.swjtu@gmail.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>, Linux MM <linux-mm@kvack.org>,
	Johannes Weiner <hannes@cmpxchg.org>, vdavydov.dev@gmail.com,
	Shakeel Butt <shakeelb@google.com>, cgroups@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: reduce spinlock contention in release_pages()
Message-ID: <20211126162623.GA10277@haolee.io>
References: <20211124151915.GA6163@haolee.io>
 <YZ5o/VmU59evp65J@dhcp22.suse.cz>
 <CA+PpKPmy-u_BxYMCQOFyz78t2+3uM6nR9mQeX+MPyH6H2tOOHA@mail.gmail.com>
 <YZ8DZHERun6Fej2P@casper.infradead.org>
 <20211125080238.GA7356@haolee.io>
 <YZ9e3pzHKmn5nev0@dhcp22.suse.cz>
 <20211125123133.GA7758@haolee.io>
 <YZ+bI1fNpKar0bSU@dhcp22.suse.cz>
 <CA+PpKP=hsuBmvv09OcD2Nct8B8Cqa03UfKFHAHzKxwE0SXGP4g@mail.gmail.com>
 <YaC7BcTSijFj+bxR@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <YaC7BcTSijFj+bxR@dhcp22.suse.cz>
User-Agent: Mutt/1.12.1 (2019-06-15)
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 593C120019EB
X-Stat-Signature: 5ujqhpm7j58eqx96kakrkkswyx51x5xi
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=hZXEtwgI;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf26.hostedemail.com: domain of haolee.swjtu@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=haolee.swjtu@gmail.com
X-HE-Tag: 1637943986-673411
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Nov 26, 2021 at 11:46:29AM +0100, Michal Hocko wrote:
> On Fri 26-11-21 14:50:44, Hao Lee wrote:
> > On Thu, Nov 25, 2021 at 10:18 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Could you share more about requirements for those? Why is unmapping in
> > > any of their hot paths which really require low latencies? Because as
> > > long as unmapping requires a shared resource - like lru lock - then you
> > > have a bottle necks.
> > 
> > We deploy best-effort (BE) jobs (e.g. bigdata, machine learning) and
> > latency-critical (LC) jobs (e.g. map navigation, payments services) on the
> > same servers to improve resource utilization. The running time of BE jobs are
> > very short, but its memory consumption is large, and these jobs will run
> > periodically. The LC jobs are long-run services and are sensitive to delays
> > because jitters may cause customer churn.
> 
> Have you tried to isolate those workloads by memory cgroups? That could
> help for lru lock at least.

Sure. LC and BE jobs are in different memory cgroups (containers). memcg
indeed avoids lru contentions between LC and BE, although it can't reduce
contentions between jobs in the same cgroup. BE jobs' memory contentions
could cause cpu jitters and then interfere LC jobs.

> You are likely going to hit other locks on
> the way though. E.g. zone lock in the page allocator but that might be
> less problematic in the end.

Yes, but we haven't encountered lock contentions in the allocation path for
now. Maybe this is because the memory allocations of BE jobs are still
gradual and are not clustered into a very short time period.

> If you isolate your long running services
> to a different NUMA node then you can get even less interaction.

Agree.

> 
> > If a batch of BE jobs are finished simultaneously, lots of memory are freed,
> > and spinlock contentions happen. BE jobs don't care about these contentions,
> > but contentions cause them to spend more time in kernel mode, and thus, LC
> > jobs running on the same cpu cores will be delayed and jitters occur. (The
> > kernel preemption is disabled on our servers, and we try not to separate
> > LC/BE using cpuset in order to achieve "complete mixture deployment"). Then
> > LC services people will complain about the poor service stability. This
> > scenario has occurred several times, so we want to find a way to avoid it.
> 
> It will be hard and a constant fight to get reasonably low latencies on
> a non preemptible kernel. It would likely be better to partition CPUs
> between latency sensitive and BE jobs. I can see how that might not be
> really practical but especially with non-preemptible kernels you have a
> large space for priority inversions that is hard to forsee or contain.

Agree. It's really hard. Maybe we will eventually use cpuset to separate LC
and BE if we can't find a better way to mix them on the same set of cpus.

I will try Matthew's idea to use semaphore or mutex to limit the number of BE
jobs that are in the exiting path. This sounds like a feasible approach for
our scenario...

Thanks

> -- 
> Michal Hocko
> SUSE Labs