From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=05z6=56=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 518FEC2BA2B
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 01:10:08 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 0FAED206B6
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 01:10:07 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0FAED206B6
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 7B2228E0003; Mon, 13 Apr 2020 21:10:07 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 762C78E0001; Mon, 13 Apr 2020 21:10:07 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 651898E0003; Mon, 13 Apr 2020 21:10:07 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129])
	by kanga.kvack.org (Postfix) with ESMTP id 4DDF08E0001
	for <linux-mm@kvack.org>; Mon, 13 Apr 2020 21:10:07 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 107608245571
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 01:10:07 +0000 (UTC)
X-FDA: 76704679254.27.need10_58400ddadc704
X-HE-Tag: need10_58400ddadc704
X-Filterd-Recvd-Size: 4431
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf04.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 01:10:06 +0000 (UTC)
IronPort-SDR: u26Bvq3Fc23j/fZ5aC7hfh9vuXXeWxgLkOEd+zJoMx80mMA0EBk6S6+oBRr0xMv2n14ONTU6EW
 3LGWqFh9v4IQ==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga003.jf.intel.com ([10.7.209.27])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2020 18:10:04 -0700
IronPort-SDR: Fec9jgEmmBJtqJpbXgRU9rtoUUmOAwqm/rKTeVers9ZmXtcWwcmOsEvOjQZ0XKjDGaknqWI3fV
 EedrX2H7yCbA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.72,380,1580803200"; 
   d="scan'208";a="253047045"
Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.23])
  by orsmga003.jf.intel.com with ESMTP; 13 Apr 2020 18:10:01 -0700
From: "Huang\, Ying" <ying.huang@intel.com>
To: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>,  Chintan Pandya <chintan.pandya@oneplus.com>,  Michal Hocko <mhocko@suse.com>,  "akpm\@linux-foundation.org" <akpm@linux-foundation.org>,  "linux-mm\@kvack.org" <linux-mm@kvack.org>,  "gregkh\@linuxfoundation.org" <gregkh@linuxfoundation.org>,  "gthelen\@google.com" <gthelen@google.com>,  "jack\@suse.cz" <jack@suse.cz>,  Ken Lin <ken.lin@oneplus.com>,  Gasine Xu <Gasine.Xu@oneplus.com>
Subject: Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases
References: <20200403081812.GA14090@oneplus.com>
	<20200403085201.GX22681@dhcp22.suse.cz>
	<20200409152913.GA9878@oneplus.com>
	<20200409154538.GR18386@dhcp22.suse.cz>
	<SG2PR04MB2921D2AAA8726318EF53D83691DE0@SG2PR04MB2921.apcprd04.prod.outlook.com>
	<87lfn390db.fsf@yhuang-dev.intel.com>
	<SG2PR04MB2921E6D51681B935C0F85EEA91DF0@SG2PR04MB2921.apcprd04.prod.outlook.com>
	<CAKgT0Ufy9C=MkSjAgyyEHOO8dupQ7Sr9LWUqX15bbkW+cB2qwQ@mail.gmail.com>
	<20200413153351.GB13136@oneplus.com>
Date: Tue, 14 Apr 2020 09:10:01 +0800
In-Reply-To: <20200413153351.GB13136@oneplus.com> (Prathu Baronia's message of
	"Mon, 13 Apr 2020 21:03:52 +0530")
Message-ID: <871roq7tzq.fsf@yhuang-dev.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Prathu Baronia <prathu.baronia@oneplus.com> writes:

> The 04/11/2020 13:47, Alexander Duyck wrote:
>> 
>> This is an interesting data point. So running things in reverse seems
>> much more expensive than running them forward. As such I would imagine
>> process_huge_page is going to be significantly more expensive then on
>> ARM64 since it will wind through the pages in reverse order from the
>> end of the page all the way down to wherever the page was accessed.
>> 
>> I wonder if we couldn't simply process_huge_page to process pages in
>> two passes? The first being from the addr_hint + some offset to the
>> end, and then loop back around to the start of the page for the second
>> pass and just process up to where we started the first pass. The idea
>> would be that the offset would be enough so that we have the 4K that
>> was accessed plus some range before and after the address hopefully
>> still in the L1 cache after we are done.
> That's a great idea, we were working on a similar idea for the v2 patch and you
> suggesting this idea has reassured our approach. This will incorporate the
> benefits of optimized memset and will keep the cache hot around the
> faulting address.
>
> Earlier we had taken this offset as 0.5MB and after your response we have kept it
> as 32KB. As we understand there is a trade-off associated with keeping this value
> too high, we would really appreciate if you can suggest a method to derive an
> appropriate value for this offset from the L1 cache size.

I don't think we should only keep L1 cache hot.  I think it is good to
keep L2 cache hot too.  That could be 1 MB on x86 machine.  In theory,
it's better to keep as much cache hot as possible.

I understand that the benefit of cache-hot is offset by slower backward
zeroing in your system.  So you need to balance between them.  But
because backward zeroing is as fast as forward zeroing on x86, we should
consider that too.  Maybe we need to use two different implementations
on x86 and ARM, or use some parameter to tune it for different
architectures.

Best Regards,
Huang, Ying