From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8FBD6C433F5
	for <linux-mm@archiver.kernel.org>; Thu,  7 Apr 2022 10:08:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 260DF6B0071; Thu,  7 Apr 2022 06:08:32 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 210FA6B0073; Thu,  7 Apr 2022 06:08:32 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0D9926B0074; Thu,  7 Apr 2022 06:08:32 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0185.hostedemail.com [216.40.44.185])
	by kanga.kvack.org (Postfix) with ESMTP id F039C6B0071
	for <linux-mm@kvack.org>; Thu,  7 Apr 2022 06:08:31 -0400 (EDT)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id B9067ACB5C
	for <linux-mm@kvack.org>; Thu,  7 Apr 2022 10:08:21 +0000 (UTC)
X-FDA: 79329658002.21.4A087D1
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf15.hostedemail.com (Postfix) with ESMTP id 41281A0003
	for <linux-mm@kvack.org>; Thu,  7 Apr 2022 10:08:20 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1649326100;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=DrACxl9C2gOChjAyxZVhTY+qfXPOAdTm/ErunvhnUCc=;
	b=NHfRbhNd+pAosOEL6jQ6zwq2B+ZKC6QI9RiAd79BmR5FT2ioLx03kzSNNkrArXfwj38kxi
	9jck/yiV+DmSKwLubok5zB8gIeV46HesumFMqOCBxGJxUZaKs3+84jme/T2Zn+Ew3AVLwP
	mgbIilakFVakwS3y7LSmPRpgd77SaSg=
Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com
 [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-447-ZPbXRT2vN72mAPb2BlzZaw-1; Thu, 07 Apr 2022 06:08:19 -0400
X-MC-Unique: ZPbXRT2vN72mAPb2BlzZaw-1
Received: by mail-wm1-f69.google.com with SMTP id x8-20020a7bc768000000b0038e73173886so2748333wmk.6
        for <linux-mm@kvack.org>; Thu, 07 Apr 2022 03:08:19 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:message-id:date:mime-version:user-agent
         :content-language:to:cc:references:from:organization:subject
         :in-reply-to:content-transfer-encoding;
        bh=DrACxl9C2gOChjAyxZVhTY+qfXPOAdTm/ErunvhnUCc=;
        b=mEQaGt9Ra7ZTmxlBUdNebA0TypyAaoiPUo8wjEcA3q5l7AQW8U2E3ohfHbSmT2wToJ
         yoB2ZiVLR5FLDUNP3bFVCchYRaHACEmaMlMCSgNLcYDeja/yl99e8sUUufmhWkMyGE76
         B2KWTsBCCWh+4aJIMy11aOF+bYTRyniBYY9MqvFO+tvACRGVP6HA/5EuyayKC6gw20Dh
         cohBSbIQKTFcXsSylb3KLpqQhzFRWaiBShHO16eiVafp6iEz5fF9D76FAvz41t4ohDsO
         IPaPBac1lIyO8E/2jk79H66W3fphDn3UnCrzbqPFrc9UJ50JpJjeKUfhFajwsGEyTqJm
         UZfQ==
X-Gm-Message-State: AOAM530JchNBsTyFhhUyyIe54YNSVG3JHMNGlfa2XryQaLSJ+ZnhXoZi
	b+i2AGJjI2QeOs4waSopCpzda+Wl8Wr4x51G/D3TQubnHTYun5Wkd6n5Df5axoNkA/BBfU+NFQW
	G2n8BMgMph5E=
X-Received: by 2002:a5d:64eb:0:b0:205:d418:4798 with SMTP id g11-20020a5d64eb000000b00205d4184798mr10296696wri.11.1649326098319;
        Thu, 07 Apr 2022 03:08:18 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJy/CDCZaWyg95GLwbhj/qt39N+IZvaplfCwXnnSE4E74yjYE+B9+NkOc6BjdiTjCVvf14UJEg==
X-Received: by 2002:a5d:64eb:0:b0:205:d418:4798 with SMTP id g11-20020a5d64eb000000b00205d4184798mr10296667wri.11.1649326097983;
        Thu, 07 Apr 2022 03:08:17 -0700 (PDT)
Received: from ?IPV6:2a09:80c0:192:0:20af:34be:985b:b6c8? ([2a09:80c0:192:0:20af:34be:985b:b6c8])
        by smtp.gmail.com with ESMTPSA id l15-20020a05600c4f0f00b0038cbdf5221dsm7798149wmq.41.2022.04.07.03.08.17
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Thu, 07 Apr 2022 03:08:17 -0700 (PDT)
Message-ID: <045a59a1-0929-a969-b184-1311f81504b8@redhat.com>
Date: Thu, 7 Apr 2022 12:08:16 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.6.2
To: Mike Kravetz <mike.kravetz@oracle.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>, Peter Xu <peterx@redhat.com>,
 Naoya Horiguchi <naoya.horiguchi@linux.dev>,
 "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
 Andrea Arcangeli <aarcange@redhat.com>,
 "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
 Davidlohr Bueso <dave@stgolabs.net>,
 Prakash Sangappa <prakash.sangappa@oracle.com>,
 James Houghton <jthoughton@google.com>, Mina Almasry
 <almasrymina@google.com>, Ray Fucillo <Ray.Fucillo@intersystems.com>,
 Andrew Morton <akpm@linux-foundation.org>
References: <20220406204823.46548-1-mike.kravetz@oracle.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing
In-Reply-To: <20220406204823.46548-1-mike.kravetz@oracle.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NHfRbhNd;
	spf=none (imf15.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 41281A0003
X-Stat-Signature: 5ujou8cub3xe8xata7iz3sfue6jx6hb8
X-HE-Tag: 1649326100-3064
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 06.04.22 22:48, Mike Kravetz wrote:
> hugetlb fault scalability regressions have recently been reported [1].
> This is not the first such report, as regressions were also noted when
> commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization") was added [2] in v5.7.  At that time, a proposal to
> address the regression was suggested [3] but went nowhere.
> 
> To illustrate the regression, I created a simple program that does the
> following in an infinite loop:
> - mmap a 4GB hugetlb file (size insures pmd sharing)
> - fault in all pages
> - unmap the hugetlb file
> 
> The hugetlb fault code was then instrumented to collect number of times
> the mutex was locked and wait time.  Samples are from 10 second
> intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
> map/fault/unmap program are running.
> 
> v5.17
> -----
> [  708.763114] Wait_debug: faults sec  3622
> [  708.764010]             num faults  36220
> [  708.765016]             num waits   36220
> [  708.766054]             intvl wait time 54074 msecs
> [  708.767287]             max_wait_time   31000 usecs
> 
> 
> v5.17 + this series (similar to v5.6)
> -------------------------------------
> [  282.191391] Wait_debug: faults sec  1777939
> [  282.192571]             num faults  17779393
> [  282.193746]             num locks   5517
> [  282.194858]             intvl wait time 19907 msecs
> [  282.196226]             max_wait_time   43000 usecs
> 
> As can be seen, fault time suffers when there are other operations
> taking i_mmap_rwsem in write mode such as unmap.
> 
> This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
> depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
> fault path back to huge_pmd_share where it is only taken when necessary.
> After, reverting these patches we still need to handle:
> fault and truncate races
> - Catch and properly backout faults beyond i_size
>   Backing out reservations is much easier after 846be08578ed to expand
>   restore_reserve_on_error functionality.
> unshare and fault/lookup races
> - Since the pointer returned from huge_pte_offset or huge_pte_alloc may
>   become invalid until we lock the page table, we must revalidate after
>   taking the lock.  Code paths must backout and possibly retry if
>   page table pointer changes.
> 
> The commit message in patch 5 suggests that it is not safe to use
> SPLIT_PMD_PTLOCKS for hugetlb mappings if sharing is possible.  If
> others confirm/agree then there will need to be additional work.
> 
> Please help with comments or suggestions.  I would like to come up with
> something that is performant and safe.

May I challenge the existence of huge PMD sharing? TBH I am not
convinced that the code complexity is worth the benefit.


Let me know if I get something wrong:

Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.

Sure, with thousands of processes sharing that memory, the size of page
tables required would increase with each and every process. But TBH,
that's in no way different to other file systems where we're even
dealing with PTE tables.


Which results in me wondering if

a) We should simply use gigantic pages for such extreme use case. Allows
   for freeing up more memory via vmemmap either way.
b) We should instead look into reclaiming reconstruct-able page table.
   It's hard to imagine that each and every process accesses each and
   every part of the gigantic file all of the time.
c) We should instead establish a more generic page table sharing
   mechanism.


Consequently, I'd be much more in favor of ripping it out :/ but that's
just my personal opinion.

-- 
Thanks,

David / dhildenb