From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on archive.lwn.net X-Spam-Level: X-Spam-Status: No, score=-6.0 required=5.0 tests=MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham autolearn_force=no version=3.4.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by archive.lwn.net (Postfix) with ESMTP id 0A6E47DF8D for ; Thu, 24 May 2018 08:27:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964924AbeEXI1j (ORCPT ); Thu, 24 May 2018 04:27:39 -0400 Received: from mx2.suse.de ([195.135.220.15]:52702 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965164AbeEXI1d (ORCPT ); Thu, 24 May 2018 04:27:33 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 847CAAC6A; Thu, 24 May 2018 08:27:31 +0000 (UTC) Date: Thu, 24 May 2018 10:27:29 +0200 From: Michal Hocko To: TSUKADA Koutaro Cc: Johannes Weiner , Vladimir Davydov , Jonathan Corbet , "Luis R. Rodriguez" , Kees Cook , Andrew Morton , Roman Gushchin , David Rientjes , Mike Kravetz , "Aneesh Kumar K.V" , Naoya Horiguchi , Anshuman Khandual , Marc-Andre Lureau , Punit Agrawal , Dan Williams , Vlastimil Babka , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Subject: Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg Message-ID: <20180524082729.GX20441@dhcp22.suse.cz> References: <20180522135148.GA20441@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-doc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-doc@vger.kernel.org On Thu 24-05-18 13:26:12, TSUKADA Koutaro wrote: [...] > I do not know if it is really a strong use case, but I will explain my > motive in detail. English is not my native language, so please pardon > my poor English. > > I am one of the developers for software that managing the resource used > from user job at HPC-Cluster with Linux. The resource is memory mainly. > The HPC-Cluster may be shared by multiple people and used. Therefore, the > memory used by each user must be strictly controlled, otherwise the > user's job will runaway, not only will it hamper the other users, it will > crash the entire system in OOM. > > Some users of HPC are very nervous about performance. Jobs are executed > while synchronizing with MPI communication using multiple compute nodes. > Since CPU wait time will occur when synchronizing, they want to minimize > the variation in execution time at each node to reduce waiting times as > much as possible. We call this variation a noise. > > THP does not guarantee to use the Huge Page, but may use the normal page. > This mechanism is one cause of variation(noise). > > The users who know this mechanism will be hesitant to use THP. However, > the users also know the benefits of the Huge Page's TLB hit rate > performance, and the Huge Page seems to be attractive. It seems natural > that these users are interested in HugeTLBfs, I do not know at all > whether it is the right approach or not. Sure, asking for guarantee makes hugetlb pages attractive. But nothing is really for free, especially any resource _guarantee_, and you have to pay an additional configuration price usually. > At the very least, our HPC system is pursuing high versatility and we > have to consider whether we can provide it if users want to use HugeTLBfs. > > In order to use HugeTLBfs we need to create a persistent pool, but in > our use case sharing nodes, it would be impossible to create, delete or > resize the pool. Why? I can see this would be quite a PITA but not really impossible. > One of the answers I have reached is to use HugeTLBfs by overcommitting > without creating a pool(this is the surplus hugepage). > > Surplus hugepages is hugetlb page, but I think at least that consuming > buddy pool is a decisive difference from hugetlb page of persistent pool. > If nr_overcommit_hugepages is assumed to be infinite, allocating pages for > surplus hugepages from buddy pool is all unlimited even if being limited > by memcg. Not really, you can specify how much you can overcommit hugetlb pages. > In extreme cases, overcommitment will allow users to exhaust > the entire memory of the system. Of course, this can be prevented by the > hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup > respectively, as I asked in the first mail(set limit to 10GB), the > control will not work. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html