From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9E78C65BAF for ; Wed, 12 Dec 2018 17:09:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8A4C120870 for ; Wed, 12 Dec 2018 17:09:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8A4C120870 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727955AbeLLRJD (ORCPT ); Wed, 12 Dec 2018 12:09:03 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46930 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726358AbeLLRJD (ORCPT ); Wed, 12 Dec 2018 12:09:03 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7BF0930832D2; Wed, 12 Dec 2018 17:09:02 +0000 (UTC) Received: from sky.random (ovpn-122-218.rdu2.redhat.com [10.10.122.218]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B41965ED3C; Wed, 12 Dec 2018 17:08:58 +0000 (UTC) Date: Wed, 12 Dec 2018 12:00:16 -0500 From: Andrea Arcangeli To: Michal Hocko Cc: David Rientjes , Linus Torvalds , mgorman@techsingularity.net, Vlastimil Babka , ying.huang@intel.com, s.priebe@profihost.ag, Linux List Kernel Mailing , alex.williamson@redhat.com, lkp@01.org, kirill@shutemov.name, Andrew Morton , zi.yan@cs.rutgers.edu Subject: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression Message-ID: <20181212170016.GG1130@redhat.com> References: <20181205233632.GE11899@redhat.com> <20181210044916.GC24097@redhat.com> <20181212095051.GO1286@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181212095051.GO1286@dhcp22.suse.cz> User-Agent: Mutt/1.11.1 (2018-12-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.44]); Wed, 12 Dec 2018 17:09:02 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 12, 2018 at 10:50:51AM +0100, Michal Hocko wrote: > I can be convinced that larger pages really require a different behavior > than base pages but you should better show _real_ numbers on a wider > variety workloads to back your claims. I have only heard hand waving and I agree with your point about node_reclaim and I think David complaint of "I got remote THP instead of local 4k" with our proposed fix, is going to morph into "I got remote 4k instead of local 4k" with his favorite fix. Because David stopped calling reclaim with __GFP_THISNODE, the moment the node is full of pagecache, node_reclaim behavior will go away and even 4k pages will start to be allocated remote (and because of __GFP_THISNODE set in the THP allocation, all readily available or trivial to compact remote THP will be ignored too). What David needs I think is a way to set __GFP_THISNODE for THP *and 4k* allocations and if both fails in a row with __GFP_THISNODE set, we need to repeat the whole thing without __GFP_THISNODE set (ideally with a mask to skip the node that we already scraped down to the bottom during the initial __GFP_THISNODE pass). This way his proprietary software binary will work even better than before when the local node is fragmented and he'll finally be able to get the speedup from remote THP too in case the local node is truly OOM, but all other nodes are full of readily available THP. To achieve this without a new MADV_THISNODE/MADV_NODE_RECLAIM, we'd need a way to start with __GFP_THISNODE and then draw the line in reclaim and decide to drop __GFP_THISNODE when too much pressure mounts in the local node, but like you said it becomes like node_reclaim and it would be better if it can be done with an opt-in, like MADV_HUGEPAGE because not all workloads would benefit from such extra pagecache reclaim cost (like not all workload benefits from synchronous compaction). I think some NUMA reclaim mode semantics ended up being embedded and hidden in the THP MADV_HUGEPAGE, but they imposed massive slowdown to all workloads that can't cope with the node_reclaim mode behavior because they don't fit in a node. Adding MADV_THISNODE/MADV_NODE_RECLAIM, will guarantee his proprietary software binary will run at maximum performance without cache interference, and he's happy to accept the risk of massive slowdown in case the local node is truly OOM. The fallback, despite very inefficient, will still happen without OOM killer triggering. Thanks, Andrea