From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-qk1-f171.google.com (mail-qk1-f171.google.com [209.85.222.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E55814AD17 for ; Fri, 24 Jan 2025 15:48:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.171 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737733702; cv=none; b=n+0lSSgWmK1eAUCnccE79y9r1nmn+iu2FV5FQabMoUb1gvwn7OQX3CvOBjNi+dP0TEkESV8EQ5WzK4AlmUnxmJh+HdQq5ccmaVbcFuzuSPa8kTwBdo2gQnz9wowgsRVOVk0tciZsRfiNwHA/d6lO1W/Wu9bUB2x7whSo1RlHnFg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737733702; c=relaxed/simple; bh=vV35Mn7e+hzGvBxfwK2B/OBtMH5woGqqePsZPgqMYnY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=kwcB75t+Cz42cK6FAlu5c/7vyOODbK887SceGdT+2ymGqeQki+9oNAzSZzfYXuwx142ordjeqJK6oplP901J4IX4TB9WVeGv+ZlZu3HYNgAUcpkck8ex/QAaJtQKE9NSpZt91z2a1K10d57QyNhFKPPCWF8sbVRgGKQrkPN4ZwM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net; spf=pass smtp.mailfrom=gourry.net; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b=GaOTwk+w; arc=none smtp.client-ip=209.85.222.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=gourry.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gourry.net Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gourry.net header.i=@gourry.net header.b="GaOTwk+w" Received: by mail-qk1-f171.google.com with SMTP id af79cd13be357-7b9bc648736so227258885a.1 for ; Fri, 24 Jan 2025 07:48:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1737733699; x=1738338499; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=; b=GaOTwk+wQD7Ah0XQQUYppOqmGgDL+3ZGQvr0SGlBZqFeih3LzZKnSP261KydanGefO sq0gm1fSzwEYiryWI7ziUOUHsIximaqKa+geuYckLSyU+xOqNb5935mr9opFaWZCJNCz R2KuQP7DNVRUTuu9jdlWizXkhLfsMeAVjVLLy/eKyyKSrVjOEmiG0K/A6LfbEL1eutUj 99N4o4CAS3WfBAzWm2DY+VOPLijevBQVzko/M7fbUjEV65rM9jT6BPemI4yaUV6aUZV/ pH8/4oLfzky3ffgYtw+EIgs3rxVrdGLMHdgxFud1mOAnaua1sa2HRInje0cw8Ek6uwGS Y3kw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737733699; x=1738338499; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=m562TZx/BdOCXnSumLiFkdR7KvTDmTP7YvRm+7CCcuI=; b=PLPQTLSDaFV2q6UQNrFsYnV2Pyxv0S5IBliTQ2dt51tSkxJnKAInm5d9okWP0JAn3b vaTwa2OMYUjQzXecyYdmv9mSMGYNAUEGEmN0jLEeRFc90fUq4z6G9/9Gn9M7xLkgyqq2 IzszO0JRbDsKKtx4YZpmyxQxlURHCf3k2Fep5D3i1Qtt1m0GzQJfqEYY9Sffh5k3ShFE mgtk/uvOmqOL1mp/5z8E0I08beMJekqO6MYy9nRWQkTcOrIi7xKXSNweT7PgM4mJzol+ hjP3tBuLd5j8WE8g7Fvk7Qs6TOQVvgDHJ8yYQOJoD83T+x67skvG7/9S/yJdIP158vNH ElAg== X-Forwarded-Encrypted: i=1; AJvYcCV49AT01b5CyOLrdbxaseIEsH71TK14Gn05GIhfsNeB0Svbjm/NBZDnYFOVB2Oi//ARzrwwf64MhfMg@vger.kernel.org X-Gm-Message-State: AOJu0YyCp1wALg/sklC1zgCJqsMw42pbtOoreYXiJ+kEptYh137y9BlR VjoABqnySf8hSWQGNuChMCpYo2GNSYWMCNQiXZs1h7qgs/xu/97UdRXmAjtzkj8= X-Gm-Gg: ASbGncv9af8MxO2YveDcnI50SbpgIg5y0E9oaXQW7lec+UsZ4BWakk+vKA65Dps17mp sdW/WmNjnHaQMl43CoaScImmjziyavSTI80spDwcu2pLr2XPilkmYwEJ1aZmvWgyEQ33LHGa4ze iEdwf6hhifIRSvLCbSHLSDv97EK7w2wLUh4gQojpXBSVfeYDVp46b3UAGRQpixbFxkEXjy116br XVbylzL1XOTZ5+Z8GKjp4P4lz0aaS8Xd1gMqn442tPTTo2TebVGq0ZbZFNhlOIfkQyU9Larja5V NR7HXkxcIg2WnHc45hjmkK5TG4e915JHIYn0zVX67rmSKD0H0Y3dWj/rnJp7aY0= X-Google-Smtp-Source: AGHT+IHAmF9H/uJyOZ1fy3dvSN0mMIgEX3R3qF1HUo1sYQICkvQxxUc2gSc15oHllcaWPb/TZ5k1Sw== X-Received: by 2002:a05:620a:2687:b0:7b6:d70a:86df with SMTP id af79cd13be357-7be6320aee0mr4993653785a.27.1737733699074; Fri, 24 Jan 2025 07:48:19 -0800 (PST) Received: from gourry-fedora-PF4VCD3F (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7be9aeee9b7sm104073185a.68.2025.01.24.07.48.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Jan 2025 07:48:18 -0800 (PST) Date: Fri, 24 Jan 2025 10:48:16 -0500 From: Gregory Price To: Matthew Wilcox Cc: Joshua Hahn , hyeonggon.yoo@sk.com, ying.huang@linux.alibaba.com, rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org, akpm@linux-foundation.org, honggyu.kim@sk.com, rakie.kim@sk.com, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: Re: [PATCH v3] Weighted interleave auto-tuning Message-ID: References: <20250115185854.1991771-1-joshua.hahnjy@gmail.com> Precedence: bulk X-Mailing-List: linux-acpi@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote: > On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote: > > On machines with multiple memory nodes, interleaving page allocations > > across nodes allows for better utilization of each node's bandwidth. > > Previous work by Gregory Price [1] introduced weighted interleave, which > > allowed for pages to be allocated across NUMA nodes according to > > user-set ratios. > > I still don't get it. You always want memory to be on the local node or > the fabric gets horribly congested and slows you right down. But you're > not really talking about NUMA, are you? You're talking about CXL. > > And CXL is terrible for bandwidth. I just ran the numbers. > > On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs, > each with a bandwidth of 38.4GB/s for a total of 300GB/s. > > For each CXL lane, you take a lane of PCIe gen5 away. So that's > notionally 32Gbit/s, or 4GB/s per lane. But CXL is crap, and you'll be > lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s. > You're not going to use all 80 lanes for CXL (presumably these CPUs are > going to want to do I/O somehow), so maybe allocate 20 of them to CXL. > That's 60GB/s, or a 20% improvement in bandwidth. On top of that, > it's slow, with a minimum of 10ns latency penalty just from the CXL > encode/decode penalty. > >From the original - the performance tests show considerable opportunity in the scenarios where DRAM bandwidth is pressured - as you can either 1) Lower the DRAM bandwidth pressure by offloading some cachelines to CXL - reducing latency on DRAM and reducing average latency overall. The latency cost on CXL lines gets amortized over all DRAM fetches no longer hitting stalls. 2) Under full-pressure scenarios (DRAM and CXL are saturated), the additional lanes / buffers provide more concurrent fetches - i.e. you're just doing more work (and avoiding going to storage). This is the weaker of the two scenarios. No one is proposing we switch the default policy to weighted interleave. = Performance summary = (tests may have different configurations, see extended info below) 1) MLC (W2) : +38% over DRAM. +264% over default interleave. MLC (W5) : +40% over DRAM. +226% over default interleave. 2) Stream : -6% to +4% over DRAM, +430% over default interleave. 3) XSBench : +19% over DRAM. +47% over default interleave. ===================================================================== Performance tests - MLC >From - Ravi Jonnalagadda Hardware: Single-socket, multiple CXL memory expanders. Workload: W2 Data Signature: 2:1 read:write DRAM only bandwidth (GBps): 298.8 DRAM + CXL (default interleave) (GBps): 113.04 DRAM + CXL (weighted interleave)(GBps): 412.5 Gain over DRAM only: 1.38x Gain over default interleave: 2.64x Workload: W5 Data Signature: 1:1 read:write DRAM only bandwidth (GBps): 273.2 DRAM + CXL (default interleave) (GBps): 117.23 DRAM + CXL (weighted interleave)(GBps): 382.7 Gain over DRAM only: 1.4x Gain over default interleave: 2.26x ===================================================================== Performance test - Stream >From - Gregory Price Hardware: Single socket, single CXL expander Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) mbind weights : +2.5% to +4% (consistently better than DRAM) ===================================================================== Performance tests - XSBench >From - Hyeongtak Ji Hardware: Single socket, Single CXL memory Expander NUMA node 0: 56 logical cores, 128 GB memory NUMA node 2: 96 GB CXL memory Threads: 56 Lookups: 170,000,000 Summary: +19% over DRAM. +47% over default interleave. > Putting page cache in the CXL seems like nonsense to me. I can see it > making sense to swap to CXL, or allocating anonymous memory for tasks > with low priority on it. But I just can't see the point of putting > pagecache on CXL. No one said anything about page cache - but it depends. If you can keep your entire working set in-memory and on-CXL, as opposed to swapping to disk - you win. "Swapping to CXL" incurs a bunch of page faults, that sounds like a lose. However - the stream test from the original proposal agrees with you that just making everything interleaved (code, pagecache, etc) is at best a wash: Global weighting : -6% to +4% (workload dependant) But targeting specific regions can provide a modest bump mbind weights : +2.5% to +4% (consistently better than DRAM) ~Gregory