From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, NICE_REPLY_A,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5737C433E0 for ; Mon, 3 Aug 2020 13:44:10 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 8FC0C2076E for ; Mon, 3 Aug 2020 13:44:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Jd6EJYjd" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8FC0C2076E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:52134 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k2al8-0003wV-Ma for qemu-devel@archiver.kernel.org; Mon, 03 Aug 2020 09:44:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:53698) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k2akX-0003Qp-V0; Mon, 03 Aug 2020 09:43:33 -0400 Received: from mail-qt1-x842.google.com ([2607:f8b0:4864:20::842]:37893) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1k2akV-0006VD-1O; Mon, 03 Aug 2020 09:43:33 -0400 Received: by mail-qt1-x842.google.com with SMTP id e5so14150502qth.5; Mon, 03 Aug 2020 06:43:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=NqoK83PRdIJnHEw6P7ZICbek9SFhfz4mflYaEb3PdiU=; b=Jd6EJYjdlWgWcaOKJGO/heo3T6FbqLMWjwxzSjDvh9sc83izyeXkNoTWBOAkeZDNE+ kWis7sOM7Mviwu7U1Tdk0pyAgq+JLB/VuSbLACn40JMtfP/+PPQ8idv3bmyPftboHOYW t+2lwZAR3HN8JyiK5vXqxXZLTPQ+x7hYN5vvDYAupjLoDwyDOcq8hl6liEiuNJOB+aRM zbyS6tD/Z32EGjRJTpZedJD1LmP9bjz07LBCopyimL3HSkF2hC0qHcsvtj5MmP1cjJO9 yQzmDIMPUDIGtUMG88wiGb6QERxA736dzS61sToaW0GuJzeSoKiyp5QAOiq+ViWbOxMP NJXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=NqoK83PRdIJnHEw6P7ZICbek9SFhfz4mflYaEb3PdiU=; b=idWkDkCvSPcNiQhq2v34QDIMK79cmuVawwkLDDROKByH/xsexK22fF/paxBpbScJdx T5Jon3C/lFR9btQxf2LyPQHp0q1YxCr3ioLQJwUXZ1Dk41hlJOgaoIh4ELxt38rae0YI YGSShF/kkKgVLNUBRBTyX2hxHOBgvCnuFwkpg2WQTVizAFg49gAC4o2v7lCA5q8UCdJn gatdHKoCTcKrnHDWsUCNcW4yPaRrLVIegDd9T1icGDQRQ5KAz8l20ok5i7oJCUmC7SQR nHglyW0BCneyJnyKMCNewfy7n8wqtfzT2ZgeJ8/M3nah+43yolbCGNsrWsWYL7BbGAVM s9LA== X-Gm-Message-State: AOAM5325qB0PI25e7BJB3gmfR451X/vivcgx3pqfz+7PAQW38OtHIdE2 rUnR1VA+q2xqjJ6ek4X+jCXMW2gA X-Google-Smtp-Source: ABdhPJyyy+wU8bWgg+jUYBrtXKzrJhvCkKtyF3o8XUfq+x+GyEjzyWUn5ayaGpO4MlTKFYNRx63IpA== X-Received: by 2002:aed:3461:: with SMTP id w88mr16479427qtd.180.1596462208623; Mon, 03 Aug 2020 06:43:28 -0700 (PDT) Received: from ?IPv6:2804:431:c7c7:fea0:f34f:387b:ebfa:f155? ([2804:431:c7c7:fea0:f34f:387b:ebfa:f155]) by smtp.gmail.com with ESMTPSA id z14sm20342572qtn.92.2020.08.03.06.43.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 03 Aug 2020 06:43:27 -0700 (PDT) Subject: Re: [PATCH 1/1] docs: adding NUMA documentation for pseries To: Greg Kurz References: <20200729125756.224846-1-danielhb413@gmail.com> <20200730005852.GN84173@umbus.fritz.box> <20200803134917.48c5e7a5@bahia.lan> <8985209a-427b-1ec5-7d90-6a760e58f1cd@gmail.com> <20200803145311.55864d02@bahia.lan> From: Daniel Henrique Barboza Message-ID: <2e83b3fe-100e-c75a-4a77-c6c3758d681d@gmail.com> Date: Mon, 3 Aug 2020 10:43:24 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20200803145311.55864d02@bahia.lan> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=2607:f8b0:4864:20::842; envelope-from=danielhb413@gmail.com; helo=mail-qt1-x842.google.com X-detected-operating-system: by eggs.gnu.org: No matching host in p0f cache. That's all we know. X-Spam_score_int: -17 X-Spam_score: -1.8 X-Spam_bar: - X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: qemu-ppc@nongnu.org, qemu-devel@nongnu.org, David Gibson Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" David, This patch is breaking the build, as Greg mentioned below. Just sent a v2 that works properly. If you prefer you can squash this to the existing patch to fix it: $ git diff HEAD^ docs/specs/index.rst diff --git a/docs/specs/index.rst b/docs/specs/index.rst index 426632a475..1b0eb979d5 100644 --- a/docs/specs/index.rst +++ b/docs/specs/index.rst @@ -12,6 +12,7 @@ Contents: ppc-xive ppc-spapr-xive + ppc-spapr-numa acpi_hw_reduced_hotplug tpm acpi_hest_ghes Thank you Greg for reporting it. This went under my radar completely. Daniel On 8/3/20 9:53 AM, Greg Kurz wrote: > On Mon, 3 Aug 2020 09:14:22 -0300 > Daniel Henrique Barboza wrote: > >> >> >> On 8/3/20 8:49 AM, Greg Kurz wrote: >>> On Thu, 30 Jul 2020 10:58:52 +1000 >>> David Gibson wrote: >>> >>>> On Wed, Jul 29, 2020 at 09:57:56AM -0300, Daniel Henrique Barboza wrote: >>>>> This patch adds a new documentation file, ppc-spapr-numa.rst, >>>>> informing what developers and user can expect of the NUMA distance >>>>> support for the pseries machine, up to QEMU 5.1. >>>>> >>>>> In the (hopefully soon) future, when we rework the NUMA mechanics >>>>> of the pseries machine to at least attempt to contemplate user >>>>> choice, this doc will be extended to inform about the new >>>>> support. >>>>> >>>>> Signed-off-by: Daniel Henrique Barboza >>>> >>>> Applied to ppc-for-5.2, thanks. >>>> >>> >>> I'm now hitting this: >>> >>> Warning, treated as error: >>> docs/specs/ppc-spapr-numa.rst:document isn't included in any toctree >> >> How are you hitting this? I can't reproduce this error. Tried running >> ./autogen.sh and 'make' and didn't see it. >> > > I do out-of-tree builds and my configure line is: > > configure \ > --enable-docs \ > --disable-strip \ > --disable-xen \ > --enable-trace-backend=log \ > --enable-kvm \ > --enable-linux-aio \ > --enable-vhost-net \ > --enable-virtfs \ > --enable-seccomp \ > --target-list='ppc64-softmmu' > >> Checking what other docs are doing I figure that this might be missing: >> >> $ git diff >> diff --git a/docs/specs/index.rst b/docs/specs/index.rst >> index 426632a475..1b0eb979d5 100644 >> --- a/docs/specs/index.rst >> +++ b/docs/specs/index.rst >> @@ -12,6 +12,7 @@ Contents: >> >> ppc-xive >> ppc-spapr-xive >> + ppc-spapr-numa >> acpi_hw_reduced_hotplug >> tpm >> acpi_hest_ghes >> >> >> >> Can you please check if this solves the error? >> > > Yes it does ! Thanks ! > >> >> >> Thanks, >> >> >> Daniel >> >>> >>>>> --- >>>>> docs/specs/ppc-spapr-numa.rst | 191 ++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 191 insertions(+) >>>>> create mode 100644 docs/specs/ppc-spapr-numa.rst >>>>> >>>>> diff --git a/docs/specs/ppc-spapr-numa.rst b/docs/specs/ppc-spapr-numa.rst >>>>> new file mode 100644 >>>>> index 0000000000..e762038022 >>>>> --- /dev/null >>>>> +++ b/docs/specs/ppc-spapr-numa.rst >>>>> @@ -0,0 +1,191 @@ >>>>> + >>>>> +NUMA mechanics for sPAPR (pseries machines) >>>>> +============================================ >>>>> + >>>>> +NUMA in sPAPR works different than the System Locality Distance >>>>> +Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR >>>>> +1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This >>>>> +document aims to complement this specification, providing details >>>>> +of the elements that impacts how QEMU views NUMA in pseries. >>>>> + >>>>> +Associativity and ibm,associativity property >>>>> +-------------------------------------------- >>>>> + >>>>> +Associativity is defined as a group of platform resources that has >>>>> +similar mean performance (or in our context here, distance) relative to >>>>> +everyone else outside of the group. >>>>> + >>>>> +The format of the ibm,associativity property varies with the value of >>>>> +bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with >>>>> +bit 0 equal to zero is deprecated. The current format, with the bit 0 >>>>> +with the value of one, makes ibm,associativity property represent the >>>>> +physical hierarchy of the platform, as one or more lists that starts >>>>> +with the highest level grouping up to the smallest. Considering the >>>>> +following topology: >>>>> + >>>>> +:: >>>>> + >>>>> + Mem M1 ---- Proc P1 | >>>>> + ----------------- | Socket S1 ---| >>>>> + chip C1 | | >>>>> + | HW module 1 (MOD1) >>>>> + Mem M2 ---- Proc P2 | | >>>>> + ----------------- | Socket S2 ---| >>>>> + chip C2 | >>>>> + >>>>> +The ibm,associativity property for the processors would be: >>>>> + >>>>> +* P1: {MOD1, S1, C1, P1} >>>>> +* P2: {MOD1, S2, C2, P2} >>>>> + >>>>> +Each allocable resource has an ibm,associativity property. The LOPAPR >>>>> +specification allows multiple lists to be present in this property, >>>>> +considering that the same resource can have multiple connections to the >>>>> +platform. >>>>> + >>>>> +Relative Performance Distance and ibm,associativity-reference-points >>>>> +-------------------------------------------------------------------- >>>>> + >>>>> +The ibm,associativity-reference-points property is an array that is used >>>>> +to define the relevant performance/distance related boundaries, defining >>>>> +the NUMA levels for the platform. >>>>> + >>>>> +The definition of its elements also varies with the value of bit 0 of byte 5 >>>>> +of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero >>>>> +is also deprecated. With the current format, each integer of the >>>>> +ibm,associativity-reference-points represents an 1 based ordinal index (i.e. >>>>> +the first element is 1) of the ibm,associativity array. The first >>>>> +boundary is the most significant to application performance, followed by >>>>> +less significant boundaries. Allocated resources that belongs to the >>>>> +same performance boundaries are expected to have relative NUMA distance >>>>> +that matches the relevancy of the boundary itself. Resources that belongs >>>>> +to the same first boundary will have the shortest distance from each >>>>> +other. Subsequent boundaries represents greater distances and degraded >>>>> +performance. >>>>> + >>>>> +Using the previous example, the following setting reference points defines >>>>> +three NUMA levels: >>>>> + >>>>> +* ibm,associativity-reference-points = {0x3, 0x2, 0x1} >>>>> + >>>>> +The first NUMA level (0x3) is interpreted as the third element of each >>>>> +ibm,associativity array, the second level is the second element and >>>>> +the third level is the first element. Let's also consider that elements >>>>> +belonging to the first NUMA level have distance equal to 10 from each >>>>> +other, and each NUMA level doubles the distance from the previous. This >>>>> +means that the second would be 20 and the third level 40. For the P1 and >>>>> +P2 processors, we would have the following NUMA levels: >>>>> + >>>>> +:: >>>>> + >>>>> + * ibm,associativity-reference-points = {0x3, 0x2, 0x1} >>>>> + >>>>> + * P1: associativity{MOD1, S1, C1, P1} >>>>> + >>>>> + First NUMA level (0x3) => associativity[2] = C1 >>>>> + Second NUMA level (0x2) => associativity[1] = S1 >>>>> + Third NUMA level (0x1) => associativity[0] = MOD1 >>>>> + >>>>> + * P2: associativity{MOD1, S2, C2, P2} >>>>> + >>>>> + First NUMA level (0x3) => associativity[2] = C2 >>>>> + Second NUMA level (0x2) => associativity[1] = S2 >>>>> + Third NUMA level (0x1) => associativity[0] = MOD1 >>>>> + >>>>> + P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 >>>>> + >>>>> +Changing the ibm,associativity-reference-points array changes the performance >>>>> +distance attributes for the same associativity arrays, as the following >>>>> +example illustrates: >>>>> + >>>>> +:: >>>>> + >>>>> + * ibm,associativity-reference-points = {0x2} >>>>> + >>>>> + * P1: associativity{MOD1, S1, C1, P1} >>>>> + >>>>> + First NUMA level (0x2) => associativity[1] = S1 >>>>> + >>>>> + * P2: associativity{MOD1, S2, C2, P2} >>>>> + >>>>> + First NUMA level (0x2) => associativity[1] = S2 >>>>> + >>>>> + P1 and P2 does not have a common performance boundary. Since this is a one level >>>>> + NUMA configuration, distance between them is one boundary above the first >>>>> + level, 20. >>>>> + >>>>> + >>>>> +In a hypothetical platform where all resources inside the same hardware module >>>>> +is considered to be on the same performance boundary: >>>>> + >>>>> +:: >>>>> + >>>>> + * ibm,associativity-reference-points = {0x1} >>>>> + >>>>> + * P1: associativity{MOD1, S1, C1, P1} >>>>> + >>>>> + First NUMA level (0x1) => associativity[0] = MOD0 >>>>> + >>>>> + * P2: associativity{MOD1, S2, C2, P2} >>>>> + >>>>> + First NUMA level (0x1) => associativity[0] = MOD0 >>>>> + >>>>> + P1 and P2 belongs to the same first order boundary. The distance between then >>>>> + is 10. >>>>> + >>>>> + >>>>> +How the pseries Linux guest calculates NUMA distances >>>>> +===================================================== >>>>> + >>>>> +Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is >>>>> +how the distances are expressed. The SLIT table provides the NUMA distance >>>>> +value between the relevant resources. LOPAPR does not provide a standard >>>>> +way to calculate it. We have the ibm,associativity for each resource, which >>>>> +provides a common-performance hierarchy, and the ibm,associativity-reference-points >>>>> +array that tells which level of associativity is considered to be relevant >>>>> +or not. >>>>> + >>>>> +The result is that each OS is free to implement and to interpret the distance >>>>> +as it sees fit. For the pseries Linux guest, each level of NUMA duplicates >>>>> +the distance of the previous level, and the maximum amount of levels is >>>>> +limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the >>>>> +kernel tree). This results in the following distances: >>>>> + >>>>> +* both resources in the first NUMA level: 10 >>>>> +* resources one NUMA level apart: 20 >>>>> +* resources two NUMA levels apart: 40 >>>>> +* resources three NUMA levels apart: 80 >>>>> +* resources four NUMA levels apart: 160 >>>>> + >>>>> + >>>>> +Consequences for QEMU NUMA tuning >>>>> +--------------------------------- >>>>> + >>>>> +The way the pseries Linux guest calculates NUMA distances has a direct effect >>>>> +on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is >>>>> +the default ibm,associativity-reference-points being used in the pseries >>>>> +machine: >>>>> + >>>>> +ibm,associativity-reference-points = {0x4, 0x4, 0x2} >>>>> + >>>>> +The first and second level are equal, 0x4, and a third one was added in >>>>> +commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that >>>>> +regardless of how the ibm,associativity properties are being created in >>>>> +the device tree, the pseries Linux guest will only recognize three scenarios >>>>> +as far as NUMA distance goes: >>>>> + >>>>> +* if the resources belongs to the same first NUMA level = 10 >>>>> +* second level is skipped since it's equal to the first >>>>> +* all resources that aren't a NVLink GPU, it is guaranteed that they will belong >>>>> + to the same third NUMA level, having distance = 40 >>>>> +* for NVLink GPUs, distance = 80 from everything else >>>>> + >>>>> +In short, we can summarize the NUMA distances seem in pseries Linux guests, using >>>>> +QEMU up to 5.1, as follows: >>>>> + >>>>> +* local distance, i.e. the distance of the resource to its own NUMA node: 10 >>>>> +* if it's a NVLink GPU device, distance: 80 >>>>> +* every other resource, distance: 40 >>>>> + >>>>> +This also means that user input in QEMU command line does not change the >>>>> +NUMA distancing inside the guest for the pseries machine. >>>> >>> >