Re: [Bug 219984] New: [BISECTED] High power usage since 'PCI/ASPM: Correct LTR_L1.2_THRESHOLD computation'

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "David E. Box" <david.e.box@linux.intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>,
	Sergey Dolgov <sergey.v.dolgov@gmail.com>
Cc: linux-pci@vger.kernel.org,
	Kuppuswamy Sathyanarayanan
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	"Artem S. Tashkinov" <aros@gmx.com>,
	Mika Westerberg <mika.westerberg@linux.intel.com>
Subject: Re: [Bug 219984] New: [BISECTED] High power usage since 'PCI/ASPM: Correct LTR_L1.2_THRESHOLD computation'
Date: Fri, 02 May 2025 14:30:33 -0700	[thread overview]
Message-ID: <2f33f07841d5a20a1eb14c73b2c3000dd45a031b.camel@linux.intel.com> (raw)
In-Reply-To: <20250410220915.GA326095@bhelgaas>

Hi all,

On Thu, 2025-04-10 at 17:09 -0500, Bjorn Helgaas wrote:
> On Thu, Apr 10, 2025 at 02:59:41PM +0100, Sergey Dolgov wrote:
> > Dear Bjorn,
> > 
> > one (probably the main) power user is the CPU at shallow C states post
> > 7afeb84d14ea. Even at some load (like web browsing) the CPU spends
> > most time in C7 after reverting 7afeb84d14ea, in contrast to C3 even
> > at idle in the original 6.14.0. So the main question is what can make
> > the CPU busy with larger LTR_L1.2_THRESHOLDs?
> 
> That's a good question and I have no idea what the answer is.
> Obviously a larger LTR_L1.2_THRESHOLD means less time in L1.2, but I
> don't know how that translates to CPU C states.
> 
> These bugs:
> 
>   https://bugzilla.kernel.org/show_bug.cgi?id=218394
>   https://bugzilla.kernel.org/show_bug.cgi?id=215832
> 
> mention C states and ASPM.  I added some of those folks to cc.
> 
> > I do have Win10 too, but neither Win binaries of pciutils nor Device
> > Manager show LTR_L1.2_THRESHOLD. lspci -vv run as Administrator does
> > report some "latencies" though. Some of them are significantly
> > smaller, e.g. "Exit Latency L0s <1us, L1 <16us" for the bridge
> > 00:1d.6, others are significantly larger, e.g. "Exit Latency L1
> > unlimited" for the NVMe 6e:00.0, than the LTR_L1.2_THRESHOLDs
> > calculated by Linux. The full log is attached.
> 
> I think I'm missing your point here.  The L0s/L1 Acceptable Latencies
> and the L0s/L1 Exit Latencies I see in your Win10 lspci are the same
> as in Windows, which is what I would expect because these are
> read-only Device and Link Capability registers and the OS can't
> influence them:
> 
>   00:1d.6 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port
> #15
>     LnkCap: Port #15, Speed 8GT/s, Width x1, ASPM L0s L1, Exit Latency L0s
> <1us, L1 <16us
> 
>   6e:00.0 Non-Volatile memory controller: Intel Corporation Optane NVME SSD
> H10
>     DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1
> unlimited
>     LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L1, Exit Latency L1 unlimited
> 
> The DevCap L0s and L1 Acceptable Latencies are "the acceptable total
> latency that an Endpoint can withstand due to transition from L0s or
> L1 to L0.  It is essentially an indirect measure of the Endpoint's
> internal buffering."
> 
> The LnkCap L0s and L1 Exit Latencies are the "time the Port requires
> to complete transitions from L0s or L1 to L0."
> 
> I don't know how to relate LTR_L1.2_THRESHOLD to L1.  I do know that
> L0s and L1 were part of PCIe r2.1, but it wasn't until r3.1 that the
> L1.1 and L1.2 substates were added and L1 was renamed to L1.0.  So I
> expect the L1 latencies to be used to enable L1.0 by itself, and I
> assume LTR and LTR_L1.2_THRESHOLD are used separately to further
> enable L1.2.
> 
> > But do we need to care about precise values? At least we know now that
> > 7afeb84d14ea has only increased the thresholds, slightly. What happens
> > if they are underestimated? Can this lead to severe problems, e.g.
> > data corruption on NVMes?
> 
> IIUC, LTR messages are essentially a way for the device to say "I have
> enough local buffer space to hold X ns worth of traffic while I'm
> waiting for the link to return to L0."
> 
> Then we should only put the link in L1.2 if we can get to L1.2 and
> back to L0 in X ns or less, and LTR_L1.2_THRESHOLD is basically the
> minimum L0 -> L1.2 -> L0 time.
> 
> If we set LTR_L1.2_THRESHOLD lower than it should be, it seems like
> we're at risk of overrunning the device's local buffer.  Maybe that's
> OK and the device needs to be able to tolerate that, but it does feel
> risky to me.
> 
> There's also the LTR Capability that "specifies the maximum latency a
> device is permitted to request.  Software should set this to the
> platform's maximum supported latency or less."  Evidently drivers can
> set this (only amdgpu does, AFAICS), but I have no idea how to use it.
> 
> I suppose setting it to something less than LTR_L1.2_THRESHOLD might
> cause L1.2 to be used more?  This would be writable via setpci, and it
> looks like it can be updated any time.  If you want to play with it,
> the value and scale are encoded the same way as
> encode_l12_threshold(), and PCI_EXT_CAP_ID_LTR and related #defines
> show the register layouts.
> 
> > If not (and I've never seen one using 5.15 kernels for 4 years), can
> > we reprogram LTR_L1.2_THRESHOLDs at runtime?  Like for the CPU,
> > introduce 'performance' and 'powersave' governors for the PCI, which
> > set the thresholds to, say, 2x and 0.5x (2 + 4 + t_common_mode +
> > t_power_on), respectively.
> 
> I don't think I would support a sysfs or similar interface to tweak
> this.  Right now computing LTR_L1.2_THRESHOLD already feels like a bit
> of black magic, and tweaking it would be farther down the road of
> "well, it seems to help this situation, but we don't really know why."
> 
> > On Wed, Apr 9, 2025 at 12:18 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > 
> > > On Tue, Apr 08, 2025 at 09:02:46PM +0100, Sergey Dolgov wrote:
> > > > Dear Bjorn,
> > > > 
> > > > here are both dmesg from the kernels with your info patch.
> > > 
> > > Thanks again!  Here's the difference:
> > > 
> > >   - pre  7afeb84d14ea
> > >   + post 7afeb84d14ea
> > > 
> > >    pci 0000:02:00.0: parent CMRT 0x28 child CMRT 0x00
> > >    pci 0000:02:00.0: parent T_POWER_ON 0x2c usec (val 0x16 scale 0)
> > >    pci 0000:02:00.0: child  T_POWER_ON 0x0a usec (val 0x5 scale 0)
> > >    pci 0000:02:00.0: t_common_mode 0x28 t_power_on 0x2c l1_2_threshold
> > > 0x5a
> > >   -pci 0000:02:00.0: encoded LTR_L1.2_THRESHOLD value 0x02 scale 3
> > >   +pci 0000:02:00.0: encoded LTR_L1.2_THRESHOLD value 0x58 scale 2
> > > 
> > > We computed LTR_L1.2_THRESHOLD == 0x5a == 90 usec == 90000 nsec.
> > > 
> > > Prior to 7afeb84d14ea, we computed *scale = 3, *value = (90000 >> 15)
> > > == 0x2.  But per PCIe r6.0, sec 6.18, this is a latency value of only
> > > 0x2 * 32768 == 65536 ns, which is less than the 90000 ns we requested.
> > > 
> > > After 7afeb84d14ea, we computed *scale = 2, *value =
> > > roundup(threshold_ns, 1024) / 1024 == 0x58, which is a latency value
> > > of 90112 ns, which is almost exactly what we requested.
> > > 
> > > In essence, before 7afeb84d14ea we tell the Root Port that it can
> > > enter L1.2 and get back to L0 in 65536 ns or less, and after
> > > 7afeb84d14ea, we tell it that it may take up to 90112 ns.
> > > 
> > > It's possible that the calculation of LTR_L1.2_THRESHOLD itself in
> > > aspm_calc_l12_info() is too conservative, and we don't actually need
> > > 90 usec, but I think the encoding done by 7afeb84d14ea itself is more
> > > correct.  I don't have any information about how to improve 90 usec
> > > estimate.  (If you happen to have Windows on that box, it would be
> > > really interesting to see how it sets LTR_L1.2_THRESHOLD.)
> > > 
> > > If the device has sent LTR messages indicating a latency requirement
> > > between 65536 ns and 90112 ns, the pre-7afeb84d14ea kernel would allow
> > > L1.2 while post 7afeb84d14ea would not.  I don't think we can actually
> > > see the LTR messages sent by the device, but my guess is they must be
> > > in that range.  I don't know if that's enough to account for the major
> > > difference in power consumption you're seeing.

If the Root Port is attached to a controller in the South Complex — which would
be the case on a Cannon Lake–based platform — you can observe the resulting LTR
value sent from the Port using the pmc_core driver:

    cat /sys/kernel/debug/pmc_core/ltr_show | grep SOUTHPORT

Needs CONFIG_INTEL_PMC_CORE which the major distros set.

The SOUTHPORTs correspond to Root Ports. Unfortunately, we don’t currently have
a mapping between the internal PMC SOUTHPORT_X designation and the PCI Bus
enumeration. However, since this behavior clearly affects C-state entry, you
should be able to narrow it down by monitoring this file — ideally capturing
several snapshots, as the values can change depending on device activity.

Note that the value shown may not exactly match what was sent in the LTR
message, but it won’t be smaller. My current assumption (pending confirmation)
is that simply entering L1.2 increases the effective LTR value observed by the
CPU, since it’s unlikely that the LTR message value itself changes solely as a
result of modifying the threshold.

Incidentally, you can also ignore the LTR from the Port by writing the bit value
(first column) to the ltr_ignore file in the same folder. This is for testing
only as it ignores device activity. But you should see deeper C state residency
after ignoring the problem Port, which would be a way to narrow down the
SOUTHPORT as well. The LTR consideration can be restored by writing the same bit
value to the ltr_restore file.

David

> > > 
> > > The AX200 at 6f:00.0 is in exactly the same situation as the
> > > Thunderbolt bridge at 02:00.0 (LTR_L1.2_THRESHOLD 90 usec, RP set to
> > > 65536 ns before 7afeb84d14ea and 90112 ns after).
> > > 
> > > For the NVMe devices at 6d:00.0 and 6e:00.0, LTR_L1.2_THRESHOLD is
> > > 3206 usec (!), and we set the RP to 3145728 ns (slightly too small)
> > > before, 3211264 ns after.
> > > 
> > > For the RTS525A at 70:00.0, LTR_L1.2_THRESHOLD is 126 usec, and we set
> > > the RP to 98304 ns before, 126976 ns after.
> > > 
> > > Sorry, no real answers here yet, still puzzled.
> > > 
> > > Bjorn
> 
>

next prev parent reply	other threads:[~2025-05-02 21:30 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <bug-219984-41252@https.bugzilla.kernel.org/>
2025-04-07 15:57 ` [Bug 219984] New: [BISECTED] High power usage since 'PCI/ASPM: Correct LTR_L1.2_THRESHOLD computation' Bjorn Helgaas
2025-04-07 16:10   ` Bjorn Helgaas
2025-04-07 18:33     ` Sergey Dolgov
2025-04-08 16:36       ` Bjorn Helgaas
2025-04-08 20:02         ` Sergey Dolgov
2025-04-08 23:18           ` Bjorn Helgaas
2025-04-10 13:59             ` Sergey Dolgov
2025-04-10 22:09               ` Bjorn Helgaas
2025-04-13 11:59                 ` Sergey Dolgov
2025-04-18 22:55                   ` Bjorn Helgaas
2025-04-20 12:42                     ` Sergey Dolgov
2025-05-02 21:30                 ` David E. Box [this message]
2025-05-06 11:57                   ` Sergey Dolgov
2025-07-08 18:02                     ` Sergey Dolgov
2025-07-08 18:37                       ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2f33f07841d5a20a1eb14c73b2c3000dd45a031b.camel@linux.intel.com \
    --to=david.e.box@linux.intel.com \
    --cc=aros@gmx.com \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mika.westerberg@linux.intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=sergey.v.dolgov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).