Linux MIPS Architecture development
 help / color / mirror / Atom feed
* [PATCH AUTOSEL 5.15 6/9] MIPS: Octeon: Add PCIe link status check
From: Sasha Levin @ 2024-06-05 11:54 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Songyang Li, Thomas Bogendoerfer, Sasha Levin, rdunlap, bhelgaas,
	linux-mips
In-Reply-To: <20240605115415.2964165-1-sashal@kernel.org>

From: Songyang Li <leesongyang@outlook.com>

[ Upstream commit 29b83a64df3b42c88c0338696feb6fdcd7f1f3b7 ]

The standard PCIe configuration read-write interface is used to
access the configuration space of the peripheral PCIe devices
of the mips processor after the PCIe link surprise down, it can
generate kernel panic caused by "Data bus error". So it is
necessary to add PCIe link status check for system protection.
When the PCIe link is down or in training, assigning a value
of 0 to the configuration address can prevent read-write behavior
to the configuration space of peripheral PCIe devices, thereby
preventing kernel panic.

Signed-off-by: Songyang Li <leesongyang@outlook.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/mips/pci/pcie-octeon.c | 6 ++++++
 1 file changed, 6 insertions(+)
 mode change 100644 => 100755 arch/mips/pci/pcie-octeon.c

diff --git a/arch/mips/pci/pcie-octeon.c b/arch/mips/pci/pcie-octeon.c
old mode 100644
new mode 100755
index d919a0d813a17..38de2a9c3cf1a
--- a/arch/mips/pci/pcie-octeon.c
+++ b/arch/mips/pci/pcie-octeon.c
@@ -230,12 +230,18 @@ static inline uint64_t __cvmx_pcie_build_config_addr(int pcie_port, int bus,
 {
 	union cvmx_pcie_address pcie_addr;
 	union cvmx_pciercx_cfg006 pciercx_cfg006;
+	union cvmx_pciercx_cfg032 pciercx_cfg032;
 
 	pciercx_cfg006.u32 =
 	    cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG006(pcie_port));
 	if ((bus <= pciercx_cfg006.s.pbnum) && (dev != 0))
 		return 0;
 
+	pciercx_cfg032.u32 =
+		cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG032(pcie_port));
+	if ((pciercx_cfg032.s.dlla == 0) || (pciercx_cfg032.s.lt == 1))
+		return 0;
+
 	pcie_addr.u64 = 0;
 	pcie_addr.config.upper = 2;
 	pcie_addr.config.io = 1;
-- 
2.43.0


^ permalink raw reply related

* [PATCH AUTOSEL 6.1 07/12] MIPS: Octeon: Add PCIe link status check
From: Sasha Levin @ 2024-06-05 11:53 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Songyang Li, Thomas Bogendoerfer, Sasha Levin, bhelgaas, rdunlap,
	linux-mips
In-Reply-To: <20240605115334.2963803-1-sashal@kernel.org>

From: Songyang Li <leesongyang@outlook.com>

[ Upstream commit 29b83a64df3b42c88c0338696feb6fdcd7f1f3b7 ]

The standard PCIe configuration read-write interface is used to
access the configuration space of the peripheral PCIe devices
of the mips processor after the PCIe link surprise down, it can
generate kernel panic caused by "Data bus error". So it is
necessary to add PCIe link status check for system protection.
When the PCIe link is down or in training, assigning a value
of 0 to the configuration address can prevent read-write behavior
to the configuration space of peripheral PCIe devices, thereby
preventing kernel panic.

Signed-off-by: Songyang Li <leesongyang@outlook.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/mips/pci/pcie-octeon.c | 6 ++++++
 1 file changed, 6 insertions(+)
 mode change 100644 => 100755 arch/mips/pci/pcie-octeon.c

diff --git a/arch/mips/pci/pcie-octeon.c b/arch/mips/pci/pcie-octeon.c
old mode 100644
new mode 100755
index c9edd3fb380df..9eaacd3d33880
--- a/arch/mips/pci/pcie-octeon.c
+++ b/arch/mips/pci/pcie-octeon.c
@@ -230,12 +230,18 @@ static inline uint64_t __cvmx_pcie_build_config_addr(int pcie_port, int bus,
 {
 	union cvmx_pcie_address pcie_addr;
 	union cvmx_pciercx_cfg006 pciercx_cfg006;
+	union cvmx_pciercx_cfg032 pciercx_cfg032;
 
 	pciercx_cfg006.u32 =
 	    cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG006(pcie_port));
 	if ((bus <= pciercx_cfg006.s.pbnum) && (dev != 0))
 		return 0;
 
+	pciercx_cfg032.u32 =
+		cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG032(pcie_port));
+	if ((pciercx_cfg032.s.dlla == 0) || (pciercx_cfg032.s.lt == 1))
+		return 0;
+
 	pcie_addr.u64 = 0;
 	pcie_addr.config.upper = 2;
 	pcie_addr.config.io = 1;
-- 
2.43.0


^ permalink raw reply related

* [PATCH AUTOSEL 6.6 10/20] MIPS: Octeon: Add PCIe link status check
From: Sasha Levin @ 2024-06-05 11:51 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Songyang Li, Thomas Bogendoerfer, Sasha Levin, rdunlap, bhelgaas,
	linux-mips
In-Reply-To: <20240605115225.2963242-1-sashal@kernel.org>

From: Songyang Li <leesongyang@outlook.com>

[ Upstream commit 29b83a64df3b42c88c0338696feb6fdcd7f1f3b7 ]

The standard PCIe configuration read-write interface is used to
access the configuration space of the peripheral PCIe devices
of the mips processor after the PCIe link surprise down, it can
generate kernel panic caused by "Data bus error". So it is
necessary to add PCIe link status check for system protection.
When the PCIe link is down or in training, assigning a value
of 0 to the configuration address can prevent read-write behavior
to the configuration space of peripheral PCIe devices, thereby
preventing kernel panic.

Signed-off-by: Songyang Li <leesongyang@outlook.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/mips/pci/pcie-octeon.c | 6 ++++++
 1 file changed, 6 insertions(+)
 mode change 100644 => 100755 arch/mips/pci/pcie-octeon.c

diff --git a/arch/mips/pci/pcie-octeon.c b/arch/mips/pci/pcie-octeon.c
old mode 100644
new mode 100755
index c9edd3fb380df..9eaacd3d33880
--- a/arch/mips/pci/pcie-octeon.c
+++ b/arch/mips/pci/pcie-octeon.c
@@ -230,12 +230,18 @@ static inline uint64_t __cvmx_pcie_build_config_addr(int pcie_port, int bus,
 {
 	union cvmx_pcie_address pcie_addr;
 	union cvmx_pciercx_cfg006 pciercx_cfg006;
+	union cvmx_pciercx_cfg032 pciercx_cfg032;
 
 	pciercx_cfg006.u32 =
 	    cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG006(pcie_port));
 	if ((bus <= pciercx_cfg006.s.pbnum) && (dev != 0))
 		return 0;
 
+	pciercx_cfg032.u32 =
+		cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG032(pcie_port));
+	if ((pciercx_cfg032.s.dlla == 0) || (pciercx_cfg032.s.lt == 1))
+		return 0;
+
 	pcie_addr.u64 = 0;
 	pcie_addr.config.upper = 2;
 	pcie_addr.config.io = 1;
-- 
2.43.0


^ permalink raw reply related

* [PATCH AUTOSEL 6.8 11/24] MIPS: Octeon: Add PCIe link status check
From: Sasha Levin @ 2024-06-05 11:50 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Songyang Li, Thomas Bogendoerfer, Sasha Levin, rdunlap, bhelgaas,
	linux-mips
In-Reply-To: <20240605115101.2962372-1-sashal@kernel.org>

From: Songyang Li <leesongyang@outlook.com>

[ Upstream commit 29b83a64df3b42c88c0338696feb6fdcd7f1f3b7 ]

The standard PCIe configuration read-write interface is used to
access the configuration space of the peripheral PCIe devices
of the mips processor after the PCIe link surprise down, it can
generate kernel panic caused by "Data bus error". So it is
necessary to add PCIe link status check for system protection.
When the PCIe link is down or in training, assigning a value
of 0 to the configuration address can prevent read-write behavior
to the configuration space of peripheral PCIe devices, thereby
preventing kernel panic.

Signed-off-by: Songyang Li <leesongyang@outlook.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/mips/pci/pcie-octeon.c | 6 ++++++
 1 file changed, 6 insertions(+)
 mode change 100644 => 100755 arch/mips/pci/pcie-octeon.c

diff --git a/arch/mips/pci/pcie-octeon.c b/arch/mips/pci/pcie-octeon.c
old mode 100644
new mode 100755
index 2583e318e8c6b..b080c7c6cc463
--- a/arch/mips/pci/pcie-octeon.c
+++ b/arch/mips/pci/pcie-octeon.c
@@ -230,12 +230,18 @@ static inline uint64_t __cvmx_pcie_build_config_addr(int pcie_port, int bus,
 {
 	union cvmx_pcie_address pcie_addr;
 	union cvmx_pciercx_cfg006 pciercx_cfg006;
+	union cvmx_pciercx_cfg032 pciercx_cfg032;
 
 	pciercx_cfg006.u32 =
 	    cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG006(pcie_port));
 	if ((bus <= pciercx_cfg006.s.pbnum) && (dev != 0))
 		return 0;
 
+	pciercx_cfg032.u32 =
+		cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG032(pcie_port));
+	if ((pciercx_cfg032.s.dlla == 0) || (pciercx_cfg032.s.lt == 1))
+		return 0;
+
 	pcie_addr.u64 = 0;
 	pcie_addr.config.upper = 2;
 	pcie_addr.config.io = 1;
-- 
2.43.0


^ permalink raw reply related

* [PATCH AUTOSEL 6.9 12/28] MIPS: Octeon: Add PCIe link status check
From: Sasha Levin @ 2024-06-05 11:48 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Songyang Li, Thomas Bogendoerfer, Sasha Levin, bhelgaas, rdunlap,
	linux-mips
In-Reply-To: <20240605114927.2961639-1-sashal@kernel.org>

From: Songyang Li <leesongyang@outlook.com>

[ Upstream commit 29b83a64df3b42c88c0338696feb6fdcd7f1f3b7 ]

The standard PCIe configuration read-write interface is used to
access the configuration space of the peripheral PCIe devices
of the mips processor after the PCIe link surprise down, it can
generate kernel panic caused by "Data bus error". So it is
necessary to add PCIe link status check for system protection.
When the PCIe link is down or in training, assigning a value
of 0 to the configuration address can prevent read-write behavior
to the configuration space of peripheral PCIe devices, thereby
preventing kernel panic.

Signed-off-by: Songyang Li <leesongyang@outlook.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/mips/pci/pcie-octeon.c | 6 ++++++
 1 file changed, 6 insertions(+)
 mode change 100644 => 100755 arch/mips/pci/pcie-octeon.c

diff --git a/arch/mips/pci/pcie-octeon.c b/arch/mips/pci/pcie-octeon.c
old mode 100644
new mode 100755
index 2583e318e8c6b..b080c7c6cc463
--- a/arch/mips/pci/pcie-octeon.c
+++ b/arch/mips/pci/pcie-octeon.c
@@ -230,12 +230,18 @@ static inline uint64_t __cvmx_pcie_build_config_addr(int pcie_port, int bus,
 {
 	union cvmx_pcie_address pcie_addr;
 	union cvmx_pciercx_cfg006 pciercx_cfg006;
+	union cvmx_pciercx_cfg032 pciercx_cfg032;
 
 	pciercx_cfg006.u32 =
 	    cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG006(pcie_port));
 	if ((bus <= pciercx_cfg006.s.pbnum) && (dev != 0))
 		return 0;
 
+	pciercx_cfg032.u32 =
+		cvmx_pcie_cfgx_read(pcie_port, CVMX_PCIERCX_CFG032(pcie_port));
+	if ((pciercx_cfg032.s.dlla == 0) || (pciercx_cfg032.s.lt == 1))
+		return 0;
+
 	pcie_addr.u64 = 0;
 	pcie_addr.config.upper = 2;
 	pcie_addr.config.io = 1;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers
From: Christoph Hellwig @ 2024-06-05  8:24 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Mina Almasry, Christoph Hellwig, netdev, linux-kernel, linux-doc,
	linux-alpha, linux-mips, linux-parisc, sparclinux,
	linux-trace-kernel, linux-arch, bpf, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Donald Hunter, Jonathan Corbet, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Thomas Bogendoerfer,
	James E.J. Bottomley, Helge Deller, Andreas Larsson,
	Jesper Dangaard Brouer, Ilias Apalodimas, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Arnd Bergmann,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, David Wei,
	Jason Gunthorpe, Yunsheng Lin, Shailend Chand,
	Harshitha Ramamurthy, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi
In-Reply-To: <5aee4bba-ca65-443c-bd78-e5599b814a13@gmail.com>

On Mon, Jun 03, 2024 at 03:52:32PM +0100, Pavel Begunkov wrote:
> The question for Christoph is what exactly is the objection here? Why we
> would not be using well defined ops when we know there will be more
> users?

The point is that there should be no more users.  If you need another
case you are doing something very wrong.


^ permalink raw reply

* Re: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers
From: Christoph Hellwig @ 2024-06-05  8:23 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Christoph Hellwig, netdev, linux-kernel, linux-doc, linux-alpha,
	linux-mips, linux-parisc, sparclinux, linux-trace-kernel,
	linux-arch, bpf, linux-kselftest, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Donald Hunter, Jonathan Corbet, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Thomas Bogendoerfer,
	James E.J. Bottomley, Helge Deller, Andreas Larsson,
	Jesper Dangaard Brouer, Ilias Apalodimas, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Arnd Bergmann,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Pavel Begunkov,
	David Wei, Jason Gunthorpe, Yunsheng Lin, Shailend Chand,
	Harshitha Ramamurthy, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi
In-Reply-To: <CAHS8izMU_nMEr04J9kXiX6rJqK4nQKA+W-enKLhNxvK7=H2pgA@mail.gmail.com>

On Mon, Jun 03, 2024 at 07:17:05AM -0700, Mina Almasry wrote:
> On Fri, May 31, 2024 at 10:35 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Thu, May 30, 2024 at 08:16:01PM +0000, Mina Almasry wrote:
> > > I'm unsure if the discussion has been resolved yet. Sending the series
> > > anyway to get reviews/feedback on the (unrelated) rest of the series.
> >
> > As far as I'm concerned it is not.  I've not seen any convincing
> > argument for more than page/folio allocator including larger order /
> > huge page and dmabuf.
> >
> 
> Thanks Christoph, this particular patch series adds dmabuf, so I
> assume no objection there. I assume the objection is that you want the
> generic, extensible hooks removed.

Exactly!  Note that this isn't a review of the dmabuf bits as there
are people more qualified with me.

> To be honest, I don't think the hooks are an integral part of the
> design, and at this point I think we've argued for them enough. I
> think we can easily achieve the same thing with just raw if statements
> in a couple of places. We can always add the hooks if and only if we
> actually justify many memory providers.
> 
> Any objections to me removing the hooks and directing to memory
> allocations via simple if statements? Something like (very rough
> draft, doesn't compile):

I like this approach, thanks! 

You might still want to keep the static key, though.

^ permalink raw reply

* Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator
From: Andrew Lunn @ 2024-06-05  0:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jason Gunthorpe, Paolo Abeni, Mina Almasry, netdev, linux-kernel,
	linux-doc, linux-alpha, linux-mips, linux-parisc, sparclinux,
	linux-trace-kernel, linux-arch, bpf, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Donald Hunter, Jonathan Corbet, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Thomas Bogendoerfer,
	James E.J. Bottomley, Helge Deller, Andreas Larsson,
	Jesper Dangaard Brouer, Ilias Apalodimas, Masami Hiramatsu,
	Mathieu Desnoyers, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Pavel Begunkov,
	David Wei, Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20240604202738.3aab6308@gandalf.local.home>

> How is the compiler going to know which path is going to be taken the most?
> There's two main paths in the ring buffer logic. One when an event stays on
> the sub-buffer, the other when the event crosses over to a new sub buffer.
> As there's 100s of events that happen on the same sub-buffer for every one
> time there's a cross over, I optimized the paths that stayed on the
> sub-buffer, which caused the time for those events to go from 250ns down to
> 150 ns!. That's a 40% speed up.
> 
> I added the unlikely/likely and 'always_inline' and 'noinline' paths to
> make sure the "staying on the buffer" path was always the hot path, and
> keeping it tight in cache.
> 
> How is a compiler going to know that?

It might have some heuristics to try to guess unlikely/likely, but
that is not what we are talking about here.

How much difference did 'always_inline' and 'noinline' make? Hopefully
the likely is enough of a clue it should prefer to inline whatever is
in that branch, where as for the unlikely case it can do a function
call.

But compilers is not my thing, which is why i would reach out to the
compiler people and ask them, is it expected to get this wrong, could
it be made better?

   Andrew

^ permalink raw reply

* Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator
From: Steven Rostedt @ 2024-06-05  0:27 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jason Gunthorpe, Paolo Abeni, Mina Almasry, netdev, linux-kernel,
	linux-doc, linux-alpha, linux-mips, linux-parisc, sparclinux,
	linux-trace-kernel, linux-arch, bpf, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Donald Hunter, Jonathan Corbet, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Thomas Bogendoerfer,
	James E.J. Bottomley, Helge Deller, Andreas Larsson,
	Jesper Dangaard Brouer, Ilias Apalodimas, Masami Hiramatsu,
	Mathieu Desnoyers, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Pavel Begunkov,
	David Wei, Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <3be107ce-3d9f-4528-b9f7-1c9e38da0688@lunn.ch>

On Wed, 5 Jun 2024 01:44:37 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> > Interesting, as I sped up the ftrace ring buffer by a substantial amount by
> > adding strategic __always_inline, noinline, likely() and unlikely()
> > throughout the code. It had to do with what was considered the fast path
> > and slow path, and not actually the size of the function. gcc got it
> > horribly wrong.  
> 
> And what did the compiler people say when you reported gcc was getting
> it wrong?
> 
> Our assumption is, the compiler is better than a human at deciding
> this. Or at least, a human who does not spend a long time profiling
> and tuning. If this assumption is not true, we probably should be
> trying to figure out why, and improving the compiler when
> possible. That will benefit everybody.
> 

How is the compiler going to know which path is going to be taken the most?
There's two main paths in the ring buffer logic. One when an event stays on
the sub-buffer, the other when the event crosses over to a new sub buffer.
As there's 100s of events that happen on the same sub-buffer for every one
time there's a cross over, I optimized the paths that stayed on the
sub-buffer, which caused the time for those events to go from 250ns down to
150 ns!. That's a 40% speed up.

I added the unlikely/likely and 'always_inline' and 'noinline' paths to
make sure the "staying on the buffer" path was always the hot path, and
keeping it tight in cache.

How is a compiler going to know that?

-- Steve

^ permalink raw reply

* Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator
From: Andrew Lunn @ 2024-06-04 23:44 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jason Gunthorpe, Paolo Abeni, Mina Almasry, netdev, linux-kernel,
	linux-doc, linux-alpha, linux-mips, linux-parisc, sparclinux,
	linux-trace-kernel, linux-arch, bpf, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Donald Hunter, Jonathan Corbet, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Thomas Bogendoerfer,
	James E.J. Bottomley, Helge Deller, Andreas Larsson,
	Jesper Dangaard Brouer, Ilias Apalodimas, Masami Hiramatsu,
	Mathieu Desnoyers, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Pavel Begunkov,
	David Wei, Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20240604124243.66203a46@gandalf.local.home>

> Interesting, as I sped up the ftrace ring buffer by a substantial amount by
> adding strategic __always_inline, noinline, likely() and unlikely()
> throughout the code. It had to do with what was considered the fast path
> and slow path, and not actually the size of the function. gcc got it
> horribly wrong.

And what did the compiler people say when you reported gcc was getting
it wrong?

Our assumption is, the compiler is better than a human at deciding
this. Or at least, a human who does not spend a long time profiling
and tuning. If this assumption is not true, we probably should be
trying to figure out why, and improving the compiler when
possible. That will benefit everybody.

       Andrew


^ permalink raw reply

* Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
From: Sean Christopherson @ 2024-06-04 23:36 UTC (permalink / raw)
  To: Oliver Upton
  Cc: James Houghton, Andrew Morton, Paolo Bonzini, Albert Ou,
	Ankit Agrawal, Anup Patel, Atish Patra, Axel Rasmussen, Bibo Mao,
	Catalin Marinas, David Matlack, David Rientjes, Huacai Chen,
	James Morse, Jonathan Corbet, Marc Zyngier, Michael Ellerman,
	Nicholas Piggin, Palmer Dabbelt, Paul Walmsley,
	Raghavendra Rao Ananta, Ryan Roberts, Shaoqin Huang, Shuah Khan,
	Suzuki K Poulose, Tianrui Zhao, Will Deacon, Yu Zhao, Zenghui Yu,
	kvm-riscv, kvm, kvmarm, linux-arm-kernel, linux-doc, linux-kernel,
	linux-kselftest, linux-mips, linux-mm, linux-riscv, linuxppc-dev,
	loongarch
In-Reply-To: <Zl-cjHVKaQ0iQE5d@linux.dev>

On Tue, Jun 04, 2024, Oliver Upton wrote:
> On Tue, Jun 04, 2024 at 03:20:20PM -0700, James Houghton wrote:
> > On Fri, May 31, 2024 at 12:18 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > > On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > > > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> > > Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> > > now. Maybe demote it to:
> > >
> > >   r = kvm_pgtable_walk(...);
> > >   WARN_ON_ONCE(r && r != -EAGAIN);
> > 
> > Oh, indeed, thank you. Just to make sure -- does it make sense to
> > retry the cmpxchg if it fails? For example, the way I have it now for
> > x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
> > move on to the next one having done nothing. Does something like that
> > make sense for arm64?
> 
> At least for arm64 I do not see a need for retry. The only possible
> races are:
> 
>  - A stage-2 fault handler establishing / adjusting the mapping for the
>    GFN. If the guest is directly accessing the GFN in question, what's
>    the point of wiping out AF?
> 
>    Even when returning -EAGAIN we've already primed stage2_age_data::young,
>    so we report the correct state back to the primary MMU.
> 
>  - Another kvm_age_gfn() trying to age the same GFN. I haven't even
>    looked to see if this is possible from the primary MMU POV, but in
>    theory one of the calls will win the race and clear AF.
> 
> Given Yu's concerns about making pending writers wait, we should take
> every opportunity to bail on the walk.

+1.  The x86 path that retries is, for all intents and purposes, limited to Intel
CPUs that don't support EPT A/D bits, i.e. to pre-HSW CPUs.  I wouldn't make any
decisions based on that code.

^ permalink raw reply

* Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
From: Oliver Upton @ 2024-06-04 23:00 UTC (permalink / raw)
  To: James Houghton
  Cc: Andrew Morton, Paolo Bonzini, Albert Ou, Ankit Agrawal,
	Anup Patel, Atish Patra, Axel Rasmussen, Bibo Mao,
	Catalin Marinas, David Matlack, David Rientjes, Huacai Chen,
	James Morse, Jonathan Corbet, Marc Zyngier, Michael Ellerman,
	Nicholas Piggin, Palmer Dabbelt, Paul Walmsley,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Shuah Khan, Suzuki K Poulose, Tianrui Zhao,
	Will Deacon, Yu Zhao, Zenghui Yu, kvm-riscv, kvm, kvmarm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-kselftest,
	linux-mips, linux-mm, linux-riscv, linuxppc-dev, loongarch
In-Reply-To: <CADrL8HV4SZ9BEQg1j3ojG-v5umL_d3sa4e1k2vMQCMmBEgeFpQ@mail.gmail.com>

On Tue, Jun 04, 2024 at 03:20:20PM -0700, James Houghton wrote:
> On Fri, May 31, 2024 at 12:18 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> >
> > On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> > >
> > > [...]
> > >
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 9e2bbee77491..eabb07c66a07 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > > >     data->young = true;
> > > >
> > > >     /*
> > > > -    * stage2_age_walker() is always called while holding the MMU lock for
> > > > -    * write, so this will always succeed. Nonetheless, this deliberately
> > > > -    * follows the race detection pattern of the other stage-2 walkers in
> > > > -    * case the locking mechanics of the MMU notifiers is ever changed.
> > > > +    * This walk may not be exclusive; the PTE is permitted to change
> > > > +    * from under us.
> > > >      */
> > > >     if (data->mkold && !stage2_try_set_pte(ctx, new))
> > > >             return -EAGAIN;
> > >
> > > It is probably worth mentioning that if there was a race to update the
> > > PTE then the GFN is most likely young, so failing to clear AF probably
> > > isn't even consequential.
> 
> Thanks Oliver.
> 
> >
> > Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> > now. Maybe demote it to:
> >
> >   r = kvm_pgtable_walk(...);
> >   WARN_ON_ONCE(r && r != -EAGAIN);
> 
> Oh, indeed, thank you. Just to make sure -- does it make sense to
> retry the cmpxchg if it fails? For example, the way I have it now for
> x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
> move on to the next one having done nothing. Does something like that
> make sense for arm64?

At least for arm64 I do not see a need for retry. The only possible
races are:

 - A stage-2 fault handler establishing / adjusting the mapping for the
   GFN. If the guest is directly accessing the GFN in question, what's
   the point of wiping out AF?

   Even when returning -EAGAIN we've already primed stage2_age_data::young,
   so we report the correct state back to the primary MMU.

 - Another kvm_age_gfn() trying to age the same GFN. I haven't even
   looked to see if this is possible from the primary MMU POV, but in
   theory one of the calls will win the race and clear AF.

Given Yu's concerns about making pending writers wait, we should take
every opportunity to bail on the walk.

-- 
Thanks,
Oliver

^ permalink raw reply

* Re: [PATCH v4 6/7] KVM: arm64: Relax locking for kvm_test_age_gfn and kvm_age_gfn
From: James Houghton @ 2024-06-04 22:20 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Andrew Morton, Paolo Bonzini, Albert Ou, Ankit Agrawal,
	Anup Patel, Atish Patra, Axel Rasmussen, Bibo Mao,
	Catalin Marinas, David Matlack, David Rientjes, Huacai Chen,
	James Morse, Jonathan Corbet, Marc Zyngier, Michael Ellerman,
	Nicholas Piggin, Palmer Dabbelt, Paul Walmsley,
	Raghavendra Rao Ananta, Ryan Roberts, Sean Christopherson,
	Shaoqin Huang, Shuah Khan, Suzuki K Poulose, Tianrui Zhao,
	Will Deacon, Yu Zhao, Zenghui Yu, kvm-riscv, kvm, kvmarm,
	linux-arm-kernel, linux-doc, linux-kernel, linux-kselftest,
	linux-mips, linux-mm, linux-riscv, linuxppc-dev, loongarch
In-Reply-To: <Zloicw4IU8_-V5Ns@linux.dev>

On Fri, May 31, 2024 at 12:18 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> On Fri, May 31, 2024 at 12:11:33PM -0700, Oliver Upton wrote:
> > On Wed, May 29, 2024 at 06:05:09PM +0000, James Houghton wrote:
> >
> > [...]
> >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 9e2bbee77491..eabb07c66a07 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -1319,10 +1319,8 @@ static int stage2_age_walker(const struct kvm_pgtable_visit_ctx *ctx,
> > >     data->young = true;
> > >
> > >     /*
> > > -    * stage2_age_walker() is always called while holding the MMU lock for
> > > -    * write, so this will always succeed. Nonetheless, this deliberately
> > > -    * follows the race detection pattern of the other stage-2 walkers in
> > > -    * case the locking mechanics of the MMU notifiers is ever changed.
> > > +    * This walk may not be exclusive; the PTE is permitted to change
> > > +    * from under us.
> > >      */
> > >     if (data->mkold && !stage2_try_set_pte(ctx, new))
> > >             return -EAGAIN;
> >
> > It is probably worth mentioning that if there was a race to update the
> > PTE then the GFN is most likely young, so failing to clear AF probably
> > isn't even consequential.

Thanks Oliver.

>
> Oh, and the WARN_ON() in kvm_pgtable_stage2_test_clear_young() is bogus
> now. Maybe demote it to:
>
>   r = kvm_pgtable_walk(...);
>   WARN_ON_ONCE(r && r != -EAGAIN);

Oh, indeed, thank you. Just to make sure -- does it make sense to
retry the cmpxchg if it fails? For example, the way I have it now for
x86[1], we retry the cmpxchg if the spte is still a leaf, otherwise we
move on to the next one having done nothing. Does something like that
make sense for arm64?

[1]: https://lore.kernel.org/linux-mm/20240529180510.2295118-6-jthoughton@google.com/

^ permalink raw reply

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
From: Conor Dooley @ 2024-06-04 20:17 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: yunhui cui, Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
In-Reply-To: <CAHVXubhy1yEKOx91gc9S++yKOoQa+sJ5EDSiMFwR6qepwzRMew@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1982 bytes --]

On Tue, Jun 04, 2024 at 01:44:15PM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley <conor@kernel.org> wrote:
> >
> > On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > > > >
> > > > > As for the current status of the patch, there are two points that can
> > > > > be optimized:
> > > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> > >
> > > That would produce a non-portable kernel. But I'm not opposed to that
> > > at all, let me check how we handle other extensions. Maybe @Conor
> > > Dooley has some feedback here?
> >
> > To be honest, not really sure what to give feedback on. Could you
> > elaborate on exactly what the option is going to do? Given the
> > portability concern, I guess you were proposing that the option would
> > remove the preventative fences, rather than your current patch that
> > removes them via an alternative?
> 
> No no, I won't do that, we need a generic kernel for distros so that's
> not even a question. What Yunhui was asking about (to me) is: can we
> introduce a Kconfig option to always remove the preventive fences,
> bypassing the use of alternatives altogether?
> 
> To me, it won't make a difference in terms of performance. But if we
> already offer such a possibility for other extensions, well I'll do
> it. Otherwise, the question is: should we start doing that?

We don't do that for other extensions yet, because currently all the
extensions we have options for are additive. There's like 3 alternative
patchsites, and they are all just one nop? I don't see the point of
having a Kconfig knob for that.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator
From: Steven Rostedt @ 2024-06-04 16:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paolo Abeni, Mina Almasry, netdev, linux-kernel, linux-doc,
	linux-alpha, linux-mips, linux-parisc, sparclinux,
	linux-trace-kernel, linux-arch, bpf, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Donald Hunter, Jonathan Corbet, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Thomas Bogendoerfer,
	James E.J. Bottomley, Helge Deller, Andreas Larsson,
	Jesper Dangaard Brouer, Ilias Apalodimas, Masami Hiramatsu,
	Mathieu Desnoyers, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Pavel Begunkov,
	David Wei, Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20240604163158.GB21513@ziepe.ca>

On Tue, 4 Jun 2024 13:31:58 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote:
> > On Tue, 04 Jun 2024 12:13:15 +0200
> > Paolo Abeni <pabeni@redhat.com> wrote:
> >   
> > > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:  
> > > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > > index d82f92d7cf9ce..d5fac8edf621d 100644
> > > > --- a/net/core/devmem.c
> > > > +++ b/net/core/devmem.c
> > > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
> > > >  	kfree(owner);
> > > >  }
> > > >  
> > > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)    
> > > 
> > > Minor nit: please no 'inline' keyword in c files.  
> > 
> > I'm curious. Is this a networking rule? I use 'inline' in my C code all the
> > time.  
> 
> It mostly comes from Documentation/process/coding-style.rst:
> 
> 15) The inline disease
> ----------------------
> 
> There appears to be a common misperception that gcc has a magic "make me
> faster" speedup option called ``inline``. While the use of inlines can be
> appropriate (for example as a means of replacing macros, see Chapter 12), it
> very often is not. Abundant use of the inline keyword leads to a much bigger
> kernel, which in turn slows the system as a whole down, due to a bigger
> icache footprint for the CPU and simply because there is less memory
> available for the pagecache. Just think about it; a pagecache miss causes a
> disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles
> that can go into these 5 milliseconds.
> 
> A reasonable rule of thumb is to not put inline at functions that have more
> than 3 lines of code in them. An exception to this rule are the cases where
> a parameter is known to be a compiletime constant, and as a result of this
> constantness you *know* the compiler will be able to optimize most of your
> function away at compile time. For a good example of this later case, see
> the kmalloc() inline function.
> 
> Often people argue that adding inline to functions that are static and used
> only once is always a win since there is no space tradeoff. While this is
> technically correct, gcc is capable of inlining these automatically without
> help, and the maintenance issue of removing the inline when a second user
> appears outweighs the potential value of the hint that tells gcc to do
> something it would have done anyway.
> 

Interesting, as I sped up the ftrace ring buffer by a substantial amount by
adding strategic __always_inline, noinline, likely() and unlikely()
throughout the code. It had to do with what was considered the fast path
and slow path, and not actually the size of the function. gcc got it
horribly wrong.

-- Steve

^ permalink raw reply

* Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator
From: Jason Gunthorpe @ 2024-06-04 16:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Paolo Abeni, Mina Almasry, netdev, linux-kernel, linux-doc,
	linux-alpha, linux-mips, linux-parisc, sparclinux,
	linux-trace-kernel, linux-arch, bpf, linux-kselftest, linux-media,
	dri-devel, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Donald Hunter, Jonathan Corbet, Richard Henderson,
	Ivan Kokshaysky, Matt Turner, Thomas Bogendoerfer,
	James E.J. Bottomley, Helge Deller, Andreas Larsson,
	Jesper Dangaard Brouer, Ilias Apalodimas, Masami Hiramatsu,
	Mathieu Desnoyers, Arnd Bergmann, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Pavel Begunkov,
	David Wei, Yunsheng Lin, Shailend Chand, Harshitha Ramamurthy,
	Shakeel Butt, Jeroen de Borst, Praveen Kaligineedi,
	Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <20240604121551.07192993@gandalf.local.home>

On Tue, Jun 04, 2024 at 12:15:51PM -0400, Steven Rostedt wrote:
> On Tue, 04 Jun 2024 12:13:15 +0200
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
> > On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> > > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > > index d82f92d7cf9ce..d5fac8edf621d 100644
> > > --- a/net/core/devmem.c
> > > +++ b/net/core/devmem.c
> > > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
> > >  	kfree(owner);
> > >  }
> > >  
> > > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)  
> > 
> > Minor nit: please no 'inline' keyword in c files.
> 
> I'm curious. Is this a networking rule? I use 'inline' in my C code all the
> time.

It mostly comes from Documentation/process/coding-style.rst:

15) The inline disease
----------------------

There appears to be a common misperception that gcc has a magic "make me
faster" speedup option called ``inline``. While the use of inlines can be
appropriate (for example as a means of replacing macros, see Chapter 12), it
very often is not. Abundant use of the inline keyword leads to a much bigger
kernel, which in turn slows the system as a whole down, due to a bigger
icache footprint for the CPU and simply because there is less memory
available for the pagecache. Just think about it; a pagecache miss causes a
disk seek, which easily takes 5 milliseconds. There are a LOT of cpu cycles
that can go into these 5 milliseconds.

A reasonable rule of thumb is to not put inline at functions that have more
than 3 lines of code in them. An exception to this rule are the cases where
a parameter is known to be a compiletime constant, and as a result of this
constantness you *know* the compiler will be able to optimize most of your
function away at compile time. For a good example of this later case, see
the kmalloc() inline function.

Often people argue that adding inline to functions that are static and used
only once is always a win since there is no space tradeoff. While this is
technically correct, gcc is capable of inlining these automatically without
help, and the maintenance issue of removing the inline when a second user
appears outweighs the potential value of the hint that tells gcc to do
something it would have done anyway.

Jason

^ permalink raw reply

* Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator
From: Steven Rostedt @ 2024-06-04 16:15 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Mina Almasry, netdev, linux-kernel, linux-doc, linux-alpha,
	linux-mips, linux-parisc, sparclinux, linux-trace-kernel,
	linux-arch, bpf, linux-kselftest, linux-media, dri-devel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Donald Hunter,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Masami Hiramatsu, Mathieu Desnoyers, Arnd Bergmann,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Steffen Klassert, Herbert Xu, David Ahern, Willem de Bruijn,
	Shuah Khan, Sumit Semwal, Christian König, Pavel Begunkov,
	David Wei, Jason Gunthorpe, Yunsheng Lin, Shailend Chand,
	Harshitha Ramamurthy, Shakeel Butt, Jeroen de Borst,
	Praveen Kaligineedi, Willem de Bruijn, Kaiyuan Zhang
In-Reply-To: <bea8b8bf1630309bb004f614e4a3c7f684a6acb6.camel@redhat.com>

On Tue, 04 Jun 2024 12:13:15 +0200
Paolo Abeni <pabeni@redhat.com> wrote:

> On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index d82f92d7cf9ce..d5fac8edf621d 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
> >  	kfree(owner);
> >  }
> >  
> > +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)  
> 
> Minor nit: please no 'inline' keyword in c files.

I'm curious. Is this a networking rule? I use 'inline' in my C code all the
time.

-- Steve

^ permalink raw reply

* Re: [PATCH 3/7] MIPS: csrc-r4k: Select HAVE_UNSTABLE_SCHED_CLOCK if 64BIT
From: Jiaxun Yang @ 2024-06-04 14:51 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Thomas Bogendoerfer, Serge Semin, Daniel Lezcano, Thomas Gleixner,
	linux-mips@vger.kernel.org, linux-kernel
In-Reply-To: <alpine.DEB.2.21.2406012326580.23854@angie.orcam.me.uk>



在2024年6月1日六月 下午11:30,Maciej W. Rozycki写道:
> On Sat, 11 May 2024, Jiaxun Yang wrote:
>
>> csrc-r4k suffers from SMP synchronization overhead.
> [...]
>> diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
>> index f1aa1bf11166..fa8ca0287568 100644
>> --- a/arch/mips/Kconfig
>> +++ b/arch/mips/Kconfig
>> @@ -1083,6 +1083,7 @@ config CSRC_IOASIC
>>  
>>  config CSRC_R4K
>>  	select CLOCKSOURCE_WATCHDOG if CPU_FREQ
>> +	select HAVE_UNSTABLE_SCHED_CLOCK if 64BIT
>
>  Shouldn't it be:
>
> 	select HAVE_UNSTABLE_SCHED_CLOCK if 64BIT && SMP

This option is not depending on SMP on x86 and PARISC, so I followed
the same approach.

But yes, it makes sense to depend on SMP.

Thanks
>
> then?
>
>   Maciej

-- 
- Jiaxun

^ permalink raw reply

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
From: Alexandre Ghiti @ 2024-06-04 11:44 UTC (permalink / raw)
  To: Conor Dooley
  Cc: yunhui cui, Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
In-Reply-To: <20240604-dazzling-envy-1dcf111eb2c5@spud>

On Tue, Jun 4, 2024 at 10:52 AM Conor Dooley <conor@kernel.org> wrote:
>
> On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> > On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > > >
> > > > As for the current status of the patch, there are two points that can
> > > > be optimized:
> > > > 1. Some chip hardware implementations may not cache TLB invalid
> > > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> >
> > That would produce a non-portable kernel. But I'm not opposed to that
> > at all, let me check how we handle other extensions. Maybe @Conor
> > Dooley has some feedback here?
>
> To be honest, not really sure what to give feedback on. Could you
> elaborate on exactly what the option is going to do? Given the
> portability concern, I guess you were proposing that the option would
> remove the preventative fences, rather than your current patch that
> removes them via an alternative?

No no, I won't do that, we need a generic kernel for distros so that's
not even a question. What Yunhui was asking about (to me) is: can we
introduce a Kconfig option to always remove the preventive fences,
bypassing the use of alternatives altogether?

To me, it won't make a difference in terms of performance. But if we
already offer such a possibility for other extensions, well I'll do
it. Otherwise, the question is: should we start doing that?

> I don't think we have any extension
> related options that work like that at the moment, and making that an
> option will just mean that distros that look to cater for multiple
> platforms won't be able to turn it on.
>
> Thanks,
> Conor.

^ permalink raw reply

* Re: [PATCH net-next v10 11/14] tcp: RX path for devmem TCP
From: Paolo Abeni @ 2024-06-04 10:53 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, linux-alpha,
	linux-mips, linux-parisc, sparclinux, linux-trace-kernel,
	linux-arch, bpf, linux-kselftest, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Donald Hunter,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Pavel Begunkov, David Wei, Jason Gunthorpe, Yunsheng Lin,
	Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <20240530201616.1316526-12-almasrymina@google.com>

On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> @@ -2317,6 +2318,213 @@ static int tcp_inq_hint(struct sock *sk)
>  	return inq;
>  }
>  
> +/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */
> +struct tcp_xa_pool {
> +	u8		max; /* max <= MAX_SKB_FRAGS */
> +	u8		idx; /* idx <= max */
> +	__u32		tokens[MAX_SKB_FRAGS];
> +	netmem_ref	netmems[MAX_SKB_FRAGS];
> +};
> +
> +static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p,
> +			       bool lock)
> +{
> +	int i;
> +
> +	if (!p->max)
> +		return;
> +	if (lock)
> +		xa_lock_bh(&sk->sk_user_frags);

The conditional lock here confuses sparse.

I think you can avoid it providing a unlocked version (no need to check
for '!p->max' the only caller wanting the unlocked version already
performs such check) and a locked one, calling the other.

Cheers,

Paolo


^ permalink raw reply

* Re: [PATCH net-next v10 10/14] net: add support for skbs with unreadable frags
From: Paolo Abeni @ 2024-06-04 10:46 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, linux-alpha,
	linux-mips, linux-parisc, sparclinux, linux-trace-kernel,
	linux-arch, bpf, linux-kselftest, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Donald Hunter,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Pavel Begunkov, David Wei, Jason Gunthorpe, Yunsheng Lin,
	Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <20240530201616.1316526-11-almasrymina@google.com>

On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 26f09c3e830b7..7b9d018f552bd 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -422,6 +422,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
>  {
>  	struct skb_shared_info *pinfo = skb_shinfo(skb);
>  
> +	if (WARN_ON_ONCE(!skb_frags_readable(skb)))
> +		return;
> +
>  	BUG_ON(skb->end - skb->tail < grow);
>  
>  	memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> @@ -443,7 +446,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
>  {
>  	int grow = skb_gro_offset(skb) - skb_headlen(skb);
>  
> -	if (grow > 0)
> +	if (grow > 0 && skb_frags_readable(skb))
>  		gro_pull_from_frag0(skb, grow);
>  }

I'm unsure if this was already mentioned, so please pardon the eventual
duplicate...

The above code is quite critical performance wise, and the previous
patch already prevent frag0 from being set to a non paged frag, so what
about dropping the above additional checks?

thanks!

Paolo


^ permalink raw reply

* Re: [PATCH net-next v10 06/14] page_pool: convert to use netmem
From: Paolo Abeni @ 2024-06-04 10:23 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, linux-alpha,
	linux-mips, linux-parisc, sparclinux, linux-trace-kernel,
	linux-arch, bpf, linux-kselftest, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Donald Hunter,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Pavel Begunkov, David Wei, Jason Gunthorpe, Yunsheng Lin,
	Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, linux-mm, Matthew Wilcox
In-Reply-To: <20240530201616.1316526-7-almasrymina@google.com>

On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h
> index 6834356b2d2ae..c5b6383ff2760 100644
> --- a/include/trace/events/page_pool.h
> +++ b/include/trace/events/page_pool.h
> @@ -42,51 +42,52 @@ TRACE_EVENT(page_pool_release,
>  TRACE_EVENT(page_pool_state_release,
>  
>  	TP_PROTO(const struct page_pool *pool,
> -		 const struct page *page, u32 release),
> +		 netmem_ref netmem, u32 release),

This causes a sparse warning, as the caller is still passing a 'page'
argument.

>  
> -	TP_ARGS(pool, page, release),
> +	TP_ARGS(pool, netmem, release),
>  
>  	TP_STRUCT__entry(
>  		__field(const struct page_pool *,	pool)
> -		__field(const struct page *,		page)
> +		__field(netmem_ref,			netmem)
>  		__field(u32,				release)
>  		__field(unsigned long,			pfn)
>  	),
>  
>  	TP_fast_assign(
>  		__entry->pool		= pool;
> -		__entry->page		= page;
> +		__entry->netmem		= netmem;
>  		__entry->release	= release;
> -		__entry->pfn		= page_to_pfn(page);
> +		__entry->pfn		= netmem_to_pfn(netmem);
>  	),
>  
> -	TP_printk("page_pool=%p page=%p pfn=0x%lx release=%u",
> -		  __entry->pool, __entry->page, __entry->pfn, __entry->release)
> +	TP_printk("page_pool=%p netmem=%lu pfn=0x%lx release=%u",
> +		  __entry->pool, (__force unsigned long)__entry->netmem,
> +		  __entry->pfn, __entry->release)
>  );
>  
>  TRACE_EVENT(page_pool_state_hold,
>  
>  	TP_PROTO(const struct page_pool *pool,
> -		 const struct page *page, u32 hold),
> +		 netmem_ref netmem, u32 hold),

Same here.

Thanks,

Paolo


^ permalink raw reply

* Re: [PATCH net-next v10 05/14] netdev: netdevice devmem allocator
From: Paolo Abeni @ 2024-06-04 10:13 UTC (permalink / raw)
  To: Mina Almasry, netdev, linux-kernel, linux-doc, linux-alpha,
	linux-mips, linux-parisc, sparclinux, linux-trace-kernel,
	linux-arch, bpf, linux-kselftest, linux-media, dri-devel
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Donald Hunter,
	Jonathan Corbet, Richard Henderson, Ivan Kokshaysky, Matt Turner,
	Thomas Bogendoerfer, James E.J. Bottomley, Helge Deller,
	Andreas Larsson, Jesper Dangaard Brouer, Ilias Apalodimas,
	Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
	Arnd Bergmann, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Steffen Klassert, Herbert Xu, David Ahern,
	Willem de Bruijn, Shuah Khan, Sumit Semwal, Christian König,
	Pavel Begunkov, David Wei, Jason Gunthorpe, Yunsheng Lin,
	Shailend Chand, Harshitha Ramamurthy, Shakeel Butt,
	Jeroen de Borst, Praveen Kaligineedi, Willem de Bruijn,
	Kaiyuan Zhang
In-Reply-To: <20240530201616.1316526-6-almasrymina@google.com>

On Thu, 2024-05-30 at 20:16 +0000, Mina Almasry wrote:
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index d82f92d7cf9ce..d5fac8edf621d 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
>  	kfree(owner);
>  }
>  
> +static inline dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)

Minor nit: please no 'inline' keyword in c files.

Thanks,

Paolo


^ permalink raw reply

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
From: Conor Dooley @ 2024-06-04  8:51 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: yunhui cui, Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
In-Reply-To: <CAHVXubg4vtfjSJ-w8-7suzZ9L5ZmTo8udUMaYjJ5veKBmikNjA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1269 bytes --]

On Tue, Jun 04, 2024 at 09:17:26AM +0200, Alexandre Ghiti wrote:
> On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > >
> > > As for the current status of the patch, there are two points that can
> > > be optimized:
> > > 1. Some chip hardware implementations may not cache TLB invalid
> > > entries, so it doesn't matter whether svvptc is available or not. Can
> > > we consider adding a CONFIG_RISCV_SVVPTC to control it?
> 
> That would produce a non-portable kernel. But I'm not opposed to that
> at all, let me check how we handle other extensions. Maybe @Conor
> Dooley has some feedback here?

To be honest, not really sure what to give feedback on. Could you
elaborate on exactly what the option is going to do? Given the
portability concern, I guess you were proposing that the option would
remove the preventative fences, rather than your current patch that
removes them via an alternative? I don't think we have any extension
related options that work like that at the moment, and making that an
option will just mean that distros that look to cater for multiple
platforms won't be able to turn it on.

Thanks,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [External] [PATCH RFC/RFT v2 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings
From: Alexandre Ghiti @ 2024-06-04  7:17 UTC (permalink / raw)
  To: yunhui cui, Conor Dooley
  Cc: Catalin Marinas, Will Deacon, Thomas Bogendoerfer,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Andrew Morton,
	Ved Shanbhogue, Matt Evans, Dylan Jhong, linux-arm-kernel,
	linux-kernel, linux-mips, linuxppc-dev, linux-riscv, linux-mm
In-Reply-To: <CAHVXubi+s2Q0y_xLbHpQJpz+yXvKWJ8e96wwAHP6A9C7U-He7g@mail.gmail.com>

On Tue, Jun 4, 2024 at 9:15 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> Hi Yunhui,
>
> On Tue, Jun 4, 2024 at 8:21 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> >
> > Hi Alexandre,
> >
> > On Mon, Jun 3, 2024 at 8:02 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > >
> > > Hi Yunhui,
> > >
> > > On Mon, Jun 3, 2024 at 4:26 AM yunhui cui <cuiyunhui@bytedance.com> wrote:
> > > >
> > > > Hi Alexandre,
> > > >
> > > > On Thu, Feb 1, 2024 at 12:03 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > >
> > > > > In 6.5, we removed the vmalloc fault path because that can't work (see
> > > > > [1] [2]). Then in order to make sure that new page table entries were
> > > > > seen by the page table walker, we had to preventively emit a sfence.vma
> > > > > on all harts [3] but this solution is very costly since it relies on IPI.
> > > > >
> > > > > And even there, we could end up in a loop of vmalloc faults if a vmalloc
> > > > > allocation is done in the IPI path (for example if it is traced, see
> > > > > [4]), which could result in a kernel stack overflow.
> > > > >
> > > > > Those preventive sfence.vma needed to be emitted because:
> > > > >
> > > > > - if the uarch caches invalid entries, the new mapping may not be
> > > > >   observed by the page table walker and an invalidation may be needed.
> > > > > - if the uarch does not cache invalid entries, a reordered access
> > > > >   could "miss" the new mapping and traps: in that case, we would actually
> > > > >   only need to retry the access, no sfence.vma is required.
> > > > >
> > > > > So this patch removes those preventive sfence.vma and actually handles
> > > > > the possible (and unlikely) exceptions. And since the kernel stacks
> > > > > mappings lie in the vmalloc area, this handling must be done very early
> > > > > when the trap is taken, at the very beginning of handle_exception: this
> > > > > also rules out the vmalloc allocations in the fault path.
> > > > >
> > > > > Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1]
> > > > > Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2]
> > > > > Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3]
> > > > > Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4]
> > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > ---
> > > > >  arch/riscv/include/asm/cacheflush.h  | 18 +++++-
> > > > >  arch/riscv/include/asm/thread_info.h |  5 ++
> > > > >  arch/riscv/kernel/asm-offsets.c      |  5 ++
> > > > >  arch/riscv/kernel/entry.S            | 84 ++++++++++++++++++++++++++++
> > > > >  arch/riscv/mm/init.c                 |  2 +
> > > > >  5 files changed, 113 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > index a129dac4521d..b0d631701757 100644
> > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > @@ -37,7 +37,23 @@ static inline void flush_dcache_page(struct page *page)
> > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > >
> > > > >  #ifdef CONFIG_64BIT
> > > > > -#define flush_cache_vmap(start, end)           flush_tlb_kernel_range(start, end)
> > > > > +extern u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > > +extern char _end[];
> > > > > +#define flush_cache_vmap flush_cache_vmap
> > > > > +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> > > > > +{
> > > > > +       if (is_vmalloc_or_module_addr((void *)start)) {
> > > > > +               int i;
> > > > > +
> > > > > +               /*
> > > > > +                * We don't care if concurrently a cpu resets this value since
> > > > > +                * the only place this can happen is in handle_exception() where
> > > > > +                * an sfence.vma is emitted.
> > > > > +                */
> > > > > +               for (i = 0; i < ARRAY_SIZE(new_vmalloc); ++i)
> > > > > +                       new_vmalloc[i] = -1ULL;
> > > > > +       }
> > > > > +}
> > > > >  #define flush_cache_vmap_early(start, end)     local_flush_tlb_kernel_range(start, end)
> > > > >  #endif
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/thread_info.h b/arch/riscv/include/asm/thread_info.h
> > > > > index 5d473343634b..32631acdcdd4 100644
> > > > > --- a/arch/riscv/include/asm/thread_info.h
> > > > > +++ b/arch/riscv/include/asm/thread_info.h
> > > > > @@ -60,6 +60,11 @@ struct thread_info {
> > > > >         void                    *scs_base;
> > > > >         void                    *scs_sp;
> > > > >  #endif
> > > > > +       /*
> > > > > +        * Used in handle_exception() to save a0, a1 and a2 before knowing if we
> > > > > +        * can access the kernel stack.
> > > > > +        */
> > > > > +       unsigned long           a0, a1, a2;
> > > > >  };
> > > > >
> > > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > > > diff --git a/arch/riscv/kernel/asm-offsets.c b/arch/riscv/kernel/asm-offsets.c
> > > > > index a03129f40c46..939ddc0e3c6e 100644
> > > > > --- a/arch/riscv/kernel/asm-offsets.c
> > > > > +++ b/arch/riscv/kernel/asm-offsets.c
> > > > > @@ -35,6 +35,8 @@ void asm_offsets(void)
> > > > >         OFFSET(TASK_THREAD_S9, task_struct, thread.s[9]);
> > > > >         OFFSET(TASK_THREAD_S10, task_struct, thread.s[10]);
> > > > >         OFFSET(TASK_THREAD_S11, task_struct, thread.s[11]);
> > > > > +
> > > > > +       OFFSET(TASK_TI_CPU, task_struct, thread_info.cpu);
> > > > >         OFFSET(TASK_TI_FLAGS, task_struct, thread_info.flags);
> > > > >         OFFSET(TASK_TI_PREEMPT_COUNT, task_struct, thread_info.preempt_count);
> > > > >         OFFSET(TASK_TI_KERNEL_SP, task_struct, thread_info.kernel_sp);
> > > > > @@ -42,6 +44,9 @@ void asm_offsets(void)
> > > > >  #ifdef CONFIG_SHADOW_CALL_STACK
> > > > >         OFFSET(TASK_TI_SCS_SP, task_struct, thread_info.scs_sp);
> > > > >  #endif
> > > > > +       OFFSET(TASK_TI_A0, task_struct, thread_info.a0);
> > > > > +       OFFSET(TASK_TI_A1, task_struct, thread_info.a1);
> > > > > +       OFFSET(TASK_TI_A2, task_struct, thread_info.a2);
> > > > >
> > > > >         OFFSET(TASK_TI_CPU_NUM, task_struct, thread_info.cpu);
> > > > >         OFFSET(TASK_THREAD_F0,  task_struct, thread.fstate.f[0]);
> > > > > diff --git a/arch/riscv/kernel/entry.S b/arch/riscv/kernel/entry.S
> > > > > index 9d1a305d5508..c1ffaeaba7aa 100644
> > > > > --- a/arch/riscv/kernel/entry.S
> > > > > +++ b/arch/riscv/kernel/entry.S
> > > > > @@ -19,6 +19,78 @@
> > > > >
> > > > >         .section .irqentry.text, "ax"
> > > > >
> > > > > +.macro new_vmalloc_check
> > > > > +       REG_S   a0, TASK_TI_A0(tp)
> > > > > +       REG_S   a1, TASK_TI_A1(tp)
> > > > > +       REG_S   a2, TASK_TI_A2(tp)
> > > > > +
> > > > > +       csrr    a0, CSR_CAUSE
> > > > > +       /* Exclude IRQs */
> > > > > +       blt     a0, zero, _new_vmalloc_restore_context
> > > > > +       /* Only check new_vmalloc if we are in page/protection fault */
> > > > > +       li      a1, EXC_LOAD_PAGE_FAULT
> > > > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > > > +       li      a1, EXC_STORE_PAGE_FAULT
> > > > > +       beq     a0, a1, _new_vmalloc_kernel_address
> > > > > +       li      a1, EXC_INST_PAGE_FAULT
> > > > > +       bne     a0, a1, _new_vmalloc_restore_context
> > > > > +
> > > > > +_new_vmalloc_kernel_address:
> > > > > +       /* Is it a kernel address? */
> > > > > +       csrr    a0, CSR_TVAL
> > > > > +       bge     a0, zero, _new_vmalloc_restore_context
> > > > > +
> > > > > +       /* Check if a new vmalloc mapping appeared that could explain the trap */
> > > > > +
> > > > > +       /*
> > > > > +        * Computes:
> > > > > +        * a0 = &new_vmalloc[BIT_WORD(cpu)]
> > > > > +        * a1 = BIT_MASK(cpu)
> > > > > +        */
> > > > > +       REG_L   a2, TASK_TI_CPU(tp)
> > > > > +       /*
> > > > > +        * Compute the new_vmalloc element position:
> > > > > +        * (cpu / 64) * 8 = (cpu >> 6) << 3
> > > > > +        */
> > > > > +       srli    a1, a2, 6
> > > > > +       slli    a1, a1, 3
> > > > > +       la      a0, new_vmalloc
> > > > > +       add     a0, a0, a1
> > > > > +       /*
> > > > > +        * Compute the bit position in the new_vmalloc element:
> > > > > +        * bit_pos = cpu % 64 = cpu - (cpu / 64) * 64 = cpu - (cpu >> 6) << 6
> > > > > +        *         = cpu - ((cpu >> 6) << 3) << 3
> > > > > +        */
> > > > > +       slli    a1, a1, 3
> > > > > +       sub     a1, a2, a1
> > > > > +       /* Compute the "get mask": 1 << bit_pos */
> > > > > +       li      a2, 1
> > > > > +       sll     a1, a2, a1
> > > > > +
> > > > > +       /* Check the value of new_vmalloc for this cpu */
> > > > > +       REG_L   a2, 0(a0)
> > > > > +       and     a2, a2, a1
> > > > > +       beq     a2, zero, _new_vmalloc_restore_context
> > > > > +
> > > > > +       /* Atomically reset the current cpu bit in new_vmalloc */
> > > > > +       amoxor.w        a0, a1, (a0)
> > > > > +
> > > > > +       /* Only emit a sfence.vma if the uarch caches invalid entries */
> > > > > +       ALTERNATIVE("sfence.vma", "nop", 0, RISCV_ISA_EXT_SVVPTC, 1)
> > > > > +
> > > > > +       REG_L   a0, TASK_TI_A0(tp)
> > > > > +       REG_L   a1, TASK_TI_A1(tp)
> > > > > +       REG_L   a2, TASK_TI_A2(tp)
> > > > > +       csrw    CSR_SCRATCH, x0
> > > > > +       sret
> > > > > +
> > > > > +_new_vmalloc_restore_context:
> > > > > +       REG_L   a0, TASK_TI_A0(tp)
> > > > > +       REG_L   a1, TASK_TI_A1(tp)
> > > > > +       REG_L   a2, TASK_TI_A2(tp)
> > > > > +.endm
> > > > > +
> > > > > +
> > > > >  SYM_CODE_START(handle_exception)
> > > > >         /*
> > > > >          * If coming from userspace, preserve the user thread pointer and load
> > > > > @@ -30,6 +102,18 @@ SYM_CODE_START(handle_exception)
> > > > >
> > > > >  .Lrestore_kernel_tpsp:
> > > > >         csrr tp, CSR_SCRATCH
> > > > > +
> > > > > +       /*
> > > > > +        * The RISC-V kernel does not eagerly emit a sfence.vma after each
> > > > > +        * new vmalloc mapping, which may result in exceptions:
> > > > > +        * - if the uarch caches invalid entries, the new mapping would not be
> > > > > +        *   observed by the page table walker and an invalidation is needed.
> > > > > +        * - if the uarch does not cache invalid entries, a reordered access
> > > > > +        *   could "miss" the new mapping and traps: in that case, we only need
> > > > > +        *   to retry the access, no sfence.vma is required.
> > > > > +        */
> > > > > +       new_vmalloc_check
> > > > > +
> > > > >         REG_S sp, TASK_TI_KERNEL_SP(tp)
> > > > >
> > > > >  #ifdef CONFIG_VMAP_STACK
> > > > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > > > index eafc4c2200f2..54c9fdeda11e 100644
> > > > > --- a/arch/riscv/mm/init.c
> > > > > +++ b/arch/riscv/mm/init.c
> > > > > @@ -36,6 +36,8 @@
> > > > >
> > > > >  #include "../kernel/head.h"
> > > > >
> > > > > +u64 new_vmalloc[NR_CPUS / sizeof(u64) + 1];
> > > > > +
> > > > >  struct kernel_mapping kernel_map __ro_after_init;
> > > > >  EXPORT_SYMBOL(kernel_map);
> > > > >  #ifdef CONFIG_XIP_KERNEL
> > > > > --
> > > > > 2.39.2
> > > > >
> > > > >
> > > >
> > > > Can we consider using new_vmalloc as a percpu variable, so that we
> > > > don't need to add a0/1/2 in thread_info?
> > >
> > > At first, I used percpu variables. But then I realized that percpu
> > > areas are allocated in the vmalloc area, so if somehow we take a trap
> > > when accessing the new_vmalloc percpu variable, we could not recover
> > > from this as we would trap forever in new_vmalloc_check. But
> > > admittedly, not sure that can happen.
> > >
> > > And how would that remove a0, a1 and a2 from thread_info? We'd still
> > > need to save some registers somewhere to access the percpu variable
> > > right?
> > >
> > > > Also, try not to do too much
> > > > calculation logic in new_vmalloc_check, after all, handle_exception is
> > > > a high-frequency path. In this case, can we consider writing
> > > > new_vmalloc_check in C language to increase readability?
> > >
> > > If we write that in C, we don't have the control over the allocated
> > > registers and then we can't correctly save the context.
> >
> > If we use C language, new_vmalloc_check is written just like do_irq(),
> > then we need _save_context, but for new_vmalloc_check, it is not worth
> > the loss, because exceptions from user mode do not need
> > new_vmalloc_check, which also shows that it is reasonable to put
> > new_vmalloc_check after _restore_kernel_tpsp.
> >
> > Saving is necessary. We can save a0, a1, a2 without using thread_info.
> > We can choose to save on the kernel stack of the current tp, but we
> > need to add the following instructions:
> > REG_S sp, TASK_TI_USER_SP(tp)
> > REG_L sp, TASK_TI_KERNEL_SP(tp)
> > addi sp, sp, -(PT_SIZE_ON_STACK)
> > It seems that saving directly on thread_info is more direct, but
> > saving on the kernel stack is more logically consistent, and there is
> > no need to increase the size of thread_info.
>
> You can't save on the kernel stack since kernel stacks are allocated
> in the vmalloc area.
>
> >
> > As for the current status of the patch, there are two points that can
> > be optimized:
> > 1. Some chip hardware implementations may not cache TLB invalid
> > entries, so it doesn't matter whether svvptc is available or not. Can
> > we consider adding a CONFIG_RISCV_SVVPTC to control it?

That would produce a non-portable kernel. But I'm not opposed to that
at all, let me check how we handle other extensions. Maybe @Conor
Dooley has some feedback here?

> >
> > 2. .macro new_vmalloc_check
> > REG_S a0, TASK_TI_A0(tp)
> > REG_S a1, TASK_TI_A1(tp)
> > REG_S a2, TASK_TI_A2(tp)
> > When executing blt a0, zero, _new_vmalloc_restore_context, you can not
> > save a1, a2 first
>
> Ok, I can do that :)
>
> Thanks again for your inputs,
>
> Alex
>
> >
> > >
> > > Thanks for your interest in this patchset :)
> > >
> > > Alex
> > >
> > > >
> > > > Thanks,
> > > > Yunhui
> >
> > Thanks,
> > Yunhui

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox